最近程序遇到了一个问题,写的一个C#客户端程序在Windows 11系统休眠之后再次唤醒,有一定的概率不能够正常地接收到服务端发送的数据。这个程序有一个专门用于连接服务端的C# 线程,名字为"WatchServiceTransport",它的内部是一个while循环,不停的从服务端使用Socket
的NetworkStream.Read
读取数据。正常的时候,程序是能够接收数据的,但是当Windows休眠之后(此时网络连接可能断开),此后即使Windows唤醒仍然不能连接收取数据。起初怀疑是否是因为Windows 11休眠后,连接TCP的线程被系统强制杀死了?那如何验证是否是这个原因导致的呢?这就需要查看在出现问题时,名为“WatchServiceTransport”的线程是否还存在。
幸好今天这个问题又复现了,于是导出了DMP文件,用WinDbg分析了一下,顺便记录如何打印所有的C#线程。
首先就是使用!threads命令查看当前的所有线程:
0:008> !threads
ThreadCount: 57
UnstartedThread: 0
BackgroundThread: 7
PendingThread: 0
DeadThread: 49
Hosted Runtime: no
Lock
ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception
0 1 68f8 0000021cb7f40cf0 26020 Preemptive 0000021CBAFEF9E8:0000021CBAFF00D8 0000021cb7f17200 0 STA
2 2 75f0 0000021cb7f68d50 2b220 Preemptive 0000000000000000:0000000000000000 0000021cb7f17200 0 MTA (Finalizer)
3 3 7a78 0000021cd2510d10 102a220 Preemptive 0000000000000000:0000000000000000 0000021cb7f17200 0 MTA (Threadpool Worker)
6 4 78b4 0000021cd258bb90 202b220 Preemptive 0000000000000000:0000000000000000 0000021cb7f17200 0 MTA
7 5 5dbc 0000021cd2589520 202b220 Preemptive 0000021CBAE422E0:0000021CBAE440D8 0000021cb7f17200 0 MTA
8 6 5a60 0000021cd25a8100 2b220 Preemptive 0000021CBAE40658:0000021CBAE420D8 0000021cb7f17200 0 MTA
10 7 753c 0000021cda0737e0 8029220 Preemptive 0000000000000000:0000000000000000 0000021cb7f17200 0 MTA (Threadpool Completion Port)
XXXX 63 0 0000021cd43a7aa0 1039820 Preemptive 0000000000000000:0000000000000000 0000021cb7f17200 0 Ukn (Threadpool Worker)
XXXX 24 0 0000021cd43a6330 1039820 Preemptive 0000000000000000:0000000000000000 0000021cb7f17200 0 Ukn (Threadpool Worker)
XXXX 23 0 0000021cd43a6b00 1039820 Preemptive 0000000000000000:0000000000000000 0000021cb7f17200 0 Ukn (Threadpool Worker)
-------------------------------------------------------------省略很多个类似-----------------------------------------------------------------------------------
XXXX 31 0 0000021cd4247600 1039820 Preemptive 0000000000000000:0000000000000000 0000021cb7f17200 0 Ukn (Threadpool Worker)
XXXX 32 0 0000021cd42a84b0 1039820 Preemptive 0000000000000000:0000000000000000 0000021cb7f17200 0 Ukn (Threadpool Worker)
XXXX 12 0 0000021cd42a5da0 1039820 Preemptive 0000000000000000:0000000000000000 0000021cb7f17200 0 Ukn (Threadpool Worker)
13 15 7c18 0000021cd42a8c80 1029220 Preemptive 0000021CBAFEA1A0:0000021CBAFEC0D8 0000021cb7f17200 0 MTA (Threadpool Worker)
在统计的信息中,可以看到总共有57个线程,其中49个已经dead,真正存活的有8个线程,其中后台线程有7个。需要注意的是上述ID是托管线程的唯一标识符,右CLR分配,OSID,是操作系统分配给线程的线程ID,通过这个ID,可以在Windows任务管理器或其他系统工具中识别该线程。
上面这个它看不出来具体的很多信息,所以需要使用专门的针对.NET的命令,首先就是!dumpheap
0:006> !dumpheap -stat -type Thread
Statistics:
MT Count TotalSize Class Name
00007ffc00d162d0 1 24 System.Threading.SynchronizationContext
00007ffc00d15d20 1 24 System.Threading.Overlapped
00007ffc00ce57c0 1 24 System.Threading.ThreadPoolWorkQueue+SparseArray`1[[System.Threading.ThreadPoolWorkQueue+WorkStealingQueue, mscorlib]]
00007ffc00ccd878 1 24 System.Threading.Overlapped+<>c
00007ffc00cdd130 1 32 System.Threading.TimerQueue+AppDomainTimerSafeHandle
00007ffc00ccdb20 1 32 System.Threading.Gen2GcCallback
00007ffc00ce55c0 1 40 System.Threading.ThreadPoolWorkQueue+QueueSegment
00007ffc00ce5550 1 40 System.Threading.ThreadPoolWorkQueue
00007ffbca60cba0 1 40 System.Net.TimerThread+TimerQueue
00007ffc00cdd0d8 1 48 System.Threading.TimerQueue
00007ffc00cdd058 2 48 System.Threading.TimerHolder
00007ffc00cd6a18 1 48 System.Threading._IOCompletionCallback
00007ffc00d14ed8 1 64 System.Threading.IOCompletionCallback
00007ffc00d148e8 1 64 System.Threading.WaitCallback
00007ffc00cfd958 2 64 System.Threading.Timer
00007ffbca666608 1 64 System.Net.TimerThread+Callback
00007ffbca60cc28 1 72 System.Net.TimerThread+TimerNode
00007ffc00cd5a40 1 80 System.Threading.Tasks.Task`1[[System.Int32, mscorlib]]
00007ffc00ccd8e8 1 80 System.Threading.PinnableBufferCache
00007ffc00cd6b30 3 120 System.Threading.ThreadHelper
00007ffc00cfd8a0 2 128 System.Threading.TimerCallback
00007ffc00cfd5e0 3 144 System.Threading.AutoResetEvent
00007ffc00ce5838 1 152 System.Threading.ThreadPoolWorkQueue+WorkStealingQueue[]
00007ffc00cc1a20 4 160 System.Threading.WaitHandle[]
00007ffc00cdcff8 2 176 System.Threading.TimerQueueTimer
00007ffc00d137e0 4 192 System.Threading.ManualResetEvent
00007ffc00ccda70 1 192 System.Threading.PinnableBufferCacheEventSource
00007ffc00cd69b8 2 224 System.Threading.OverlappedData
00007ffb4296ed00 1 232 System.Windows.Forms.Application+ThreadContext
00007ffb4171bef0 1 248 System.Threading.Tasks.Task`1[[System.Data.ProviderBase.DbConnectionInternal, System.Data]][]
00007ffc00d15a70 5 320 System.Threading.SendOrPostCallback
00007ffc00d00b50 2 320 System.Threading.ThreadAbortException
00007ffc00cf3e00 5 320 System.Threading.ReaderWriterLock
00007ffc00d13878 6 384 System.Threading.ThreadStart
00007ffc00d13c18 7 448 System.Threading.ContextCallback
00007ffc00ce5620 1 2072 System.Threading.IThreadPoolWorkItem[]
00007ffc00cfeb20 84 7392 System.Threading.ExecutionContext
00007ffb42972ac8 82 8528 System.Windows.Forms.Control+ThreadMethodEntry
00007ffc00d017e8 106 10176 System.Threading.Thread
Total 343 objects
找到倒数第二行,类型为System.Threading.Thread
的对象,可以看到有106个对象,找到前面的地址:00007ffc00d017e8
然后:
0:006> !DumpHeap /d -mt 00007ffc00d017e8
Address MT Size
0000021cb9d38610 00007ffc00d017e8 96
0000021cb9e125b8 00007ffc00d017e8 96
0000021cb9e12ed8 00007ffc00d017e8 96
0000021cb9e13080 00007ffc00d017e8 96
..................................省略其中的100个....................
0000021cbaf9e0f0 00007ffc00d017e8 96
0000021cbafea0f0 00007ffc00d017e8 96
Statistics:
MT Count TotalSize Class Name
00007ffc00d017e8 106 10176 System.Threading.Thread
Total 106 objects
这里面第一列就是所有的线程的地址了。
这里挑选第二个线程的地址查看:
0:006> !DumpObj /d 0000021cb9e125b8
Name: System.Threading.Thread
MethodTable: 00007ffc00d017e8
EEClass: 00007ffc00e27670
Size: 96(0x60) bytes
File: C:\WINDOWS\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
Fields:
MT Field Offset Type VT Attr Value Name
00007ffc00ce1510 40019e9 8 ....Contexts.Context 0 instance 0000000000000000 m_Context
00007ffc00cfeb20 40019ea 10 ....ExecutionContext 0 instance 0000000000000000 m_ExecutionContext
00007ffc00d007a0 40019eb 18 System.String 0 instance 0000021cb9e122d8 m_Name
00007ffc00d02eb8 40019ec 20 System.Delegate 0 instance 0000000000000000 m_Delegate
00007ffc00d04810 40019ed 28 ...ation.CultureInfo 0 instance 0000000000000000 m_CurrentCulture
00007ffc00d04810 40019ee 30 ...ation.CultureInfo 0 instance 0000000000000000 m_CurrentUICulture
00007ffc00d00bb0 40019ef 38 System.Object 0 instance 0000000000000000 m_ThreadStartArg
00007ffc00cff8c0 40019f0 40 System.IntPtr 1 instance 21cd258bb90 DONT_USE_InternalThread
00007ffc00d03368 40019f1 48 System.Int32 1 instance 2 m_Priority
00007ffc00d03368 40019f2 4c System.Int32 1 instance 4 m_ManagedThreadId
00007ffc00cfc638 40019f3 50 System.Boolean 1 instance 1 m_ExecutionContextBelongsToOuterScope
00007ffc00d10de0 40019f4 fa0 ...LocalDataStoreMgr 0 shared static s_LocalDataStoreMgr
>> Domain:Value 0000021cb7f17200:NotInit <<
00007ffc01f3ab08 40019f6 fa8 ...eInfo, mscorlib]] 0 shared static s_asyncLocalCurrentCulture
>> Domain:Value 0000021cb7f17200:NotInit <<
00007ffc01f3ab08 40019f7 fb0 ...eInfo, mscorlib]] 0 shared static s_asyncLocalCurrentUICulture
>> Domain:Value 0000021cb7f17200:NotInit <<
00007ffc00d112c0 40019f5 20 ...alDataStoreHolder 0 shared TLstatic s_LocalDataStore
>> Thread:Value <<
00007ffc00d03368 40019f8 d4c System.Int32 1 shared TLstatic t_currentProcessorIdCache
>> Thread:Value <<
可以看到,有m_Name
字段的值非空,前面的地址的值为0000021cb9e122d8
,它的offset
为18
可以直接dump这个值:
0:006> !DumpObj /d 0000021cb9e122d8
Name: System.String
MethodTable: 00007ffc00d007a0
EEClass: 00007ffc00cb4868
Size: 64(0x40) bytes
File: C:\WINDOWS\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
String: ProcessAlert Thread
Fields:
MT Field Offset Type VT Attr Value Name
00007ffc00d03368 4000283 8 System.Int32 1 instance 19 m_stringLength
00007ffc00d01610 4000284 c System.Char 1 instance 50 m_firstChar
00007ffc00d007a0 4000288 e0 System.String 0 shared static Empty
>> Domain:Value 0000021cb7f17200:NotInit <<
可以看到该字符的描述信息,也可以通过线程的基地址+18
来直接打印该位置的内容,比如:
0:008> !do poi(0000021cb9e125b8+18)
Name: System.String
MethodTable: 00007ffc00d007a0
EEClass: 00007ffc00cb4868
Size: 64(0x40) bytes
File: C:\WINDOWS\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
String: ProcessAlert Thread
Fields:
MT Field Offset Type VT Attr Value Name
00007ffc00d03368 4000283 8 System.Int32 1 instance 19 m_stringLength
00007ffc00d01610 4000284 c System.Char 1 instance 50 m_firstChar
00007ffc00d007a0 4000288 e0 System.String 0 shared static Empty
>> Domain:Value 0000021cb7f17200:NotInit <<
与上面的结果一致,该字符的长度为19
。偏移量为C
,于是使用:
0:006> du 0000021cb9e122d8+c
0000021c`b9e122e4 "ProcessAlert Thread"
地址加上第一个字符的偏移量,可以看到这个线程的名字叫”ProcessAlert Thread“的线程。
106个线程,逐个的查看比较麻烦。这时可以通过脚本来判断,移除所有名字为空的线程:
0:006> .foreach (address {!dumpheap -short -mt 00007ffc00d017e8}){.if (poi(${address}+18) != 0) {du poi(${address}+18)+c}}
0000021c`b9e122e4 "ProcessAlert Thread"
0000021c`b9e12e44 "client_processing_message_thread"
0000021c`b9e12e84 ""
0000021c`b9e13004 "WatchServiceTransport"
从输出结果可以看到,有设置名字的线程,总共有4个(这也提醒我们在使用Thread的时候,最好给它的Name赋一个有意义的值,这样方便调试)。程序里的那个名为“WatchServiceTransport”的线程是存在的,这表示Windows休眠之后并没有把这个线程杀死。
那么接下来查看一下当时程序的调用堆栈,首先还是回到!threads
命令,查看当前所有线程:
0:000> !threads
ThreadCount: 57
UnstartedThread: 0
BackgroundThread: 7
PendingThread: 0
DeadThread: 49
Hosted Runtime: no
Lock
ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception
0 1 68f8 0000021cb7f40cf0 26020 Preemptive 0000021CBAFEF9E8:0000021CBAFF00D8 0000021cb7f17200 0 STA
2 2 75f0 0000021cb7f68d50 2b220 Preemptive 0000000000000000:0000000000000000 0000021cb7f17200 0 MTA (Finalizer)
3 3 7a78 0000021cd2510d10 102a220 Preemptive 0000000000000000:0000000000000000 0000021cb7f17200 0 MTA (Threadpool Worker)
6 4 78b4 0000021cd258bb90 202b220 Preemptive 0000000000000000:0000000000000000 0000021cb7f17200 0 MTA
7 5 5dbc 0000021cd2589520 202b220 Preemptive 0000021CBAE422E0:0000021CBAE440D8 0000021cb7f17200 0 MTA
8 6 5a60 0000021cd25a8100 2b220 Preemptive 0000021CBAE40658:0000021CBAE420D8 0000021cb7f17200 0 MTA
10 7 753c 0000021cda0737e0 8029220 Preemptive 0000000000000000:0000000000000000 0000021cb7f17200 0 MTA (Threadpool Completion Port)
上面只列出了8个存活的线程,从后面的描述来看,ID为5和6的很可能是我们自定义的线程。
于是切换到ID为6的线程。
0:008> ~~[5a60]s
ntdll!NtWaitForSingleObject+0x14:
00007ffc`3399c044 c3 ret
然后执行!clrstack
0:008> !clrstack
OS Thread Id: 0x5a60 (8)
Child SP IP Call Site
000000d83eefec98 00007ffc3399c044 [InlinedCallFrame: 000000d83eefec98] System.Net.UnsafeNclNativeMethods+OSSOCK.recv(IntPtr, Byte*, Int32, System.Net.Sockets.SocketFlags)
000000d83eefec98 00007ffbca943c02 [InlinedCallFrame: 000000d83eefec98] System.Net.UnsafeNclNativeMethods+OSSOCK.recv(IntPtr, Byte*, Int32, System.Net.Sockets.SocketFlags)
000000d83eefec70 00007ffbca943c02 DomainBoundILStubClass.IL_STUB_PInvoke(IntPtr, Byte*, Int32, System.Net.Sockets.SocketFlags)
000000d83eefed20 00007ffbca8d0017 System.Net.Sockets.Socket.Receive(Byte[], Int32, Int32, System.Net.Sockets.SocketFlags, System.Net.Sockets.SocketError ByRef)
000000d83eefedb0 00007ffbca8ce225 System.Net.Sockets.NetworkStream.Read(Byte[], Int32, Int32)
000000d83eefee20 00007ffbc26001b6 TradeServerTransport.TCPClientTransport.ConnectServer() [D:\Dev_Work\Tools\交易日志分析\TradeServerTransport\TCPClientTransport.cs @ 59]
000000d83eefee90 00007ffc01211bf8 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
000000d83eefef60 00007ffc01211ae5 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
000000d83eefef90 00007ffc01211ab5 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
000000d83eefefe0 00007ffc01236445 System.Threading.ThreadHelper.ThreadStart()
000000d83eeff200 00007ffc21dd1673 [GCFrame: 000000d83eeff200]
000000d83eeff568 00007ffc21dd1673 [DebuggerU2MCatchHandlerFrame: 000000d83eeff568]
可以看到,当前线程卡在了NetworkStream.Read
方法那里了。所以可能的问题就是,当Windows休眠后本机网络断开了与远程服务器的连接,这里的Read
方法并没有得到连接断开的通知,一直卡在了Read
方法这里,从而导致接收不到数据。找到可能的问题之后修改的方法就比较简单了。
参考:
- https://stackoverflow.com/questions/31555264/can-windbg-display-thread-names
- https://mp.weixin.qq.com/s/demtsmWXuIoWD1Rzl1H9Yw
- https://zhuanlan.zhihu.com/p/512861073
- https://blog.csdn.net/weixin_30840573/article/details/96611509
- https://stackoverflow.com/questions/12258870/networkstream-read-still-blocks-even-when-network-goes-down
- https://stackoverflow.com/questions/6958255/what-are-some-reasons-networkstream-read-would-hang-block
- https://stackoverflow.com/questions/43112114/weird-networkstream-reading-issue