最近程序遇到了一个问题,写的一个C#客户端程序在Windows 11系统休眠之后再次唤醒,有一定的概率不能够正常地接收到服务端发送的数据。这个程序有一个专门用于连接服务端的C# 线程,名字为"WatchServiceTransport",它的内部是一个while循环,不停的从服务端使用SocketNetworkStream.Read读取数据。正常的时候,程序是能够接收数据的,但是当Windows休眠之后(此时网络连接可能断开),此后即使Windows唤醒仍然不能连接收取数据。起初怀疑是否是因为Windows 11休眠后,连接TCP的线程被系统强制杀死了?那如何验证是否是这个原因导致的呢?这就需要查看在出现问题时,名为“WatchServiceTransport”的线程是否还存在。

幸好今天这个问题又复现了,于是导出了DMP文件,用WinDbg分析了一下,顺便记录如何打印所有的C#线程。

首先就是使用!threads命令查看当前的所有线程:

0:008> !threads
ThreadCount:      57
UnstartedThread:  0
BackgroundThread: 7
PendingThread:    0
DeadThread:       49
Hosted Runtime:   no
                                                                                                        Lock  
       ID OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   0    1 68f8 0000021cb7f40cf0    26020 Preemptive  0000021CBAFEF9E8:0000021CBAFF00D8 0000021cb7f17200 0     STA 
   2    2 75f0 0000021cb7f68d50    2b220 Preemptive  0000000000000000:0000000000000000 0000021cb7f17200 0     MTA (Finalizer) 
   3    3 7a78 0000021cd2510d10  102a220 Preemptive  0000000000000000:0000000000000000 0000021cb7f17200 0     MTA (Threadpool Worker) 
   6    4 78b4 0000021cd258bb90  202b220 Preemptive  0000000000000000:0000000000000000 0000021cb7f17200 0     MTA 
   7    5 5dbc 0000021cd2589520  202b220 Preemptive  0000021CBAE422E0:0000021CBAE440D8 0000021cb7f17200 0     MTA 
   8    6 5a60 0000021cd25a8100    2b220 Preemptive  0000021CBAE40658:0000021CBAE420D8 0000021cb7f17200 0     MTA 
  10    7 753c 0000021cda0737e0  8029220 Preemptive  0000000000000000:0000000000000000 0000021cb7f17200 0     MTA (Threadpool Completion Port) 
XXXX   63    0 0000021cd43a7aa0  1039820 Preemptive  0000000000000000:0000000000000000 0000021cb7f17200 0     Ukn (Threadpool Worker) 
XXXX   24    0 0000021cd43a6330  1039820 Preemptive  0000000000000000:0000000000000000 0000021cb7f17200 0     Ukn (Threadpool Worker) 
XXXX   23    0 0000021cd43a6b00  1039820 Preemptive  0000000000000000:0000000000000000 0000021cb7f17200 0     Ukn (Threadpool Worker) 
-------------------------------------------------------------省略很多个类似-----------------------------------------------------------------------------------
XXXX   31    0 0000021cd4247600  1039820 Preemptive  0000000000000000:0000000000000000 0000021cb7f17200 0     Ukn (Threadpool Worker) 
XXXX   32    0 0000021cd42a84b0  1039820 Preemptive  0000000000000000:0000000000000000 0000021cb7f17200 0     Ukn (Threadpool Worker) 
XXXX   12    0 0000021cd42a5da0  1039820 Preemptive  0000000000000000:0000000000000000 0000021cb7f17200 0     Ukn (Threadpool Worker) 
  13   15 7c18 0000021cd42a8c80  1029220 Preemptive  0000021CBAFEA1A0:0000021CBAFEC0D8 0000021cb7f17200 0     MTA (Threadpool Worker) 

在统计的信息中,可以看到总共有57个线程,其中49个已经dead,真正存活的有8个线程,其中后台线程有7个。需要注意的是上述ID是托管线程的唯一标识符,右CLR分配,OSID,是操作系统分配给线程的线程ID,通过这个ID,可以在Windows任务管理器或其他系统工具中识别该线程。

上面这个它看不出来具体的很多信息,所以需要使用专门的针对.NET的命令,首先就是!dumpheap

0:006> !dumpheap -stat -type Thread
Statistics:
              MT    Count    TotalSize Class Name
00007ffc00d162d0        1           24 System.Threading.SynchronizationContext
00007ffc00d15d20        1           24 System.Threading.Overlapped
00007ffc00ce57c0        1           24 System.Threading.ThreadPoolWorkQueue+SparseArray`1[[System.Threading.ThreadPoolWorkQueue+WorkStealingQueue, mscorlib]]
00007ffc00ccd878        1           24 System.Threading.Overlapped+<>c
00007ffc00cdd130        1           32 System.Threading.TimerQueue+AppDomainTimerSafeHandle
00007ffc00ccdb20        1           32 System.Threading.Gen2GcCallback
00007ffc00ce55c0        1           40 System.Threading.ThreadPoolWorkQueue+QueueSegment
00007ffc00ce5550        1           40 System.Threading.ThreadPoolWorkQueue
00007ffbca60cba0        1           40 System.Net.TimerThread+TimerQueue
00007ffc00cdd0d8        1           48 System.Threading.TimerQueue
00007ffc00cdd058        2           48 System.Threading.TimerHolder
00007ffc00cd6a18        1           48 System.Threading._IOCompletionCallback
00007ffc00d14ed8        1           64 System.Threading.IOCompletionCallback
00007ffc00d148e8        1           64 System.Threading.WaitCallback
00007ffc00cfd958        2           64 System.Threading.Timer
00007ffbca666608        1           64 System.Net.TimerThread+Callback
00007ffbca60cc28        1           72 System.Net.TimerThread+TimerNode
00007ffc00cd5a40        1           80 System.Threading.Tasks.Task`1[[System.Int32, mscorlib]]
00007ffc00ccd8e8        1           80 System.Threading.PinnableBufferCache
00007ffc00cd6b30        3          120 System.Threading.ThreadHelper
00007ffc00cfd8a0        2          128 System.Threading.TimerCallback
00007ffc00cfd5e0        3          144 System.Threading.AutoResetEvent
00007ffc00ce5838        1          152 System.Threading.ThreadPoolWorkQueue+WorkStealingQueue[]
00007ffc00cc1a20        4          160 System.Threading.WaitHandle[]
00007ffc00cdcff8        2          176 System.Threading.TimerQueueTimer
00007ffc00d137e0        4          192 System.Threading.ManualResetEvent
00007ffc00ccda70        1          192 System.Threading.PinnableBufferCacheEventSource
00007ffc00cd69b8        2          224 System.Threading.OverlappedData
00007ffb4296ed00        1          232 System.Windows.Forms.Application+ThreadContext
00007ffb4171bef0        1          248 System.Threading.Tasks.Task`1[[System.Data.ProviderBase.DbConnectionInternal, System.Data]][]
00007ffc00d15a70        5          320 System.Threading.SendOrPostCallback
00007ffc00d00b50        2          320 System.Threading.ThreadAbortException
00007ffc00cf3e00        5          320 System.Threading.ReaderWriterLock
00007ffc00d13878        6          384 System.Threading.ThreadStart
00007ffc00d13c18        7          448 System.Threading.ContextCallback
00007ffc00ce5620        1         2072 System.Threading.IThreadPoolWorkItem[]
00007ffc00cfeb20       84         7392 System.Threading.ExecutionContext
00007ffb42972ac8       82         8528 System.Windows.Forms.Control+ThreadMethodEntry
00007ffc00d017e8      106        10176 System.Threading.Thread
Total 343 objects

找到倒数第二行,类型为System.Threading.Thread的对象,可以看到有106个对象,找到前面的地址:00007ffc00d017e8

然后:

0:006> !DumpHeap /d -mt 00007ffc00d017e8
         Address               MT     Size
0000021cb9d38610 00007ffc00d017e8       96     
0000021cb9e125b8 00007ffc00d017e8       96     
0000021cb9e12ed8 00007ffc00d017e8       96     
0000021cb9e13080 00007ffc00d017e8       96     
..................................省略其中的100个....................
0000021cbaf9e0f0 00007ffc00d017e8       96     
0000021cbafea0f0 00007ffc00d017e8       96     

Statistics:
              MT    Count    TotalSize Class Name
00007ffc00d017e8      106        10176 System.Threading.Thread
Total 106 objects

这里面第一列就是所有的线程的地址了。

这里挑选第二个线程的地址查看:

0:006> !DumpObj /d 0000021cb9e125b8
Name:        System.Threading.Thread
MethodTable: 00007ffc00d017e8
EEClass:     00007ffc00e27670
Size:        96(0x60) bytes
File:        C:\WINDOWS\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ffc00ce1510  40019e9        8 ....Contexts.Context  0 instance 0000000000000000 m_Context
00007ffc00cfeb20  40019ea       10 ....ExecutionContext  0 instance 0000000000000000 m_ExecutionContext
00007ffc00d007a0  40019eb       18        System.String  0 instance 0000021cb9e122d8 m_Name
00007ffc00d02eb8  40019ec       20      System.Delegate  0 instance 0000000000000000 m_Delegate
00007ffc00d04810  40019ed       28 ...ation.CultureInfo  0 instance 0000000000000000 m_CurrentCulture
00007ffc00d04810  40019ee       30 ...ation.CultureInfo  0 instance 0000000000000000 m_CurrentUICulture
00007ffc00d00bb0  40019ef       38        System.Object  0 instance 0000000000000000 m_ThreadStartArg
00007ffc00cff8c0  40019f0       40        System.IntPtr  1 instance      21cd258bb90 DONT_USE_InternalThread
00007ffc00d03368  40019f1       48         System.Int32  1 instance                2 m_Priority
00007ffc00d03368  40019f2       4c         System.Int32  1 instance                4 m_ManagedThreadId
00007ffc00cfc638  40019f3       50       System.Boolean  1 instance                1 m_ExecutionContextBelongsToOuterScope
00007ffc00d10de0  40019f4      fa0 ...LocalDataStoreMgr  0   shared           static s_LocalDataStoreMgr
                                 >> Domain:Value  0000021cb7f17200:NotInit  <<
00007ffc01f3ab08  40019f6      fa8 ...eInfo, mscorlib]]  0   shared           static s_asyncLocalCurrentCulture
                                 >> Domain:Value  0000021cb7f17200:NotInit  <<
00007ffc01f3ab08  40019f7      fb0 ...eInfo, mscorlib]]  0   shared           static s_asyncLocalCurrentUICulture
                                 >> Domain:Value  0000021cb7f17200:NotInit  <<
00007ffc00d112c0  40019f5       20 ...alDataStoreHolder  0   shared         TLstatic s_LocalDataStore
    >> Thread:Value <<
00007ffc00d03368  40019f8      d4c         System.Int32  1   shared         TLstatic t_currentProcessorIdCache
    >> Thread:Value <<

可以看到,有m_Name字段的值非空,前面的地址的值为0000021cb9e122d8,它的offset18

可以直接dump这个值:

0:006> !DumpObj /d 0000021cb9e122d8
Name:        System.String
MethodTable: 00007ffc00d007a0
EEClass:     00007ffc00cb4868
Size:        64(0x40) bytes
File:        C:\WINDOWS\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
String:      ProcessAlert Thread
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ffc00d03368  4000283        8         System.Int32  1 instance               19 m_stringLength
00007ffc00d01610  4000284        c          System.Char  1 instance               50 m_firstChar
00007ffc00d007a0  4000288       e0        System.String  0   shared           static Empty
                                 >> Domain:Value  0000021cb7f17200:NotInit  <<

可以看到该字符的描述信息,也可以通过线程的基地址+18来直接打印该位置的内容,比如:

0:008> !do poi(0000021cb9e125b8+18)
Name:        System.String
MethodTable: 00007ffc00d007a0
EEClass:     00007ffc00cb4868
Size:        64(0x40) bytes
File:        C:\WINDOWS\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
String:      ProcessAlert Thread
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ffc00d03368  4000283        8         System.Int32  1 instance               19 m_stringLength
00007ffc00d01610  4000284        c          System.Char  1 instance               50 m_firstChar
00007ffc00d007a0  4000288       e0        System.String  0   shared           static Empty
                                 >> Domain:Value  0000021cb7f17200:NotInit  <<

与上面的结果一致,该字符的长度为19。偏移量为C,于是使用:

0:006> du 0000021cb9e122d8+c
0000021c`b9e122e4  "ProcessAlert Thread"

地址加上第一个字符的偏移量,可以看到这个线程的名字叫”ProcessAlert Thread“的线程。

106个线程,逐个的查看比较麻烦。这时可以通过脚本来判断,移除所有名字为空的线程:

0:006> .foreach (address {!dumpheap -short -mt 00007ffc00d017e8}){.if (poi(${address}+18) != 0) {du poi(${address}+18)+c}}
0000021c`b9e122e4  "ProcessAlert Thread"
0000021c`b9e12e44  "client_processing_message_thread"
0000021c`b9e12e84  ""
0000021c`b9e13004  "WatchServiceTransport"

从输出结果可以看到,有设置名字的线程,总共有4个(这也提醒我们在使用Thread的时候,最好给它的Name赋一个有意义的值,这样方便调试)。程序里的那个名为“WatchServiceTransport”的线程是存在的,这表示Windows休眠之后并没有把这个线程杀死。

那么接下来查看一下当时程序的调用堆栈,首先还是回到!threads命令,查看当前所有线程:

0:000> !threads
ThreadCount:      57
UnstartedThread:  0
BackgroundThread: 7
PendingThread:    0
DeadThread:       49
Hosted Runtime:   no
                                                                                                        Lock  
       ID OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   0    1 68f8 0000021cb7f40cf0    26020 Preemptive  0000021CBAFEF9E8:0000021CBAFF00D8 0000021cb7f17200 0     STA 
   2    2 75f0 0000021cb7f68d50    2b220 Preemptive  0000000000000000:0000000000000000 0000021cb7f17200 0     MTA (Finalizer) 
   3    3 7a78 0000021cd2510d10  102a220 Preemptive  0000000000000000:0000000000000000 0000021cb7f17200 0     MTA (Threadpool Worker) 
   6    4 78b4 0000021cd258bb90  202b220 Preemptive  0000000000000000:0000000000000000 0000021cb7f17200 0     MTA 
   7    5 5dbc 0000021cd2589520  202b220 Preemptive  0000021CBAE422E0:0000021CBAE440D8 0000021cb7f17200 0     MTA 
   8    6 5a60 0000021cd25a8100    2b220 Preemptive  0000021CBAE40658:0000021CBAE420D8 0000021cb7f17200 0     MTA 
  10    7 753c 0000021cda0737e0  8029220 Preemptive  0000000000000000:0000000000000000 0000021cb7f17200 0     MTA (Threadpool Completion Port) 

上面只列出了8个存活的线程,从后面的描述来看,ID为5和6的很可能是我们自定义的线程。

于是切换到ID为6的线程。

0:008> ~~[5a60]s
ntdll!NtWaitForSingleObject+0x14:
00007ffc`3399c044 c3              ret

然后执行!clrstack

0:008> !clrstack
OS Thread Id: 0x5a60 (8)
        Child SP               IP Call Site
000000d83eefec98 00007ffc3399c044 [InlinedCallFrame: 000000d83eefec98] System.Net.UnsafeNclNativeMethods+OSSOCK.recv(IntPtr, Byte*, Int32, System.Net.Sockets.SocketFlags)
000000d83eefec98 00007ffbca943c02 [InlinedCallFrame: 000000d83eefec98] System.Net.UnsafeNclNativeMethods+OSSOCK.recv(IntPtr, Byte*, Int32, System.Net.Sockets.SocketFlags)
000000d83eefec70 00007ffbca943c02 DomainBoundILStubClass.IL_STUB_PInvoke(IntPtr, Byte*, Int32, System.Net.Sockets.SocketFlags)
000000d83eefed20 00007ffbca8d0017 System.Net.Sockets.Socket.Receive(Byte[], Int32, Int32, System.Net.Sockets.SocketFlags, System.Net.Sockets.SocketError ByRef)
000000d83eefedb0 00007ffbca8ce225 System.Net.Sockets.NetworkStream.Read(Byte[], Int32, Int32)
000000d83eefee20 00007ffbc26001b6 TradeServerTransport.TCPClientTransport.ConnectServer() [D:\Dev_Work\Tools\交易日志分析\TradeServerTransport\TCPClientTransport.cs @ 59]
000000d83eefee90 00007ffc01211bf8 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
000000d83eefef60 00007ffc01211ae5 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
000000d83eefef90 00007ffc01211ab5 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
000000d83eefefe0 00007ffc01236445 System.Threading.ThreadHelper.ThreadStart()
000000d83eeff200 00007ffc21dd1673 [GCFrame: 000000d83eeff200] 
000000d83eeff568 00007ffc21dd1673 [DebuggerU2MCatchHandlerFrame: 000000d83eeff568] 

可以看到,当前线程卡在了NetworkStream.Read方法那里了。所以可能的问题就是,当Windows休眠后本机网络断开了与远程服务器的连接,这里的Read方法并没有得到连接断开的通知,一直卡在了Read方法这里,从而导致接收不到数据。找到可能的问题之后修改的方法就比较简单了。

 

参考: