0:005> !analyze -v CONTEXT: (.ecxr) eax=063ce258 ebx=07b90000 ecx=0063552e edx=0063552e esi=03070909 edi=03070909 eip=71954432 esp=063ce220 ebp=063ce23c iopl=0 nv up ei pl nz na pe nc cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010206 clr!WKS::gc_heap::mark_object_simple+0x12: 71954432 8b0f mov ecx,dword ptr [edi] ds:002b:03070909=???????? Resetting default scope EXCEPTION_RECORD: (.exr -1) ExceptionAddress: 71954432 (clr!WKS::gc_heap::mark_object_simple+0x00000012) ExceptionCode: c0000005 (Access violation) ExceptionFlags: 00000001 NumberParameters: 2 Parameter[0]: 00000000 Parameter[1]: 03070909 Attempt to read from address 03070909 STACK_TEXT: 063ce23c 719543fc 063ce258 0a76cc88 71954260 clr!WKS::gc_heap::mark_object_simple+0x12 063ce25c 71950b62 0a76cc88 063cec88 00000000 clr!WKS::GCHeap::Promote+0xa8 ... 063cec28 71950fa3 71950da0 063cec40 00000500 clr!Thread::StackWalkFrames+0x9d 063cec4c 7195103e 063cec88 00000002 00000000 clr!standalone::ScanStackRoots+0x43 063cec68 71954038 0079cb88 063cec88 00080101 clr!GCToEEInterface::GcScanRoots+0xdb 063cecc0 71953225 00080101 00000000 00000001 clr!WKS::gc_heap::mark_phase+0x17e 063cece0 7195355b 71f75da0 00000000 00000001 clr!WKS::gc_heap::gc1+0xae 063cecf8 71953665 71f75fb4 71f75fb4 00000000 clr!WKS::gc_heap::garbage_collect+0x367 063ced18 7195376a 00000000 00000000 71f75fb4 clr!WKS::GCHeap::GarbageCollectGeneration+0x1bd ...从卦中信息看,当前执行流处于GC标记阶段,并且是在各个线程栈上寻找用户根,在寻找的过程中踩到了坏内存,接下来需要捋一下是什么逻辑踩到的,可以用 u 反汇编一下。
0:005> u WKS::gc_heap::mark_object_simple clr!WKS::gc_heap::mark_object_simple: 71954420 55 push ebp 71954421 8bec mov ebp,esp 71954423 83ec18 sub esp,18h 71954426 8b4508 mov eax,dword ptr [ebp+8] 71954429 57 push edi 7195442a 8b38 mov edi,dword ptr [eax] 7195442c 89bde8ffffff mov dword ptr [ebp-18h],edi 71954432 8b0f mov ecx,dword ptr [edi] ...从汇编逻辑看,这是将方法的第一个参数进行解引用,参考 coreclr 的源码。
void gc_heap::mark_object_simple(uint8_t** po THREAD_NUMBER_DCL) { uint8_t* o = *po; if (gc_mark1(o)) { ... } }结合C++代码,edi=03070909 就是上面的o,也就是需要标记的托管对象,但现在这个 o 是一个坏对象,那为什么会坏掉呢?
0:005> !verifyheap No heap corruption detected.从卦中看,我去,托管堆居然是好的,过往经验在这个dump里被击的粉碎,接下来要往哪里突破呢?可以观察下这个托管地址和当前的托管segment在空间距离上的特征,命令输出如下:
0:005> !address 03070909 // 堆代码 duidaima.com Usage: <unknown> Base Address: 02ca2000 End Address: 036f0000 Region Size: 00a4e000 ( 10.305 MB) State: 00002000 MEM_RESERVE Protect: <info not present at the target> Type: 00020000 MEM_PRIVATE Allocation Base: 026f0000 Allocation Protect: 00000004 PAGE_READWRITE 0:005> !eeheap -gc Number of GC Heaps: 1 generation 0 starts at 0x06ca7a7c generation 1 starts at 0x06b91000 generation 2 starts at 0x026f1000 ephemeral segment allocation context: none segment begin allocated size 026f0000 026f1000 02c98f8c 0x5a7f8c(5930892) 06b90000 06b91000 0732b3d0 0x79a3d0(7971792) Large object heap starts at 0x036f1000 segment begin allocated size 036f0000 036f1000 03c78da0 0x587da0(5799328) Total Size: Size: 0x12ca0fc (19702012) bytes. ------------------------------ GC Heap Size: Size: 0x12ca0fc (19702012) bytes. 0:005> !address BaseAddr EndAddr+1 RgnSize Type State Protect Usage ----------------------------------------------------------------------------------------------- ... + 26f0000 2ca2000 5b2000 MEM_PRIVATE MEM_COMMIT PAGE_READWRITE <unknown> [..........o.....] 2ca2000 36f0000 a4e000 MEM_PRIVATE MEM_RESERVE <unknown> ...说实话,有经验的朋友看到这卦中信息马上就知道是怎么回事了,步骤大概是这样的。
3.但不知为何,线程栈还保留了这个老地址 03070909,而不是新地址
void GCToEEInterface::GcScanRoots(promote_func* fn, int condemned, int max_gen, ScanContext* sc) { Thread* pThread = NULL; while ((pThread = ThreadStore::GetThreadList(pThread)) != NULL) { ScanStackRoots(pThread, fn, sc); } }接下来上 windbg 在崩溃的线程栈上实操一下。
0:005> kb 8 # ChildEBP RetAddr Args to Child 00 063ce23c 719543fc 063ce258 0a76cc88 71954260 clr!WKS::gc_heap::mark_object_simple+0x12 01 063ce25c 71950b62 0a76cc88 063cec88 00000000 clr!WKS::GCHeap::Promote+0xa8 02 063ce274 71951a35 063cec40 0a76cc88 00000000 clr!GcEnumObject+0x37 03 063ce5d8 71950e6f 063ce920 063ce870 00000000 clr!EECodeManager::EnumGcRefs+0x72b 04 063ce628 717bfaa4 063ce650 063cec40 71950da0 clr!GcStackCrawlCallBack+0x139 05 063ce8f4 717bfbaa 063ce920 71950da0 063cec40 clr!Thread::StackWalkFramesEx+0x92 06 063cec28 71950fa3 71950da0 063cec40 00000500 clr!Thread::StackWalkFrames+0x9d 07 063cec4c 7195103e 063cec88 00000002 00000000 clr!standalone::ScanStackRoots+0x43 0:005> dp 063cec88 L1 063cec88 08debbf8 0:005> !t ThreadCount: 30 UnstartedThread: 0 BackgroundThread: 29 PendingThread: 0 DeadThread: 0 Hosted Runtime: no Lock ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception ... 30 26 3e98 08debbf8 2b220 Preemptive 00000000:00000000 0079cb88 0 MTA ...从卦中看,30号线程就是我苦苦寻找的僵死线程,接下来赶紧切过去看看,果然发现了C++的函数xxx.Driver.xxx,由于私密性,我就模糊一下了哈。
0:030> ~30s eax=00000000 ebx=08debbf8 ecx=00000000 edx=00000000 esi=00000000 edi=00000244 eip=77872aac esp=0a76c9fc ebp=0a76ca6c iopl=0 nv up ei pl nz na pe nc cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000206 ntdll!NtWaitForSingleObject+0xc: 77872aac c20c00 ret 0Ch 0:030> !clrstack OS Thread Id: 0x3e98 (30) Child SP IP Call Site 0a76cc18 77872aac [InlinedCallFrame: 0a76cc18] 0a76cc0c 00aa8047 DomainBoundILStubClass.IL_STUB_PInvoke(UInt32, xxx ByRef) 0a76cc18 00aa6c67 [InlinedCallFrame: 0a76cc18] xxx.Driver.xxx(UInt32, xxx ByRef) 0a76ccc0 00aa6c67 xxx.Driver.xxxFault(UInt32, System.String) ...既然发现了C++方法,最后还剩一个疑问,就是此时的03070909真的在非托管层吗?这个可以通过搜索它的线程栈地址。
0:030> s-d poi(@$teb+0x8) poi(@$teb+0x4) 03070909 0a76cc88 03070909 728f5d01 68d8c642 5c654b42 .....].rB..hBKe\从代码中可以看到确实是在xxx.Driver.xxxFault方法里传给了C++,有了这些信息接下来就是告诉朋友,重点关注下这个方法,捋一下逻辑。