调试Utility进程崩溃

M55内核升级中发现Utility进程崩溃很多。以前Utility进程的崩溃只占到总崩溃量1%不到,现在却增加到了10%以上。 Utility进程崩溃的堆栈都很类似,如下:

// 0号线程
 # ChildEBP RetAddr  
00 0021f5f0 77b065ec ntdll!KiFastSystemCallRet
01 0021f5f4 75a3179c ntdll!NtWaitForSingleObject+0xc
02 0021f660 76fec533 KERNELBASE!WaitForSingleObjectEx+0x98
03 0021f678 76fec4e2 kernel32!WaitForSingleObjectExImplementation+0x75
04 0021f68c 0139d1a6 kernel32!WaitForSingleObject+0x12
05 0021f6a8 0139c75f UCBrowser!sandbox::BrokerServicesBase::~BrokerServicesBase+0x4a
06 0021f6b0 013c1821 UCBrowser!sandbox::SingletonBase<sandbox::BrokerServicesBase>::OnExit+0x13
07 0021f6dc 013c16ff UCBrowser!<lambda_275893d493268fdec8709772e3fcec0e>::operator()+0x89
08 0021f710 013c174d UCBrowser!__crt_seh_guarded_call<int>::operator()<<lambda_995298e7d72eb4c2aab26c0585b3abe5>,<lambda_275893d493268fdec8709772e3fcec0e> &,<lambda_293819299cbf9a7022e18b56a874bb5c> >+0x27
09 0021f730 013c1a26 UCBrowser!__acrt_lock_and_call<<lambda_275893d493268fdec8709772e3fcec0e> >+0x24
0a 0021f744 013bf1ff UCBrowser!_execute_onexit_table+0x17
0b 0021f77c 013bf3ae UCBrowser!common_exit+0xa1
0c 0021f790 013b1e1f UCBrowser!exit+0x11
0d 0021f7d0 76feef8c UCBrowser!__scrt_common_main_seh+0x107
0e 0021f7dc 77b2367a kernel32!BaseThreadInitThunk+0xe
0f 0021f81c 77b2364d ntdll!__RtlUserThreadStart+0x70
10 0021f834 00000000 ntdll!_RtlUserThreadStart+0x1b

// 1号线程
 # ChildEBP RetAddr  
00 03b8f1e8 77b0539c ntdll!KiFastSystemCallRet
01 03b8f1ec 75a31876 ntdll!ZwDelayExecution+0xc
02 03b8f254 75a31818 KERNELBASE!SleepEx+0x65
03 03b8f264 62065f1d KERNELBASE!Sleep+0xf
04 03b8f328 770009b6 chrome_elf!`anonymous namespace'::UnhandledExceptionHandler+0x5b
05 03b8f3b0 77b92bad kernel32!UnhandledExceptionFilter+0x127
06 03b8f3c8 77b932fd ntdll!TppExceptionFilter+0x59
07 03b8f3dc 77b5b9b1 ntdll!TppWaiterpOuterExceptionFilter+0x11
08 03b8f3e4 77ade3a4 ntdll!TppWaiterpThread+0x708
09 03b8f3f8 77ade234 ntdll!_EH4_CallFilterFunc+0x12
0a 03b8f420 77b06d79 ntdll!_except_handler4+0x8e
0b 03b8f444 77b06d4b ntdll!ExecuteHandler2+0x26
0c 03b8f468 77adfa17 ntdll!ExecuteHandler+0x24
0d 03b8f4f4 77b06bd7 ntdll!RtlDispatchException+0x127
0e 03b8f4f4 77b5b8ce ntdll!KiUserExceptionDispatcher+0xf
0f 03b8f990 76feef8c ntdll!TppWaiterpThread+0x623
10 03b8f99c 77b2367a kernel32!BaseThreadInitThunk+0xe
11 03b8f9dc 77b2364d ntdll!__RtlUserThreadStart+0x70
12 03b8f9f4 00000000 ntdll!_RtlUserThreadStart+0x1b

// 2号线程
 # ChildEBP RetAddr  
00 03e6f8e8 77b065fc ntdll!KiFastSystemCallRet
01 03e6f8ec 77af1146 ntdll!NtWaitForWorkViaWorkerFactory+0xc
02 03e6fa3c 76feef8c ntdll!TppWorkerThread+0x209
03 03e6fa48 77b2367a kernel32!BaseThreadInitThunk+0xe
04 03e6fa88 77b2364d ntdll!__RtlUserThreadStart+0x70
05 03e6faa0 00000000 ntdll!_RtlUserThreadStart+0x1b

我们看0号线程,也就是主线程。这个调用栈没有wMain函数,却有exit函数。可以看出来程序崩溃在进程退出的时候。最后的我们的代码崩溃帧是在BrokerServicesBase::~BrokerServicesBase里面,其代码如下:

// The destructor should only be called when the Broker process is terminating.
// Since BrokerServicesBase is a singleton, this is called from the CRT
// termination handlers, if this code lives on a DLL it is called during
// DLL_PROCESS_DETACH in other words, holding the loader lock, so we cannot
// wait for threads here.
BrokerServicesBase::~BrokerServicesBase() {
  // If there is no port Init() was never called successfully.
  if (!job_port_.IsValid())
    return;

  // Closing the port causes, that no more Job notifications are delivered to
  // the worker thread and also causes the thread to exit. This is what we
  // want to do since we are going to close all outstanding Jobs and notifying
  // the policy objects ourselves.
  ::PostQueuedCompletionStatus(job_port_.Get(), 0, THREAD_CTRL_QUIT, FALSE);

  if (job_thread_.IsValid() &&
      WAIT_TIMEOUT == ::WaitForSingleObject(job_thread_.Get(), 1000)) {
    // Cannot clean broker services.
    NOTREACHED();
    return;
  }

  base::STLDeleteElements(&tracker_list_);
  delete thread_pool_;

  ::DeleteCriticalSection(&lock_);
}

我认为0号线程是正常的。我们再看1号线程。1号线程是程序发生异常,生成dump的线程。可以看到是ntdll!TppWaiterpThread引发了ntdll!KiUserExceptionDispatcher。好奇怪,异常竟然来自系统模块。我们看下Windbg的dump分析:

EXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 77b5b8ce (ntdll!TppWaiterpThread+0x00000623)
   ExceptionCode: c000070a
  ExceptionFlags: 00000000
NumberParameters: 5
   Parameter[0]: c0000008
   Parameter[1]: 0000012c
   Parameter[2]: 00469d80
   Parameter[3]: 0046dc88
   Parameter[4]: 51f58e20

可以看到ExceptionCode的值为c000070a,查一下ntstatus.h头文件,值c000070a表示STATUS_THREADPOOL_HANDLE_EXCEPTION,意思是线程池等待的句柄异常,可以看到句柄是0000012c。 然后用!handle命令行查看一下进程当前拥有的句柄情况:

Handle 00000118
  Type          IoCompletion
Handle 0000011c
  Type          Event
Handle 00000120
  Type          Thread
Handle 0000013c
  Type          File

可以看到进程句柄表里面并没有0000012c句柄。因此我猜测可能是因为0000012c句柄已经关闭了,而TppWaiterpThread还在使用0000012c这个无效的句柄导致的异常。

因此我想收集一下句柄分配和释放的信息,然后根据dump查看一下无效的句柄是怎么分配和释放的。幸好chromium里面有一个ActiveVerifier机制,ActiveVerifier是用来跟踪句柄使用情况的,它跟踪了一部分句柄创建的和Hook了CloseHandle函数。在StartTracking和CloseHandle里面加一个我自己写的LogHandle函数,这个函数会把句柄分配和释放的StackTrace跟进句柄值记录到环境变量中。 因为Utility进程创建和释放的句柄并不多,所以把这个调试信息放在环境变量中是方便和合适的。

灰度出去,收集崩溃信息。这次果然收集到了有用的信息,先看dump分析:

EXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 7759c7fd (ntdll!TppWaiterpThread+0x00000632)
   ExceptionCode: c000070a
  ExceptionFlags: 00000000
NumberParameters: 5
   Parameter[0]: c0000008
   Parameter[1]: 00000198
   Parameter[2]: 005e9888
   Parameter[3]: 00662b08
   Parameter[4]: 5f067ed0

这次异常的句柄值是00000198,再看看环境变量中的调试信息:

0:00000184=,5F07FDB7,5F03C4C6,5F07E9E6,5F07EC8C,5F3DDBF0,5F3DDBD7,6011A1ED,5F02222F,5F021549,5E7E626E,009543C0,0095284B,009B2089,751F33AA,77539F72,77539F45
0:00000194=,5F07FDB7,5F03C4C6,5F06001A,5F0977F4,5F063951,5F0171BB,5F02207B,5F021FE9,5F021540,5E7E626E,009543C0,0095284B,009B2089,751F33AA,77539F72,77539F45
0:00000198=,5F07FDB7,5F03C4C6,5F06001A,5F74B8D1,5F0171B0,5F02207B,5F021FE9,5F021540,5E7E626E,009543C0,0095284B,009B2089,751F33AA,77539F72,77539F45
0:0000019C=,5F07FDB7,5F03C4C6,5F06001A,5F06DCA8,5F74B8C9,5F0171B0,5F02207B,5F021FE9,5F021540,5E7E626E,009543C0,0095284B,009B2089,751F33AA,77539F72,77539F45
0:000001A0=,5F07FDB7,5F03C4C6,5F06001A,5F06DC71,5F74B8C9,5F0171B0,5F02207B,5F021FE9,5F021540,5E7E626E,009543C0,0095284B,009B2089,751F33AA,77539F72,77539F45
0:000001AC=,0097EAD7,00969DB6,00952C2E,0099D313,0099E115,5F017186,5F02207B,5F021FE9,5F021540,5E7E626E,009543C0,0095284B,009B2089,751F33AA,77539F72,77539F45
1:00000124=,0097EAD7,0096A330,0096A459,00952C8A,0096BE72,0096BC87,0096BB3A,0095492D,009545A0,009544F8,0095438B,0095284B,009B2089,751F33AA,77539F72,77539F45

00000198果然被CloseHandle了。然后根据后面的地址查看关闭的调用栈:

5F07FDB7  chrome_child!base::debug::StackTrace::StackTrace+0x17   stack_trace_win.cc(242)
5F03C4C6  chrome_child!base::win::HandleTraits::CloseHandle+0x16  scoped_handle.cc(323)
5F06001A  chrome_child!base::WaitableEvent::~WaitableEvent+0x2a   waitable_event_win.cc(36)
5F74B8D1  chrome_child!content::ChildProcess::~ChildProcess+0x73  child_process.cc(79)
5F0171B0  chrome_child!content::UtilityMain+0xe2                  main.cc(68)+0x13

从代码里面可以看到00000198句柄已经随着ChildProcess的析构而关闭变成无效了,其他地方继续使用这个句柄是有问题的。

00000198句柄是ChildProcess的shutdown_event_成员变量。这个shutdown_event_会被ChildThreadImpl传递给其IPC::SyncChannel的channel_成员。而SyncChannel的shutdown_watcher_会调用系统的RegisterWaitForSingleObject去利用线程池去等待00000198句柄对应Event。

所以问题的根源是ChildProcess的析构函数CloseHandle了shutdown_event_,但是SyncChannel里面的shutdown_watcher_通过RegisterWaitForSingleObject继续等待shutdown_event_句柄。MSDN里RegisterWaitForSingleObjec介绍说:If this handle is closed while the wait is still pending, the function’s behavior is undefined.

对于这个bug,Chromium在2017年5月份的一个54d831564ddf3b5f69adee4dfc35e75cb8c419d3提交终于做了修复,可见他们为了解决这个问题也是查了很久。