Android WatchDog 工作原理

一、概述

Android系统中,有硬件 WatchDog 用于定时检测关键硬件是否正常工作,类似地,在framework层有一个软件WatchDog用于定期检测关键系统服务是否发生死锁事件。WatchDog功能主要是分析系统核心服务和重要线程是否处于Blocked状态。

  • 监视reboot广播;
  • 监视mMonitors关键系统服务是否死锁。

二、WatchDog初始化

2.1 startOtherServices

[-> SystemServer.java]

  • private void startOtherServices() {
  • ...
  • //创建watchdog【见小节2.2】
  • final Watchdog watchdog = Watchdog.getInstance();
  • //注册reboot广播【见小节2.3】
  • watchdog.init(context, mActivityManagerService);
  • ...
  • mSystemServiceManager.startBootPhase(SystemService.PHASE_LOCK_SETTINGS_READY); //480
  • ...
  • mActivityManagerService.systemReady(new Runnable() {
  • public void run() {
  • mSystemServiceManager.startBootPhase(
  • SystemService.PHASE_ACTIVITY_MANAGER_READY);
  • ...
  • // watchdog启动【见小节3.1】
  • Watchdog.getInstance().start();
  • mSystemServiceManager.startBootPhase(
  • SystemService.PHASE_THIRD_PARTY_APPS_CAN_START);
  • }
  • }
  • }

system_server进程启动的过程中初始化WatchDog,主要有:

  • 创建watchdog对象,该对象本身继承于Thread;
  • 注册reboot广播;
  • 调用start()开始工作。

2.2 getInstance

[-> Watchdog.java]

  • public static Watchdog getInstance() {
  • if (sWatchdog == null) {
  • //单例模式,创建实例对象【见小节2.3 】
  • sWatchdog = new Watchdog();
  • }
  • return sWatchdog;
  • }

2.3 创建Watchdog

[-> Watchdog.java]

  • public class Watchdog extends Thread {
  • //所有的HandlerChecker对象组成的列表,HandlerChecker对象类型【见小节2.3.1】
  • final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();
  • ...
  • private Watchdog() {
  • super("watchdog");
  • //将前台线程加入队列
  • mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
  • "foreground thread", DEFAULT_TIMEOUT);
  • mHandlerCheckers.add(mMonitorChecker);
  • //将主线程加入队列
  • mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
  • "main thread", DEFAULT_TIMEOUT));
  • //将ui线程加入队列
  • mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
  • "ui thread", DEFAULT_TIMEOUT));
  • //将i/o线程加入队列
  • mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
  • "i/o thread", DEFAULT_TIMEOUT));
  • //将display线程加入队列
  • mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
  • "display thread", DEFAULT_TIMEOUT));
  • //【见小节2.3.2】
  • addMonitor(new BinderThreadMonitor());
  • }
  • }

Watchdog继承于Thread,创建的线程名为"watchdog"。mHandlerCheckers队列包括、
主线程,fg, ui, io, display线程的HandlerChecker对象。

2.3.1 HandlerChecker

[-> Watchdog.java]

  • public final class HandlerChecker implements Runnable {
  • private final Handler mHandler; //Handler对象
  • private final String mName; //线程描述名
  • private final long mWaitMax; //最长等待时间
  • //记录着监控的服务
  • private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
  • private boolean mCompleted; //开始检查时先设置成false
  • private Monitor mCurrentMonitor;
  • private long mStartTime; //开始准备检查的时间点
  • HandlerChecker(Handler handler, String name, long waitMaxMillis) {
  • mHandler = handler;
  • mName = name;
  • mWaitMax = waitMaxMillis;
  • mCompleted = true;
  • }
  • }

2.3.2 addMonitor

  • public class Watchdog extends Thread {
  • public void addMonitor(Monitor monitor) {
  • synchronized (this) {
  • ...
  • //此处mMonitorChecker数据类型为HandlerChecker
  • mMonitorChecker.addMonitor(monitor);
  • }
  • }
  • public final class HandlerChecker implements Runnable {
  • private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
  • public void addMonitor(Monitor monitor) {
  • //将上面的BinderThreadMonitor添加到mMonitors队列
  • mMonitors.add(monitor);
  • }
  • ...
  • }
  • }

监控Binder线程, 将monitor添加到HandlerChecker的成员变量mMonitors列表中。
在这里是将BinderThreadMonitor对象加入该线程。

  • private static final class BinderThreadMonitor implements Watchdog.Monitor {
  • public void monitor() {
  • Binder.blockUntilThreadAvailable();
  • }
  • }
  • blockUntilThreadAvailable最终调用的是IPCThreadState,等待有空闲的binder线程
  • void IPCThreadState::blockUntilThreadAvailable()
  • {
  • pthread_mutex_lock(&mProcess->mThreadCountLock);
  • while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
  • //等待正在执行的binder线程小于进程最大binder线程上限(16个)
  • pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
  • }
  • pthread_mutex_unlock(&mProcess->mThreadCountLock);
  • }

可见addMonitor(new BinderThreadMonitor())是将Binder线程添加到android.fg线程的handler(mMonitorChecker)来检查是否工作正常。

2.3 init

[-> Watchdog.java]

  • public void init(Context context, ActivityManagerService activity) {
  • mResolver = context.getContentResolver();
  • mActivity = activity;
  • //注册reboot广播接收者【见小节2.3.1】
  • context.registerReceiver(new RebootRequestReceiver(),
  • new IntentFilter(Intent.ACTION_REBOOT),
  • android.Manifest.permission.REBOOT, null);
  • }

2.3.1 RebootRequestReceiver

[-> Watchdog.java]

  • final class RebootRequestReceiver extends BroadcastReceiver {
  • @Override
  • public void onReceive(Context c, Intent intent) {
  • if (intent.getIntExtra("nowait", 0) != 0) {
  • //【见小节2.3.2】
  • rebootSystem("Received ACTION_REBOOT broadcast");
  • return;
  • }
  • Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);
  • }
  • }

2.3.2 rebootSystem

[-> Watchdog.java]

  • void rebootSystem(String reason) {
  • Slog.i(TAG, "Rebooting system because: " + reason);
  • IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
  • try {
  • //通过PowerManager执行reboot操作
  • pms.reboot(false, reason, false);
  • } catch (RemoteException ex) {
  • }
  • }

最终是通过PowerManagerService来完成重启操作,具体的重启流程后续会单独讲述。

三、Watchdog检测机制

当调用Watchdog.getInstance().start()时,则进入线程“watchdog”的run()方法, 该方法分成两部分:

  • 前半部 [小节3.1] 用于监测是否触发超时;
  • 后半部 [小节4.1], 当触发超时则输出各种信息。

3.1 run

[-> Watchdog.java]

  • public void run() {
  • boolean waitedHalf = false;
  • while (true) {
  • final ArrayList<HandlerChecker> blockedCheckers;
  • final String subject;
  • final boolean allowRestart;
  • int debuggerWasConnected = 0;
  • synchronized (this) {
  • long timeout = CHECK_INTERVAL; //CHECK_INTERVAL=30s
  • for (int i=0; i<mHandlerCheckers.size(); i++) {
  • HandlerChecker hc = mHandlerCheckers.get(i);
  • //执行所有的Checker的监控方法, 每个Checker记录当前的mStartTime[见小节3.2]
  • hc.scheduleCheckLocked();
  • }
  • if (debuggerWasConnected > 0) {
  • debuggerWasConnected--;
  • }
  • long start = SystemClock.uptimeMillis();
  • //通过循环,保证执行30s才会继续往下执行
  • while (timeout > 0) {
  • if (Debug.isDebuggerConnected()) {
  • debuggerWasConnected = 2;
  • }
  • try {
  • wait(timeout); //触发中断,直接捕获异常,继续等待.
  • } catch (InterruptedException e) {
  • Log.wtf(TAG, e);
  • }
  • if (Debug.isDebuggerConnected()) {
  • debuggerWasConnected = 2;
  • }
  • timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
  • }
  • //评估Checker状态【见小节3.3】
  • final int waitState = evaluateCheckerCompletionLocked();
  • if (waitState == COMPLETED) {
  • waitedHalf = false;
  • continue;
  • } else if (waitState == WAITING) {
  • continue;
  • } else if (waitState == WAITED_HALF) {
  • if (!waitedHalf) {
  • //首次进入等待时间过半的状态
  • ArrayList<Integer> pids = new ArrayList<Integer>();
  • pids.add(Process.myPid());
  • //输出system_server和3个native进程的traces【见小节4.2】
  • ActivityManagerService.dumpStackTraces(true, pids, null, null,
  • NATIVE_STACKS_OF_INTEREST);
  • waitedHalf = true;
  • }
  • continue;
  • }
  • ... //进入这里,意味着Watchdog已超时【见小节4.1】
  • }
  • ...
  • }
  • }
  • public static final String[] NATIVE_STACKS_OF_INTEREST = new String[] {
  • "/system/bin/mediaserver",
  • "/system/bin/sdcard",
  • "/system/bin/surfaceflinger"
  • };

该方法主要功能:

  1. 执行所有的Checker的监控方法scheduleCheckLocked()
    • 当mMonitor个数为0(除了android.fg线程之外都为0)且处于poll状态,则设置mCompleted = true;
    • 当上次check还没有完成, 则直接返回.
  2. 等待30s后, 再调用evaluateCheckerCompletionLocked来评估Checker状态;
  3. 根据waitState状态来执行不同的操作:
    • 当COMPLETED或WAITING,则相安无事;
    • 当WAITED_HALF(超过30s)且为首次, 则输出system_server和3个Native进程的traces;
    • 当OVERDUE, 则输出更多信息.

由此,可见当触发一次Watchdog, 则必然会调用两次AMS.dumpStackTraces, 也就是说system_server和3个Native进程的traces
的traces信息会输出两遍,且时间间隔超过30s.

3.2 scheduleCheckLocked

  • public final class HandlerChecker implements Runnable {
  • ...
  • public void scheduleCheckLocked() {
  • if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
  • mCompleted = true; //当目标looper正在轮询状态则返回。
  • return;
  • }
  • if (!mCompleted) {
  • return; //有一个check正在处理中,则无需重复发送
  • }
  • mCompleted = false;
  • mCurrentMonitor = null;
  • // 记录当下的时间
  • mStartTime = SystemClock.uptimeMillis();
  • //发送消息,插入消息队列最开头, 见下方的run()方法
  • mHandler.postAtFrontOfQueue(this);
  • }
  • public void run() {
  • final int size = mMonitors.size();
  • for (int i = 0 ; i < size ; i++) {
  • synchronized (Watchdog.this) {
  • mCurrentMonitor = mMonitors.get(i);
  • }
  • //回调具体服务的monitor方法
  • mCurrentMonitor.monitor();
  • }
  • synchronized (Watchdog.this) {
  • mCompleted = true;
  • mCurrentMonitor = null;
  • }
  • }
  • }

该方法主要功能: 向Watchdog的监控线程的Looper池的最头部执行该HandlerChecker.run()方法,在该方法中调用monitor(),执行完成后会设置mCompleted = true. 那么当handler消息池当前的消息,导致迟迟没有机会执行monitor()方法, 则会触发watchdog.

其中postAtFrontOfQueue(this),该方法输入参数为Runnable对象,根据消息机制,最终会回调HandlerChecker中的run方法,该方法会循环遍历所有的Monitor接口,具体的服务实现该接口的monitor()方法。

可能的问题,如果有其他消息不断地调用postAtFrontOfQueue()也可能导致watchdog没有机会执行;或者是每个monitor消耗一些时间,雷加起来超过1分钟造成的watchdog. 这些都是非常规的Watchdog.

3.3 evaluateCheckerCompletionLocked

  • private int evaluateCheckerCompletionLocked() {
  • int state = COMPLETED;
  • for (int i=0; i<mHandlerCheckers.size(); i++) {
  • HandlerChecker hc = mHandlerCheckers.get(i);
  • //【见小节3.4】
  • state = Math.max(state, hc.getCompletionStateLocked());
  • }
  • return state;
  • }

获取mHandlerCheckers列表中等待状态值最大的state.

3.4 getCompletionStateLocked

```java
public int getCompletionStateLocked() {
if (mCompleted) {
return COMPLETED;
} else {
long latency = SystemClock.uptimeMillis() - mStartTime;
// mWaitMax默认是60s
if (latency < mWaitMax/2) {
return WAITING;
} else if (latency < mWaitMax) {
return WAITED_HALF;
}
}
return OVERDUE;

top Created with Sketch.