Watchdogs

System Watchdogs provide a mechanism for monitoring the system health. Once a watchdog countdown timer is set with appropriate expiry period and service time, software services the watchdog every service time thus not allowing it to expire. Servicing the watchdog, restarts the countdown timer. When software cannot service the watchdog timer for some reason (system is hung), the watchdog timer counts downs to 0, triggering a system reset.

In hardware, Watchdog module has a line connected to system reset, to trigger reset.

This way, the system can be restarted automatically for operation when its hung or non-responsive for sometime. This can be called software trigger watchdog (i.e., software failed to service watchdog in stipulated time)

In case of Hardware trigger watchdogs, this watchdog reset is triggered when an unexpected hardware event or assert happens. One example is accessing a hardware module register when that module's clock is off. For e.g. say the dynamic power management code has decided to turn-off PLLs to USB module because of bus non-activity, and some driver tries to access USB registers after that point immediately causes a memory bus assert, triggering watchdog reset.

Debugging issues causing watchdog is one of the challenging stability problems. Software trigger watchdogs, though necessary in a hung or non-responsive case, poses a stability issue and causes disruption to user and inconvenience whenever it happens unpredictably and randomly.

With dynamic power management ability in Linux kernel to aggressively go into low power idle, and associated platform/SoC dynamic voltage frequency scaling code, a lot of stability issues surface because of buggy driver and platform code which otherwise would work fine. Add flaky hardware components (did I say DDR chips?) to the equation.

Debugging and root-causing is a nightmare because watchdog occurrences are random and sporadic, and hard to reproduce using any tests.

Instrumenting the code is futile because when watchdog happens, all the traces are lost. Of-course, if watchdog reset is connected to a warm reset, life would be relatively easier!
If tracing is possible, ARM ETB/ETM (internal embedded trace buffer) can be used that stores the last instructions leading to watchdog giving us a clue.
Also, scratch registers in the processor can be used for instrumentation. Upon a warm reset the registers can be read back to see how far the code proceeded before watchdog'ing.

In the architecture I worked, warm reset was not possible because of our PMIC and board design; thus making it extremely hard to debug watchdog.

One trick that I used was to disable watchdog reset and use GPIOs for instrumentation. A watchdog would then cause a hard-hang instead of reset, and probing the GPIO voltage levels at that stage can help us determine which code path has been executed leading to a watchdog. This is useful, if we have narrowed down to a certain code path/function

My software core(brain) - dump

Watchdogs

No comments: