Debugging kernel suspend-resume

Debugging suspend-resume in Linux kernel is challenging. Suspending a core involves turning off or putting peripherals in a low power mode. This involves UART as well, which means adding printks and console for debugging are ruled out after a certain point in the suspend path. This is akin to early bootloader or kernel debugging

Linux kernel tools for debugging suspend

Linux kernel provides certain utilities for debugging PM (power management) modes including suspend.

- Pass commandline option no_console_suspend to kernel

With this, console won't be suspended (when the tty layer suspends), so we can continue to see printks and debug logs from other driver suspends till very late in the suspend code path.

- Test modes with kernel sys entries /sys/power/state

Various stages of suspend (like process freezing, device suspend, platform suspend and core) can be tested by passing respective power management parameters to /sys/power/state and suspend path only till that stage will be executed before 'resume' happens.

More details can be found in www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt

Debugging using Hardware mechanisms

Most of the interesting suspend issues I faced was either in the late suspend or early resume code path, mostly because of hardware, processor related issues than pure software bugs. These typically are the most challenging because the errors are sporadic, unpredictable and hard to reproduce.

I have found GPIOs and Hardware Debuggers like JTAG to be very useful for late stage suspend, especially when all the drivers and platform have been suspended, and we enter into code that runs outside of DDR. (static RAM based IRAM in the iMX architecture). The IRAM code implements late suspend and early resume and essentially does following

- Put DDR in self refresh so that DDR controller can be put in low power mode/turned off.

- Switch ARM core PLL to run from a base Oscillator clock that stays on during suspend.

- Turn off/switch some SoC PLLs and put their regulators into bypass mode.

- invalidate, flush and disable caches (from closer to further i.e., L1 first then L2)

- Finally, put ARM core to sleep by executing ARM WFI (wait for interrupt) instruction

This is all assembly code (since the code is hand-written, and mapped into small IRAM address space) with lot of tight loops (for e.g. wait for PLL changes take effect by looping till a bit is set). A hang here would cause the device not to suspend fully or worse, not wake-up.

Good 'old GPIOs can be used to naildown to the perticular instruction, or use a JTAG to single-step.

Debugging early resumes with JTAG is tricky because, JTAG communication with processor is lost when suspended. When a button press or some other mechanism is used to trigger device wake-up/resume, and by the time JTAG is connected, we would have already executed the problematic code and would be hung by then!

My favorite trick is to add a simple tight loop just after ARM WFI instruction, so the code resumes and tight loops there, allowing one to attach JTAG and break out of the loop by modifying the condition register, and then single-step. This will also help root-cause genuine hardware problems disallowing wake interrupts in the first place hence not able to come out of WFI

Got the idea? Same can be applied for early uboot or kernel debugging when console is not yet setup.

My software core(brain) - dump

Debugging kernel suspend-resume

No comments: