INFO: task $NAME:$PID blocked for more than 120 seconds

Sometimes the Linux kernel prints a message like this:

INFO: task $NAME:$PID blocked for more than 120 seconds.
      Not tainted 4.9.0-11-amd64 #1 Debian 4.9.189-3+deb9u1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/vdt-8      D   0   390      2 0x00000000
 0000000000000046 ffff9d20ccfea800 0000000000000000 ffff9d20e9a3d100
 ffff9d20ffc98980 ffff9d20ed7a0040 ffffafc302b03b30 ffffffff8de17609
 0000000000000001 00ffffff8db05a84 ffff9d20ffc98980 ffff9d20e743b5e0
Call Trace:
 [<ffffffff8de17609>] ? __schedule+0x239/0x6f0
 [<ffffffff8de182c0>] ? bit_wait+0x50/0x50
 [<ffffffff8de17af2>] ? schedule+0x32/0x80
 [<ffffffff8de1ae8d>] ? schedule_timeout+0x1dd/0x380
 [<ffffffff8de1c4e5>] ? __switch_to_asm+0x35/0x70
 [<ffffffff8de1c4f1>] ? __switch_to_asm+0x41/0x70
 [<ffffffff8de1c4e5>] ? __switch_to_asm+0x35/0x70
 [<ffffffff8de1c4f1>] ? __switch_to_asm+0x41/0x70
 [<ffffffff8de1c4e5>] ? __switch_to_asm+0x35/0x70
 [<ffffffff8de1c4f1>] ? __switch_to_asm+0x41/0x70
 [<ffffffff8de1c4e5>] ? __switch_to_asm+0x35/0x70
 [<ffffffff8d8f19de>] ? ktime_get+0x3e/0xb0
 [<ffffffff8de182c0>] ? bit_wait+0x50/0x50
 [<ffffffff8de1736d>] ? io_schedule_timeout+0x9d/0x100
 [<ffffffff8d8bd737>] ? prepare_to_wait+0x57/0x80
 [<ffffffff8de182d7>] ? bit_wait_io+0x17/0x60
 [<ffffffff8de17e95>] ? __wait_on_bit+0x55/0x80
 [<ffffffff8de182c0>] ? bit_wait+0x50/0x50
 [<ffffffff8de17ffe>] ? out_of_line_wait_on_bit+0x7e/0xa0
 [<ffffffff8d8bdba0>] ? wake_atomic_t_function+0x60/0x60
 [<ffffffffc041afe9>] ? jbd2_journal_commit_transaction+0xf59/0x17c0 [jbd2]
 [<ffffffffc041fc72>] ? kjournald2+0xc2/0x260 [jbd2]
 [<ffffffff8d8bdb00>] ? prepare_to_wait_event+0xf0/0xf0
 [<ffffffffc041fbb0>] ? commit_timeout+0x10/0x10 [jbd2]
 [<ffffffff8d89abc9>] ? kthread+0xd9/0xf0
 [<ffffffff8de1c4f1>] ? __switch_to_asm+0x41/0x70
 [<ffffffff8d89aaf0>] ? kthread_park+0x60/0x60
 [<ffffffff8de1c577>] ? ret_from_fork+0x57/0x70

If you see it: Don’t panic! - the Linux kernel did not either.

First of all notice the INFO: It’s in informational message, not a WARNING, ERROR or fatal OOPS. The Linux kernel informs you that something might be wrong, but the Linux kernel can continue with other tasks and might recover fully from this situation.

Most of the time this happens because Input/Output (IO) is slow for some reason. Reasons include:

  1. Your use NFS (Network File System) and the server is currently not responding on time due to a network issue.
  2. You use iSCSI or any of the other network file systems.
  3. This is a virtual machine and the IO system on the host is currently very slow (due some other VM activity).
  4. You have a backup battery failure on your RAID controller: the controller then has to wait for the data to actually hit the disks instead of using its write cache.
  5. You write a very large file to a slow disk like an USB stick: The Linux kernel will cache the file in RAM and then has to wait a long time in the final sync()-call until the data is fully written - use dd oflag=direct bs=4M to by-pass the host cache.
  6. Using a 32 bit Linux Kernel on systems with more than 4 GiB. Even thou thePhysical Address Extension (PAE) allows to use more memory, it is not recommended as all Linux 4.x+ Kernel have a latent bug.

This is confirmed by two lines of the strack trace above:

… io_schedule_timeout…
… jbd2_journal_commit_transaction…

The last line indicates, that a write IO operation was scheduled and the Linux kernel now waits for its completion. When the blocking issue is resolved the Linux kernel will continue normally.

What makes this situation more severe is that most certainly other IO operations will block as well: In most cases you will no longer be able to login on the affected server as ssh and syslog will schedule write operations on their own and will also get stuck waiting for the same IO device.

Restarting the server (or the VM) itself will get you out of the immediate situation, but often you will find yourself back in the same situation again after a short time. So please take some time and find the real triggering cause.