This might be old school, and maybe even boring reading. But, if you concern about performance on Linux servers, at some point, you will have to have a look to the kernel messages.
When we run very stressful jobs running on large servers (large number of CPU’s and RAM memory), where IO activity is very high. It is pretty common to start seeing these messages on the ‘dmesg’ kernel output:
[24169.372862] INFO: task kswapd1:1140 blocked for more than 120 seconds. [24169.375623] Tainted: G E 4.9.51-10.52.amzn1.x86_64 #1 [24169.378445] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [24169.382533] kswapd1 D 0 1140 2 0x00000000 [24169.385066] ffff8811605c5a00 0000000000000000 ffff8823645dc900 ffff882362844900 [24169.389208] ffff882371357c00 ffffc9001a13ba08 ffffffff8153896c ffffc9001a13ba18 [24169.393329] ffff881163ac92d8 ffff88115e87f400 ffff882362844900 ffff88115e87f46c [24169.445313] Call Trace: [24169.446981] [<ffffffff8153896c>] ? __schedule+0x23c/0x680 [24169.449454] [<ffffffff81538de6>] schedule+0x36/0x80 [24169.451790] [<ffffffff8153907e>] schedule_preempt_disabled+0xe/0x10 [24169.454509] [<ffffffff8153a8d5>] __mutex_lock_slowpath+0x95/0x110 [24169.457156] [<ffffffff8153a967>] mutex_lock+0x17/0x27 [24169.459546] [<ffffffffa0966d00>] xfs_reclaim_inodes_ag+0x2b0/0x330 [xfs] [24169.462407] [<ffffffffa0967be3>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs] [24169.465203] [<ffffffffa0978f59>] xfs_fs_free_cached_objects+0x19/0x20 [xfs] ...
Why is this happening?
This indicates that the process requested a block device such as a disk/swap, and wasn’t able to be fulfilled for more than 120 seconds and subsequently abandoned by the kernel.
As mentioned before, the probability of observing this behavior will increase when we use instances with large number of vCores (e.g. 64+ vCores), given that the volume of IO requests could be higher, and, the kernel buffer queues not configured for such load.
This is also very common on clustered environments. Even though the framework will retry, on Hadoop/YARN environments, this will impact in performance, and, might also lead to application failures.
To solve this problem, we have to increase the ‘dirty_background_bytes‘ kernel setting to higher values to be able to accommodate the throughput.
As a base formula, we usually consider a value of dirty_background_bytes=10MB for 40MB/Sec throughput.
The goal is to have the page cache to allow the OS to write asynchronously to disk whenever possible. Otherwise, when the write becomes synchronous, it would involve a waiting time.
Additionally to the dirty_background_bytes kernel parameter, we can also set:
- dirty_background_ratio = 10 (represents the percentage of system memory which when dirty then system can start writing data to the disks. When the dirty pages exceed ratio 10, I/O starts, i.e they start getting flushed / written to the disk)
- dirty_ratio = 15 (is percentage of system memory which when dirty, the process doing writes would block and write out dirty pages to the disks. When total number of dirty pages exceed ratio 15, all writes get blocked until some of the dirty pages get written to disk. This is like an upper limit)
We try to keep these two values low to avoid I/O bottlenecks. We can experiment setting both down to zero, to force fast flushes on high stress environments.
Another parameter we can use is dirty_bytes, which represents the amount of dirty memory at which a process generating disk writes will itself start writeback. We should set dirty_ratio = 0 if we set dirty_bytes, as only one of these might be specified at a time.
dirty_bytes = 2147483648
More details on these parameters is available here.