Wednesday 12 December 2018

Linux I/O Schedulers

The Linux kernel I/O schedulers attempt to balance the need to get the best possible I/O performance while also trying to ensure the I/O requests are "fairly" shared among the I/O consumers.  There are several I/O schedulers in Linux, each try to solve the I/O scheduling issues using different mechanisms/heuristics and each has their own set of strengths and weaknesses.

For traditional spinning media it makes sense to try and order I/O operations so that they are close together to reduce read/write head movement and hence decrease latency.  However, this reordering means that some I/O requests may get delayed, and the usual solution is to schedule these delayed requests after a specific time.   Faster non-volatile memory devices can generally handle random I/O requests very easily and hence do not require reordering.

Balancing the fairness is also an interesting issue.  A greedy I/O consumer should not block other I/O consumers and there are various heuristics used to determine the fair sharing of I/O.  Generally, the more complex and "fairer" the solution the more compute is required, so selecting a very fair I/O scheduler with a fast I/O device and a slow CPU may not necessarily perform as well as a simpler I/O scheduler.

Finally, the types of I/O patterns on the I/O devices influence the I/O scheduler choice, for example, mixed random read/writes vs mainly sequential reads and occasional random writes.

Because of the mix of requirements, there is no such thing as a perfect all round I/O scheduler.  The defaults being used are chosen to be a good best choice for the general user, however, this may not match everyone's needs.   To clarify the choices, the Ubuntu Kernel Team has provided a Wiki page describing the choices and how to select and tune the various I/O schedulers.  Caveat emptor applies, these are just guidelines and should be used as a starting point to finding the best I/O scheduler for your particular need.

Saturday 1 December 2018

New features in Forkstat

Forkstat is a simple utility I wrote a while ago that can trace process activity using the rather useful Linux NETLINK_CONNECTOR API.   Recently I have added two extra features that may be of interest:

1.  Improved output using some UTF-8 glyphs.  These are used to show process parent/child relationships and various process events, such as termination, core dumping and renaming.   Use the new -g (glyph) option to enable this mode. For example:


In the above example, the program "wobble" was started and forks off a child process.  The parent then renames itself to wibble (indicated by a turning arrow). The child then segfaults and generates a core dump (indicated by a skull and crossbones), triggering apport to investigate the crash.  After this, we observe NetworkManager creating a thread that runs for a very short period of time.   This kind of activity is normally impossible to spot while running conventions tools such as ps or top.

2. By default, forkstat will show the process name using the contents of /proc/$PID/cmdline.  The new -c option allows one to instead use the 16 character task "comm" field, and this can be helpful for spotting process name changes on PROC_EVENT_COMM events.

These are small changes, but I think they make forkstat more useful.  The updated forkstat will be available in Ubuntu 19.04 "Disco Dingo".

Thursday 22 November 2018

High-level tracing with bpftrace

Bpftrace is a new high-level tracing language for Linux using the extended Berkeley packet filter (eBPF).  It is a very powerful and flexible tracing front-end that enables systems to be analyzed much like DTrace.

The bpftrace tool is now installable as a snap. From the command line one can install it and enable it to use system tracing as follows:

 sudo snap install bpftrace  
 sudo snap connect bpftrace:system-trace  

To illustrate the power of bpftrace, here are some simple one-liners:

 # trace openat() system calls
 sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%d %s %s\n", pid, comm, str(args->filename)); }'   
 Attaching 1 probe...  
 1080 irqbalance /proc/interrupts  
 1080 irqbalance /proc/stat  
 2255 dmesg /etc/ld.so.cache  
 2255 dmesg /lib/x86_64-linux-gnu/libtinfo.so.5  
 2255 dmesg /lib/x86_64-linux-gnu/librt.so.1  
 2255 dmesg /lib/x86_64-linux-gnu/libc.so.6  
 2255 dmesg /lib/x86_64-linux-gnu/libpthread.so.0  
 2255 dmesg /usr/lib/locale/locale-archive  
 2255 dmesg /lib/terminfo/l/linux  
 2255 dmesg /home/king/.config/terminal-colors.d  
 2255 dmesg /etc/terminal-colors.d  
 2255 dmesg /dev/kmsg  
 2255 dmesg /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache  

 # count system calls using tracepoints:  
 sudo bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }'  
 @[tracepoint:syscalls:sys_enter_getsockname]: 1  
 @[tracepoint:syscalls:sys_enter_kill]: 1  
 @[tracepoint:syscalls:sys_enter_prctl]: 1  
 @[tracepoint:syscalls:sys_enter_epoll_wait]: 1  
 @[tracepoint:syscalls:sys_enter_signalfd4]: 2  
 @[tracepoint:syscalls:sys_enter_utimensat]: 2  
 @[tracepoint:syscalls:sys_enter_set_robust_list]: 2  
 @[tracepoint:syscalls:sys_enter_poll]: 2  
 @[tracepoint:syscalls:sys_enter_socket]: 3  
 @[tracepoint:syscalls:sys_enter_getrandom]: 3  
 @[tracepoint:syscalls:sys_enter_setsockopt]: 3  
 ...  

Note that it is recommended to use bpftrace with Linux 4.9 or higher.

The bpftrace github project page has an excellent README guide with some worked examples and is a very good place to start.  There is also a very useful reference guide and one-liner tutorial too.

If you have any useful btftrace one-liners, it would be great to share them. This is an amazingly powerful tool, and it would be interesting to see how it will be used.

Wednesday 3 October 2018

Static Analysis Trends on Linux Next

I've been running static analysis using CoverityScan on linux-next for 2 years with the aim to find bugs (and try to fix some) before they are merged into Linux.  I have also been gathering the defect count data and tracking the defect trends:
As one can see from above, CoverityScan has found a considerable amount of defects and these are being steadily fixed by the Linux developer community.  The encouraging fact is that the outstanding issues are reducing over time. Some of the spikes in the data are because of changes in the analysis that I'm running (e.g. getting more coverage), but even so, one can see a definite trend downwards in the total defects in the Kernel.

With static analysis, some of these reported defects are false positives or corner cases that are in fact impossible to occur in real life and I am slowly working through these and annotating them so they don't get reported in the defect count.

It must be also noted that over these two years the kernel has grown from around 14.6 million to 17.1 million lines of code so the defect count has dropped from 1 defect in every ~2100 lines to 1 defect in every ~3000 lines over the past 2 years.  All in all, it is a remarkable improvement for such a large and complex codebase that is growing in size at such rate.

Monday 16 July 2018

Comparing Latencies and Power consumption with various CPU schedulers

The low-latency kernel offering with Ubuntu provides a kernel tuned for low-latency environments using low-latency kernel configuration options.  The x86 kernels by default run with the Intel-Pstate CPU scheduler set to run with the powersave scaling governor biased towards power efficiency.

While power efficiency is fine for most use-cases, it can introduce latencies due to the fact that the CPU can be running at a low frequency to save power and also switching from a deep C state when idle to a higher C state when servicing an event can also increase on latencies.

In a somewhat contrived experiment, I rigged up an i7-3770 to collect latency timings of clock_nanosleep() wake-ups with timer event coalescing disabled (timer_slack set to zero) over 60 seconds across a range of CPU scheduler and governor settings on a 4.15 low-latency kernel.  This can be achieved using stress-ng, for example:

 sudo stress-ng --cyclic 1 --cyclic-dist 100 –cyclic-sleep=10000 --cpu 1 -l 0 -v \
--cyclic-policy rr --cyclic-method clock_ns --cpu 0 -t 60 --timer-slack 0  

..the above runs a cyclic measurement collecting latency counts in 100ns buckets with a clock_nanosecond wakeup interval of 10,000 nanoseconds with zero % load CPU stressor and timer slack set to 0 nanoseconds.  This dumps latency distribution stats that can be plotted to see where the modal latency points occur and the latency characteristics of the CPU scheduler.

I also used powerstat to measure the power consumed by the CPU package over a 60 second interval.  Measurements for the Intel-Pstate CPU scheduler [performance, powersave] and the ACPI CPU scheduler (intel_pstate=disabled) [performance, powersave, conservative and ondemand] were taken for 1,000,000 down to 10,000 nanosecond timer delays.

1,000,000 nanosecond timer delays (1 millisecond)

Strangely the powersave Intel-Pstate is using the most power (not what I expected).

The ACPI CPU scheduler in performance mode has the best latency distribution followed by the Intel-Pstate CPU scheduler also in performance mode.

100,000 nanosecond timer delays (100 microseconds)

Note that Intel-Pstate performance consumes the most power...
...and also has the most responsive low-latency distribution.

10,000 nanosecond timer delays (10 microseconds)

In this scenario, the ACPI CPU scheduler in performance mode was consuming the most power and had the best latency distribution.

It is clear that the best latency responses occur when a CPU scheduler is running in performance mode and this consumes a little more power than other CPU scheduler modes.  However, it is not clear which CPU scheduler (Intel-Pstate or ACPI) is best in specific use-cases.

The conclusion is rather obvious;  but needs to be stated.  For best low-latency response, set the CPU governor to the performance mode at the cost of higher power consumption.  Depending on the use-case, the extra power cost is probably worth the improved latency response.

As mentioned earlier, this is a somewhat contrived experiment, only one CPU was being exercised with a predictable timer wakeup.  A more interesting test would be with data handling, such as incoming packet handling over ethernet at different rates; I will probably experiment with that if and when I get more time.  Since this was a synthetic test using stress-ng, it does not represent real world low-latency scenarios, however, it may be worth exploring CPU scheduler settings to tune a low-latency configuration rather than relying on the default CPU scheduler setting.

Friday 23 March 2018

Kernel Commits with "Fixes" tag

Over the past 5 years there has been a steady increase in the number of kernel bug fix commits that use the "Fixes" tag.  Kernel developers use this annotation on a commit to reference an older commit that originally introduced the bug, which is obviously very useful for bug tracking purposes. What is interesting is that there has been a steady take-up of developers using this annotation:

With the 4.15 release, 1859 of the 16223 commits (11.5%) were tagged as "Fixes", so that's a fair amount of work going into bug fixing.  I suspect there are more commits that are bug fixes, but aren't using the "Fixes" tag, so it's hard to tell for certain how many commits are fixes without doing deeper analysis.  Probably over time this tag will be widely adopted for all bug fixes and the trend line will level out and we will have a better idea of the proportion of commits per release that are just devoted to fixing issues.  Let's see how this looks in another 5 years time,  I'll keep you posted!

Wednesday 21 February 2018

Linux Kernel Module Growth

The Linux kernel grows at an amazing pace, each kernel release adds more functionality, more drivers and hence more kernel modules.  I recently wondered what the trend was for kernel module growth per release, so I performed module builds on kernels v2.6.24 through to v4.16-rc2 for x86-64 to get a better idea of growth rates:


..as one can see, the rate of growth is relatively linear with about 89 modules being added to each kernel release, which is not surprising as the size of the kernel is growing at a fairly linear rate too.  It is interesting to see that the number of modules has easily more than tripled in the 10 years between v2.6.24 and v4.16-rc2,  with a rate of about 470 new modules per year. At this rate, Linux will see the 10,000th module land in around the year 2025.

Saturday 3 February 2018

stress-ng V0.09.15

It has been a while since my last post about stress-ng so I thought it would be useful to provide an update on the changes since V0.08.09.

I have been focusing on making stress-ng more portable so it can build with various versions of clang and gcc as well as run against a wide range of kernels.   The portability shims and config detection added to stress-ng allow it to build and run on a wide range of Linux systems, as well as GNU/HURD, Minix, Debian kFreeBSD, various BSD systems, OpenIndiana and OS X.

Enabling stress-ng to work on a wide range of architectures and kernels with a range of compiler versions has helped me to find and fix various corner case bugs.  Also, static analysis with a various set of tools has helped to drive up the code quality. As ever, I thoroughly recommend using static analysis tools on any project to find bugs.

Since V0.08.09 I've added the following stressors:
  • inode-flags  - (using the FS_IOC_GETFLAGS/FS_IOC_SETFLAGS ioctl, see ioctl_iflags(2) for more details.
  • sockdiag - exercise the Linux sock_diag netlink socket diagnostics
  • branch - exercise branch prediction
  • swap - exercise adding and removing variously sized swap partitions
  • ioport - exercise I/O port read/writes to try and cause CPU I/O bus delays
  • hrtimers - high resolution timer stressor
  • physpage - exercise the lookup of a physical page address and page count of a virtual page
  • mmapaddr - mmap pages to randomly unused VM addresses and exercise mincore and segfault handling
  • funccall - exercise function calling with a range of function arguments types and sizes, for benchmarking stack/CPU/cache and compiler.
  • tree - BSD tree (red/black and splay) stressor, good for exercising memory/cache
  • rawdev - exercise raw block device I/O reads
  • revio - reverse file offset random writes, causes lots of fragmentation and hence many file extents
  • mmap-fixed - stress fixed address mmaps, with a wide range of VM addresses
  • enosys - exercise a wide range of random system call numbers that are not wired up, hence generating ENOSYS errors
  • sigpipe - stress SIGPIPE signal generation and handling
  • vm-addr - exercise a wide range of VM addresses for fixed address mmaps with thorough address bit patterns stressing
Stress-ng has nearly 200 stressors and many of these have various stress methods than can be selected to perform specific stress testing.  These are all documented in the manual.  I've also updated the stress-ng project page with various links to academic papers and presentations that have used stress-ng in various ways to stress computer systems.  It is useful to find out how stress-ng is being used so that I can shape this tool in the future.

As ever, patches for fixes and improvements are always appreciated.  Keep on stressing!