Linux production notes

Non-Uniform Access Memory (NUMA) hardware

Historically, there have been a number of situations where the choice of NUMA memory management settings in the kernel would adversely affect the performance of q on systems using NUMA memory architectures. This resulted in higher-than-expected system-process usage for q, and lower memory performance. For this reason we made certain recommendations for the settings for memory interleave and transparent huge pages.

One of the performance issues seen by q in this context is the same as the “swap insanity” issue, as linked below. Essentially, when the Linux kernel decides to swap out dirty pages, due to memory exhaustion, it was observed to affect performance of q, significantly more than expected. A relief for this situation was achieved via setting NUMA interleaving options in the kernel. However, with the introduction of new Linux distributions based on newer kernel versions we now recommend different NUMA settings, depending on the version of the distribution being used. The use of the interleave feature should still be considered for those cases where your code drives the q processes to write to memory pages in excess of the physical memory capacity of the node.

  • For distributions based on kernels 3.x, please disable interleave, and enable zone_reclaim. For all situations where memory page demand is constrained to the physical memory space of the node, this should return a better overall performance.

  • For Linux distributions based on Linux kernel 2.6 or earlier (e.g RHEL 6.7 or CentoS 6.7 or earlier), we recommend to disable NUMA, and instead set an interleave memory policy, especially in the use-case described above.

In both cases, q is unaware of whether NUMA is enabled or not.

If possible, you should change the NUMA settings via a BIOS setting, if that is supported by your system. Otherwise use the technique below.

To fully disable NUMA and enable an interleave memory policy, start q with the numactl command as follows

$ numactl --interleave=all q

and disable zone-reclaim in the proc settings as follows

$ echo 0 > /proc/sys/vm/zone_reclaim_mode

The MySQL “swap insanity” problem and the effects of NUMA
Although the post is about the impact on MySQL, the issues are the same for other databases such as q.

To find out whether NUMA is enabled in your bios, use

$ dmesg | grep -i numa

And to see if NUMA is enabled on a process basis

$ numactl -s

Huge Pages and Transparent Huge Pages (THP)

A number of customers have been impacted by bugs in the Linux kernel with respect to Transparent Huge Pages. These issues manifest themselves as process crashes, stalls at 100% CPU usage, and sporadic performance degradation. Our recommendation for THP is similar to the recommendation for memory interleaving.

Linux kernel THP
2.6 or earlier disable
3.x enable

Other database vendors are also reporting similar issues with THP.

Note that changing Transparent Huge Pages isn’t possible via sysctl(8). Rather, it requires manually echoing settings into /sys/kernel at or after boot. In /etc/rc.local or by hand. To disable THP, do this:

if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
  echo never > /sys/kernel/mm/transparent_hugepage/enabled
fi

if test -f /sys/kernel/mm/transparent_hugepage/defrag; then
  echo never > /sys/kernel/mm/transparent_hugepage/defrag
fi

Some distributions may require a slightly different path, e.g:

$ echo never >/sys/kernel/mm/redhat_transparent_hugepage/enabled

Another possibility to configure this is via grub

transparent_hugepage=never

To enable THP for Linux kernel 3.x, do this:

if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
  echo always > /sys/kernel/mm/transparent_hugepage/enabled
fi

if test -f /sys/kernel/mm/transparent_hugepage/defrag; then
  echo always > /sys/kernel/mm/transparent_hugepage/defrag
fi

Q must be restarted to pick up the new setting.

Monitoring free disk space

In addition to monitoring free disk space for the usual partitions you write to, ensure you also monitor free space of /tmp on Unix, since q uses this area for capturing the output from system commands, such as system "ls".

Compression

If you find that q is seg faulting (crashing) when accessing compressed files, try increasing the Linux kernel parameter vm.max_map_count. As root

$ sysctl vm.max_map_count=16777216

and/or make a suitable change for this parameter more permanent through /etc/sysctl.conf. As root

$ echo "vm.max_map_count = 16777216" | tee -a /etc/sysctl.conf
$ sysctl -p

You can check current settings with

$ more /proc/sys/vm/max_map_count

Assuming you are using 128Kb logical size blocks for your compressed files, a general guide is, at a minimum, set max_map_count to one map per 128Kb of memory, or 65530, whichever is higher.

If you are encountering a SIGBUS error, please check that the size of /dev/shm is large enough to accommodate the decompressed data. Typically, you should set the size of /dev/shm to be at least as large as a fully decompressed HDB partition.

Set ulimit to the higher of 4096 and 1024 plus the number of compressed columns which may be queried concurrently.

$ ulimit -n 4096

Timekeeping

Timekeeping on production servers is a complicated topic. These are just a few notes which can help.

If you are using any of local time functions .z.(TPNZD) q will use the localtime(3) system function to determine time offset from GMT. In some setups (GNU libc) this can cause excessive system calls to /etc/localtime.
chemie.fu-berlin.de, stackoverflow.com

Setting TZ environment helps this:

$ export TZ=America/New_York

or from q

q)setenv[`TZ;"Europe/London"]

One more way of getting excessive system calls when using .z.(pt…) is to have a slow clock source configured on your OS. Modern Linux distributions provide very low overhead functionality for getting current time. Use tsc clocksource to activate this.

$ echo tsc >/sys/devices/system/clocksource/clocksource0/current_clocksource
# list available clocksource on the system
$ cat /sys/devices/system/clocksource/clocksource*/available_clocksource

If you are using PTP for timekeeping, your PTP hardware vendor might provide their own implementation of time. Check that those utilize VDSO mechanism for exposing time to user space.