Tune the system for benchmarks¶
CPU pinning and CPU isolation¶
On Linux with a multicore CPU, isolating at least 1 core has a significant impact on the stability of benchmarks. The My journey to stable benchmark, part 1 (system) article explains how to tune Linux for this and shows the effect of CPU isolation and CPU pinning.
Runner class automatically pin worker
processes to isolated CPUs (when isolated CPUs are detected). CPU pinning can
be checked in benchmark metadata: it is enabled if the
metadata is set.
os.sched_setaffinity() is used to pin processes.
Even if no CPU is isolated, CPU pining makes benchmarks more stable: use the
--affinity command line option.
Check the CPU topology for HyperThreading and NUMA for best performances.
On Windows, worker process are set to the highest priority:
REALTIME_PRIORITY_CLASS. See the SetPriorityClass function.
Isolate CPUs on Linux¶
Identify physical CPU cores (required for Intel Hyper-Threading CPUs):
$ lscpu --extended CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ 0 0 0 0 0:0:0:0 oui 5900,0000 1600,0000 1 0 0 1 1:1:1:0 oui 5900,0000 1600,0000 2 0 0 2 2:2:2:0 oui 5900,0000 1600,0000 3 0 0 3 3:3:3:0 oui 5900,0000 1600,0000 4 0 0 0 0:0:0:0 oui 5900,0000 1600,0000 5 0 0 1 1:1:1:0 oui 5900,0000 1600,0000 6 0 0 2 2:2:2:0 oui 5900,0000 1600,0000 7 0 0 3 3:3:3:0 oui 5900,0000 1600,0000
I have a single CPU on a single socket. We will isolate physical cores 2 and 3, and logical CPUs 2, 3, 6 and 7. Be also careful of NUMA: here all physical cores are on the same NUMA node (0).
Reboot, enter GRUB and modify the Linux command line to add:
Check stability of a benchmark¶
Download the system_load.py: script to simulate busy system, run enough dummy workers until the system load is higher than the minimum specified on the command line.
Prefix benchmark command with
taskset -c 2,3,6,7to run the benchmark on isolated CPUs
Run the benchmark on an idle system
Run the benchmark with
system_load.py 5running in a different window
The two results must be close. Otherwise, CPU isolation doesn’t work.
You can also check the number of context switches by reading
nonvoluntary_ctxt_switches. It must be low on a CPU-bound benchmark.
On Linux, pyperf adds a
runnable_threads metadata to runs: “number of
currently runnable kernel scheduling entities (processes, threads)” (the value
comes from the 4th field of
See also the Visualize the system noise using perf and CPU isolation article (by Victor Stinner, June 2016).
In 2017, high performance Intel and AMD CPUs can have multiple nodes of CPU cores where each node is assigned to a memory region. The latency for a memory region depends on the CPU node. This configuration is called Non-uniform memory access: NUMA.
lscpu -a -e command to list CPUs and their affected NUMA node.
CPU pinning is very important on NUMA systems to get best performances.
See also the
Features of Intel CPUs¶
Modern Intel CPUs has many dynamic features impacting performances:
HyperThreading: run two threads per CPU code, share L1 caches
Turbo Boost: CPU frequency is optimized for best performances depending on the number of “active” cores, CPU temperature, etc.
P-state and C-state: the frequency of a CPU core frequency changes depending of C-state and P-state which are tuned by the operating system (by the kernel).
Tools to measure CPU frequency, P-state and C-state:
On Fedora, type
dnf install -y kernel-tools to install
Causes of Performance Swings Due to Code Placement in IA by Zia Ansari (Intel), November 2016.
Intel CPUs: P-state, C-state, Turbo Boost, CPU frequency, etc. by Victor Stinner, July 2016
Intel CPUs (part 2): Turbo Boost, temperature, frequency and Pstate C0 bug by Victor Stinner, September 2016
nohz_full kernel option is used, the CPU frequency must be fixed,
otherwise the CPU frequency will be unstable. See Bug 1378529: intel_pstate
driver doesn’t support NOHZ_FULL.
Skylake: 6th generation
Broadwell: 5th generation
Haswell: 4th generation
Ivy Bridge: 3rd
Sandy Bridge: 2nd
Operations and checks of the pyperf system command¶
The pyperf system command implements the following operations:
“CPU scaling governor (intel_pstate driver)”: Get/Set the CPU scaling governor.
tunesets the governor to
resetsets the governor to
“CPU Frequency”: Read/Write
scaling_min_freqto the maximum frequency,
scaling_min_freqto the minimum frequency.
“IRQ affinity”: Handle the state of the
tunestops the service,
resetstarts the service. Read/Write the CPU affinity of interruptions:
/proc/irq/N/smp_affinityof all IRQs
“Perf event”: Use
/proc/sys/kernel/perf_event_max_sample_rateto set the maximum sample rate of perf event to
1for tune, or
“Power supply”: check that the power cable is plugged. If the power cable is unplugged (a laptop running only on a battery), the CPU speed can change when the battery level becomes too low.
“Turbo Boost (MSR)”: use
/dev/cpu/N/msrto read/write the Turbo Boost mode of Intel CPUs
“Turbo Boost (intel_pstate driver)”: read from/write into
/sys/devices/system/cpu/intel_pstate/no_turboto control the Turbo Boost mode of the Intel CPU using the
“Turbo Boost (intel_pstate driver)” is used automatically if the CPU 0 uses the
The pyperf system command implements the following checks:
“ASLR”: Check that Full randomization (
2) is enabled in
“Check nohz_full”: Make sure that nohz_full kernel option is not used with the CPU driver intel_pstate. The intel_pstate drive is incompatible with nohz_full: see https://bugzilla.redhat.com/show_bug.cgi?id=1378529 bug report.
“Linux scheduler”: Check that CPUs are isolated using the
isolcpus=<cpu list>parameter of the Linux kernel. Check that
rcu_nocbs=<cpu list>parameter is used to no schedule RCU on isolated CPUs.
CPUFreq: CPU frequency and voltage scaling code in the Linux kernel
Power Management Quality Of Service Interface (PM QOS) (
CPU pinning, real-time:
Disable Turbo Boost of Intel CPUs:
Linux-RT: HOWTO: Build an RT-application
See also the Krun program which tunes Linux and OpenBSD to run benchmarks.
The following options were not tested by pyperf developers.
Disable HyperThreading in the BIOS
Disable Turbo Boost in the BIOS
for i in $(pgrep rcu); do taskset -pc 0 $i ; done(is it useful if rcu_nocbs is already used?)
nohz_full=cpu_list: be careful of P-state/C-state bug (see below)
intel_pstate=disable: force the usage of the ACPI CPU driver
Non-maskable interrupts (NMI): add
nmi_watchdog=0 nowatchdog nosoftlockupto the Linux kernel command line
processor.max_cstate=1 idle=poll https://access.redhat.com/articles/65410 “You can disable all c-states by booting with idle=poll or just the deep ones with “processor.max_cstate=1”
/dev/cpu_dma_latencycan be used to prevent the CPU from entering deep C-states. Open the device, write a 32-bit
0to it, then keep it open while your tests runs, close when you’re finished. See processor.max_cstate, intel_idle.max_cstate and /dev/cpu_dma_latency.
Misc (untested) Linux commands:
echo "Disable realtime bandwidth reservation" echo -1 > /proc/sys/kernel/sched_rt_runtime_us echo "Reduce hung_task_check_count" echo 1 > /proc/sys/kernel/hung_task_check_count echo "Disable software watchdog" echo -1 > /proc/sys/kernel/softlockup_thresh echo "Reduce vmstat polling" echo 20 > /proc/sys/vm/stat_interval
If available on your kernel (CONFIG_NO_HZ=y and CONFIG_NO_HZ_FULL=y), you may also enable tickness kernel on these nodes. Add the following option to the command line:
Check that the Linux command line works:
$ cat /sys/devices/system/cpu/isolated 2-3,6-7 $ cat /sys/devices/system/cpu/nohz_full 2-3,6-7
Be careful of nohz_full using the intel_pstate CPU driver.
ASLR must not be disabled manually! (it’s enabled by default on Linux)