Analyze benchmark results

pyperf commands

To analyze benchmark results, write the output into a JSON file using the --output option (-o):

$ python3 -m pyperf timeit '[1,2]*1000' -o bench.json
.....................
Mean +- std dev: 4.22 us +- 0.08 us

pyperf provides the following commands to analyze benchmark results:

  • pyperf show: single line summary, mean and standard deviation
  • pyperf check: check if benchmark results stability
  • pyperf metadata: display metadata collected during the benchmark
  • pyperf dump: see all values per run, including warmup values and the calibration run
  • pyperf stats: compute various statistics (min/max, mean, median, percentiles, etc.).
  • pyperf hist: render an histogram to see the shape of the distribution.
  • pyperf slowest: top 5 benchmarks which took the most time to be run.

Statistics

Outliers

If you run a benchmark without tuning the system, it’s likely that you will get outliers: a few values much slower than the average.

Example:

$ python3 -m pyperf timeit '[1,2]*1000' -o outliers.json
.....................
WARNING: the benchmark result may be unstable
* the maximum (6.02 us) is 39% greater than the mean (4.34 us)

Try to rerun the benchmark with more runs, values and/or loops.
Run 'python3 -m pyperf system tune' command to reduce the system jitter.
Use pyperf stats, pyperf dump and pyperf hist to analyze results.
Use --quiet option to hide these warnings.

Mean +- std dev: 4.34 us +- 0.31 us

Use the pyperf stats command to count the number of outliers (9 on this example):

$ python3 -m pyperf stats outliers.json -q
Total duration: 11.6 sec
Start date: 2017-03-16 16:30:01
End date: 2017-03-16 16:30:16
Raw value minimum: 135 ms
Raw value maximum: 197 ms

Number of calibration run: 1
Number of run with values: 20
Total number of run: 21

Number of warmup per run: 1
Number of value per run: 3
Loop iterations per value: 2^15
Total number of values: 60

Minimum:         4.12 us
Median +- MAD:   4.25 us +- 0.05 us
Mean +- std dev: 4.34 us +- 0.31 us
Maximum:         6.02 us

  0th percentile: 4.12 us (-5% of the mean) -- minimum
  5th percentile: 4.15 us (-4% of the mean)
 25th percentile: 4.21 us (-3% of the mean) -- Q1
 50th percentile: 4.25 us (-2% of the mean) -- median
 75th percentile: 4.30 us (-1% of the mean) -- Q3
 95th percentile: 4.84 us (+12% of the mean)
100th percentile: 6.02 us (+39% of the mean) -- maximum

Number of outlier (out of 4.07 us..4.44 us): 9

Histogram:

$ python3 -m pyperf hist outliers.json -q
4.10 us: 15 ##############################
4.20 us: 29 ##########################################################
4.30 us:  6 ############
4.40 us:  3 ######
4.50 us:  2 ####
4.60 us:  1 ##
4.70 us:  0 |
4.80 us:  1 ##
4.90 us:  0 |
5.00 us:  0 |
5.10 us:  0 |
5.20 us:  2 ####
5.30 us:  0 |
5.40 us:  0 |
5.50 us:  0 |
5.60 us:  0 |
5.70 us:  0 |
5.80 us:  0 |
5.90 us:  0 |
6.00 us:  1 ##

Using an histogram, it’s easy to see that most values (57 values) are in the range [4.12 us; 4.84 us], but 3 values are in the range [5.17 us; 6.02 us]: 39% slower for the maximum (6.02 us).

See How to get reproductible benchmark results to avoid outliers.

If you cannot get stable benchmark results, another option is to use median and median absolute deviation (MAD) instead of mean and standard deviation. Median and MAD are robust statistics which ignore outliers.

Minimum VS average

Links:

Median and median absolute deviation VS mean and standard deviation

Median and median absolute deviation (MAD) are robust statistics which ignore outliers.

Probability distribution

The pyperf hist command renders an histogram of the distribution of all values.

See also:

Why is pyperf so slow?

--fast and --rigorous options indirectly have an impact on the total duration of benchmarks. The pyperf module is not optimized for the total duration but to produce reliable benchmarks.

The --fast is designed to be fast, but remain reliable enough to be sensitive. Using less worker processes and less values per worker would produce unstable results.

Compare benchmark results

Let’s use Python 3.6 and Python 3.8 to generate two different benchmark results:

$ python3.6 -m pyperf timeit '[1,2]*1000' -o py36.json
.....................
Mean +- std dev: 4.70 us +- 0.18 us

$ python3.8 -m pyperf timeit '[1,2]*1000' -o py38.json
.....................
Mean +- std dev: 4.22 us +- 0.08 us

The pyperf compare_to command compares the second benchmark to the first benchmark:

$ python3 -m pyperf compare_to py36.json py38.json
Mean +- std dev: [py36] 4.70 us +- 0.18 us -> [py38] 4.22 us +- 0.08 us: 1.11x faster (-10%)

Python 3.8 is faster than Python 3.6 on this benchmark.

pyperf determines whether two samples differ significantly using a Student’s two-sample, two-tailed t-test with alpha equals to 0.95.

Render a table using --table option:

$ python3 -m pyperf compare_to py36.json py38.json --table
+-----------+---------+------------------------------+
| Benchmark | py36    | py38                         |
+===========+=========+==============================+
| timeit    | 4.70 us | 4.22 us: 1.11x faster (-10%) |
+-----------+---------+------------------------------+