Analyze benchmark results¶

pyperf commands¶

To analyze benchmark results, write the output into a JSON file using the --output option (-o):

$ python3 -m pyperf timeit '[1,2]*1000' -o bench.json
.....................
Mean +- std dev: 4.22 us +- 0.08 us

pyperf provides the following commands to analyze benchmark results:

pyperf show: single line summary, mean and standard deviation
pyperf check: check if benchmark results stability
pyperf metadata: display metadata collected during the benchmark
pyperf dump: see all values per run, including warmup values and the calibration run
pyperf stats: compute various statistics (min/max, mean, median, percentiles, etc.).
pyperf hist: render an histogram to see the shape of the distribution.
pyperf slowest: top 5 benchmarks which took the most time to be run.

Statistics¶

Outliers¶

If you run a benchmark without tuning the system, it’s likely that you will get outliers: a few values much slower than the average.

Example:

$ python3 -m pyperf timeit '[1,2]*1000' -o outliers.json
.....................
WARNING: the benchmark result may be unstable
* the maximum (6.02 us) is 39% greater than the mean (4.34 us)

Try to rerun the benchmark with more runs, values and/or loops.
Run 'python3 -m pyperf system tune' command to reduce the system jitter.
Use pyperf stats, pyperf dump and pyperf hist to analyze results.
Use --quiet option to hide these warnings.

Mean +- std dev: 4.34 us +- 0.31 us

Use the pyperf stats command to count the number of outliers (9 on this example):

$ python3 -m pyperf stats outliers.json -q
Total duration: 11.6 sec
Start date: 2017-03-16 16:30:01
End date: 2017-03-16 16:30:16
Raw value minimum: 135 ms
Raw value maximum: 197 ms

Number of calibration run: 1
Number of run with values: 20
Total number of run: 21

Number of warmup per run: 1
Number of value per run: 3
Loop iterations per value: 2^15
Total number of values: 60

Minimum:         4.12 us
Median +- MAD:   4.25 us +- 0.05 us
Mean +- std dev: 4.34 us +- 0.31 us
Maximum:         6.02 us

  0th percentile: 4.12 us (-5% of the mean) -- minimum
  5th percentile: 4.15 us (-4% of the mean)
 25th percentile: 4.21 us (-3% of the mean) -- Q1
 50th percentile: 4.25 us (-2% of the mean) -- median
 75th percentile: 4.30 us (-1% of the mean) -- Q3
 95th percentile: 4.84 us (+12% of the mean)
100th percentile: 6.02 us (+39% of the mean) -- maximum

Number of outlier (out of 4.07 us..4.44 us): 9

Histogram:

$ python3 -m pyperf hist outliers.json -q
10 us: 15 ##############################
20 us: 29 ##########################################################
30 us:  6 ############
40 us:  3 ######
50 us:  2 ####
60 us:  1 ##
70 us:  0 |
80 us:  1 ##
90 us:  0 |
00 us:  0 |
10 us:  0 |
20 us:  2 ####
30 us:  0 |
40 us:  0 |
50 us:  0 |
60 us:  0 |
70 us:  0 |
80 us:  0 |
90 us:  0 |
00 us:  1 ##

Using an histogram, it’s easy to see that most values (57 values) are in the range [4.12 us; 4.84 us], but 3 values are in the range [5.17 us; 6.02 us]: 39% slower for the maximum (6.02 us).

See How to get reproductible benchmark results to avoid outliers.

If you cannot get stable benchmark results, another option is to use median and median absolute deviation (MAD) instead of mean and standard deviation. Median and MAD are robust statistics which ignore outliers.

Minimum VS average¶

Links:

Statistically Rigorous Java Performance Evaluation by Andy Georges, Dries Buytaert and Lieven Eeckhout, 2007
Benchmarking: minimum vs average (June 2016) by Kevin Modzelewski
My journey to stable benchmark, part 3 (average) (May 2016) by Victor Stinner
Median versus Mean: pyperf issue #1: Use a better measures than average and standard
timeit module of PyPy now uses average: change timeit to report the average +- standard deviation

Median and median absolute deviation VS mean and standard deviation¶

Median and median absolute deviation (MAD) are robust statistics which ignore outliers.

Probability distribution¶

The pyperf hist command renders an histogram of the distribution of all values.

Why is pyperf so slow?¶

--fast and --rigorous options indirectly have an impact on the total duration of benchmarks. The pyperf module is not optimized for the total duration but to produce reliable benchmarks.

The --fast is designed to be fast, but remain reliable enough to be sensitive. Using less worker processes and less values per worker would produce unstable results.

Compare benchmark results¶

Let’s use Python 3.6 and Python 3.8 to generate two different benchmark results:

$ python3.6 -m pyperf timeit '[1,2]*1000' -o py36.json
.....................
Mean +- std dev: 4.70 us +- 0.18 us

$ python3.8 -m pyperf timeit '[1,2]*1000' -o py38.json
.....................
Mean +- std dev: 4.22 us +- 0.08 us

The pyperf compare_to command compares the second benchmark to the first benchmark:

$ python3 -m pyperf compare_to py36.json py38.json
Mean +- std dev: [py36] 4.70 us +- 0.18 us -> [py38] 4.22 us +- 0.08 us: 1.11x faster (-10%)

Python 3.8 is faster than Python 3.6 on this benchmark.

pyperf determines whether two samples differ significantly using a Student’s two-sample, two-tailed t-test with alpha equals to 0.95.

Render a table using --table option:

$ python3 -m pyperf compare_to mult_list_py36.json mult_list_py38.json --table
+----------------+----------------+-----------------------+
| Benchmark      | mult_list_py36 | mult_list_py38        |
+================+================+=======================+
| [1,2]*1000     | 3.70 us        | 3.18 us: 1.16x faster |
+----------------+----------------+-----------------------+
| [1,2,3]*1000   | 4.61 us        | 4.17 us: 1.11x faster |
+----------------+----------------+-----------------------+
| Geometric mean | (ref)          | 1.09x faster          |
+----------------+----------------+-----------------------+

Benchmark hidden because not significant (1): [1]*1000