Analyze benchmark results¶
pyperf commands¶
To analyze benchmark results, write the output into a JSON file using
the --output
option (-o
):
$ python3 -m pyperf timeit '[1,2]*1000' -o bench.json
.....................
Mean +- std dev: 4.22 us +- 0.08 us
pyperf provides the following commands to analyze benchmark results:
pyperf show: single line summary, mean and standard deviation
pyperf check: check if benchmark results stability
pyperf metadata: display metadata collected during the benchmark
pyperf dump: see all values per run, including warmup values and the calibration run
pyperf stats: compute various statistics (min/max, mean, median, percentiles, etc.).
pyperf hist: render an histogram to see the shape of the distribution.
pyperf slowest: top 5 benchmarks which took the most time to be run.
Statistics¶
Outliers¶
If you run a benchmark without tuning the system, it’s likely that you will get outliers: a few values much slower than the average.
Example:
$ python3 -m pyperf timeit '[1,2]*1000' -o outliers.json
.....................
WARNING: the benchmark result may be unstable
* the maximum (6.02 us) is 39% greater than the mean (4.34 us)
Try to rerun the benchmark with more runs, values and/or loops.
Run 'python3 -m pyperf system tune' command to reduce the system jitter.
Use pyperf stats, pyperf dump and pyperf hist to analyze results.
Use --quiet option to hide these warnings.
Mean +- std dev: 4.34 us +- 0.31 us
Use the pyperf stats command to count the number of outliers (9 on this example):
$ python3 -m pyperf stats outliers.json -q
Total duration: 11.6 sec
Start date: 2017-03-16 16:30:01
End date: 2017-03-16 16:30:16
Raw value minimum: 135 ms
Raw value maximum: 197 ms
Number of calibration run: 1
Number of run with values: 20
Total number of run: 21
Number of warmup per run: 1
Number of value per run: 3
Loop iterations per value: 2^15
Total number of values: 60
Minimum: 4.12 us
Median +- MAD: 4.25 us +- 0.05 us
Mean +- std dev: 4.34 us +- 0.31 us
Maximum: 6.02 us
0th percentile: 4.12 us (-5% of the mean) -- minimum
5th percentile: 4.15 us (-4% of the mean)
25th percentile: 4.21 us (-3% of the mean) -- Q1
50th percentile: 4.25 us (-2% of the mean) -- median
75th percentile: 4.30 us (-1% of the mean) -- Q3
95th percentile: 4.84 us (+12% of the mean)
100th percentile: 6.02 us (+39% of the mean) -- maximum
Number of outlier (out of 4.07 us..4.44 us): 9
Histogram:
$ python3 -m pyperf hist outliers.json -q
4.10 us: 15 ##############################
4.20 us: 29 ##########################################################
4.30 us: 6 ############
4.40 us: 3 ######
4.50 us: 2 ####
4.60 us: 1 ##
4.70 us: 0 |
4.80 us: 1 ##
4.90 us: 0 |
5.00 us: 0 |
5.10 us: 0 |
5.20 us: 2 ####
5.30 us: 0 |
5.40 us: 0 |
5.50 us: 0 |
5.60 us: 0 |
5.70 us: 0 |
5.80 us: 0 |
5.90 us: 0 |
6.00 us: 1 ##
Using an histogram, it’s easy to see that most values (57 values) are in the range [4.12 us; 4.84 us], but 3 values are in the range [5.17 us; 6.02 us]: 39% slower for the maximum (6.02 us).
See How to get reproductible benchmark results to avoid outliers.
If you cannot get stable benchmark results, another option is to use median and median absolute deviation (MAD) instead of mean and standard deviation. Median and MAD are robust statistics which ignore outliers.
Minimum VS average¶
Links:
Statistically Rigorous Java Performance Evaluation by Andy Georges, Dries Buytaert and Lieven Eeckhout, 2007
Benchmarking: minimum vs average (June 2016) by Kevin Modzelewski
My journey to stable benchmark, part 3 (average) (May 2016) by Victor Stinner
Median versus Mean: pyperf issue #1: Use a better measures than average and standard
timeit module of PyPy now uses average: change timeit to report the average +- standard deviation
Median and median absolute deviation VS mean and standard deviation¶
Median and median absolute deviation (MAD) are robust statistics which ignore outliers.
Probability distribution¶
The pyperf hist command renders an histogram of the distribution of all values.
See also:
Probability distribution (Wikipedia)
“How NOT to Measure Latency” by Gil Tene (video at Youtube)
HdrHistogram: A High Dynamic Range Histogram.: “look at the entire percentile spectrum”
Why is pyperf so slow?¶
--fast
and --rigorous
options indirectly have an impact on the total
duration of benchmarks. The pyperf
module is not optimized for the total
duration but to produce reliable benchmarks.
The --fast
is designed to be fast, but remain reliable enough to be
sensitive. Using less worker processes and less values per worker would
produce unstable results.
Compare benchmark results¶
Let’s use Python 3.6 and Python 3.8 to generate two different benchmark results:
$ python3.6 -m pyperf timeit '[1,2]*1000' -o py36.json
.....................
Mean +- std dev: 4.70 us +- 0.18 us
$ python3.8 -m pyperf timeit '[1,2]*1000' -o py38.json
.....................
Mean +- std dev: 4.22 us +- 0.08 us
The pyperf compare_to command compares the second benchmark to the first benchmark:
$ python3 -m pyperf compare_to py36.json py38.json
Mean +- std dev: [py36] 4.70 us +- 0.18 us -> [py38] 4.22 us +- 0.08 us: 1.11x faster (-10%)
Python 3.8 is faster than Python 3.6 on this benchmark.
pyperf determines whether two samples differ significantly using a Student’s
two-sample, two-tailed t-test with alpha equals to
0.95
.
Render a table using --table
option:
$ python3 -m pyperf compare_to mult_list_py36.json mult_list_py38.json --table
+----------------+----------------+-----------------------+
| Benchmark | mult_list_py36 | mult_list_py38 |
+================+================+=======================+
| [1,2]*1000 | 3.70 us | 3.18 us: 1.16x faster |
+----------------+----------------+-----------------------+
| [1,2,3]*1000 | 4.61 us | 4.17 us: 1.11x faster |
+----------------+----------------+-----------------------+
| Geometric mean | (ref) | 1.09x faster |
+----------------+----------------+-----------------------+
Benchmark hidden because not significant (1): [1]*1000