-
Notifications
You must be signed in to change notification settings - Fork 53
Description
As suggested by @markshannon, it could be useful to know for each benchmark the mean number of times each specific instruction is executed to characterize each benchmark better. (Here "specific instruction" means a unique location within a code object, not an bytecode instruction type.)
The metric is:
- count the number of times each specific instruction is executed
- sum all of these numbers up and divide by the total number of specific instructions that were executed at least once
This, roughly speaking, gives an idea of how "loopy" code is.
Methodology
This uses sys.monitoring for measurement, using a plugin for pyperf that turns on the measurement only around the actual benchmarking code. (The concept of a pyperf plugin exists only in a pull request at the moment).
The data was collected by running with pyperf's --debug-single-value, which runs the benchmark code exactly once (the inner loop only once, and the outer loop that spawns individual processes exactly once). This is partly to reveal which benchmarks have no loops at all, and also because the instrumentation is very slow, so as a practical matter running as little as possible helps get results in a reasonable time (it takes about 3 hours on my laptop, atm).
class instruction_count:
def __init__(self):
# Called at startup
self.instructions = defaultdict(int)
monitoring.register_callback(
1, monitoring.events.INSTRUCTION, self.event_handler
)
monitoring.use_tool_id(1, "instruction_count")
def collect_metadata(self, metadata):
# Called at shutdown
value = sum(self.instructions.values()) // len(self.instructions)
metadata["mean_instr_count"] = value
metadata["instr_count"] = len(self.instructions)
def event_handler(self, code, instruction_offset):
# The sys.monitoring INSTRUCTION event handler
self.instructions[(code, instruction_offset)] += 1
def __enter__(self):
# Called before running benchmarking code
sys.monitoring.set_events(1, monitoring.events.INSTRUCTION)
def __exit__(self, _exc_type, _exc_value, _traceback):
# Called after running benchmarking code
sys.monitoring.set_events(1, 0)Results
Results, sorted by mean instruction count
| name | mean instruction count | instruction count |
|---|---|---|
| deepcopy_reduce | 1 | 384 |
| logging_silent | 1 | 276 |
| pickle | 1 | 260 |
| pickle_dict | 1 | 164 |
| pickle_list | 1 | 192 |
| python_startup | 1 | 1197 |
| python_startup_no_site | 1 | 1197 |
| sqlite_synth | 1 | 322 |
| unpack_sequence | 1 | 4960 |
| unpickle | 1 | 261 |
| unpickle_list | 1 | 185 |
| aiohttp | 2 | 10437 |
| flaskblogging | 2 | 10443 |
| gunicorn | 2 | 10441 |
| djangocms | 3 | 12340 |
| logging_format | 8 | 969 |
| logging_simple | 8 | 921 |
| bench_thread_pool | 9 | 5776 |
| comprehensions | 12 | 418 |
| json_loads | 15 | 302 |
| asyncio_websockets | 16 | 20083 |
| bench_mp_pool | 18 | 5551 |
| deepcopy_memo | 27 | 333 |
| thrift | 29 | 1546 |
| regex_dna | 33 | 2042 |
| typing_runtime_protocols | 47 | 596 |
| regex_effbot | 51 | 3128 |
| regex_v8 | 55 | 15982 |
| sympy_integrate | 100 | 27549 |
| deepcopy | 105 | 558 |
| sqlalchemy_imperative | 117 | 6826 |
| sqlalchemy_declarative | 245 | 24749 |
| tornado_http | 245 | 18026 |
| create_gc_cycles | 277 | 220 |
| sympy_sum | 290 | 52200 |
| json | 304 | 294 |
| dask | 390 | 53408 |
| pathlib | 438 | 2580 |
| unpickle_pure_python | 467 | 1869 |
| asyncio_tcp_ssl | 473 | 12662 |
| coverage | 483 | 6329 |
| mako | 501 | 3412 |
| deltablue | 504 | 1394 |
| html5lib | 550 | 12352 |
| pylint | 551 | 106975 |
| genshi_text | 570 | 7044 |
| dulwich_log | 696 | 4295 |
| pickle_pure_python | 716 | 1305 |
| hexiom | 800 | 1718 |
| xml_etree_parse | 891 | 3230 |
| asyncio_tcp | 988 | 5015 |
| genshi_xml | 1051 | 7436 |
| sympy_str | 1105 | 19761 |
| telco | 1255 | 285 |
| async_generators | 1388 | 19193 |
| json_dumps | 1853 | 244 |
| chameleon | 1892 | 684 |
| docutils | 1994 | 64996 |
| sympy_expand | 2073 | 15964 |
| xml_etree_iterparse | 2096 | 3439 |
| xml_etree_process | 2186 | 3860 |
| xml_etree_generate | 2991 | 3171 |
| mypy2 | 3338 | 98633 |
| pidigits | 3499 | 262 |
| scimark_sparse_mat_mult | 4978 | 271 |
| regex_compile | 5028 | 3840 |
| django_template | 5268 | 864 |
| pycparser | 8694 | 12235 |
| richards | 9794 | 927 |
| richards_super | 10174 | 970 |
| crypto_pyaes | 10197 | 982 |
| chaos | 12955 | 888 |
| go | 18550 | 1511 |
| gc_traversal | 20066 | 175 |
| coroutines | 21939 | 166 |
| meteor_contest | 28833 | 303 |
| scimark_lu | 29979 | 646 |
| nqueens | 34722 | 359 |
| raytrace | 35010 | 1228 |
| scimark_monte_carlo | 40481 | 358 |
| float | 44642 | 280 |
| generators | 52287 | 222 |
| pyflate | 57937 | 1328 |
| scimark_fft | 64011 | 820 |
| nbody | 64806 | 458 |
| mdp | 70366 | 1743 |
| spectral_norm | 79688 | 273 |
| scimark_sor | 100342 | 293 |
| bpe_tokeniser | 162137 | 3128 |
| pprint_safe_repr | 268750 | 416 |
| fannkuch | 271393 | 272 |
| tomli_loads | 286966 | 1611 |
| pprint_pformat | 360345 | 638 |
Results, sorted by absolute instructions counts
| name | mean instruction count | instruction count |
|---|---|---|
| pickle_dict | 1 | 164 |
| coroutines | 21939 | 166 |
| gc_traversal | 20066 | 175 |
| unpickle_list | 1 | 185 |
| pickle_list | 1 | 192 |
| create_gc_cycles | 277 | 220 |
| generators | 52287 | 222 |
| json_dumps | 1853 | 244 |
| pickle | 1 | 260 |
| unpickle | 1 | 261 |
| pidigits | 3499 | 262 |
| scimark_sparse_mat_mult | 4978 | 271 |
| fannkuch | 271393 | 272 |
| spectral_norm | 79688 | 273 |
| logging_silent | 1 | 276 |
| float | 44642 | 280 |
| telco | 1255 | 285 |
| scimark_sor | 100342 | 293 |
| json | 304 | 294 |
| json_loads | 15 | 302 |
| meteor_contest | 28833 | 303 |
| sqlite_synth | 1 | 322 |
| deepcopy_memo | 27 | 333 |
| scimark_monte_carlo | 40481 | 358 |
| nqueens | 34722 | 359 |
| deepcopy_reduce | 1 | 384 |
| pprint_safe_repr | 268750 | 416 |
| comprehensions | 12 | 418 |
| nbody | 64806 | 458 |
| deepcopy | 105 | 558 |
| typing_runtime_protocols | 47 | 596 |
| pprint_pformat | 360345 | 638 |
| scimark_lu | 29979 | 646 |
| chameleon | 1892 | 684 |
| scimark_fft | 64011 | 820 |
| django_template | 5268 | 864 |
| chaos | 12955 | 888 |
| logging_simple | 8 | 921 |
| richards | 9794 | 927 |
| logging_format | 8 | 969 |
| richards_super | 10174 | 970 |
| crypto_pyaes | 10197 | 982 |
| python_startup | 1 | 1197 |
| python_startup_no_site | 1 | 1197 |
| raytrace | 35010 | 1228 |
| pickle_pure_python | 716 | 1305 |
| pyflate | 57937 | 1328 |
| deltablue | 504 | 1394 |
| go | 18550 | 1511 |
| thrift | 29 | 1546 |
| tomli_loads | 286966 | 1611 |
| hexiom | 800 | 1718 |
| mdp | 70366 | 1743 |
| unpickle_pure_python | 467 | 1869 |
| regex_dna | 33 | 2042 |
| pathlib | 438 | 2580 |
| regex_effbot | 51 | 3128 |
| bpe_tokeniser | 162137 | 3128 |
| xml_etree_generate | 2991 | 3171 |
| xml_etree_parse | 891 | 3230 |
| mako | 501 | 3412 |
| xml_etree_iterparse | 2096 | 3439 |
| regex_compile | 5028 | 3840 |
| xml_etree_process | 2186 | 3860 |
| dulwich_log | 696 | 4295 |
| unpack_sequence | 1 | 4960 |
| asyncio_tcp | 988 | 5015 |
| bench_mp_pool | 18 | 5551 |
| bench_thread_pool | 9 | 5776 |
| coverage | 483 | 6329 |
| sqlalchemy_imperative | 117 | 6826 |
| genshi_text | 570 | 7044 |
| genshi_xml | 1051 | 7436 |
| aiohttp | 2 | 10437 |
| gunicorn | 2 | 10441 |
| flaskblogging | 2 | 10443 |
| pycparser | 8694 | 12235 |
| djangocms | 3 | 12340 |
| html5lib | 550 | 12352 |
| asyncio_tcp_ssl | 473 | 12662 |
| sympy_expand | 2073 | 15964 |
| regex_v8 | 55 | 15982 |
| tornado_http | 245 | 18026 |
| async_generators | 1388 | 19193 |
| sympy_str | 1105 | 19761 |
| asyncio_websockets | 16 | 20083 |
| sqlalchemy_declarative | 245 | 24749 |
| sympy_integrate | 100 | 27549 |
| sympy_sum | 290 | 52200 |
| dask | 390 | 53408 |
| docutils | 1994 | 64996 |
| mypy2 | 3338 | 98633 |
| pylint | 551 | 106975 |
A few conclusions to draw from this:
There are a few benchmarks were the mean instruction count is 1, or very low, where there's not much an optimizer could do (short of multiple loops). Some, at least pickle*, unpickle* and sqlite_synth aren't really CPython interpreter benchmarks at all and just drop to C code pretty quickly (and the profiling results confirm this). flaskblogging, gunicorn and djangocms are at least ostenibly macrobenchmarks, so we should investigate why the mean instruction count is so low there. We should consider excluding this class whole class of benchmarks from the global number, at least for purposes of optimizing the interpreter.
Benchmarks with a high mean instruction count and total instruction count feel like really robust examples of real-world code. These include pylint, mypy2, docutils, dask, sympy*, pycparser.