Improve startup time. #134
Replies: 50 comments 1 reply
-
We also need to o initialize some extension modules, sys and built ons. |
Beta Was this translation helpful? Give feedback.
-
If creating and/or destroying the new interpreter is what takes the time, then one possible way to speed it up is to freeze the object graph of a newly created interpreter. That would work roughly as follows:
We can now create the entire object graph for the interpreter by:
The above assumes that interpreters are fully independent. |
Beta Was this translation helpful? Give feedback.
-
I've just remembered Running
|
Beta Was this translation helpful? Give feedback.
-
I don't have any such info presently. I know there were investigations in the past that involved such profiling but I expect any resulting profiling data is outdated at this point. Regardless, it shouldn't take much effort to get at least some basic data. Furthermore, we can already get at least some insight by running |
Beta Was this translation helpful? Give feedback.
-
Regarding Mark's strategy, can you explain what we would have to do when we need to change the object graph? IIUR currently we change some of the modules that are frozen (_frozen_importlib and friends), and then regenerate the frozen bytecode (for which we have tools). I am worried that your item (3) makes this process more complicated. Otherwise, it sounds like a decent strategy (many data structures are already static webs of pointers, the new thing is that we allow static webs of And yes, we need to work on profiling startup. |
Beta Was this translation helpful? Give feedback.
-
FTR, facebook/instagram has been using a related strategy, which they described at the language summit a couple years back. |
Beta Was this translation helpful? Give feedback.
-
Please don't forget |
Beta Was this translation helpful? Give feedback.
-
Mark has a nascent proposal |
Beta Was this translation helpful? Give feedback.
-
This is only beneficial if the size of mapped files exceeds a certain threshold, which is fairly large (probably larger than 100 KiB by now). Below that, copying is faster. With small files, I expect a run-time penalty even after loading because the less dense packing of information (distributed across a larger number of pages) results in increased TLB misses. And by default, Linux will only allow 65,530 separate mappings per process. |
Beta Was this translation helpful? Give feedback.
-
My intuition is that it will be hard to show that mmap is faster than copying here, so I would personally be happy if the first version just read (i.e., copied) the file into memory. |
Beta Was this translation helpful? Give feedback.
-
The point of the proposal is to avoid the overhead of unmarshalling, whether the file is copied or mapped isn't so important. The important thing is that the pyc format is position independent (no pointers), is immutable, and requires minimal decoding before being usable. |
Beta Was this translation helpful? Give feedback.
-
Oh, that's really neat! |
Beta Was this translation helpful? Give feedback.
-
I just discovered
The biggest cumulative time (4.4 ms) is the The next biggest cost is Of course, without Here's the raw data for that:
|
Beta Was this translation helpful? Give feedback.
-
Some comments about this based on work I did at previous core sprints. Mark's ideas seem good to me and are along the same lines I was thinking. The unmarshal step is fairly expensive and I think we could reduce that with a change of the code object layout. Currently, the VM expects code objects to be (mostly) composed of PyObjects. It doesn't need to be that way and we could create a code object memory layout that's much cheaper to load from disk. I suggested something similar in spirit to Cap’n Proto. We probably want to kept the on-disk files machine independent. The importlib overhead has been optimized over the years so there is no big low hanging fruit there. However, it seem seems wasteful how much work is done by importlib just to start up. I think we could dump all needed startup modules into a kind of optimized bundle (e.g. better optimized than a zipped package) and eliminate the importlib overhead. E.g. on startup call a C-function that unmarshals all the code on the bundle and then executes some bytecode to finish the startup. It doesn't need to be linked with the executable (as frozen modules) but could be an external file, e.g. in When Python starts, the interleaving of Another idea: try to reduce the amount of work done when executing top-level module code. Quite a bit of work has been done over the years to try to reduce startup time and therefore what's done at the top-level of modules imported at startup. E.g. do it the first time a function is called, rather than at import time. However, maybe we can do better. My lazy top-level code execution idea was one approach. The major killer of that was that metaclasses can basically execute anything. My AST walker couldn't find too much that could be lazily executed. I think Cinder has done something like this but I don't know details (StrictModule annotation?). Another issue: doing something lazily doesn't help performance if you end up doing anyhow. I think I was running into such an issue. A twist on the lazy code execution is to just defer the loading/unmarshal of code objects of functions, until the function is actually called. It seems likely that quite a few functions in modules loaded at startup are never actually executed. So, does it help just to never load the code for them? I had a prototype of this idea but it didn't show much of a win. Might be worth revisiting. The original idea was inspired by something Larry did with PHP years ago. Another idea: allow some code to be executed at module compile time. I.e. Python code that gets called by the compiler to compute objects that end up inside the .pyc. A simple example, assume my module needs to compute a dict that will become a constant global. It could be possible to do that work at compile time (rather than at import time) and then store the result of that computation in the .pyc. Sometime like:
Obviously there would be a restrictions on what you can do at compile time. This would be an advanced feature and used carefully. Another place it could be used is to pre-compile regular expressions to something that can be quickly loaded at import time and still have good runtime performance. You can do something a bit like this by having code that generates a .py file and then compile that. However, having it a feature of the Python compiler would be nice (aside from the language complexity increase). Perhaps things like dataclasses could use the feature to do more work at compile time. |
Beta Was this translation helpful? Give feedback.
-
The |
Beta Was this translation helpful? Give feedback.
-
The biggest remaining design issue is the following. In my prototype, which I based on Mark's design, there is a single array of lazily-loadable constants, and (separately) a single array of names. These correspond to co_consts and co_names, but the arrays (tuples, really) are shared between all code objects loaded from the same PYC file. This doesn't scale so well -- in a large module you can easily have 1000s of constants or names, and whenever the size of either array exceeds 255, the corresponding instructions require prefixing with EXPANDED_ARG -- this makes the bytecode bulky and slow, and also it means we have to recompute all jump targets (in my prototype PYC writer I just give up when this happens, so I can only serialize toy modules). A secondary problem with this is that code objects are now contained in their own co_consts array, making them GC cycles (and throwing dis.py for a loop when it tries to recursively find all code objects nested in a module's code object). The simplest solution I can think of is to give each serialized code object two extra arrays of indexes, which map the per-code co_consts and co_names arrays to per-file arrays. At runtime there would then have to be separate per-file arrays and per-code-object arrays. This is somewhat inefficient space-wise, but the implementation is straightforward. (If we were to skip the index arrays, we'd end up with the situation where if a name or constant is shared between many code objects, it would be hydrated many times, once per code object, and the resulting objects would not be shared unless interned.) |
Beta Was this translation helpful? Give feedback.
-
Maybe this can reduce redirection only to the case of shared objects: Each code object has a dedicated sub-section of the module level array (and knows its first-index). Builder.add returns (index, redirected), where redirected= 0 if the index entry has the actual const/name, and it is 1 if that entry has the index of the actual data. So if Builder.add added the item in a new index, redirected=0. If it reused an old index, it needs to check whether the index is within this code object's section or not (so maybe it needs the first_index passed in). If it found an index from a previous code object, it adds that index to a new slot and returns this new slot's index with redirected=1. Then there needs to be an opcode that resolves the redirection and feeds the MAKE_* with the correct index. Or something along those lines. Whether it's worth it depends how often there won't be a need for redirection. |
Beta Was this translation helpful? Give feedback.
-
Let's see, take this example: eggs = 0
ham = 1
spam = 2
def egg_sandwich():
return eggs, spam
def ham_sandwich():
return ham + spam + spam There are three names here, "eggs", "spam" and "ham", and only "spam" is shared. The disassembly for the functions would be something like this (constant indices are relative here):
The strings section has four entries:
String entries 0, 1 and 2 have offsets directly into the binary data, where they find the length and encoded data for the strings. String entry 3 has a redirect instead of an offsset into the binary data. How to represent redirection? We can encode this by setting the lowest bit of the offset value, if we ensure that offsets into binary data are always even. This wastes the occasional byte but that seems minor. (Alternatively, the offset could be shifted left by one.) Code to find the real offset would be something like this (compare to _PyHydra_UnicodeFromIndex()): PyObject *
_PyHydra_UnicodeFromIndex(struct lazy_pyc *pyc, int index)
{
if (0 <= index && index < pyc->n_strings) {
uint32_t offset = pyc->string_offsets[index];
if (offset & 1) {
index = offset >> 1;
assert(0 <= index && index < pyc_n_strings);
offset = pyc->string_offsets[index];
}
return _PyHydra_UnicodeFromOffset(pyc, offset);
}
PyErr_Format(PyExc_SystemError, "String index %d out of range", index);
return NULL;
} We then change the opcode for LOAD_NAME as follows: case TARGET(LOAD_NAME): {
PyObject *name = GETITEM(names, oparg);
if (name == NULL) {
name = _PyHydrate_LoadName(co->co_pyc, co->co_strings_start + oparg); // **Changed**
if (name == NULL) {
goto error;
}
Py_INCREF(name); // **New**
PyTuple_SET_ITEM(names, oparg, name); // **New**
}
... // Rest is unchanged Where Moreover we would need to update _PyHydra_LoadName() to first check if the string with the given index already exists in name = PyTuple_GET_ITEM(pyc->names, index);
if (name != NULL) {
Py_INCREF(name);
return name;
} So as to avoid constructing multiple copies of the same string ("spam" in our example). Also, of course, when a code object is hydrated we have to set its |
Beta Was this translation helpful? Give feedback.
-
Either that, or add is this possible: add an opcode "RESOLVE_REDIRECT index" which puts the real index on the stack, and then the oparg to the next opcode is something like -1 which means "pop it from the stack". |
Beta Was this translation helpful? Give feedback.
-
We could do that for the MAKE_STRING opcode, which is only used in the "mini-subroutines" called by LAZY_LOAD_CONSTANT, but for LOAD_NAME and friends there is an issue: inserting extra opcodes requires recomputing all jump targets, and also the line table and the exception table, which I'd rather not do (especially not in this prototype). We could define the MAKE_STRING opcode as always using the "absolute" index -- in fact, it has to be defined that way, since these mini-subroutines don't belong to any particular code object: if a constant is shared the mini-subroutine could be invoked from any of the code objects that reference it, or even from another mini-subroutine. Since I generate the mini-subroutines from scratch and they don't have line or exception tables, having to use EXTENDED_ARG in these is not a problem. The changes I suggested for co_names also need to be used for co_consts. We can use the same trick of requiring real offsets to be even. In fact, for code objects the offset has to be a multiple of 4, since we interpret the offset as a pointer to an array of (Somewhat unrelated, the encoding for strings is currently a varint given the size of the encoded bytes, followed by that many bytes comprising the UTF-8-encoded value. This favors space over time. But maybe we ought to favor time over space here? We could store the size in code units as a uint32_t shifted left by 2, encoding the character width in the bottom 2 bits, followed by raw data bytes, either 1, 2 or 4 bytes per code unit. We might even have a spare bit to distinguish between ASCII and Latin-1 in the case where it's one byte per character. E.g.
It's wort experimenting a bit here to see if this is actually faster, and we could scan the 100 or 4000 most popular PyPI packages to see what the total space wasted is compared to the current encoding. For a short ASCII string the wastage would be 3 bytes per string.) |
Beta Was this translation helpful? Give feedback.
-
Wait -- if the 0,1,2 entries are just offsets into the binary data, then 3 might as well be that too, and just have the same offset as entry 1. No? |
Beta Was this translation helpful? Give feedback.
-
If you do it that way, egg_sandwich() and ham_sandwich() each end up with their own copy of "spam". Now, we should really intern these strings too, which would take care of that, so possibly this is okay. The marshal format shares interned strings. Any string used in LOAD_NAME etc. is worth hashing (these are identifiers that are almost certainly used as dict keys, either globals or builtins, or instance/class attributes). The sharing may be more important for MAKE_STRING, which is also used for "data" strings that aren't interned (things that don't look like identifiers) but are still kept shared both by the compiler and by marshal. (Marshal has a complex system for avoiding to write the same constant multiple times, see w_ref() and r_ref() and friends.) However, MAKE_STRING is only used from mini-subroutines, so we can (and must) set oparg to the absolute index. (Mini-subroutines don't have their own co_names and co_consts arrays.) So if there was a third function that had a string constant "spam", it would have a mini-subroutine invoked by LAZY_LOAD_CONSTANT, which would look like this:
We need to support redirects for LAZY_LOAD_CONSTANT too, and it can follow the same general principle (a per-code-object "relative" index and an "absolute" index). But we'll need two versions of that one! It's used both from regular code objects (where the bytecode rewriter substitutes it for LOAD_CONST, and the index needs to be relative to the code object) and from mini-subroutines, where we need to use the absolute index. |
Beta Was this translation helpful? Give feedback.
-
Okay, so I am seeing something weird. When I run the "test_speed" subtest using the "test" package framework, it consistently reports a ~30% speedup, while when I run it using the unittest-based main(), it reports no speedup. Examples:
I see similar differences on macOS. At this point it's important to remind us how the test is run. The code is here. (Note that it seems that it always uses marshal.loads(). This is correct: the code that recognizes the new format lives at the top level in that function.) The big difference seems to be that the individual "classic" times reported are much lower when running using unittest.main() than when running using the test package. What could cause this? Is the test package maybe messing with GC configuration or some other system parameter? |
Beta Was this translation helpful? Give feedback.
-
Well, hm... In test/libregrtest/setup.py there's code that sets a dummy audit hook. If I comment that out, the classic running time using the test framework goes down to roughly what it is when using unittest.main(), and the "advantage" of the new code pretty much disappears. I consider this particular mystery solved. |
Beta Was this translation helpful? Give feedback.
-
I discussed the status of my experiment so far with Mark this morning, and we agreed to pause development and try some other experiments before we go further down this road.
Probably it would be better to start writing up Experiments B and C in more detail in new issues here. I will get started with that. |
Beta Was this translation helpful? Give feedback.
-
We now also have
|
Beta Was this translation helpful? Give feedback.
-
Fun fact: PyOxidizer's has support for importing .py/.pyc content from a memory-mapped file using 0-copy and the run-time code implementing that functionality is available on PyPI (https://pypi.org/project/oxidized-importer) and doesn't require the use of PyOxidizer. This extension module exposes an API (https://pyoxidizer.readthedocs.io/en/latest/oxidized_importer_api_reference.html). And it is even possible to create your own packed resources files from your own content (https://pyoxidizer.readthedocs.io/en/latest/oxidized_importer_freezing_applications.html). Conceptually it is similar to the zip importer, except a bit faster. So if you wanted a quick way to to eliminate the stdlib I've also isolated unmarshaling as the perf hotspot when |
Beta Was this translation helpful? Give feedback.
-
Oh, here I thought PyOxidizer had solved my problem here already. But it's really closer to https://bugs.python.org/issue45020 -- it loads the marshal data in memory using 0-copy (not sure if that really doesn't copy anything ever, I am always wary of mmap, but I suspect it pays off if many processes share the same segment read-only). But this issue is about eliminating marshal altogether, at least for those modules that are (nearly) always imported at startup. This would have helped Mercurial a bit. And yes, the next step would be eliminating the execution of the code, going directly to the collection of objects in the module dict after execution. But that's a much harder problem, since the results may depend on lots of context (e.g. os.environ, sys.argv, hash randomization, network, threads, other loaded modules, you name it). Facebook's "Strict Modules" (https://github.com/facebookincubator/cinder#strict-modules) and/or "Static Python" (same link next section). |
Beta Was this translation helpful? Give feedback.
-
What would be the concrete next steps for a contributor to help out with? I'm uncertain of where we are with the desirability of certain proposals discussed here. |
Beta Was this translation helpful? Give feedback.
-
We're in the process of assessing where time is spent during startup and finalization and what to do about it. The frozen modules stuff described above helped but the rest of the story is fairly complex. Consequently, it will take some effort to reach the point where we have a clear picture of concrete next steps. You're welcome to help out with the ongoing investigation. 😄 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Before we attempt this, we need to know where the time is spent.
@ericsnowcurrently Do you have any profiling information from running
python -S -c "print('Hi')"
, or a similar script?It takes about 10 ms on my machine for 3.9 which seems a long time to print "Hi".
For comparison, 2.7 takes about 7ms.
I suspect there is a lot of incidental stuff going on that we can eliminate.
The tasks that have to be performed to run
python -S -c "print('Hi')"
are:print(Hi")
print(Hi")
None of those tasks should take long, so what is slow?
Beta Was this translation helpful? Give feedback.
All reactions