Improve startup time. #134

markshannon · 2021-03-25T16:42:48Z

markshannon
Mar 25, 2021
Collaborator

Before we attempt this, we need to know where the time is spent.
@ericsnowcurrently Do you have any profiling information from running python -S -c "print('Hi')", or a similar script?

It takes about 10 ms on my machine for 3.9 which seems a long time to print "Hi".
For comparison, 2.7 takes about 7ms.

I suspect there is a lot of incidental stuff going on that we can eliminate.

The tasks that have to be performed to run python -S -c "print('Hi')" are:

Load executable from disk (it should be in memory cache, but it still needs work by the O/S).
Build config object from command line and environment variables
Create a new interpreter
Create a new thread
Compile (including parsing) print(Hi")
Execute print(Hi")
Dispose of thread
Dispose of interpreter

None of those tasks should take long, so what is slow?

gvanrossum · 2021-03-25T16:48:52Z

gvanrossum
Mar 25, 2021
Maintainer

We also need to o initialize some extension modules, sys and built ons.

0 replies

markshannon · 2021-03-25T17:01:57Z

markshannon
Mar 25, 2021
Collaborator Author

If creating and/or destroying the new interpreter is what takes the time, then one possible way to speed it up is to freeze the object graph of a newly created interpreter.

That would work roughly as follows:

Modify CPython to walk the object graph, immediately after interpreter creation, dumping it to a text file.
Convert that text file into a declarative spec of the initial object graph.
Check in that spec, we will want to modify it later, and it will probably need fixing up by hand.
Discard the modified CPython, we shouldn't need it anymore.
Make a tool that generates two pieces of C code:
- A static structure containing the whole object graph, with offsets not pointers
- A table containing offsets of those offsets

We can now create the entire object graph for the interpreter by:

Copying the static structure into the newly allocated memory for the interpreter.
Traversing the offsets of offsets, converting them into pointers.

The above assumes that interpreters are fully independent.
Since they are not, at least not yet, we need to break the object graph into the part belonging to the runtime, and that belonging to each interpreter. For the runtime (static objects) there is no need for offsets, just create the graph with pointers. For the interpreter, the data structure needs to mark offsets so we can tell which are pointers into the runtime, and which are pointers into the interpreter.

0 replies

markshannon · 2021-03-25T17:07:13Z

markshannon
Mar 25, 2021
Collaborator Author

I've just remembered python3 -v.

Running python3 -v -S -c "print('Hi')" prints

import _frozen_importlib # frozen
import _imp # builtin
import '_thread' # <class '_frozen_importlib.BuiltinImporter'>
import '_warnings' # <class '_frozen_importlib.BuiltinImporter'>
import '_weakref' # <class '_frozen_importlib.BuiltinImporter'>
import '_io' # <class '_frozen_importlib.BuiltinImporter'>
import 'marshal' # <class '_frozen_importlib.BuiltinImporter'>
import 'posix' # <class '_frozen_importlib.BuiltinImporter'>
import '_frozen_importlib_external' # <class '_frozen_importlib.FrozenImporter'>
# installing zipimport hook
import 'time' # <class '_frozen_importlib.BuiltinImporter'>
import 'zipimport' # <class '_frozen_importlib.FrozenImporter'>
# installed zipimport hook
# /home/mark/repos/cpython/Lib/encodings/__pycache__/__init__.cpython-310.pyc matches /home/mark/repos/cpython/Lib/encodings/__init__.py
# code object from '/home/mark/repos/cpython/Lib/encodings/__pycache__/__init__.cpython-310.pyc'
# /home/mark/repos/cpython/Lib/__pycache__/codecs.cpython-310.pyc matches /home/mark/repos/cpython/Lib/codecs.py
# code object from '/home/mark/repos/cpython/Lib/__pycache__/codecs.cpython-310.pyc'
import '_codecs' # <class '_frozen_importlib.BuiltinImporter'>
import 'codecs' # <_frozen_importlib_external.SourceFileLoader object at 0x7f6fbcf2d4f0>
# /home/mark/repos/cpython/Lib/encodings/__pycache__/aliases.cpython-310.pyc matches /home/mark/repos/cpython/Lib/encodings/aliases.py
# code object from '/home/mark/repos/cpython/Lib/encodings/__pycache__/aliases.cpython-310.pyc'
import 'encodings.aliases' # <_frozen_importlib_external.SourceFileLoader object at 0x7f6fbcec6d00>
import 'encodings' # <_frozen_importlib_external.SourceFileLoader object at 0x7f6fbcf2d340>
# /home/mark/repos/cpython/Lib/encodings/__pycache__/utf_8.cpython-310.pyc matches /home/mark/repos/cpython/Lib/encodings/utf_8.py
# code object from '/home/mark/repos/cpython/Lib/encodings/__pycache__/utf_8.cpython-310.pyc'
import 'encodings.utf_8' # <_frozen_importlib_external.SourceFileLoader object at 0x7f6fbcf2d370>
import '_signal' # <class '_frozen_importlib.BuiltinImporter'>
# /home/mark/repos/cpython/Lib/__pycache__/io.cpython-310.pyc matches /home/mark/repos/cpython/Lib/io.py
# code object from '/home/mark/repos/cpython/Lib/__pycache__/io.cpython-310.pyc'
# /home/mark/repos/cpython/Lib/__pycache__/abc.cpython-310.pyc matches /home/mark/repos/cpython/Lib/abc.py
# code object from '/home/mark/repos/cpython/Lib/__pycache__/abc.cpython-310.pyc'
import '_abc' # <class '_frozen_importlib.BuiltinImporter'>
import 'abc' # <_frozen_importlib_external.SourceFileLoader object at 0x7f6fbcec6f70>
import 'io' # <_frozen_importlib_external.SourceFileLoader object at 0x7f6fbcec6e50>

0 replies

ericsnowcurrently · 2021-03-26T15:33:58Z

ericsnowcurrently
Mar 26, 2021
Maintainer

Before we attempt this, we need to know where the time is spent.
@ericsnowcurrently Do you have any profiling information from running python -S -c "print('Hi')", or a similar script?

I don't have any such info presently. I know there were investigations in the past that involved such profiling but I expect any resulting profiling data is outdated at this point.

Regardless, it shouldn't take much effort to get at least some basic data. Furthermore, we can already get at least some insight by running ./python -v -S -c "print('Hi')".

0 replies

gvanrossum · 2021-03-26T16:12:53Z

gvanrossum
Mar 26, 2021
Maintainer

Regarding Mark's strategy, can you explain what we would have to do when we need to change the object graph? IIUR currently we change some of the modules that are frozen (_frozen_importlib and friends), and then regenerate the frozen bytecode (for which we have tools). I am worried that your item (3) makes this process more complicated.

Otherwise, it sounds like a decent strategy (many data structures are already static webs of pointers, the new thing is that we allow static webs of PyObject pointers).

And yes, we need to work on profiling startup.

0 replies

ericsnowcurrently · 2021-03-26T18:00:22Z

ericsnowcurrently
Mar 26, 2021
Maintainer

FTR, facebook/instagram has been using a related strategy, which they described at the language summit a couple years back.

0 replies

methane · 2021-05-11T23:36:25Z

methane
May 11, 2021

I've just remembered python3 -v.

Please don't forget -X importtime (or PYTHONPROFILEIMPORTTIME=1 python) too.

0 replies

gvanrossum · 2021-05-17T16:36:29Z

gvanrossum
May 17, 2021
Maintainer

Mark has a nascent proposal

0 replies

fweimer · 2021-05-23T13:34:27Z

fweimer
May 23, 2021

Mark has a nascent proposal

This is only beneficial if the size of mapped files exceeds a certain threshold, which is fairly large (probably larger than 100 KiB by now). Below that, copying is faster. With small files, I expect a run-time penalty even after loading because the less dense packing of information (distributed across a larger number of pages) results in increased TLB misses. And by default, Linux will only allow 65,530 separate mappings per process.

0 replies

gvanrossum · 2021-05-23T15:38:08Z

gvanrossum
May 23, 2021
Maintainer

My intuition is that it will be hard to show that mmap is faster than copying here, so I would personally be happy if the first version just read (i.e., copied) the file into memory.

0 replies

markshannon · 2021-05-28T09:38:02Z

markshannon
May 28, 2021
Collaborator Author

The point of the proposal is to avoid the overhead of unmarshalling, whether the file is copied or mapped isn't so important.

The important thing is that the pyc format is position independent (no pointers), is immutable, and requires minimal decoding before being usable.
This is important, because it means we can merge the pyc files of common modules into a single file, or embed them in the interpreter.

0 replies

gvanrossum · 2021-05-28T14:50:37Z

gvanrossum
May 28, 2021
Maintainer

This is important, because it means we can merge the pyc files of common modules into a single file, or embed them in the interpreter.

Oh, that's really neat!

0 replies

gvanrossum · 2021-05-28T16:59:11Z

gvanrossum
May 28, 2021
Maintainer

I just discovered -X importtime. Running 3.10 with this, -S, and '-c pass' on my Windows laptop, shows:

import time: self [us] | cumulative | imported package
import time:       169 |        169 |   _io
import time:        38 |         38 |   marshal
import time:       135 |        135 |   nt
import time:        66 |         66 |   winreg
import time:       708 |       1114 | _frozen_importlib_external
import time:       593 |        593 |   time
import time:       294 |        886 | zipimport
import time:       124 |        124 |     _codecs
import time:      1108 |       1231 |   codecs
import time:      1105 |       1105 |   encodings.aliases
import time:      2101 |       4436 | encodings
import time:       702 |        702 | encodings.utf_8
import time:      1028 |       1028 | encodings.cp1252
import time:       106 |        106 | _signal
import time:        52 |         52 |     _abc
import time:      1504 |       1555 |   abc
import time:      1311 |       2866 | io

The biggest cumulative time (4.4 ms) is the encodings package, plus another 1.7 ms for two specific encodings. Maybe we can somehow avoid importing the full encodings package, and only import the encoding we really need?

The next biggest cost is io (2.8 ms), which is largely due to importing abc. Again, a possible target for specific hacks. (In both cases I suspect @vstinner has already thought about this.)

Of course, without -S, we spend a ton of time (25-30 ms!) importing site, which pulls in the kitchen sink. A possible hack targeting this could be to write a configuration file somewhere that lists the outcome of all the work done by site, plus a list of directories and their mtimes that should be statted so that the full thing only has to run if any of those directories have changed. Or something like that. (Maybe a hash of certain directories and files would be needed -- I imagine even that is faster than importing os and contextlib.)

Here's the raw data for that:

import time: self [us] | cumulative | imported package
import time:       223 |        223 |   _io
import time:        62 |         62 |   marshal
import time:       162 |        162 |   nt
import time:        73 |         73 |   winreg
import time:       960 |       1477 | _frozen_importlib_external
import time:       636 |        636 |   time
import time:       299 |        935 | zipimport
import time:       141 |        141 |     _codecs
import time:      1381 |       1522 |   codecs
import time:      1226 |       1226 |   encodings.aliases
import time:      2273 |       5020 | encodings
import time:       655 |        655 | encodings.utf_8
import time:       603 |        603 | encodings.cp1252
import time:        50 |         50 | _signal
import time:        45 |         45 |     _abc
import time:       762 |        806 |   abc
import time:       750 |       1556 | io
import time:        69 |         69 |       _stat
import time:       684 |        753 |     stat
import time:      1259 |       1259 |     _collections_abc
import time:       893 |        893 |       genericpath
import time:      1332 |       2225 |     ntpath
import time:      1284 |       5519 |   os
import time:       947 |        947 |   _sitebuiltins
import time:      1381 |       1381 |   types
import time:       695 |        695 |       warnings
import time:       852 |       1546 |     importlib
import time:       558 |        558 |     importlib._abc
import time:       140 |        140 |         itertools
import time:       585 |        585 |         keyword
import time:        74 |         74 |           _operator
import time:       754 |        827 |         operator
import time:       643 |        643 |         reprlib
import time:       540 |        540 |         _collections
import time:      1422 |       4154 |       collections
import time:        69 |         69 |         _functools
import time:      1073 |       1142 |       functools
import time:      1040 |       6335 |     contextlib
import time:       972 |       9410 |   importlib.util
import time:       755 |        755 |   importlib.machinery
import time:       715 |        715 |   sitecustomize
import time:       383 |        383 |   usercustomize
import time:      6707 |      25813 | site

0 replies

nascheme · 2021-05-28T23:18:16Z

nascheme
May 28, 2021

Some comments about this based on work I did at previous core sprints. Mark's ideas seem good to me and are along the same lines I was thinking. The unmarshal step is fairly expensive and I think we could reduce that with a change of the code object layout. Currently, the VM expects code objects to be (mostly) composed of PyObjects. It doesn't need to be that way and we could create a code object memory layout that's much cheaper to load from disk. I suggested something similar in spirit to Cap’n Proto. We probably want to kept the on-disk files machine independent.

The importlib overhead has been optimized over the years so there is no big low hanging fruit there. However, it seem seems wasteful how much work is done by importlib just to start up. I think we could dump all needed startup modules into a kind of optimized bundle (e.g. better optimized than a zipped package) and eliminate the importlib overhead. E.g. on startup call a C-function that unmarshals all the code on the bundle and then executes some bytecode to finish the startup. It doesn't need to be linked with the executable (as frozen modules) but could be an external file, e.g. in <prefix>/lib/python-<xyz>/_startup.pyb. That reduces bootstrap problems. A further refinement is to let application developers create their own bundles, for fast startup. E.g. a tool like Mercurial might want to do that. Have a command line option or env var that instructs Python to load additional bundles after the startup bundle.

When Python starts, the interleaving of __import__ with the unmarshal step and top-level code execution of modules makes it difficult to profile where time is being spent. If you look at a profile trace or output from -X importtime, those costs are intermixed. I wonder if there is some performance advantage to do similar operations together. E.g. 1. load all the code for startup from disk, 2. unmarshal all the code, 3. execute top-level code for all modules. It could help making sense of the profile and focusing efforts.

Another idea: try to reduce the amount of work done when executing top-level module code. Quite a bit of work has been done over the years to try to reduce startup time and therefore what's done at the top-level of modules imported at startup. E.g. do it the first time a function is called, rather than at import time. However, maybe we can do better. My lazy top-level code execution idea was one approach. The major killer of that was that metaclasses can basically execute anything. My AST walker couldn't find too much that could be lazily executed. I think Cinder has done something like this but I don't know details (StrictModule annotation?). Another issue: doing something lazily doesn't help performance if you end up doing anyhow. I think I was running into such an issue.

A twist on the lazy code execution is to just defer the loading/unmarshal of code objects of functions, until the function is actually called. It seems likely that quite a few functions in modules loaded at startup are never actually executed. So, does it help just to never load the code for them? I had a prototype of this idea but it didn't show much of a win. Might be worth revisiting. The original idea was inspired by something Larry did with PHP years ago.

Another idea: allow some code to be executed at module compile time. I.e. Python code that gets called by the compiler to compute objects that end up inside the .pyc. A simple example, assume my module needs to compute a dict that will become a constant global. It could be possible to do that work at compile time (rather than at import time) and then store the result of that computation in the .pyc. Sometime like:

if __compiletime__:
   GLOBAL_DICT = _compute_it_at_compile_time()

Obviously there would be a restrictions on what you can do at compile time. This would be an advanced feature and used carefully. Another place it could be used is to pre-compile regular expressions to something that can be quickly loaded at import time and still have good runtime performance. You can do something a bit like this by having code that generates a .py file and then compile that. However, having it a feature of the Python compiler would be nice (aside from the language complexity increase). Perhaps things like dataclasses could use the feature to do more work at compile time.

0 replies

JelleZijlstra · 2021-05-28T23:25:05Z

JelleZijlstra
May 28, 2021

The if __compiletime__ idea would also be interesting for enums. In profiling I've seen a lot of import time being spent creating enum classes, and at least in theory that work could all be done at compile time.

0 replies

gvanrossum · 2021-06-23T23:11:05Z

gvanrossum
Jun 23, 2021
Maintainer

The biggest remaining design issue is the following. In my prototype, which I based on Mark's design, there is a single array of lazily-loadable constants, and (separately) a single array of names. These correspond to co_consts and co_names, but the arrays (tuples, really) are shared between all code objects loaded from the same PYC file.

This doesn't scale so well -- in a large module you can easily have 1000s of constants or names, and whenever the size of either array exceeds 255, the corresponding instructions require prefixing with EXPANDED_ARG -- this makes the bytecode bulky and slow, and also it means we have to recompute all jump targets (in my prototype PYC writer I just give up when this happens, so I can only serialize toy modules).

A secondary problem with this is that code objects are now contained in their own co_consts array, making them GC cycles (and throwing dis.py for a loop when it tries to recursively find all code objects nested in a module's code object).

The simplest solution I can think of is to give each serialized code object two extra arrays of indexes, which map the per-code co_consts and co_names arrays to per-file arrays. At runtime there would then have to be separate per-file arrays and per-code-object arrays. This is somewhat inefficient space-wise, but the implementation is straightforward.

(If we were to skip the index arrays, we'd end up with the situation where if a name or constant is shared between many code objects, it would be hydrated many times, once per code object, and the resulting objects would not be shared unless interned.)

0 replies

iritkatriel · 2021-06-25T15:19:59Z

iritkatriel
Jun 25, 2021
Maintainer

The simplest solution I can think of is to give each serialized code object two extra arrays of indexes, which map the per-code co_consts and co_names arrays to per-file arrays. At runtime there would then have to be separate per-file arrays and per-code-object arrays. This is somewhat inefficient space-wise, but the implementation is straightforward.

Maybe this can reduce redirection only to the case of shared objects:

Each code object has a dedicated sub-section of the module level array (and knows its first-index). Builder.add returns (index, redirected), where redirected= 0 if the index entry has the actual const/name, and it is 1 if that entry has the index of the actual data. So if Builder.add added the item in a new index, redirected=0. If it reused an old index, it needs to check whether the index is within this code object's section or not (so maybe it needs the first_index passed in). If it found an index from a previous code object, it adds that index to a new slot and returns this new slot's index with redirected=1.

Then there needs to be an opcode that resolves the redirection and feeds the MAKE_* with the correct index. Or something along those lines.

Whether it's worth it depends how often there won't be a need for redirection.

0 replies

gvanrossum · 2021-06-25T18:09:37Z

gvanrossum
Jun 25, 2021
Maintainer

Let's see, take this example:

eggs = 0
ham = 1
spam = 2
def egg_sandwich():
    return eggs, spam
def ham_sandwich():
    return ham + spam + spam

There are three names here, "eggs", "spam" and "ham", and only "spam" is shared. The disassembly for the functions would be something like this (constant indices are relative here):

# Disassembly of egg_sandwich()
LOAD_NAME 0 (eggs)
LOAD_NAME 1 (spam)
MAKE_TUPLE 2
RETURN_VALUE
# Disassembly of ham_sandwich()
LOAD_NAME 0 (ham)
LOAD_NAME 1 (spam)
BINARY_ADD
LOAD_NAME 1 (spam)
BINARY_ADD
RETURN_VALUE

The strings section has four entries:

# Section for egg_sandwich() starts here:
0: "eggs"
1: "spam"
# Section for ham_sandwich() starts here:
2: "ham"
3: redirect(1)

String entries 0, 1 and 2 have offsets directly into the binary data, where they find the length and encoded data for the strings. String entry 3 has a redirect instead of an offsset into the binary data.

How to represent redirection? We can encode this by setting the lowest bit of the offset value, if we ensure that offsets into binary data are always even. This wastes the occasional byte but that seems minor. (Alternatively, the offset could be shifted left by one.)

Code to find the real offset would be something like this (compare to _PyHydra_UnicodeFromIndex()):

PyObject *
_PyHydra_UnicodeFromIndex(struct lazy_pyc *pyc, int index)
{
    if (0 <= index && index < pyc->n_strings) {
        uint32_t offset = pyc->string_offsets[index];
        if (offset & 1) {
            index = offset >> 1;
            assert(0 <= index && index < pyc_n_strings);
            offset = pyc->string_offsets[index];
        }
        return _PyHydra_UnicodeFromOffset(pyc, offset);
    }
    PyErr_Format(PyExc_SystemError, "String index %d out of range", index);
    return NULL;
}

We then change the opcode for LOAD_NAME as follows:

        case TARGET(LOAD_NAME): {
            PyObject *name = GETITEM(names, oparg);
            if (name == NULL) {
                name = _PyHydrate_LoadName(co->co_pyc, co->co_strings_start + oparg);  // **Changed**
                if (name == NULL) {
                    goto error;
                }
                Py_INCREF(name);  // **New**
                PyTuple_SET_ITEM(names, oparg, name);  // **New**
            }
            ...  // Rest is unchanged

Where co_strings_start is a new field in the code object that contains the start of the code object's strings section. In the example, for egg_sandwich() this would be 0, while for ham_sandwich() it would be 2.

Moreover we would need to update _PyHydra_LoadName() to first check if the string with the given index already exists in pyc->names:

name = PyTuple_GET_ITEM(pyc->names, index);
if (name != NULL) {
    Py_INCREF(name);
    return name;
}

So as to avoid constructing multiple copies of the same string ("spam" in our example).

Also, of course, when a code object is hydrated we have to set its co_names member to a freshly allocated tuple containing all NULL items, instead of the current code that sets it to a reference to pyc->names.

0 replies

iritkatriel · 2021-06-25T18:45:13Z

iritkatriel
Jun 25, 2021
Maintainer

How to represent redirection? We can encode this by setting the lowest bit of the offset value, if we ensure that offsets into binary data are always even. This wastes the occasional byte but that seems minor. (Alternatively, the offset could be shifted left by one.)

Either that, or add is this possible: add an opcode "RESOLVE_REDIRECT index" which puts the real index on the stack, and then the oparg to the next opcode is something like -1 which means "pop it from the stack".

0 replies

gvanrossum · 2021-06-25T19:33:48Z

gvanrossum
Jun 25, 2021
Maintainer

We could do that for the MAKE_STRING opcode, which is only used in the "mini-subroutines" called by LAZY_LOAD_CONSTANT, but for LOAD_NAME and friends there is an issue: inserting extra opcodes requires recomputing all jump targets, and also the line table and the exception table, which I'd rather not do (especially not in this prototype).

We could define the MAKE_STRING opcode as always using the "absolute" index -- in fact, it has to be defined that way, since these mini-subroutines don't belong to any particular code object: if a constant is shared the mini-subroutine could be invoked from any of the code objects that reference it, or even from another mini-subroutine.

Since I generate the mini-subroutines from scratch and they don't have line or exception tables, having to use EXTENDED_ARG in these is not a problem.

The changes I suggested for co_names also need to be used for co_consts. We can use the same trick of requiring real offsets to be even. In fact, for code objects the offset has to be a multiple of 4, since we interpret the offset as a pointer to an array of uint32_t values, and the format claims that n-bytes values are aligned on multiples of n bytes. (I think I don't have any code in pyro.py to ensure this, so maybe I've been lucky, or maybe the alignment requirement is not hard on Intel CPUs, and it just slows things down a bit.)

(Somewhat unrelated, the encoding for strings is currently a varint given the size of the encoded bytes, followed by that many bytes comprising the UTF-8-encoded value. This favors space over time. But maybe we ought to favor time over space here? We could store the size in code units as a uint32_t shifted left by 2, encoding the character width in the bottom 2 bits, followed by raw data bytes, either 1, 2 or 4 bytes per code unit. We might even have a spare bit to distinguish between ASCII and Latin-1 in the case where it's one byte per character. E.g.

0 => ASCII
1 => Latin-1
2 => 2 bytes per code unit
3 => 4 bytes per code unit

It's wort experimenting a bit here to see if this is actually faster, and we could scan the 100 or 4000 most popular PyPI packages to see what the total space wasted is compared to the current encoding. For a short ASCII string the wastage would be 3 bytes per string.)

0 replies

iritkatriel · 2021-06-25T23:10:54Z

iritkatriel
Jun 25, 2021
Maintainer

# Section for egg_sandwich() starts here:
0: "eggs"
1: "spam"
# Section for ham_sandwich() starts here:
2: "ham"
3: redirect(1)
String entries 0, 1 and 2 have offsets directly into the binary data, where they find the length and encoded data for the strings. String entry 3 has a redirect instead of an offsset into the binary data.

Wait -- if the 0,1,2 entries are just offsets into the binary data, then 3 might as well be that too, and just have the same offset as entry 1. No?

0 replies

gvanrossum · 2021-06-26T00:26:40Z

gvanrossum
Jun 26, 2021
Maintainer

Wait -- if the 0,1,2 entries are just offsets into the binary data, then 3 might as well be that too, and just have the same offset as entry 1. No?

If you do it that way, egg_sandwich() and ham_sandwich() each end up with their own copy of "spam". Now, we should really intern these strings too, which would take care of that, so possibly this is okay. The marshal format shares interned strings. Any string used in LOAD_NAME etc. is worth hashing (these are identifiers that are almost certainly used as dict keys, either globals or builtins, or instance/class attributes).

The sharing may be more important for MAKE_STRING, which is also used for "data" strings that aren't interned (things that don't look like identifiers) but are still kept shared both by the compiler and by marshal. (Marshal has a complex system for avoiding to write the same constant multiple times, see w_ref() and r_ref() and friends.) However, MAKE_STRING is only used from mini-subroutines, so we can (and must) set oparg to the absolute index. (Mini-subroutines don't have their own co_names and co_consts arrays.)

So if there was a third function that had a string constant "spam", it would have a mini-subroutine invoked by LAZY_LOAD_CONSTANT, which would look like this:

MAKE_STRING 1 ("spam")
RETURN_CONSTANT n

We need to support redirects for LAZY_LOAD_CONSTANT too, and it can follow the same general principle (a per-code-object "relative" index and an "absolute" index).

But we'll need two versions of that one! It's used both from regular code objects (where the bytecode rewriter substitutes it for LOAD_CONST, and the index needs to be relative to the code object) and from mini-subroutines, where we need to use the absolute index.

0 replies

gvanrossum · 2021-06-28T22:27:02Z

gvanrossum
Jun 28, 2021
Maintainer

Okay, so I am seeing something weird. When I run the "test_speed" subtest using the "test" package framework, it consistently reports a ~30% speedup, while when I run it using the unittest-based main(), it reports no speedup. Examples:

Running using the test framework:

PS C:\Users\gvanrossum\pyco> .\PCbuild\amd64\python -m test test_new_pyc -m test_speed
0:00:00 Run tests sequentially
0:00:00 [1/1] test_new_pyc
Starting speed test
Classic load: 0.380
Classic exec: 0.133
       Classic total: 0.514
New PYC load: 0.001
New PYC exec: 0.375
       New PYC total: 0.377
Classic-to-new ratio: 1.36 (new is 36% faster)

== Tests result: SUCCESS ==

1 test OK.

Total duration: 1.5 sec
Tests result: SUCCESS

Running using the unittest main():

PS C:\Users\gvanrossum\pyco> .\PCbuild\amd64\python -m test.test_new_pyc TestNewPyc.test_speed
Starting speed test
Classic load: 0.234
Classic exec: 0.110
       Classic total: 0.344
New PYC load: 0.001
New PYC exec: 0.354
       New PYC total: 0.355
Classic-to-new ratio: 0.97 (new is -3% faster)
.
----------------------------------------------------------------------
Ran 1 test in 0.925s

OK

I see similar differences on macOS.

At this point it's important to remind us how the test is run. The code is here. (Note that it seems that it always uses marshal.loads(). This is correct: the code that recognizes the new format lives at the top level in that function.)

The big difference seems to be that the individual "classic" times reported are much lower when running using unittest.main() than when running using the test package. What could cause this? Is the test package maybe messing with GC configuration or some other system parameter?

0 replies

gvanrossum · 2021-06-28T23:09:41Z

gvanrossum
Jun 28, 2021
Maintainer

Well, hm... In test/libregrtest/setup.py there's code that sets a dummy audit hook. If I comment that out, the classic running time using the test framework goes down to roughly what it is when using unittest.main(), and the "advantage" of the new code pretty much disappears. I consider this particular mystery solved.

0 replies

gvanrossum · 2021-07-12T18:59:52Z

gvanrossum
Jul 12, 2021
Maintainer

I discussed the status of my experiment so far with Mark this morning, and we agreed to pause development and try some other experiments before we go further down this road.

Experiment A: the "pyco" branch in my repo
- See Faster startup -- Experiment A #64
- this is what we're pausing for now
Experiment B: "streamline and tweak"
- See Faster startup -- Experiment B -- streamline and tweak #65
- try to speed up unmarshalling of code objects by low-level optimizations
- streamline code objects (fewer redundant, computed fields; fewer small objects)
- opcodes for immediate values (MAKE_INT for 0-255, LOAD_COMMON_CONSTANT -- the latter extended with some more cheap-to-construct immutable constants, like "", (), maybe even -1..-5)
Experiment C: "lazy unmarshal"
- See Faster startup -- Experiment C -- lazy unmarshalling #66
- borrows ideas from Experiment A: delay hydrating code objects until called, keep PYC file in memory
- but dehydration just calls back into marshal
- this might be less code changes compared to Exp. A, and doesn't need as many new opcodes

Probably it would be better to start writing up Experiments B and C in more detail in new issues here. I will get started with that.

0 replies

gvanrossum · 2021-08-27T03:38:03Z

gvanrossum
Aug 27, 2021
Maintainer

We now also have

Experiment D -- streamline code objects Faster startup -- Experiment D -- streamline code objects #83
Experiment E -- "deep-freeze" code as static data structures Faster startup -- Experiment E -- "deep-freeze" code objects as static C data structures #84

0 replies

indygreg · 2021-08-27T04:30:20Z

indygreg
Aug 27, 2021

Fun fact: PyOxidizer's has support for importing .py/.pyc content from a memory-mapped file using 0-copy and the run-time code implementing that functionality is available on PyPI (https://pypi.org/project/oxidized-importer) and doesn't require the use of PyOxidizer. This extension module exposes an API (https://pyoxidizer.readthedocs.io/en/latest/oxidized_importer_api_reference.html). And it is even possible to create your own packed resources files from your own content (https://pyoxidizer.readthedocs.io/en/latest/oxidized_importer_freezing_applications.html). Conceptually it is similar to the zip importer, except a bit faster.

So if you wanted a quick way to to eliminate the stdlib importlib path importer and its I/O overhead to see its effect on performance, oxidized_importer should enable you to do that.

I've also isolated unmarshaling as the perf hotspot when oxidized_importer is in use and have been interested in Facebook's solution of an alternate serialization/unmarshaling mechanism but haven't had time to explore that. From my Mercurial developer days, I can also say that the module-level exec() on import can also be problematic. I thought Facebook's solution had a mechanism to serialize some of those evaluation results to avoid that overhead at import time. But this would presumably need proper bytecode operations and possibly new language features to properly support.

0 replies

gvanrossum · 2021-08-28T04:37:12Z

gvanrossum
Aug 28, 2021
Maintainer

Oh, here I thought PyOxidizer had solved my problem here already. But it's really closer to https://bugs.python.org/issue45020 -- it loads the marshal data in memory using 0-copy (not sure if that really doesn't copy anything ever, I am always wary of mmap, but I suspect it pays off if many processes share the same segment read-only).

But this issue is about eliminating marshal altogether, at least for those modules that are (nearly) always imported at startup. This would have helped Mercurial a bit. And yes, the next step would be eliminating the execution of the code, going directly to the collection of objects in the module dict after execution. But that's a much harder problem, since the results may depend on lots of context (e.g. os.environ, sys.argv, hash randomization, network, threads, other loaded modules, you name it). Facebook's "Strict Modules" (https://github.com/facebookincubator/cinder#strict-modules) and/or "Static Python" (same link next section).

0 replies

ofek · 2025-03-12T16:18:33Z

ofek
Mar 12, 2025

What would be the concrete next steps for a contributor to help out with? I'm uncertain of where we are with the desirability of certain proposals discussed here.

0 replies

ericsnowcurrently · 2025-03-12T20:52:43Z

ericsnowcurrently
Mar 12, 2025
Maintainer

We're in the process of assessing where time is spent during startup and finalization and what to do about it. The frozen modules stuff described above helped but the rest of the story is fairly complex. Consequently, it will take some effort to reach the point where we have a clear picture of concrete next steps. You're welcome to help out with the ongoing investigation. 😄

1 reply

ofek Mar 12, 2025

Thanks! Where are discussions taking place?

Improve startup time. #134

Uh oh!

Uh oh!

markshannon Mar 25, 2021 Collaborator

Replies: 50 comments · 1 reply

Uh oh!

gvanrossum Mar 25, 2021 Maintainer

Uh oh!

Uh oh!

markshannon Mar 25, 2021 Collaborator Author

Uh oh!

markshannon Mar 25, 2021 Collaborator Author

Uh oh!

ericsnowcurrently Mar 26, 2021 Maintainer

Uh oh!

gvanrossum Mar 26, 2021 Maintainer

Uh oh!

ericsnowcurrently Mar 26, 2021 Maintainer

Uh oh!

methane May 11, 2021

Uh oh!

gvanrossum May 17, 2021 Maintainer

Uh oh!

fweimer May 23, 2021

Uh oh!

Uh oh!

gvanrossum May 23, 2021 Maintainer

Uh oh!

markshannon May 28, 2021 Collaborator Author

Uh oh!

gvanrossum May 28, 2021 Maintainer

Uh oh!

gvanrossum May 28, 2021 Maintainer

Uh oh!

nascheme May 28, 2021

Uh oh!

Uh oh!

JelleZijlstra May 28, 2021

Uh oh!

gvanrossum Jun 23, 2021 Maintainer

Uh oh!

iritkatriel Jun 25, 2021 Maintainer

Uh oh!

gvanrossum Jun 25, 2021 Maintainer

Uh oh!

iritkatriel Jun 25, 2021 Maintainer

Uh oh!

gvanrossum Jun 25, 2021 Maintainer

Uh oh!

iritkatriel Jun 25, 2021 Maintainer

Uh oh!

gvanrossum Jun 26, 2021 Maintainer

Uh oh!

Uh oh!

gvanrossum Jun 28, 2021 Maintainer

Uh oh!

gvanrossum Jun 28, 2021 Maintainer

Uh oh!

Uh oh!

gvanrossum Jul 12, 2021 Maintainer

Uh oh!

gvanrossum Aug 27, 2021 Maintainer

Uh oh!

indygreg Aug 27, 2021

Uh oh!

Uh oh!

markshannon
Mar 25, 2021
Collaborator

Replies: 50 comments 1 reply

gvanrossum
Mar 25, 2021
Maintainer

markshannon
Mar 25, 2021
Collaborator Author

markshannon
Mar 25, 2021
Collaborator Author

ericsnowcurrently
Mar 26, 2021
Maintainer

gvanrossum
Mar 26, 2021
Maintainer

ericsnowcurrently
Mar 26, 2021
Maintainer

methane
May 11, 2021

gvanrossum
May 17, 2021
Maintainer

fweimer
May 23, 2021

gvanrossum
May 23, 2021
Maintainer

markshannon
May 28, 2021
Collaborator Author

gvanrossum
May 28, 2021
Maintainer

gvanrossum
May 28, 2021
Maintainer

nascheme
May 28, 2021

JelleZijlstra
May 28, 2021

gvanrossum
Jun 23, 2021
Maintainer

iritkatriel
Jun 25, 2021
Maintainer

gvanrossum
Jun 25, 2021
Maintainer

iritkatriel
Jun 25, 2021
Maintainer

gvanrossum
Jun 25, 2021
Maintainer

iritkatriel
Jun 25, 2021
Maintainer

gvanrossum
Jun 26, 2021
Maintainer

gvanrossum
Jun 28, 2021
Maintainer

gvanrossum
Jun 28, 2021
Maintainer

gvanrossum
Jul 12, 2021
Maintainer

gvanrossum
Aug 27, 2021
Maintainer

indygreg
Aug 27, 2021