Faster startup -- Share code objects from memory-mapped file #150

yuleil · 2021-09-19T16:20:10Z

yuleil
Sep 19, 2021

This is a Cpython startup improvement approach proposed by Alibaba Compiler Team.

We are working on ways to speed up python application startup time. The main idea here is sharing code objects from mmaped file, which produces similar startup benefits with a simpler implementation, compared to Experiment E.

Our design is inspired by the Application Class-Data Sharing (AppCDS) feature, introduced in OpenJDK. AppCDS allows a set of application classes to be pre-processed into a shared archive file, which can then be memory-mapped at runtime to reduce startup time and memory footprint.

Based on the above principle, we proposed Code-Data Sharing (CDS) approach, which allows a set of code objects to be deep copied into a memory-mapped heap image file. During runtime:

use MAP_FIXED to map to the predetermined heap image to ensure that the pointers are correct
One concern is ASLR will randomly arrange the address of data section, causing ob_type may point to wrong address in memory. The solution is to patch the correct address for ob_type by traversing each object in heap image.
rehash the frozen_set s
get the code object directly from heap image while importing packages

Experiments

Env: Linux & Intel skylake

Running empty application

$time for i in `seq 100`; do PYCDSMODE=0 python3 -c ''; done
real  0m1.486s
user  0m1.186s
sys   0m0.307s

$PYCDSMODE=1 python3 -c '' # dump
$time for i in `seq 100`; do PYCDSMODE=2 python3 -c ''; done
real  0m1.201s
user  0m0.934s
sys   0m0.273s

Startup time benefits: 19.18% reduction

WebServer (flask + requests + pymongo)

$time PYCDSMODE=0 python3 -c 'import flask, requests, pymongo'
real  0m0.303s
user  0m0.278s
sys   0m0.025s

$PYCDSMODE=1 python3 -c 'import flask, requests, pymongo' dump
$time PYCDSMODE=2 python3 -c 'import flask, requests, pymongo'
real  0m0.257s
user  0m0.232s
sys   0m0.024s

Startup time benefits: 15.18% reduction

Summary

Compared to the existing approaches, the main contribution of Our CDS approach includes:

CDS use the heap object directly, while the memory-mapped implementation in PyICE needs some deserialization
CDS doesn't need to generate C source code, thus avoiding using C toolchain for compiling. This is essential for a production environment on the cloud

Considering AppCDS has proved to be successful in OpenJDK 10, we believe our proposal can be a practical feature to enhance CPython startup performance, even while our overall design is still evolving.

gvanrossum · 2021-09-20T00:35:40Z

gvanrossum
Sep 20, 2021
Maintainer

Thanks! We'll try to study this in the coming weeks.

0 replies

gvanrossum · 2021-09-21T02:19:42Z

gvanrossum
Sep 21, 2021
Maintainer

To make comparison easier, here's a GitHub diff from v3.9.5, which is where this code was forked:

https://github.com/python/cpython/compare/v3.9.5...yuleil:cds?expand=1

0 replies

gvanrossum · 2021-09-21T05:34:08Z

gvanrossum
Sep 21, 2021
Maintainer

I haven't looked deeply in the implementation, but the idea looks decent enough: there's a "dump" mode that creates an mmap'ed segment with a snapshot of the heap in a file (or some part of it -- perhaps only code objects?), and a "use" mode that maps that file into memory at the original address. The beauty is that this supports arbitrary 3rd party modules.

The complexity is caused by the need to fix up the segment after it's been mmap'ed in, because

The segment contains references to objects that aren't in the segment (e.g. Py_None, PyTuple_Type), and those objects's addresses may vary; and
Frozen sets (which seem to complicate everything in this area, e.g. reproducible marshal output).

The solution is nicely general, but requires two new tp_* fields in all type objects. This seems less than ideal, since IIUC these are only populated for a handful of builtin objects. Also, it means that the mmap'ed segments cannot be shared between processes (so Instagram's "pre-forked servers" solution still wins for memory use).

Comparing with Experiment E (#84), the tooling is easier to use with 3rd party modules, although the dump/use mechanism is a bit clunky. I wonder if you could borrow an idea from #84 and generate a table of fix-ups for the mmap'ed segment that do things like patching references to standard types and singleton values, instead of adding new tp_* fields to all types.

Ideal would be if you could package this as a 3rd party extension module that can be distributed via PyPI.

0 replies

gvanrossum · 2021-09-22T17:44:38Z

gvanrossum
Sep 22, 2021
Maintainer

Some more questions:

Have you compared this to an application packaging system like PyOxidizer?
Please briefly describe how you determine what is dumped, in dump mode? (I.e., how you capture all code objects and their dependencies but nothing else.)
How portable is this approach? Would it work on Mac? Windows?
Would it be possible to dump (a significant portion of) the stdlib and embed that in the CPython binary? (How would a user then go about dumping a collection of 3rd party libraries in addition?)

0 replies

yuleil · 2021-09-23T12:01:03Z

yuleil
Sep 23, 2021
Author

We are thankful for your timely feedback. Below are some explanations regarding to your questions.

The solution is nicely general, but requires two new tp_* fields in all type objects. This seems less than ideal, since IIUC these are only populated for a handful of builtin objects.

I wonder if you could borrow an idea from #84 and generate a table of fix-ups for the mmap'ed segment that do things like patching references to standard types and singleton values, instead of adding new tp_* fields to all types.

The reason for the two new .tp_*s is that I was planning to implement a generic sharedheap that could hold objects of any type. And then share code objects on top of generic sharedheap. But this will take quite a long time for implementation. From a practical perspective, we focus on types related to code objects and then refer to prior art to avoid adding new .tp_* fields.

Also, it means that the mmap'ed segments cannot be shared between processes (so Instagram's "pre-forked servers" solution still wins for memory use).

Our access to the data in mmap'ed segments is not read-only, considering stuff like the reference counting and patches to ob_type and Py_None. We might get page faults for the first write, then copy-on-write will be triggered, which will not save memory. We could adopt a similar idea from AppCDS feature in OpenJDK, which separates the existing archive (mmap'ed segments) into ro and rw sections. We will also check out Instagram's solution as you suggested.

Ideal would be if you could package this as a 3rd party extension module that can be distributed via PyPI.

This matches perfectly with our planning. We also hope to distribute this via PyPI and will continue working in that direction.

more Q & A

Have you compared this to an application packaging system like PyOxidizer?

Not yet. We will take a close look at that.

Please briefly describe how you determine what is dumped, in dump mode? (I.e., how you capture all code objects and their dependencies but nothing else.)

def patch_import_paths():
    if sys.flags.cds_mode == 1:
        def patch_get_code(orig_get_code):
            def wrap_get_code(self, name):
                code = orig_get_code(self, name)
                SharedCodeWrap.set_module_code(name, code)
                return code
            return wrap_get_code
        SourceFileLoader.get_code = patch_get_code(SourceLoader.get_code)

We patch the SourceFileLoader.get_code method to record the loaded code objects. The recorded code objects will be deep-copied to mmap'ed segments when the process exits.

One challenge here is that all objects referenced by code objects need to be deeply copied into the mmap'ed segment. This operation is highly dependent on the object type's internal implementation. Here we add a .tp_move_in field to record deep copy operations of each type.

How portable is this approach? Would it work on Mac? Windows?

This function was developed on MacOS systems running the M1 chip. It will be perfectly fine working on Mac and Linux. It doesn’t work on Windows for now, due to the current usage of the mmap system call. We will work on extending this portability to multiple operating systems.

Would it be possible to dump (a significant portion of) the stdlib and embed that in the CPython binary? (How would a user then go about dumping a collection of 3rd party libraries in addition?)

A practical reference is OpenJDK's JEP 341: Default CDS Archives, which generates a CDS archive of JDK internal classes at build time. When a user needs to dump 3rd party libraries, a new archive file is generated, with the stdlib and 3rd party libraries used by the program. So there’s no need to use the pre-determined archive.

We tested an empty Python program python3 -c pass. Dumping the stdlib will produce an archive file of about 700KB. In contrast, the OpenJDK17's built-in archive file lib/server/classes.jsa is 14MB in size. So it is generally feasible to dump a default archive in CPython binary.

We sincerely appreciate the attention you are giving to this. We will continue working on ways to make our deliveries more efficient. I will keep you posted on our progress.

0 replies

gvanrossum · 2021-09-29T23:06:06Z

gvanrossum
Sep 29, 2021
Maintainer

Thanks for your answers. I hope you bring the project to maturity. I have one follow-up question:

We tested an empty Python program python3 -c pass. Dumping the stdlib will produce an archive file of about 700KB.

That number looks suspiciously low. Which modules are included in that? The PYC files for the stdlib total to at least 70 MB.

0 replies

yuleil · 2021-10-08T09:11:49Z

yuleil
Oct 8, 2021
Author

CDS uses a trace-based model. Since the test program is python3 -c pass, only the modules used by python3 -c pass (and loaded after _bootstrap_external._install(), where patch_import_paths() is called) are moved to the mmap'ed segment.

More specifically, these modules are:

codecs
encodings.latin_1
posixpath
os
site
_bootlocale
abc
io
stat
_sitebuiltins
genericpath
encodings.utf_8
encodings.aliases
encodings
_collections_abc

0 replies

oraluben · 2021-10-26T09:23:03Z

oraluben
Oct 26, 2021

The previous branch is obsolete and we've rewritten the implementation (e.g. remove extra fields in type object and less hard-coded GC logics), which can be found at python/cpython@54a4e1b...oraluben:cds/main.

This will be the new basement of our future development.

0 replies

gvanrossum · 2021-10-26T21:34:02Z

gvanrossum
Oct 26, 2021
Maintainer

Thanks. I don't think anyone on our team will have time to review the new version before our meeting, so hopefully you can explain some of the differences when we talk tomorrow.

0 replies

oraluben · 2021-12-13T08:52:02Z

oraluben
Dec 13, 2021

Hi, during implementing a third-party version of this CDS approach, there’s an issue that we would like to hear your advice.

As we’ve introduced, there’re three roles in CDS progress, and we need to set role of each python instance. The CPython fork POC reads PYCDSMODE env during python startup and setup the role.

The third-party CDS have APIs like cds.trace()/cds.share() and cds.init_from_env() which reads env like the PoC, but require change in code.
We were thinking about PYTHONSTARTUP but just found it’s for interactive python.

Is there any way we can inject a start hook to achieve that?
I also consider creating separate scripts for each role that looks like the following, do you think there’ll be a significant performance impact?

$ cat cdstrace
#!python3

import cds
cds.trace('<package list>')

import sys
eval(open(sys.argv[1]).read())

@gvanrossum

1 reply

gvanrossum Dec 13, 2021
Maintainer

Wrapper scripts like that are pretty common, you can look at the main() functions in cProfile.py and pdb.py in the stdlib for examples. I wouldn't worry about performance too much.

oraluben · 2022-02-18T13:47:32Z

oraluben
Feb 18, 2022

Status update in python-ideas: https://mail.python.org/archives/list/python-ideas@python.org/thread/UKEBNHXYC3NPX36NS76LQZZYLRA4RVEJ/

2 replies

gvanrossum Feb 20, 2022
Maintainer

Great -- though I'd post to python-dev if you're looking for reviews. :-)

ericsnowcurrently Feb 22, 2022
Maintainer

https://mail.python.org/archives/list/python-dev@python.org/message/B77BQQFDSTPY4KA4HMHYXJEV3MOU7W3X/

oraluben · 2022-03-29T04:04:28Z

oraluben
Mar 29, 2022

Finally, we're excited to share the open-sourced third-party library at https://github.com/alibaba/code-data-share-for-python.
(Our lawyer isn't very happy about the short name "pycds", so here it is.)

We're currently working on detailed docs and infra setup (CI & releases based on Github Actions) and PyPI package is not available yet, but I think they'll be ready very soon.

1 reply

gvanrossum Mar 29, 2022
Maintainer

Congrats on the milestone!

Faster startup -- Share code objects from memory-mapped file #150

Uh oh!

yuleil Sep 19, 2021

Experiments

Running empty application

WebServer (flask + requests + pymongo)

Summary

Replies: 12 comments · 4 replies

Uh oh!

gvanrossum Sep 20, 2021 Maintainer

Uh oh!

gvanrossum Sep 21, 2021 Maintainer

Uh oh!

gvanrossum Sep 21, 2021 Maintainer

Uh oh!

gvanrossum Sep 22, 2021 Maintainer

Uh oh!

yuleil Sep 23, 2021 Author

more Q & A

Uh oh!

gvanrossum Sep 29, 2021 Maintainer

Uh oh!

yuleil Oct 8, 2021 Author

Uh oh!

oraluben Oct 26, 2021

Uh oh!

gvanrossum Oct 26, 2021 Maintainer

Uh oh!

oraluben Dec 13, 2021

Uh oh!

gvanrossum Dec 13, 2021 Maintainer

Uh oh!

oraluben Feb 18, 2022

Uh oh!

gvanrossum Feb 20, 2022 Maintainer

Uh oh!

ericsnowcurrently Feb 22, 2022 Maintainer

Uh oh!

oraluben Mar 29, 2022

Uh oh!

gvanrossum Mar 29, 2022 Maintainer

yuleil
Sep 19, 2021

Replies: 12 comments 4 replies

gvanrossum
Sep 20, 2021
Maintainer

gvanrossum
Sep 21, 2021
Maintainer

gvanrossum
Sep 21, 2021
Maintainer

gvanrossum
Sep 22, 2021
Maintainer

yuleil
Sep 23, 2021
Author

gvanrossum
Sep 29, 2021
Maintainer

yuleil
Oct 8, 2021
Author

oraluben
Oct 26, 2021

gvanrossum
Oct 26, 2021
Maintainer

oraluben
Dec 13, 2021

gvanrossum Dec 13, 2021
Maintainer

oraluben
Feb 18, 2022

gvanrossum Feb 20, 2022
Maintainer

ericsnowcurrently Feb 22, 2022
Maintainer

oraluben
Mar 29, 2022

gvanrossum Mar 29, 2022
Maintainer