Skip to content

Modelling memory usage of arbitrary functions #749

@TomNicholas

Description

@TomNicholas

(Summarizing a conversation @tomwhite @alxmrs and I had yesterday, which was inspired by a question someone asked me at SciPy)

Some functions have unknown memory usage

One of Cubed's primary selling points is it's model of the explicit upper bound on projected memory usage. For most functions in the Array API, this comes from knowing the shape and dtype of all intermediate data in advance. But (as noted in #73), this doesn't work for operations with data-dependent output shapes (such as np.unique), as then you don't know the shape until runtime (and in general not until you actually reach that stage of the computation).

Placing bounds

I believe the particular functions with data-dependent output shapes in the current version of the Array API all have the convenient property that their output shapes will be strictly smaller than their inputs, so we could still place an upper bound on the memory usage at graph construction time (these have not been implemented yet).

General case

But in general users need to be able to apply custom computations (e.g. via xarray.apply_ufunc) and those could in theory produce any output shape whatsoever, even data-dependent ones that are arbitrarily bigger than the inputs!

The fully general case is Cubed beyond just arrays, which gives us an interesting definition of the furthest Cubed's computation model could (or should) ever be generalized to: any parallelizable computation whose plan has stages that can be described before runtime, including bounding their memory requirements (where parallelism is achieved via embarrassingly parallel maps, and communication through some abstracted KV storage layer).

User input required

How do we square that with users being able to pass in arbitrary functions that could blow up memory? We can't automatically determine the memory usage in advance (@alxmrs pointed out that's literally as hard as the halting problem), but we could explicitly ask users to estimate the maximum projected memory usage of such functions.

In general that's a lot to ask, but for array computations specifically it only means asking "what's the biggest shape this custom array->array operation might produce?". For np.unique we already know that so we can do it automatically, but users should be able to provide this for custom functions, or at least guess it. If they truly have no idea then they we still make them explicitly pick an arbitrary number, because the finite size of their worker's VM sets an unavoidable upper bound in reality anyway.

Bad user experience

Let's imagine a user has a complex array function to apply, with data-dependent output shape and memory usage. Let's also imagine that the stage which applies this function is near the end of a very large computation plan.

This is pretty much the worst possible case:

  • We don't know what the output shape is,
  • We can't pre-compute what the output shape will be, because it depends on knowing the result of 90% of the not-yet-performed computation,
  • If we took a "fuck around and find out" approach by picking some arbitrary resource allocation and running it regardless, we won't find out if that was enough resources until 90% of the way through.

Using Dask would be especially painful in this case, because you still have to pick the per-worker resource limit somehow, but the relationship between your configuration parameters and the function's memory usage can be very opaque, and if it fails at the last stage you will have to start all over again.

Better user experience

I think we could do much better with Cubed. We could in theory tackle the general case like this:

  • Ask users to estimate the upper limit of the memory usage / output shape.
  • If their estimate was high enough the computation will complete first time, possibly with some over-provisioning of resources.
  • If their estimate was too low, we are automatically able to resume from the preceding stage, so the 90% of work done is not lost.
  • Now we prompt the user by saying "would you like to try re-running this stage with 2x as much RAM?" or similar, and still potentially provide an estimate of total resource usage.
  • Alternatively / in addition we could provide some config allowing the user to specify in advance how much resources they are willing to spend, e.g. "first try with X resources, then if that doesn't work keep doubling it up to a max limit of Y, then if that still doesn't work then raise". This would be a way to codify what people often do in the real world already.

Actions

I propose we:

  • Provide a way for users to say what they expect the output shape (and possibly also any additional memory usage) would be when applying a custom function.
  • Document this along with a brief discussion of the design logic above.
  • Use this system for the functions with data-dependent output shape in the array API, possibly defaulting to the worst case of the largest possible output shape.
  • Define an "unsafe" stage in the cubed plan for such computations, where memory usage is still projected but with far lower confidence (vaguely reminiscent of unsafe rust).
  • Add some kind of retry logic around such a stage, with a configurable prompt for the user, so they can choose how much compute they want to authorize the deployment of.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions