Skip to content

Don't store resources content in memory #386

@s0ph1e

Description

@s0ph1e

Now all pages are stored in memory (each resource content is stored in Resource.text) which cause high memory consumption.
It would be nice to avoid storing Resource.text and save resourcess directly to FS just after they were received
Probably we can use streams for that

  • for html, css: Request -> update links/images/styles/etc. -> saveResource
  • all other types: Request -> saveResource when content modification is not needed

To do:

  • Update Resource class - get rid of text property and related functionality. Probably store reference to stream for resource
  • Update scraper mechanism: rework request/save functionality in scraper - replace requestQueue property with streamsQueue, replace requestedResourcePromises with requestResourceStreams or remove it, use streams instead of promises in request file
  • Check and update all actions that use Resource class objects - at least afterResponse, saveResource
  • Measure memory consumption of current implementation and streams implementation

Questions:

  • how to handle links to pages which are not downloaded yet? Can we set reference in parent before child is loaded? (see getReference action)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions