Skip to content

Deserialize hits on demand #8270

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gao-artur opened this issue Jul 24, 2024 · 5 comments
Closed

Deserialize hits on demand #8270

gao-artur opened this issue Jul 24, 2024 · 5 comments
Labels
8.x Relates to a 8.x client version Category: Feature

Comments

@gao-artur
Copy link

Hey. We need to process a large amount of documents, sometimes tens of millions. For that, we retrieve the first batch, make some processing, then retrieve the next batch, etc.

The problem is that these documents are pretty large. Multiply their size by the batch size, and you find that the deserialization process wastes a lot of memory.

Unfortunately, Nest deserializes Hits into IReadOnlyCollection, which means all the documents are deserialized at once. This means you need to allocate a large array (most likely in LOH) to hold all these objects, and these objects can't be GC'ed until you finish processing them and retrieve the next batch. But even during the deserialization of the next batch, the previous batch won't be GC'ed unless you set the previous response reference to null before starting the next batch search.

We would like to have an API that will deserialize documents one-by-one during the Hits enumeration. Changing Hits to IEnumerable/IAsyncEnumerable is probably too large a breaking change, but maybe introducing a new property for this purpose, something like LazyHits, can work? The Hits then can be changed to read from LazyHits and store the result in memory to avoid multiple deserialization.

This approach will solve all the problems at once:

  • no need to allocate an array to hold deserialized documents
  • once we finish processing a document, we go to the next one, and the previous document can be GC'ed.
@flobernd
Copy link
Member

Hi @gao-artur, just to clarify: Are you using the new v8 client or NEST? NEST is deprecated and won't get any feature updates.

Otherwise, your feature request reminds me of this planned feature (Point in Time Search All Docs section):
#5149

That specific API could be implemented in a way that allows to stream the responses.

Usually it's hard to implement streaming with JSON responses as there is no guaranteed order of fields and System.Text.Json must buffer the whole response into memory to be able to finish deserialization. The above API does not allow access to individual fields of the SearchResponse which means we could implement custom JSON deserialization that skips everything except the hits field and yields the results 1 by 1.

Would this work for you?

@gao-artur
Copy link
Author

Hey @flobernd,
I'm still using the v7, but I explored the v8 and found no change in this area. I saw the #5149, and actually used it to rewrite scroll with Pit.

Usually it's hard to implement streaming with JSON responses as there is no guaranteed order of fields and System.Text.Json must buffer the whole response into memory to be able to finish deserialization.

This is fine and expected.

The above API does not allow access to individual fields of the SearchResponse which means we could implement custom JSON deserialization that skips everything except the hits field and yields the results 1 by 1.

This will work for this specific API but in my case, I also need the Total count returned by the first chunk. So, ideally, everything will be deserialized immediately but hit only during enumeration.

@flobernd
Copy link
Member

Hi @gao-artur,

So, ideally, everything will be deserialized immediately but hit only during enumeration.

This sadly is not possible due to the mentioned "random" field order in the JSON response. If the total field is streamed after the hits array, we must buffer the hits in order to finish deserialization of total and others. Sure, we could defer deserialization and keep the hits in a decoupled buffer, but this would not solve the issue anyways as that would just shift the heap alloc to a different place.

With PIT it should be possible to fire an initial query that returns the total count, but no actual hits, followed by a couple of chunk queries to get the actual documents.

@gao-artur
Copy link
Author

I see, you are right. Then, your proposal to only stream hits works great for me.

@flobernd
Copy link
Member

Going to close this issue in favor of the existing proposal and hope to find time to work on this soon!

@flobernd flobernd added 8.x Relates to a 8.x client version and removed State: Awaiting Response labels Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.x Relates to a 8.x client version Category: Feature
Projects
None yet
Development

No branches or pull requests

2 participants