Deserialize hits on demand #8270

gao-artur · 2024-07-24T13:58:38Z

Hey. We need to process a large amount of documents, sometimes tens of millions. For that, we retrieve the first batch, make some processing, then retrieve the next batch, etc.

The problem is that these documents are pretty large. Multiply their size by the batch size, and you find that the deserialization process wastes a lot of memory.

Unfortunately, Nest deserializes Hits into IReadOnlyCollection, which means all the documents are deserialized at once. This means you need to allocate a large array (most likely in LOH) to hold all these objects, and these objects can't be GC'ed until you finish processing them and retrieve the next batch. But even during the deserialization of the next batch, the previous batch won't be GC'ed unless you set the previous response reference to null before starting the next batch search.

We would like to have an API that will deserialize documents one-by-one during the Hits enumeration. Changing Hits to IEnumerable/IAsyncEnumerable is probably too large a breaking change, but maybe introducing a new property for this purpose, something like LazyHits, can work? The Hits then can be changed to read from LazyHits and store the result in memory to avoid multiple deserialization.

This approach will solve all the problems at once:

no need to allocate an array to hold deserialized documents
once we finish processing a document, we go to the next one, and the previous document can be GC'ed.

The text was updated successfully, but these errors were encountered:

flobernd · 2024-07-29T08:20:44Z

Hi @gao-artur, just to clarify: Are you using the new v8 client or NEST? NEST is deprecated and won't get any feature updates.

Otherwise, your feature request reminds me of this planned feature (Point in Time Search All Docs section):
#5149

That specific API could be implemented in a way that allows to stream the responses.

Usually it's hard to implement streaming with JSON responses as there is no guaranteed order of fields and System.Text.Json must buffer the whole response into memory to be able to finish deserialization. The above API does not allow access to individual fields of the SearchResponse which means we could implement custom JSON deserialization that skips everything except the hits field and yields the results 1 by 1.

Would this work for you?

gao-artur · 2024-07-29T08:36:58Z

Hey @flobernd,
I'm still using the v7, but I explored the v8 and found no change in this area. I saw the #5149, and actually used it to rewrite scroll with Pit.

Usually it's hard to implement streaming with JSON responses as there is no guaranteed order of fields and System.Text.Json must buffer the whole response into memory to be able to finish deserialization.

This is fine and expected.

The above API does not allow access to individual fields of the SearchResponse which means we could implement custom JSON deserialization that skips everything except the hits field and yields the results 1 by 1.

This will work for this specific API but in my case, I also need the Total count returned by the first chunk. So, ideally, everything will be deserialized immediately but hit only during enumeration.

flobernd · 2024-07-29T08:54:04Z

Hi @gao-artur,

So, ideally, everything will be deserialized immediately but hit only during enumeration.

This sadly is not possible due to the mentioned "random" field order in the JSON response. If the total field is streamed after the hits array, we must buffer the hits in order to finish deserialization of total and others. Sure, we could defer deserialization and keep the hits in a decoupled buffer, but this would not solve the issue anyways as that would just shift the heap alloc to a different place.

With PIT it should be possible to fire an initial query that returns the total count, but no actual hits, followed by a couple of chunk queries to get the actual documents.

gao-artur · 2024-07-29T08:57:54Z

I see, you are right. Then, your proposal to only stream hits works great for me.

flobernd · 2024-07-29T09:30:55Z

Going to close this issue in favor of the existing proposal and hope to find time to work on this soon!

gao-artur added the Category: Feature label Jul 24, 2024

flobernd added the State: Awaiting Response label Jul 29, 2024

flobernd closed this as completed Jul 29, 2024

flobernd added 8.x Relates to a 8.x client version and removed State: Awaiting Response labels Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deserialize hits on demand #8270

Deserialize hits on demand #8270

gao-artur commented Jul 24, 2024

flobernd commented Jul 29, 2024

Uh oh!

gao-artur commented Jul 29, 2024

Uh oh!

flobernd commented Jul 29, 2024

Uh oh!

gao-artur commented Jul 29, 2024

Uh oh!

flobernd commented Jul 29, 2024

Uh oh!

Deserialize hits on demand #8270

Deserialize hits on demand #8270

Comments

gao-artur commented Jul 24, 2024

flobernd commented Jul 29, 2024

Uh oh!

gao-artur commented Jul 29, 2024

Uh oh!

flobernd commented Jul 29, 2024

Uh oh!

gao-artur commented Jul 29, 2024

Uh oh!

flobernd commented Jul 29, 2024

Uh oh!