Singleton global cluster cache limited by memory usage #960

veloman-yunkan · 2025-03-09T19:37:24Z

Fixes #947

This PR started as an effort to better test the enhancement of ConcurrentCache implemented in #956 and ended up superseding the latter after being broken up into smaller pieces (#974, #975, #982) that were merged separately whereupon it was easier to merge the remaining portion from this PR than rebasing/updating #956.

mgautierfr · 2025-03-10T17:15:42Z

(I only have a global look, not a thorough review of each commits)

Testing that something works correctly through logging seems a bit odd to me.
Log should be to get information about what happens, not to test that something happens in the correct order. A best we could call this tracing. It is a small semantic subtlety but I had to express it.

I am also uncomfortable with the orchestration. I was about to ask you how you handle the unpredictability of the output ordering caused by multithreading until I realize you orchestrate the thread based on the expected output. But doing so, you "linearize" the code workflow. The log system is locking threads to ensure that only one thread can log at the same time. So testing is changing the behavior of what is tested (mostly, locks in the concurrent cache become somehow useless) and we are not testing what will run in production.

veloman-yunkan · 2025-03-10T18:20:38Z

I am also uncomfortable with the orchestration. I was about to ask you how you handle the unpredictability of the output ordering caused by multithreading until I realize you orchestrate the thread based on the expected output. But doing so, you "linearize" the code workflow.

That is exactly the trick. Each instantiation of a mutlithreaded execution is equivalent to some linearization of it (assuming that there are no race conditions, i.e. access to shared resources is properly protected). Then I can define in the tests the linearizations that the code should handle.

The log system is locking threads to ensure that only one thread can log at the same time. So testing is changing the behavior of what is tested (mostly, locks in the concurrent cache become somehow useless) and we are not testing what will run in production.

Consider the situation that you currently have in ConcurrentCache - in a comment in ConcurrentCache::getOrPut() you talk in length about a rare situation that may happen during a tiny time window. My approach allows to simulate that scenario - we only need to put a log/trace statement in the said window and, based on our understanding of how that scenario should work, generate the log manually. The test's job will be to validate that the code indeed executes along that path.

More generally, tests for concurrent flows must be written so as to cover various linearizations. In a TDD approach, such tests can help producing the right code for concurrency management.

You are right that the approach lacks means for detecting absence of correct synchronization between threads. Thanks for pointing this out! I was mostly focused on testing the logic governing the execution under different scenarios, assuming that all measures against race conditions are in place. It should still allows to catch bugs leading to deadlocks or livelocks. I will think how that approach (possibly, after some enhancement) can be utilized to catch issues with potential race conditions.

mgautierfr · 2025-03-11T07:50:28Z

Consider the situation that you currently have in ConcurrentCache - in a comment in ConcurrentCache::getOrPut() you talk in length about a rare situation that may happen during a tiny time window. My approach allows to simulate that scenario - we only need to put a log/trace statement in the said window and, based on our understanding of how that scenario should work, generate the log manually. The test's job will be to validate that the code indeed executes along that path.

More generally, tests for concurrent flows must be written so as to cover various linearizations. In a TDD approach, such tests can help producing the right code for concurrency management.

Said like that I agree with you. But then I would prefer a proper name (as test_sync_barrier) instead of log. Especially as this will be noop at runtime and the purpose is to modify the behavior when we are testing.

veloman-yunkan · 2025-03-12T14:09:31Z

@mgautierfr I am done with the preliminary version of my contribution to your PR. I will have to clean it up significantly. But its second part (focusing on the testing of ConcurrentCache, starting from 6aa5aee) will be almost unaffected, so I invite you to look at it from there.

veloman-yunkan · 2025-03-13T15:27:45Z

Said like that I agree with you. But then I would prefer a proper name (as test_sync_barrier) instead of log. Especially as this will be noop at runtime and the purpose is to modify the behavior when we are testing.

Syncrhonization is only one of the uses of logging/tracing. Currently it is done with two logging statements (before and after the mutex locking operation) that must kept together in the target log of the simulated flow. A test_sync_barrier() macro may help to avoid the pitfall of trying to achieve that effect with a single simple (general purpose) logging statement. log_debug_raii_sync_statement() is what can be used for that purpose.

But logging also helps to visualize the operation of what is otherwise difficult to tell (for example, absence of double-work addressed by the last commit).

kelson42 · 2025-03-30T18:03:44Z

@veloman-yunkan What is the status here? If I get it right you have extended/improve the PR of @mgautierfr (or code around it). I have nothing against it, but it is important we move forward and this PR seems stuck.

veloman-yunkan · 2025-03-30T19:27:05Z

@veloman-yunkan What is the status here? If I get it right you have extended/improve the PR of @mgautierfr (or code around it). I have nothing against it, but it is important we move forward and this PR seems stuck.

@kelson42 I am waiting for @mgautierfr's feedback.

kelson42 · 2025-04-10T11:09:35Z

@mgautierfr Can you please come back here so we can move forward?

mgautierfr · 2025-04-10T14:01:09Z

I haven't made a full review commit by commit but I mostly agree with the proposition. The concurrent cache is better tested with this PR.

I reiterate that I am a bit unconfident with the log function/macro doing more than logging.
I'm afraid that in a near future, we start add log in other module or simply change the log in concurrent cache as we want to debug/test it manually. That's why I think a name as log_and_testing_sync would be a better name.

I am done with the preliminary version of my contribution to your PR. I will have to clean it up significantly.

Please clean it up so I can make a full review.

kelson42 · 2025-04-17T05:37:13Z

Goal is to merge this PR and #956 so we can finally conclude on this memory mgmt effort.

veloman-yunkan · 2025-04-18T14:30:11Z

Rebased my branch including the changes from @mgautierfr's branch memory_cache on main, resolving conflicts with #958. Now going to clean-up the changes and split it into several PRs.

veloman-yunkan · 2025-04-18T22:01:03Z

Extracted the log_debug*() utilities into a separate (cleaned-up) PR #975. However the macros were not renamed since that would require more time and is contrary to @kelson42's objective of finishing this functionality ASAP.

kelson42 · 2025-04-27T09:02:26Z

@mgautierfr @veloman-yunkan Once #975 merged, are we ready to merge this one?

It seems not rebased on Log debug utils #975 feature branch
maybe we have the time to rename the macros? Should be that lon.
I woukd prefer to have a formal review

veloman-yunkan · 2025-04-28T13:06:08Z

@mgautierfr @veloman-yunkan Once #975 merged, are we ready to merge this one?

No, this PR has to be cleaned up and split.

It seems not rebased on Log debug utils #975 feature branch

Yes, after you rebased the log_debug_utils branch after merging #976 this PR got behind. Now I rebased this PR again.

maybe we have the time to rename the macros? Should be that lon.

I've thought about that too, but can't come up with good names.

I woukd prefer to have a formal review

This PR should be reviewed after clean up/split.

codecov · 2025-04-28T13:10:32Z

Codecov Report

❌ Patch coverage is 73.17073% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.69%. Comparing base (30bac7d) to head (f4b6139).
⚠️ Report is 19 commits behind head on main.

Files with missing lines	Patch %	Lines
src/fileimpl.cpp	63.63%	1 Missing and 11 partials ⚠️
src/concurrent_cache.h	40.00%	0 Missing and 3 partials ⚠️
src/buffer_reader.cpp	0.00%	2 Missing ⚠️
src/compression.cpp	66.66%	1 Missing and 1 partial ⚠️
src/lrucache.h	77.77%	0 Missing and 2 partials ⚠️
src/cluster.cpp	85.71%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #960      +/-   ##
==========================================
+ Coverage   57.65%   57.69%   +0.04%     
==========================================
  Files         103      103              
  Lines        4926     4976      +50     
  Branches     2068     2083      +15     
==========================================
+ Hits         2840     2871      +31     
- Misses        706      713       +7     
- Partials     1380     1392      +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

veloman-yunkan · 2025-05-03T14:02:40Z

Work done in this PR (clean and properly tested enhancement of ConcurrentCache with support for user-defined item cost) has been extracted into #982 which should be merged before #956 or this very PR.

kelson42 · 2025-05-04T18:01:50Z

@veloman-yunkan Not sure if #956 should be merged before this one or not, but I have rebased the branch on main HEAD and I guess this would be ready to merge?

veloman-yunkan · 2025-05-05T11:21:55Z

@veloman-yunkan Not sure if #956 should be merged before this one or not, but I have rebased the branch on main HEAD and I guess this would be ready to merge?

This PR (together with the already merged pieces extracted from it) overrides #956. It requires some clean-up and fixing of CI. Will be ready today.

We need to know the memory consumption by our input readers (especially for compressed clusters where we need to know the size of the decompressor's state).

Note that the changed meson option CLUSTER_CACHE_SIZE isn't automatically updated when rebuilding/recompiling the project in an existing build directory. The old small value of that setting preserved in the meson configuration will lead to long runtime of some unit tests (because of the effectively disabled caching of clusters). Run the following command to reconfigure your build directory: ``` meson configure -D CLUSTER_CACHE_SIZE=67108864 ```

This is useless now as the cache is destroyed right afterwards but will make a difference when the cache becomes global.

Cluster cache will become global. So we cannot test content against a reference archive as getting the data from it will also be impacted by cache configuration. So we "preload" all the content first and then do the test.

This commit breaks the ZimArchive.validate unit-test because an exception thrown from the FileImpl constructor after a cluster has been loaded (e.g. by `FileImpl::getTitleAccessorV1()` or `FileImpl::getDirectAccessInformation()` when preloading the Xapian DB) leaves orphaned clusters in the global cluster cache that may result in false cache hits for a new FileImpl object constructed at the same address as the deceased one. Also note the changed meson option CLUSTER_CACHE_SIZE that isn't automatically updated when rebuilding/recompiling the project in an existing build directory. You can update it by running the following command: ``` meson configure -D CLUSTER_CACHE_SIZE=536870912 ```

This fixes the broken unit test ZimArchive.validate.

... so that it can be included in fileimpl.h (CI on certain platforms was broken because of that). Also slightly reformatted the moved code and proof-read some comments.

veloman-yunkan · 2025-05-05T17:17:00Z

This PR is ready to be merged.

veloman-yunkan requested a review from mgautierfr March 9, 2025 19:37

veloman-yunkan force-pushed the memory_cache_veloman_version branch 2 times, most recently from e229738 to e34bd71 Compare March 12, 2025 13:51

veloman-yunkan mentioned this pull request Mar 12, 2025

Limit the cluster cache by memory instead of number of clusters #956

Closed

veloman-yunkan force-pushed the memory_cache_veloman_version branch from e34bd71 to 8efd416 Compare April 18, 2025 14:06

veloman-yunkan mentioned this pull request Apr 18, 2025

A separate unit-test for ConcurrentCache #974

Merged

veloman-yunkan force-pushed the memory_cache_veloman_version branch from 8efd416 to f287547 Compare April 18, 2025 21:54

veloman-yunkan changed the base branch from memory_cache to log_debug_utils April 18, 2025 21:55

veloman-yunkan mentioned this pull request Apr 18, 2025

Log debug utils #975

Merged

kelson42 force-pushed the log_debug_utils branch from 9ef09be to 3dfee89 Compare April 22, 2025 04:46

veloman-yunkan force-pushed the log_debug_utils branch 4 times, most recently from c115739 to 3cf4e08 Compare April 28, 2025 12:25

veloman-yunkan force-pushed the memory_cache_veloman_version branch from f287547 to 1eff7e1 Compare April 28, 2025 13:03

veloman-yunkan mentioned this pull request May 3, 2025

Support for user-defined cost in cache classes #982

Merged

veloman-yunkan changed the base branch from main to user_defined_cost_for_caches May 3, 2025 13:58

veloman-yunkan changed the title ~~[WIP] Testing of ConcurrentCache via logging~~ [WIP] Singleton global cluster cache limited by memory usage May 3, 2025

Base automatically changed from user_defined_cost_for_caches to main May 4, 2025 08:07

kelson42 force-pushed the memory_cache_veloman_version branch from a17352a to a1eca35 Compare May 4, 2025 18:00

kelson42 self-requested a review May 4, 2025 18:02

kelson42 added this to the 9.4.0 milestone May 4, 2025

mgautierfr and others added 8 commits May 5, 2025 16:17

IStreamReader::getMemorySize() & Reader::getMemorySize()

768da53

We need to know the memory consumption by our input readers (especially for compressed clusters where we need to know the size of the decompressor's state).

ConcurrentCache::dropAll()

4f4fed4

Drop all zim clusters from cache at zim file destruction

252f8b0

This is useless now as the cache is destroyed right afterwards but will make a difference when the cache becomes global.

Test different cache sizes against preloaded content

fcc27fd

Cluster cache will become global. So we cannot test content against a reference archive as getting the data from it will also be impacted by cache configuration. So we "preload" all the content first and then do the test.

Cluster cache is properly cleaned if constructor throws

f36972b

This fixes the broken unit test ZimArchive.validate.

Reset global cache size before each archive test

ba1489b

veloman-yunkan force-pushed the memory_cache_veloman_version branch from a1eca35 to ba1489b Compare May 5, 2025 16:31

veloman-yunkan changed the title ~~[WIP] Singleton global cluster cache limited by memory usage~~ Singleton global cluster cache limited by memory usage May 5, 2025

veloman-yunkan marked this pull request as ready for review May 5, 2025 16:54

Moved OpenConfig to zim/zim.h

f4b6139

... so that it can be included in fileimpl.h (CI on certain platforms was broken because of that). Also slightly reformatted the moved code and proof-read some comments.

veloman-yunkan force-pushed the memory_cache_veloman_version branch from f0f3927 to f4b6139 Compare May 5, 2025 17:01

kelson42 merged commit 8690640 into main May 5, 2025
29 of 31 checks passed

kelson42 deleted the memory_cache_veloman_version branch May 5, 2025 19:13

benoit74 mentioned this pull request May 6, 2025

[pending kiwix-serve 3.8] Reconfigure kiwix-serve CLUSTER_CACHE_SIZE kiwix/operations#416

Open

This was referenced Jun 3, 2025

Expose new libzim settings to handle memory consumption kiwix/libkiwix#1199

Open

Expose new libzim settings to handle memory consumption kiwix/kiwix-tools#745

Open

Expose new libzim settings to handle memory consumption openzim/zim-tools#470

Open

Uh oh!

Singleton global cluster cache limited by memory usage #960

Singleton global cluster cache limited by memory usage #960

Uh oh!

Conversation

veloman-yunkan commented Mar 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgautierfr commented Mar 10, 2025

Uh oh!

veloman-yunkan commented Mar 10, 2025

Uh oh!

mgautierfr commented Mar 11, 2025

Uh oh!

veloman-yunkan commented Mar 12, 2025

Uh oh!

veloman-yunkan commented Mar 13, 2025

Uh oh!

kelson42 commented Mar 30, 2025

Uh oh!

veloman-yunkan commented Mar 30, 2025

Uh oh!

kelson42 commented Apr 10, 2025

Uh oh!

mgautierfr commented Apr 10, 2025

Uh oh!

kelson42 commented Apr 17, 2025

Uh oh!

veloman-yunkan commented Apr 18, 2025

Uh oh!

veloman-yunkan commented Apr 18, 2025

Uh oh!

kelson42 commented Apr 27, 2025

Uh oh!

veloman-yunkan commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

veloman-yunkan commented May 3, 2025

Uh oh!

kelson42 commented May 4, 2025

Uh oh!

veloman-yunkan commented May 5, 2025

Uh oh!

veloman-yunkan commented May 5, 2025

Uh oh!

Uh oh!

Uh oh!

veloman-yunkan commented Mar 9, 2025 •

edited

Loading

veloman-yunkan commented Apr 28, 2025 •

edited

Loading

codecov bot commented Apr 28, 2025 •

edited

Loading