Skip to content

Include usage of filter_size for importing logs, user, computer #6889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 6, 2025

Conversation

agoscinski
Copy link
Contributor

@agoscinski agoscinski commented May 21, 2025

Sqlite only accepts a limited number of parameters so we reduce the query into chunks. For that the parameter filter_size is already provided but is not used in every use case.

Building the query for the matching nodes from the backend can for surpass the sqlite limit of allowed query parameters and thus result in failure when importing the archive. In this PR we implement the usage of the filter_size which limits the number of parameters in the query for importing logs, user and computers.

Fixes #6876 (I think both are only due to _import_logs not using filter_size)

Since this does not fully cover the usage of filter_size in all queries but seems to already fix #6876 I opened a new issue #6907 to keep this in mind.

Copy link

codecov bot commented May 21, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.59%. Comparing base (8dd0948) to head (6ce00ca).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6889      +/-   ##
==========================================
+ Coverage   78.58%   78.59%   +0.01%     
==========================================
  Files         564      564              
  Lines       43130    43128       -2     
==========================================
+ Hits        33890    33891       +1     
+ Misses       9240     9237       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@giovannipizzi
Copy link
Member

I suggest to put a batch of ~950 (in any case < 999), see my discussion here (good to maybe copy the discussion also here):

https://github.com/aiidateam/disk-objectstore/blob/73f14e3953c276268b7f7395ad385e693798e247/disk_objectstore/container.py#L97-L107

@agoscinski
Copy link
Contributor Author

agoscinski commented May 22, 2025

I suggest to put a batch of ~950 (in any case < 999), see my discussion here (good to maybe copy the discussion also here):

This was quick-and-dirty to try out. My idea was to determine the limit beforehand and use this. This script I got from ChatGPT but did not verified it yet.

import sqlite3

conn = sqlite3.connect(":memory:")
cur = conn.cursor()

# Try increasing numbers until an error occurs
max_vars = 1
try:
    while True:
        placeholders = ",".join(["?"] * max_vars)
        query = f"SELECT {placeholders}"
        cur.execute(query, [None] * max_vars)
        max_vars += 1
except sqlite3.OperationalError as e:
    print(f"Maximum number of variables supported: {max_vars - 1}")

EDIT: The script seems not really nice. I think the global variable with the doc that @giovannipizzi posted is nicer solution.

EDIT2: There is the filter_size: int = 999, parameter in our code already implementing this kind of logic. Need to check if we can unify.

@agoscinski agoscinski force-pushed the fix/sqlite-too-large-query-parameters branch 4 times, most recently from df574cb to 462e01d Compare June 5, 2025 14:50
@agoscinski
Copy link
Contributor Author

The usage of of the filter_size also needs to be implemented in _merge_node_extras, all the other _import_* functions.

@agoscinski agoscinski force-pushed the fix/sqlite-too-large-query-parameters branch from 462e01d to 4105c61 Compare June 5, 2025 17:01
.append(orm.User, filters={'email': {'in': chunk}}, project=['email', 'id'])
.all(batch_size=query_params.batch_size)
)
for _, chunk in batch_iter(set(input_id_email.values()), query_params.filter_size)
Copy link
Contributor Author

@agoscinski agoscinski Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I use set to only get a set of unique uuids, so that the query and thus the results will not contain duplicates. I don't think this will ever happen but I wanted to be better safe here. Also I guess the subsequent code can also deal with duplicates but I am not sure

for _, chunk in batch_iter(set(input_id_uuid.values()), query_params.filter_size)
]
for key, value in query_results.items()
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for import nodes, it is just a rewrite, list comprehension should be faster

@agoscinski agoscinski changed the title Reduce query parameter size for sqlite Include usage of filter_size for importing logs, user, computer Jun 5, 2025
@agoscinski agoscinski marked this pull request as ready for review June 5, 2025 17:08
@agoscinski agoscinski requested a review from eimrek June 5, 2025 17:12
Copy link
Member

@eimrek eimrek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @agoscinski. Tested also on my end, and works well. 👍

Building the query for the matching nodes from the backend can for
surpass the sqlite limit of allowed query parameters and thus result in
failure when importing the archive. In this PR we impement the usage of
the `filter_size` which limits the number of parameters in the query for
importing logs, user and computers.
@agoscinski agoscinski force-pushed the fix/sqlite-too-large-query-parameters branch from 4105c61 to 6ce00ca Compare June 6, 2025 18:11
@agoscinski agoscinski merged commit ee87c79 into main Jun 6, 2025
18 checks passed
@agoscinski agoscinski deleted the fix/sqlite-too-large-query-parameters branch June 6, 2025 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Archive import fails with sqlite backend (too many SQL variables)
3 participants