Skip to content

Allow a custom requests.Session for provider-list scraping#2380

Open
vasa-develop wants to merge 2 commits into
Materials-Consortia:mainfrom
vasa-develop:fix/scraping-session-2275
Open

Allow a custom requests.Session for provider-list scraping#2380
vasa-develop wants to merge 2 commits into
Materials-Consortia:mainfrom
vasa-develop:fix/scraping-session-2275

Conversation

@vasa-develop

Copy link
Copy Markdown

Closes #2275.

Problem

get_all_databases() and its helpers (get_providers(), get_child_database_links()) in optimade/utils.py call the module-level requests.get(...) directly, so there is no way to supply custom HTTP configuration (e.g. a proxy) when scraping the provider list. As reported in #2274, this makes it impossible to use the client behind a proxy without manually specifying base URLs.

Fix

Add an optional session: requests.Session | None = None argument to the three scraping functions. When supplied, the request is routed through session.get(...); otherwise the behaviour is unchanged (module-level requests, which is what the reference server relies on at initialisation, as noted in the issue). get_all_databases threads the session down to both helpers.

This is intentionally scoped to the optimade.utils scrapers, matching the issue's suggestion to "refactor get_all_databases (and related functions) to use a custom HTTP client/session, but ... default to doing their own thing". Wiring this into OptimadeClient itself can follow separately if desired — happy to do that in a follow-up.

Example:

import requests
from optimade.utils import get_all_databases

session = requests.Session()
session.proxies = {"https": "socks5h://127.0.0.1:16667"}
databases = list(get_all_databases(session=session))

Test plan

  • New regression tests in tests/server/routers/test_utils.py:
    • test_get_providers_uses_provided_session — a supplied session is used and the global requests.get is not.
    • test_get_child_database_links_uses_provided_session — same for the child-link request.
    • test_get_all_databases_threads_session — the session is forwarded to both helpers.
  • All three fail on main (TypeError: unexpected keyword argument 'session') and pass with this change.
  • Existing test_utils.py tests still pass; ruff check, ruff format --check, and mypy are clean.

Thanks to @hongyi-zhao for the original report in #2274.

get_providers, get_child_database_links and get_all_databases now accept
an optional `session` argument so custom HTTP configuration (e.g. proxies)
can be used when scraping the provider list. The default behaviour is
unchanged (module-level `requests`).
@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.88889% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 90.56%. Comparing base (d53161f) to head (f212cc4).

Files with missing lines Patch % Lines
optimade/utils.py 88.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2380      +/-   ##
==========================================
- Coverage   90.58%   90.56%   -0.02%     
==========================================
  Files          75       75              
  Lines        5034     5035       +1     
==========================================
  Hits         4560     4560              
- Misses        474      475       +1     
Flag Coverage Δ
project 90.56% <88.88%> (-0.02%) ⬇️
validator 90.56% <88.88%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The SSL-error retry path in get_child_database_links (verify=False) was the
one line of the session change left uncovered. Add a regression test that
raises SSLError on the first call and asserts the skip_ssl retry also goes
through the provided session with verify disabled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Client does not use configured HTTP session for scraping provider list

1 participant