Skip to content

Conversation

alaniwi
Copy link
Contributor

@alaniwi alaniwi commented Apr 24, 2025

Adds a --export option on esgpull update in order to extract lists of files that would be added to the download queue (before the point where it asks the question about whether to add them, or exists if there are no new files)

This gives the user an opportunity to properly inspect what URLs the query is matching, before committing to add the files to the queue.

Two lists of dictionaries are written, in JSON format (after prompting for the filenames, with reasonable defaults):

  • all files (before filtering to remove existing SHAs)
  • new files that will be added

(PR also simplifies a loop with a list comprehension, not directly related to the new option.)

@svenrdz
Copy link
Collaborator

svenrdz commented May 7, 2025

FYI the new_files variable name is misleading, as it corresponds to files that are new, relative to the specific query they are added to, but those same files might already exist in the database, maybe even already downloaded from another query. And in this case they would still appear in the new_files.json with status New, which might not be helpful at all.

If your intention is to know which files are new relative to the database, you can use this snippet to filter the list:

            if export:
                hash = qf.query.name.strip("<>")
                files_not_in_db = [file for file in new_files if file not in esg.db]
                for descrip, file_list, dfl_fn in [("all files", qf.files, f"{hash}_all_files.json"),
                                                   ("new files", files_not_in_db, f"{hash}_new_files.json")]:
                    fn = esg.ui.prompt(f"Filename to export {descrip} list:", dfl_fn)
                    with open(fn, "w") as fout:
                        json.dump([file.asdict() for file in file_list],
                                  fout,
                                  indent=4)
                    print(f"{len(file_list)} files - written to {fn}")

I'm not sure about whether the all_files.json is useful, but I don't have a strong opinion against having it.

In any case, could you run ruff format on the update.py file ? I'm fine without, it's going to be formatted next time I commit in this file anyway, but that would attribute the changes to me instead.

I can merge this as it is now if you don't mind my comments, I'll let you decide and tell me so I can click the button.

alaniwi and others added 27 commits May 7, 2025 19:48
… queries to filter by status instead of looping over files (ESGF#68)

Co-authored-by: CEDA support <support@ceda.ac.uk>
* feat: make distrib=true the new default

* test: explicit distrib=false to keep previous expectation
…and using `--after`/`--before` (ESGF#69)

* fix: remove black formatter from alembic (post write hook)

* feat: add created_at column (datetime) to Query table

* feat: add migration for new column with backfill

* feat: update rich output with added date; de/serialize field

* refactor: improve date formatting utilities and implementation

* fix: avoid incorrect db state during dev with unbound migrations

* test: fix dict equality checks, ignore timestamps

* feat: add Query.updated_at column (datetime)

This column should get updated everytime a file is added to a query. It
cannot rely on automatic database features and needs custom (although
simple) logic, inside the cli update command.

* chore: add migration for Query.updated_at

* fix(test): ignore at the correct depth

* test: add cli-based test on the logic for updating the updated_at column

* feat: implement the logic for updating the updated_at column

* feat: improve show output for added/updated timestamps

* chore: rename column created_at -> added_at for consistency

* feat(cli.show): add --before/--after filters

* test(cli.show): check logic of --before/--after

* chore: use more specific query to update faster

* chore: update version to dev (required for migrations)
* wip

* fix: rework the no_require rich prints

* lint

* optimize dataset stats gathering, reduce number of single sql queries

* remove cli datasets command (now in show)

* fix: handle __contains__ for non-sha tables (+better sql query)

* feat: add a handy debug mode to show full traceback

* fix: deduplication logs to debug instead of warning

* fix: better handling of dataset update

* better dataset query

* improve update message, more explicit

* remove dataset backfilling, improve message for update needed

its probably better to not have the dataset at all than have it with
potentially wrong data. also speeds up migration and keeps simple the
check for whether a dataset is "new" (otherwise we need to get it from
db and do `if total_files == 0`)

* fix

* improve message

* fix: improve orphaned dataset detection for accurate update warnings

* remove useless test

* test: correct dataset info in show output

* test: dont use institution_id with distrib false (most likely to get zero results)

* fix: dont fetch files during dataset repr

* chore: move sql statement to sql module

* fix: hash dataset on dataset_id

* chore: improve typing for Result class

* feat: add dataset.is_complete statement + index on (dataset_id, status)
* feat: add plugin system based on events + cli commands

* tests: add plugin & cli tests

* doc: add plugin system documentation

* fix: show all signature with no event provided

* doc: add test command documentation and fix the create syntax

* fix: log plugin error stacktrace

* fix: check version compatibility before validating plugin signature

* feat: add tracing with execution time for all plugins at INFO level

* tests: shorter and more focused plugin tests

* tests: lint

* fix: add config.paths.plugins and use this value instead of hardcoded

* doc: improve plugins documentation

* fix: put back create plugin folder if plugins enabled

* fix: remove signature command, create makes it obsolete

* fix: remove singature command test

* docs: clarify list/dict usage in Config

* docs: fix example

* remove timeout for now

* remove async emit and executor code

* rename file_downloaded -> file_complete

* rename download_failure -> file_error

* remove query_updated, add dataset_complete

* move emit calls from iter_results to download method

* add emit for dataset_complete

* tests: fix missing folder

* add destination arg on file_complete event spec

* fix mypy error

* add start_time/end_time to file_complete event spec
repo: remove conda installation instructions
doc: add citation to docs index
Rename `Paths.__iter__` to `Paths.values`, use Config.paths in Filesystem
fix: updating re-added queries works (+test)
svenrdz and others added 9 commits August 8, 2025 10:41
`tomlkit` uses its own types, that pydantic does not know about.
It seems to work for ints, but strings are problematic.
Other toml libraries don't preserve comments upon writing, so the
solution is simply to unwrap the "raw" config into python types.
Also store the whole config instead of only the "plugins" sub-tree in
the "raw" property, to preserve the full file state upon writes.
fix(plugins): call .unwrap() on loaded config from tomlkit
The bridge API does not support `facets=*` in query parameters
when the index is detected as bridge, esgpull instead lists all
keys from the first file result, assuming files are homogeneous.

This assumption is wrong across projects, but works in most cases
Fix bridge api extra parameters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants