add --export option on esgpull update #66

alaniwi · 2025-04-24T12:39:56Z

Adds a --export option on esgpull update in order to extract lists of files that would be added to the download queue (before the point where it asks the question about whether to add them, or exists if there are no new files)

This gives the user an opportunity to properly inspect what URLs the query is matching, before committing to add the files to the queue.

Two lists of dictionaries are written, in JSON format (after prompting for the filenames, with reasonable defaults):

all files (before filtering to remove existing SHAs)
new files that will be added

(PR also simplifies a loop with a list comprehension, not directly related to the new option.)

… file lists (all, and new) as JSON

svenrdz · 2025-05-07T17:40:09Z

FYI the new_files variable name is misleading, as it corresponds to files that are new, relative to the specific query they are added to, but those same files might already exist in the database, maybe even already downloaded from another query. And in this case they would still appear in the new_files.json with status New, which might not be helpful at all.

If your intention is to know which files are new relative to the database, you can use this snippet to filter the list:

            if export:
                hash = qf.query.name.strip("<>")
                files_not_in_db = [file for file in new_files if file not in esg.db]
                for descrip, file_list, dfl_fn in [("all files", qf.files, f"{hash}_all_files.json"),
                                                   ("new files", files_not_in_db, f"{hash}_new_files.json")]:
                    fn = esg.ui.prompt(f"Filename to export {descrip} list:", dfl_fn)
                    with open(fn, "w") as fout:
                        json.dump([file.asdict() for file in file_list],
                                  fout,
                                  indent=4)
                    print(f"{len(file_list)} files - written to {fn}")

I'm not sure about whether the all_files.json is useful, but I don't have a strong opinion against having it.

In any case, could you run ruff format on the update.py file ? I'm fine without, it's going to be formatted next time I commit in this file anyway, but that would attribute the changes to me instead.

I can merge this as it is now if you don't mind my comments, I'll let you decide and tell me so I can click the button.

… queries to filter by status instead of looping over files (ESGF#68) Co-authored-by: CEDA support <support@ceda.ac.uk>

…ecified index_node only (ESGF#71)

* feat: make distrib=true the new default * test: explicit distrib=false to keep previous expectation

…and using `--after`/`--before` (ESGF#69) * fix: remove black formatter from alembic (post write hook) * feat: add created_at column (datetime) to Query table * feat: add migration for new column with backfill * feat: update rich output with added date; de/serialize field * refactor: improve date formatting utilities and implementation * fix: avoid incorrect db state during dev with unbound migrations * test: fix dict equality checks, ignore timestamps * feat: add Query.updated_at column (datetime) This column should get updated everytime a file is added to a query. It cannot rely on automatic database features and needs custom (although simple) logic, inside the cli update command. * chore: add migration for Query.updated_at * fix(test): ignore at the correct depth * test: add cli-based test on the logic for updating the updated_at column * feat: implement the logic for updating the updated_at column * feat: improve show output for added/updated timestamps * chore: rename column created_at -> added_at for consistency * feat(cli.show): add --before/--after filters * test(cli.show): check logic of --before/--after * chore: use more specific query to update faster * chore: update version to dev (required for migrations)

* wip * fix: rework the no_require rich prints * lint * optimize dataset stats gathering, reduce number of single sql queries * remove cli datasets command (now in show) * fix: handle __contains__ for non-sha tables (+better sql query) * feat: add a handy debug mode to show full traceback * fix: deduplication logs to debug instead of warning * fix: better handling of dataset update * better dataset query * improve update message, more explicit * remove dataset backfilling, improve message for update needed its probably better to not have the dataset at all than have it with potentially wrong data. also speeds up migration and keeps simple the check for whether a dataset is "new" (otherwise we need to get it from db and do `if total_files == 0`) * fix * improve message * fix: improve orphaned dataset detection for accurate update warnings * remove useless test * test: correct dataset info in show output * test: dont use institution_id with distrib false (most likely to get zero results) * fix: dont fetch files during dataset repr * chore: move sql statement to sql module * fix: hash dataset on dataset_id * chore: improve typing for Result class * feat: add dataset.is_complete statement + index on (dataset_id, status)

* feat: add plugin system based on events + cli commands * tests: add plugin & cli tests * doc: add plugin system documentation * fix: show all signature with no event provided * doc: add test command documentation and fix the create syntax * fix: log plugin error stacktrace * fix: check version compatibility before validating plugin signature * feat: add tracing with execution time for all plugins at INFO level * tests: shorter and more focused plugin tests * tests: lint * fix: add config.paths.plugins and use this value instead of hardcoded * doc: improve plugins documentation * fix: put back create plugin folder if plugins enabled * fix: remove signature command, create makes it obsolete * fix: remove singature command test * docs: clarify list/dict usage in Config * docs: fix example * remove timeout for now * remove async emit and executor code * rename file_downloaded -> file_complete * rename download_failure -> file_error * remove query_updated, add dataset_complete * move emit calls from iter_results to download method * add emit for dataset_complete * tests: fix missing folder * add destination arg on file_complete event spec * fix mypy error * add start_time/end_time to file_complete event spec

repo: remove conda installation instructions

doc: add citation to docs index

Rename `Paths.__iter__` to `Paths.values`, use Config.paths in Filesystem

fix: updating re-added queries works (+test)

`tomlkit` uses its own types, that pydantic does not know about. It seems to work for ints, but strings are problematic. Other toml libraries don't preserve comments upon writing, so the solution is simply to unwrap the "raw" config into python types. Also store the whole config instead of only the "plugins" sub-tree in the "raw" property, to preserve the full file state upon writes.

fix(plugins): call .unwrap() on loaded config from tomlkit

The bridge API does not support `facets=*` in query parameters when the index is detected as bridge, esgpull instead lists all keys from the first file result, assuming files are homogeneous. This assumption is wrong across projects, but works in most cases

Fix bridge api extra parameters

alaniwi added 2 commits April 24, 2025 12:52

add --export option on update to export file list, in order to export…

67447d6

… file lists (all, and new) as JSON

include the hash in the export filename

068a784

alaniwi and others added 27 commits May 7, 2025 19:48

speed up "esgpull status" by excluding untracked queries and using db…

c446b14

… queries to filter by status instead of looping over files (ESGF#68) Co-authored-by: CEDA support <support@ceda.ac.uk>

Update README.md

3dfc2ae

feat: make custom distribution algorithm opt-in, defaults to using sp…

9ef5643

…ecified index_node only (ESGF#71)

Change distrib option default to True (ESGF#72)

122c331

* feat: make distrib=true the new default * test: explicit distrib=false to keep previous expectation

bump 0.8.0

ecc7052

feat: improve add message to include next steps (update) (ESGF#77)

5a79ad5

fix: handle esgf 1.5 bridge api (ESGF#79)

4f5cb0b

bump 0.9.0

677ef5f

chore: update lockfile

58c583c

doc: add citation to docs index

cada19c

rename Config.Paths.__iter__

8d229e3

fix(fs): use Config.paths instead of hardcoded dir list

67cde76

test(fs): use Config.paths instead of hardcoded dir list

4287ff1

test(fs): improve typing

134fae5

repo: remove conda installation instructions

b789f5b

Merge pull request ESGF#92 from ESGF/remove-conda-install

2ac1fd3

repo: remove conda installation instructions

Merge pull request ESGF#86 from ESGF/add-citation

b9106df

doc: add citation to docs index

Merge pull request ESGF#90 from ESGF/fix-config-paths

44c325c

Rename `Paths.__iter__` to `Paths.values`, use Config.paths in Filesystem

fix: updating re-added queries works (+test)

949b293

rename variable

6ed9e7a

fix(speed): dont add files one by one

ff45734

fix failing test

5251259

Merge pull request ESGF#97 from ESGF/fix-update-removed-query

2c0ffcc

fix: updating re-added queries works (+test)

bump 0.9.1

008f08c

svenrdz and others added 9 commits August 8, 2025 10:41

fix(plugins): missing import

af6b4cf

Merge pull request ESGF#89 from ESGF/fix-plugin-config

796270d

fix(plugins): call .unwrap() on loaded config from tomlkit

fix: bridge api knows NOT operator instead of !

f14c6c6

fix: bridge api missing parameters

e1d9011

Merge pull request ESGF#94 from ESGF/fix-bridge-api

eb6d5ae

Fix bridge api extra parameters

bump 0.9.2

94d55db

complete merge

42701a2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add --export option on esgpull update #66

add --export option on esgpull update #66

Uh oh!

alaniwi commented Apr 24, 2025

Uh oh!

svenrdz commented May 7, 2025

Uh oh!

Uh oh!

add --export option on esgpull update #66

Are you sure you want to change the base?

add --export option on esgpull update #66

Uh oh!

Conversation

alaniwi commented Apr 24, 2025

Uh oh!

svenrdz commented May 7, 2025

Uh oh!

Uh oh!