Feature/collecte de plusieurs annees pour le dataset edc #9

moreaupascal56 · 2025-02-03T00:44:46Z

Context

The original code was getting data only for the 2024 year, the goal is to get data from 2016 to 2024 and parametrize the year.

Changes

build_database.py

Récupération des données liés aux datasets annuels (via l'url on obtient l'id et le name) dans la fonction get_yearly_dataset_infos()
Creation of a function check_table_existence() to check if a table exists in duckdb.
Parametrize process_sise_eaux_dataset_2024() function (renamed to download_extract_insert_yearly_SISE_data()) in order to add a year parameter.
To avoid duplicates I check if the table exists and if yes I insert data instead of create the table.
Adds process_sise_eaux_dataset() function as a controller of download_extract_insert_yearly_SISE_data() with refresh_type parameter.

_common.py

Recreates the cache folder as shutil.rmtree drops the folder and all its children. Sometimes we just want to empty the folder.

I will add explanations on the code below directly if needed.

Results

I have run this for the years 2016 to 2024:

pascal:~/path_to_project/13_pollution_eau$ uv run pipelines/run.py run build_database
2025-02-03 01:26:34,327 - root - INFO - Starting task build_database...
2025-02-03 01:26:34,328 - tasks.build_database - INFO - Launching processing of SISE-Eaux dataset for years: ['2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023', '2024']
2025-02-03 01:26:34,328 - tasks.build_database - INFO - Processing SISE-Eaux dataset for 2016...
2025-02-03 01:27:18,227 - tasks.build_database - INFO -    Extracting files...
2025-02-03 01:28:14,023 - tasks.build_database - INFO -    Creating or updating tables in the database...
2025-02-03 01:29:28,808 - tasks.build_database - INFO -    Cleaning up cache...
2025-02-03 01:29:29,981 - tasks.build_database - INFO - Processing SISE-Eaux dataset for 2017...
2025-02-03 01:30:17,229 - tasks.build_database - INFO -    Extracting files...
2025-02-03 01:31:12,358 - tasks.build_database - INFO -    Creating or updating tables in the database...
2025-02-03 01:32:48,811 - tasks.build_database - INFO -    Cleaning up cache...
2025-02-03 01:32:50,153 - tasks.build_database - INFO - Processing SISE-Eaux dataset for 2018...
2025-02-03 01:33:39,203 - tasks.build_database - INFO -    Extracting files...
2025-02-03 01:34:36,361 - tasks.build_database - INFO -    Creating or updating tables in the database...
2025-02-03 01:35:58,385 - tasks.build_database - INFO -    Cleaning up cache...
2025-02-03 01:35:59,584 - tasks.build_database - INFO - Processing SISE-Eaux dataset for 2019...
2025-02-03 01:36:57,496 - tasks.build_database - INFO -    Extracting files...
2025-02-03 01:37:53,370 - tasks.build_database - INFO -    Creating or updating tables in the database...
2025-02-03 01:39:07,474 - tasks.build_database - INFO -    Cleaning up cache...
2025-02-03 01:39:08,617 - tasks.build_database - INFO - Processing SISE-Eaux dataset for 2020...
2025-02-03 01:40:06,480 - tasks.build_database - INFO -    Extracting files...
2025-02-03 01:41:03,738 - tasks.build_database - INFO -    Creating or updating tables in the database...
2025-02-03 01:42:49,059 - tasks.build_database - INFO -    Cleaning up cache...
2025-02-03 01:42:50,637 - tasks.build_database - INFO - Processing SISE-Eaux dataset for 2021...
2025-02-03 01:43:50,716 - tasks.build_database - INFO -    Extracting files...
2025-02-03 01:44:54,239 - tasks.build_database - INFO -    Creating or updating tables in the database...
2025-02-03 01:46:13,206 - tasks.build_database - INFO -    Cleaning up cache...
2025-02-03 01:46:14,492 - tasks.build_database - INFO - Processing SISE-Eaux dataset for 2022...
2025-02-03 01:47:10,268 - tasks.build_database - INFO -    Extracting files...
2025-02-03 01:48:14,867 - tasks.build_database - INFO -    Creating or updating tables in the database...
2025-02-03 01:49:54,366 - tasks.build_database - INFO -    Cleaning up cache...
2025-02-03 01:49:55,577 - tasks.build_database - INFO - Processing SISE-Eaux dataset for 2023...
2025-02-03 01:50:47,723 - tasks.build_database - INFO -    Extracting files...
2025-02-03 01:51:51,457 - tasks.build_database - INFO -    Creating or updating tables in the database...
2025-02-03 01:53:26,641 - tasks.build_database - INFO -    Cleaning up cache...
2025-02-03 01:53:27,838 - tasks.build_database - INFO - Processing SISE-Eaux dataset for 2024...
2025-02-03 01:54:20,804 - tasks.build_database - INFO -    Extracting files...
2025-02-03 01:55:20,821 - tasks.build_database - INFO -    Creating or updating tables in the database...
2025-02-03 01:56:46,998 - tasks.build_database - INFO -    Cleaning up cache...
2025-02-03 01:56:48,688 - tasks.build_database - INFO - Cleaning up cache...
2025-02-03 01:56:48,691 - root - INFO - Task build_database completed.

I have run some checks:

uv run pipelines/run.py run build_database

        SELECT COUNT(*)
        FROM sise_prelevements;
        
   count_star()
0       3684554

        SELECT annee_prelevement, COUNT(*)
        FROM sise_prelevements
        group by annee_prelevement;
        
   annee_prelevement  count_star()
0               2016        398366
1               2017        402015
2               2018        403663
3               2019        409081
4               2020        406865
5               2021        414561
6               2022        421756
7               2023        424035
8               2024        404212

Ideas

I have also some ideas but wants your opinion on them:

Move check_table_existence to _common.py
Split download_extract_insert_yearly_SISE_data into download_extract_SISE_data and insert_SISE_data_to_duckdb for a better readability
I guess the refresh_type = all in process_sise_eaux_dataset() should drop the tables entirely instead of deleting every year data.

…ux_dataset_2024 to download_extract_insert_yearly_SISE_data

pipelines/tasks/build_database.py

jaouena · 2025-02-03T10:27:29Z

I've run the import over the several years without any issues. I understand your questions, particularly regarding resetting tables when launching with 'all' and the need to select the 'refresh_type' via CLI. Depending on what others think and your availability @moreaupascal56 , I could contribute to the few changes you suggested if needed. Nothing more to add from my side.

pipelines/tasks/build_database.py

jereze

Hello Pascal,

Merci pour ta contribution de qualité !

Je vais demander quelques changements, notamment sur le naming, autant faire des breaking changes tout de suite en début de projet.

Renommons les tables... J'avais appelé ça SISE-Eaux, mais c'était trop approximatif. je te laisse lire le wiki/outline pour avoir tout le contexte, mais je pense que ce serait bien de nommer cette source "Eau distribuée par commune" et de préfixer le nom des tables par edc_. Je suis ouvert à d'autres propositions de nom/préfixe, mais ne gardons mon nom.
Renommons la colonne "annee_prelevement", c'est une colonne créée par nous data eng, pas par la source de donnée, on sait jamais, peut-être qu'il peut y avoir des relevés 2023 dans la source 2024... Donc nommons ça "partition" ou "pipeline_year_run" ou autre, ouvert encore une fois.
Petit erreur de type sur years_to_update = available_years[-1]
Ca fail si on avait déjà build la database sur la version précédente. Peut-être qu'on corrige pas...
Tu soulèves les limitations actuelles de ton travail, mais ça me gêne pas de merge tel quel. Peux-tu lister de nouvelles tâches pour plus tard ? (update cli to allow passing a param, really dropping the tables when full refresh). On va rajouter ces tâches dans notre backlog. Si tu veux le faire dans NocoDB directement, je peux te créer un accès (écris moi sur slack), sinon liste ça en commentaire et je ferrai un copier-coller.

moreaupascal56 · 2025-02-03T14:21:39Z

ducoup dans les nouveaux changements:

Fix du type quand refresh_type = "last"
renaming des références à SISE vers edc
renaming de annee_prelevement vers de_partition (de pour data_engineering)
gestion d'une erreur passée sous le radar 👓

J'ai créer deux tickets dans le Kanban pour améliorer ce script:

build_database.py/ Ajouter une option pour faire un drop des tables
build_database.py/ Ajouter un moyen de parametrer process_edc_datasets() via la CLI

simhub75 · 2025-02-03T18:06:03Z

Hello, top le code Pascal.
Après le requests.get on pourrai ajouter une condition response.status_code == 200 et dans le cas contraire logger le code d'erreur.
(on pourrai aussi utiliser un retry automatique en cas d'échec du requests.get, en utilisant urllib3.util.retry à voir si cela est nécessaire dans le cadre de ce projet)

pipelines/tasks/build_database.py

NicolasDuchenne · 2025-02-03T17:03:52Z

pipelines/notebooks/test_plusierus_annees.ipynb

Typo on filename and file is empty, add code or remove the notebook

@jereze je peux supprimer ce notebook? je crois que tu l'as rajouté mais il a du y avoir une erreur lors de l'import ou du commit je ne sais pas (commit)

pipelines/tasks/build_database.py

moreaupascal56 · 2025-02-04T10:07:04Z

Hello @NicolasDuchenne,

j'ai rajouté un fichier pour la config _config_edc.py, je ne sais pas si c'est come ça que tu voyais les choses (ou en yaml)
changé le defaut de refresh_type pour last
ajouté de_ingestion_date dans les tables sql (Pour tester il faut drop les tables ou la db car le schema des tables change).

Je me demande si peut être on devrait créer un folder pour le dataset edc, ce qui donnerait:

tasks
├── _common.py
├── build_database.py
├── processing_edc
│ ├── _edc_config.py
│ └── processing_edc.py

car build_database.py pourrait en théorie construire toute la db et pas uniquement EDC, donc il ne devrait pas y avoir la logique pour ce dataset dans le fichier.
L'autre option est tout simplement de renommer build_database.py en build_edc_database.py

Pour le README.md si ça va a tout le monde je l'updaterais (sic :p) une fois que le code sera prêt.

NicolasDuchenne

C'est ok pour moi, on pourra restructurer le code avec des sous dossier si on ajoute des sources dans le futur

NicolasDuchenne · 2025-02-04T16:28:51Z

pipelines/tasks/_config_edc.py

+            "communes": {
+                "file_name_prefix": "DIS_COM_UDI_",
+                "file_extension": ".txt",
+                "table_name": "edc_communes",


Comme tu renomme les tables, le notebook de test ne va plus fonctionner (analytics/notebooks/exemple.ipynb). Il faut mettre a jour le nom des tables dans le notebook

fix dans le commit 5e901f8

* add fct get_yearly_dataset_infos * parametrize process_sise_eaux_dataset_2024 * recreate cache folder by default in clear_cache() * add check_table_existence function * parametrize process_sise_eaux_dataset_2024() * format _common.py * add process_sise_eaux_dataset() controller and rename process_sise_eaux_dataset_2024 to download_extract_insert_yearly_SISE_data * upd docstrings, formatting * upd logs and add a clear_cache() in process_sise_eaux_dataset * reorganize file * add notebook to preview data * fix Incompatible types in assignment * rename SISE to EDC * rename annee_prelevement to de_partition * catch and raise error if refresh_type not in the allowed values * format * fix typo * add _config_edc.py and use it accordingly * add de_ingestion_date * make refresh_type="last" as default * fix example notebooks with new sql table names * delete test_plusierus_annees.ipynb --------- Co-authored-by: Jeremy Greze <jereze@users.noreply.github.com>

moreaupascal56 added 9 commits February 2, 2025 22:16

add fct get_yearly_dataset_infos

1a5fc78

parametrize process_sise_eaux_dataset_2024

dcb5fa5

recreate cache folder by default in clear_cache()

2da6cdf

add check_table_existence function

5b3de67

parametrize process_sise_eaux_dataset_2024()

b46a69c

format _common.py

e222562

add process_sise_eaux_dataset() controller and rename process_sise_ea…

b58894f

…ux_dataset_2024 to download_extract_insert_yearly_SISE_data

upd docstrings, formatting

dcc29a3

upd logs and add a clear_cache() in process_sise_eaux_dataset

c628c8e

moreaupascal56 commented Feb 3, 2025

View reviewed changes

pipelines/tasks/build_database.py Outdated Show resolved Hide resolved

moreaupascal56 commented Feb 3, 2025

View reviewed changes

pipelines/tasks/build_database.py Show resolved Hide resolved

moreaupascal56 commented Feb 3, 2025

View reviewed changes

pipelines/tasks/build_database.py Outdated Show resolved Hide resolved

moreaupascal56 commented Feb 3, 2025

View reviewed changes

pipelines/tasks/build_database.py Show resolved Hide resolved

reorganize file

69516eb

jereze requested review from NicolasDuchenne and jereze February 3, 2025 10:26

jereze added the team-data-eng label Feb 3, 2025

jereze reviewed Feb 3, 2025

View reviewed changes

pipelines/tasks/build_database.py Outdated Show resolved Hide resolved

jereze reviewed Feb 3, 2025

View reviewed changes

pipelines/tasks/build_database.py Outdated Show resolved Hide resolved

add notebook to preview data

40e85e3

jereze requested changes Feb 3, 2025

View reviewed changes

moreaupascal56 added 5 commits February 3, 2025 15:15

fix Incompatible types in assignment

1ff4f23

rename SISE to EDC

c0c9997

rename annee_prelevement to de_partition

d6e1ab9

catch and raise error if refresh_type not in the allowed values

077150e

format

700eb94

fix typo

42950fa

moreaupascal56 requested a review from jereze February 3, 2025 17:36

NicolasDuchenne reviewed Feb 4, 2025

View reviewed changes

moreaupascal56 added 3 commits February 4, 2025 10:56

add _config_edc.py and use it accordingly

5358514

add de_ingestion_date

a5a19dd

make refresh_type="last" as default

c3470dd

moreaupascal56 requested a review from NicolasDuchenne February 4, 2025 10:12

NicolasDuchenne previously approved these changes Feb 4, 2025

View reviewed changes

jereze previously approved these changes Feb 4, 2025

View reviewed changes

jaouena mentioned this pull request Feb 4, 2025

Notebook Exemples #12

Merged

NicolasDuchenne reviewed Feb 4, 2025

View reviewed changes

fix example notebooks with new sql table names

5e901f8

moreaupascal56 dismissed stale reviews from jereze and NicolasDuchenne via 5e901f8 February 5, 2025 10:10

delete test_plusierus_annees.ipynb

1529dfb

NicolasDuchenne approved these changes Feb 5, 2025

View reviewed changes

NicolasDuchenne merged commit 2bc4074 into dataforgoodfr:main Feb 5, 2025
1 check passed

moreaupascal56 mentioned this pull request Feb 5, 2025

Feature/request by specific year #11

Merged

jereze mentioned this pull request Feb 13, 2025

feature/duckdb: POC duck db node api #17

Merged

Feature/collecte de plusieurs annees pour le dataset edc #9

Feature/collecte de plusieurs annees pour le dataset edc #9

Uh oh!

Conversation

moreaupascal56 commented Feb 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changes

build_database.py

_common.py

Results

Ideas

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaouena commented Feb 3, 2025

Uh oh!

Uh oh!

Uh oh!

jereze left a comment

Choose a reason for hiding this comment

Uh oh!

moreaupascal56 commented Feb 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simhub75 commented Feb 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NicolasDuchenne Feb 3, 2025

Choose a reason for hiding this comment

Uh oh!

moreaupascal56 Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

moreaupascal56 commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasDuchenne left a comment

Choose a reason for hiding this comment

Uh oh!

NicolasDuchenne Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

moreaupascal56 Feb 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

moreaupascal56 commented Feb 3, 2025 •

edited

Loading

moreaupascal56 commented Feb 3, 2025 •

edited

Loading

moreaupascal56 Feb 5, 2025 •

edited

Loading

moreaupascal56 commented Feb 4, 2025 •

edited

Loading