Skip to content

Conversation

moreaupascal56
Copy link
Collaborator

@moreaupascal56 moreaupascal56 commented Feb 3, 2025

Context

The original code was getting data only for the 2024 year, the goal is to get data from 2016 to 2024 and parametrize the year.

Changes

build_database.py

  • Récupération des données liés aux datasets annuels (via l'url on obtient l'id et le name) dans la fonction get_yearly_dataset_infos()
  • Creation of a function check_table_existence() to check if a table exists in duckdb.
  • Parametrize process_sise_eaux_dataset_2024() function (renamed to download_extract_insert_yearly_SISE_data()) in order to add a year parameter.
    To avoid duplicates I check if the table exists and if yes I insert data instead of create the table.
  • Adds process_sise_eaux_dataset() function as a controller of download_extract_insert_yearly_SISE_data() with refresh_type parameter.

_common.py

  • Recreates the cache folder as shutil.rmtree drops the folder and all its children. Sometimes we just want to empty the folder.

I will add explanations on the code below directly if needed.

Results

I have run this for the years 2016 to 2024:

pascal:~/path_to_project/13_pollution_eau$ uv run pipelines/run.py run build_database
2025-02-03 01:26:34,327 - root - INFO - Starting task build_database...
2025-02-03 01:26:34,328 - tasks.build_database - INFO - Launching processing of SISE-Eaux dataset for years: ['2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023', '2024']
2025-02-03 01:26:34,328 - tasks.build_database - INFO - Processing SISE-Eaux dataset for 2016...
2025-02-03 01:27:18,227 - tasks.build_database - INFO -    Extracting files...
2025-02-03 01:28:14,023 - tasks.build_database - INFO -    Creating or updating tables in the database...
2025-02-03 01:29:28,808 - tasks.build_database - INFO -    Cleaning up cache...
2025-02-03 01:29:29,981 - tasks.build_database - INFO - Processing SISE-Eaux dataset for 2017...
2025-02-03 01:30:17,229 - tasks.build_database - INFO -    Extracting files...
2025-02-03 01:31:12,358 - tasks.build_database - INFO -    Creating or updating tables in the database...
2025-02-03 01:32:48,811 - tasks.build_database - INFO -    Cleaning up cache...
2025-02-03 01:32:50,153 - tasks.build_database - INFO - Processing SISE-Eaux dataset for 2018...
2025-02-03 01:33:39,203 - tasks.build_database - INFO -    Extracting files...
2025-02-03 01:34:36,361 - tasks.build_database - INFO -    Creating or updating tables in the database...
2025-02-03 01:35:58,385 - tasks.build_database - INFO -    Cleaning up cache...
2025-02-03 01:35:59,584 - tasks.build_database - INFO - Processing SISE-Eaux dataset for 2019...
2025-02-03 01:36:57,496 - tasks.build_database - INFO -    Extracting files...
2025-02-03 01:37:53,370 - tasks.build_database - INFO -    Creating or updating tables in the database...
2025-02-03 01:39:07,474 - tasks.build_database - INFO -    Cleaning up cache...
2025-02-03 01:39:08,617 - tasks.build_database - INFO - Processing SISE-Eaux dataset for 2020...
2025-02-03 01:40:06,480 - tasks.build_database - INFO -    Extracting files...
2025-02-03 01:41:03,738 - tasks.build_database - INFO -    Creating or updating tables in the database...
2025-02-03 01:42:49,059 - tasks.build_database - INFO -    Cleaning up cache...
2025-02-03 01:42:50,637 - tasks.build_database - INFO - Processing SISE-Eaux dataset for 2021...
2025-02-03 01:43:50,716 - tasks.build_database - INFO -    Extracting files...
2025-02-03 01:44:54,239 - tasks.build_database - INFO -    Creating or updating tables in the database...
2025-02-03 01:46:13,206 - tasks.build_database - INFO -    Cleaning up cache...
2025-02-03 01:46:14,492 - tasks.build_database - INFO - Processing SISE-Eaux dataset for 2022...
2025-02-03 01:47:10,268 - tasks.build_database - INFO -    Extracting files...
2025-02-03 01:48:14,867 - tasks.build_database - INFO -    Creating or updating tables in the database...
2025-02-03 01:49:54,366 - tasks.build_database - INFO -    Cleaning up cache...
2025-02-03 01:49:55,577 - tasks.build_database - INFO - Processing SISE-Eaux dataset for 2023...
2025-02-03 01:50:47,723 - tasks.build_database - INFO -    Extracting files...
2025-02-03 01:51:51,457 - tasks.build_database - INFO -    Creating or updating tables in the database...
2025-02-03 01:53:26,641 - tasks.build_database - INFO -    Cleaning up cache...
2025-02-03 01:53:27,838 - tasks.build_database - INFO - Processing SISE-Eaux dataset for 2024...
2025-02-03 01:54:20,804 - tasks.build_database - INFO -    Extracting files...
2025-02-03 01:55:20,821 - tasks.build_database - INFO -    Creating or updating tables in the database...
2025-02-03 01:56:46,998 - tasks.build_database - INFO -    Cleaning up cache...
2025-02-03 01:56:48,688 - tasks.build_database - INFO - Cleaning up cache...
2025-02-03 01:56:48,691 - root - INFO - Task build_database completed.

I have run some checks:

uv run pipelines/run.py run build_database

        SELECT COUNT(*)
        FROM sise_prelevements;
        
   count_star()
0       3684554

        SELECT annee_prelevement, COUNT(*)
        FROM sise_prelevements
        group by annee_prelevement;
        
   annee_prelevement  count_star()
0               2016        398366
1               2017        402015
2               2018        403663
3               2019        409081
4               2020        406865
5               2021        414561
6               2022        421756
7               2023        424035
8               2024        404212

Ideas

I have also some ideas but wants your opinion on them:

  • Move check_table_existence to _common.py
  • Split download_extract_insert_yearly_SISE_data into download_extract_SISE_data and insert_SISE_data_to_duckdb for a better readability
  • I guess the refresh_type = all in process_sise_eaux_dataset() should drop the tables entirely instead of deleting every year data.

@jaouena
Copy link
Contributor

jaouena commented Feb 3, 2025

I've run the import over the several years without any issues. I understand your questions, particularly regarding resetting tables when launching with 'all' and the need to select the 'refresh_type' via CLI. Depending on what others think and your availability @moreaupascal56 , I could contribute to the few changes you suggested if needed. Nothing more to add from my side.

Copy link
Contributor

@jereze jereze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello Pascal,

Merci pour ta contribution de qualité !

Je vais demander quelques changements, notamment sur le naming, autant faire des breaking changes tout de suite en début de projet.

  • Renommons les tables... J'avais appelé ça SISE-Eaux, mais c'était trop approximatif. je te laisse lire le wiki/outline pour avoir tout le contexte, mais je pense que ce serait bien de nommer cette source "Eau distribuée par commune" et de préfixer le nom des tables par edc_. Je suis ouvert à d'autres propositions de nom/préfixe, mais ne gardons mon nom.
  • Renommons la colonne "annee_prelevement", c'est une colonne créée par nous data eng, pas par la source de donnée, on sait jamais, peut-être qu'il peut y avoir des relevés 2023 dans la source 2024... Donc nommons ça "partition" ou "pipeline_year_run" ou autre, ouvert encore une fois.
  • Petit erreur de type sur years_to_update = available_years[-1]
  • Ca fail si on avait déjà build la database sur la version précédente. Peut-être qu'on corrige pas...
  • Tu soulèves les limitations actuelles de ton travail, mais ça me gêne pas de merge tel quel. Peux-tu lister de nouvelles tâches pour plus tard ? (update cli to allow passing a param, really dropping the tables when full refresh). On va rajouter ces tâches dans notre backlog. Si tu veux le faire dans NocoDB directement, je peux te créer un accès (écris moi sur slack), sinon liste ça en commentaire et je ferrai un copier-coller.

@moreaupascal56
Copy link
Collaborator Author

moreaupascal56 commented Feb 3, 2025

ducoup dans les nouveaux changements:

  • Fix du type quand refresh_type = "last"
  • renaming des références à SISE vers edc
  • renaming de annee_prelevement vers de_partition (de pour data_engineering)
  • gestion d'une erreur passée sous le radar 👓

J'ai créer deux tickets dans le Kanban pour améliorer ce script:

build_database.py/ Ajouter une option pour faire un drop des tables
build_database.py/ Ajouter un moyen de parametrer process_edc_datasets() via la CLI

@moreaupascal56 moreaupascal56 requested a review from jereze February 3, 2025 17:36
@simhub75
Copy link
Contributor

simhub75 commented Feb 3, 2025

Hello, top le code Pascal.
Après le requests.get on pourrai ajouter une condition response.status_code == 200 et dans le cas contraire logger le code d'erreur.
(on pourrai aussi utiliser un retry automatique en cas d'échec du requests.get, en utilisant urllib3.util.retry à voir si cela est nécessaire dans le cadre de ce projet)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo on filename and file is empty, add code or remove the notebook

Copy link
Collaborator Author

@moreaupascal56 moreaupascal56 Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jereze je peux supprimer ce notebook? je crois que tu l'as rajouté mais il a du y avoir une erreur lors de l'import ou du commit je ne sais pas (commit)

@moreaupascal56
Copy link
Collaborator Author

moreaupascal56 commented Feb 4, 2025

Hello @NicolasDuchenne,

  • j'ai rajouté un fichier pour la config _config_edc.py, je ne sais pas si c'est come ça que tu voyais les choses (ou en yaml)
  • changé le defaut de refresh_type pour last
  • ajouté de_ingestion_date dans les tables sql (Pour tester il faut drop les tables ou la db car le schema des tables change).

Je me demande si peut être on devrait créer un folder pour le dataset edc, ce qui donnerait:

tasks
├── _common.py
├── build_database.py
├── processing_edc
│ ├── _edc_config.py
│ └── processing_edc.py

car build_database.py pourrait en théorie construire toute la db et pas uniquement EDC, donc il ne devrait pas y avoir la logique pour ce dataset dans le fichier.
L'autre option est tout simplement de renommer build_database.py en build_edc_database.py

Pour le README.md si ça va a tout le monde je l'updaterais (sic :p) une fois que le code sera prêt.

NicolasDuchenne
NicolasDuchenne previously approved these changes Feb 4, 2025
Copy link
Contributor

@NicolasDuchenne NicolasDuchenne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C'est ok pour moi, on pourra restructurer le code avec des sous dossier si on ajoute des sources dans le futur

jereze
jereze previously approved these changes Feb 4, 2025
@jaouena jaouena mentioned this pull request Feb 4, 2025
"communes": {
"file_name_prefix": "DIS_COM_UDI_",
"file_extension": ".txt",
"table_name": "edc_communes",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comme tu renomme les tables, le notebook de test ne va plus fonctionner (analytics/notebooks/exemple.ipynb). Il faut mettre a jour le nom des tables dans le notebook

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix dans le commit 5e901f8

@moreaupascal56 moreaupascal56 dismissed stale reviews from jereze and NicolasDuchenne via 5e901f8 February 5, 2025 10:10
@NicolasDuchenne NicolasDuchenne merged commit 2bc4074 into dataforgoodfr:main Feb 5, 2025
1 check passed
tomboulier referenced this pull request in tomboulier/dataforgood_saison13_pollution_eau-backup Feb 8, 2025
* add fct get_yearly_dataset_infos

* parametrize process_sise_eaux_dataset_2024

* recreate cache folder by default in clear_cache()

* add check_table_existence function

* parametrize process_sise_eaux_dataset_2024()

* format _common.py

* add process_sise_eaux_dataset() controller and rename process_sise_eaux_dataset_2024 to download_extract_insert_yearly_SISE_data

* upd docstrings, formatting

* upd logs and add a clear_cache() in process_sise_eaux_dataset

* reorganize file

* add notebook to preview data

* fix Incompatible types in assignment

* rename SISE to EDC

* rename annee_prelevement to de_partition

* catch and raise error if refresh_type not in the allowed values

* format

* fix typo

* add _config_edc.py and use it accordingly

* add de_ingestion_date

* make refresh_type="last" as default

* fix example notebooks with new sql table names

* delete test_plusierus_annees.ipynb

---------

Co-authored-by: Jeremy Greze <jereze@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants