History fix (#65)

kevinmarlis · web-flow · commit dd4468319eba · 2024-03-12T09:11:03.000-07:00
* Updates and refactoring (#54) * Cleaned up yamls, notebooks, and began support for multiprocessing during aggregation, along with bug fixes * Updated notebook * Added additional field * General debugging, performance improvements, and better logging * Bug fixes and further aggregation multiprocessing development * Added better support for GRACE datasets * Bugfix * Added support for multiprocessing in aggregation * Added support for logging when using multiprocessing * Bug Fixes and Improvements (#55) * Added grids directory readme * Added safety check to ensure global_settings.py is setup * First draft of READMEs in code * Added config validation which runs automatically as part of pipeline init * Removed unused aggregated parameter and some bug fixes * Runs ds yaml validator * Added jsonschema and some version numbers * Refactored to make use of dataset and field classes where dataset class contains ds config information used across all pipeline stages * CMR harvesting uses unique concept id * Bug fixes * Added support for NSIDC HTTP data access * Changed to HTTP data access * Moved land masking to pre transformation function * Added missing fields * Added 30 second fallback if CMR query fails * Added support for preloading factors and grids along with some minor bug fixes * Hot fix for missing log_filename variable * Bug fix * Fixed units bug and set grace datasets to be converted to cm * Fixed multiprocessing logging bug and pass logging filename to functions * Renamed TPOSE grid file * Removed conversion to cm * Bug fix when applying pretransformation functions and downgraded noisy logging associated * Added support for AVHRR ice removal * Hotfix for single processor aggregation * Log overhaul (#58) * Reworked logging to better handle multiprocessing logging * Overhauled all logging and removed preloading factors and grids as it was causing a process lock * Reworked logic for determining transformations for a given granule. Reduces calls to Solr * Added support for catds http harvesting and updated L3_DEBIAS_LOCEAN datasets * Bugfix for logic finding transformations for a given granule * Harvesting work (#59) * Added NRT to list of file formats to ignore * Added support for web scraping html thredds harvesting for OSISAF datasets * Bug fix for monthly url paths * Added enumeration readme * Set parser used by soup to xml * Updated data paths to support OSI 408 * Bug fixes * Removed unused size field from Granule in preparation for further refactoring * Removed unused FTP harvester code * Bug fix * Suppress nan mean on empty slice warning * Removed unused field * Fixed logging issues * Removed unused ftp harvester * Updated grace mascon harvester to avoid redundant work * Fixed CMRGranule mod_time dtype * Added logging debug statements and suppressed nan mean empty slice warnings * Improved logging * Updated atl daily harvesting to reduce redundant work * Small updates * Further refactoring (#60) * removed unused config fields * refactored utils directory * Bug fixes and updates to reflect transformation and aggregation refactoring * Added unittest for CMR querying * Converted CMR querying to make use of python-cmr package * Further refactoring to harvesters, transformations, and aggregations * Removed changelog for now * Fixed pre transformation func name * Refactored pre and post transformation functions * Cleaned up imports * Renamed notebooks directory to quickook_notebooks * Added preprocessing function logging info * Improved transformation logging * Overhauled dataset config readme. Still a work in progress * Increased documentation verbiage * Removed unused filename_filter field * Expanded descriptions of some projection fields * Bug fix * Added context opener for writing files * Added support for ATL21 * Added support for CMR harvesting unittest * Fixed readme
diff --git a/ecco_pipeline/conf/README.md b/ecco_pipeline/conf/README.md
@@ -1,109 +1,9 @@
-# Generating a dataset config file
+# Configuration files
 
-The recommended approach to adding support for a new dataset is to start with an existing config. The best way to obtain the information for these fields is through a combination of looking at a sample data granule and looking at the dataset's documentation. Here we'll walk through the config for `AMSR-2_OSI-408`, looking at the various sections.
+## Dataset configs
 
-## Dataset
+Config file per dataset contains all information needed for a dataset to be run through the pipeline from harvesting through aggregation. See README in `ecco_pipeline/conf/ds_configs` for more information. `ecco_pipeline/conf/ds_configs/deprecated` contains configs for datasets no longer supported, typically because they have been supplanted by a newer version.
 
-```
-ds_name: AMSR-2_OSI-408 # Name for dataset
-start: "19800101T00:00:01Z" # yyyymmddThh:mm:ssZ
-end: "NOW" # yyyymmddThh:mm:ssZ for specific date or "NOW" for...now
-```
+## global_settings.py
 
-- `ds_name` is the internal name for the dataset. We recommend using the dataset's shortname or something similar if shortname is not available.
-- `start` and `end` are the isoformatted (with Z!) date ranges that the pipeline should try to process. The `end` field can also be the string "NOW" which will set the end field to the current datetime at runtime.
-
-## Harvester
-This section contains fields that dictate which harvester the data should be pulled from. There are different fields required depending on the harvester. These are the fields consistent across all harvesters.
-```
-harvester_type: osisaf
-filename_date_fmt: "%Y%m%d"
-filename_date_regex: '\d{8}'
-```
-- `harvester_type` can be one of `cmr`, `osisaf`, `nisdc`, or `catds`.
-- `filename_date_fmt` is the string format of the date in the filename.
-- `filename_date_regex` is the regex format of the date in filename.
-
-The section where `cmr` is the harvester_type includes the following:
-```
-cmr_concept_id: C2491756442-POCLOUD
-provider: "archive.podaac"
-```
-- `cmr_concept_id` is the unique concept_id identifier in CMR for the datatset
-- `provider` is the provider of the data, used to select the download URLs. Typically this is set to "archive.podaac" although it will be different for NSIDC datasets.
-
-
-The section `osisaf` or `catds` is the harvester includes:
-```
-ddir: "ice/amsr2_conc"
-```
-- `ddir` is the subdirectory where the specific data can be found. This is only required for non-cmr harvesters, and is not required for the non-cmr NSIDC harvester.
-
-## Metadata
-This section includes details specific to the data.
-```
-data_time_scale: "daily" # daily or monthly
-hemi_pattern:
-  north: "_nh_"
-  south: "_sh_"
-fields:
-  - name: ice_conc
-    long_name: Sea ice concentration
-    standard_name: sea_ice_area_fraction
-    units: " "
-    pre_transformations: []
-    post_transformations: ["seaice_concentration_to_fraction"]
-original_dataset_title: "Global Sea Ice Concentration (AMSR-2)"
-original_dataset_short_name: "Global Sea Ice Concentration (AMSR-2)"
-original_dataset_url: "https://osi-saf.eumetsat.int/products/osi-408"
-original_dataset_reference: "https://osisaf-hl.met.no/sites/osisaf-hl.met.no/files/user_manuals/osisaf_cdop2_ss2_pum_amsr2-ice-conc_v1p1.pdf"
-original_dataset_doi: "OSI-408"
-```
-- `data_time_scale` is the time scale of the data, either daily or monthly. Monthly data is considered data averaged per month - all other data is considered daily.
-- `hemi_pattern` sets the filename pattern for data split by hemisphere. This section can be omitted for datasets that don't do this.
-- `fields` is the list of data variables that should be transformed as part of the pipeline. You need to manually provide the field's `name`, `long_name`, `standard_name`, and `units`. The `pre_transformations` and `post_transformations` are the names of functions to be applied to the specific data field. Some examples are units conversion, or data masking. Functions are defined in `ecco_pipeline/utils/processing_utils/ds_functions.py`. 
-- The five `original_*` fields are dataset level metadata that will be included in transformed file metadata.
-
-## Transformation
-This section contains fields required for transformation and contain information on data resolution etc. For hemispherical data, `area_extent`, `dims`, and `proj_info` must be defined for each hemispher as below. It is not unusual for the values in this section to be determined iteratively. The testing notebooks in `tests/quicklook_notebooks/` are a useful tool in determining the validity of the values provided.
-```
-t_version: 2.0 # Update this value if any changes are made to this file
-data_res: 10/111 # Resolution of dataset
-
-# Values for non split datasets (for datasets split into nh/sh, append '_nh'/'_sh')
-area_extent_nh: [-3845000, -5345000, 3745000, 5845000]
-area_extent_sh: [-3950000, -3950000, 3945000, 4340000]
-dims_nh: [760, 1120]
-dims_sh: [790, 830]
-proj_info_nh:
-  area_id: "3411"
-  area_name: "polar_stereographic"
-  proj_id: "3411"
-  proj4_args: "+init=EPSG:3411"
-proj_info_sh:
-  area_id: "3412"
-  area_name: "polar_stereographic"
-  proj_id: "3412"
-  proj4_args: "+init=EPSG:3412"
-
-notes: ""
-```
-- `t_version` is a metadata field used internally in the pipeline. Modifying the value will trigger retransformation.
-- `data_res` is the spatial resolution of the dataset in degrees
-- `area_extent` is the area extent specific to this data in the form: lower_left_x, lower_left_y, upper_right_x, upper_right_y
-- `dims` is the size of longitude or x coordinate, latitude or y coordinate
-- `proj_info` contains projection information used by pyresample
-- `notes` is an optional string to include in global metadata in output files
-
-## Aggregation
-This section contains fields required for aggregating data.
-```
-a_version: 1.3 # Update this value if any changes are made to this file
-remove_nan_days_from_data: False # Remove empty days from data when aggregating
-do_monthly_aggregation: True
-skipna_in_mean: True # Controls skipna when calculating monthly mean
-```
-- `a_version` is a metadata field used internally in the pipeline. Modifying the value will trigger reaggregation.
-- `remove_nan_days_from_data` will remove nan days from aggregated outputs
-- `do_monthly_aggregation` will also compute monthly averages when aggregating annual files
-- `skipna_in_mean` is used when calculating the monthly mean
+Script that contains some settings that are used globally throughout the pipeline, such as the location of the output directory. This file must be manually set up after cloning the repo.