Skip to content

Adding a metadata formats section to the guide #176

@abarciauskas-bgse

Description

@abarciauskas-bgse

Cloud-native data must be accompanied with metadata. There are at least 2 reasonable options when it comes to which metadata model to use. There are even more options when it comes to storage format, extensions and query engine. Metadata can be used both to find data to analyze [or visualize] and to compute on or represent actual data.

The metadata models we consistently propose for cloud-native geospatial data are Zarr and STAC. A new section of the guide will briefly describe and link to the metadata models for both Zarr and STAC.

Outline

Motivation

See description above

Zarr and STAC Metadata Models

[A bit more about Zarr and STAC metadata models]

This section should describe how those metadata models have traditionally been used for and how they can be extended to support a consistent workflow from search to analytics. The reason why you want to do this is to (a) ensure metadata is consistent with the data it represents and (b) minimize or avoid the data pre-processing steps to get to the analytics-ready data the user requires.

Beyond Search: Integrating Search and Analysis

Most commonly, metadata has been used to just discover data (usually stored in files) to process and analyze or visualize, metadata can also store information which can be used in data analytics itself. For example, summary statistics or qualitative and quantitative values like cloud cover. These metadata fields, often optional, can be used to filter data and produce analytics in itself.

Metadata can also be used to store mappings between logical data coordinates and file bye offsets. This can be used to create a "virtual layer" between the raw data files and a data structure which can be used for data analysis. Another way to think about this is that you create an inventory of all the raw data in a way to support analysis without loading metadata from individual files. (I like the Amazon warehouse inventory analogy, the bots or humans know exactly where to fetch an item based on an inventory, they don't have to go searching the whole warehouse or even entire rows).

Search with Zarr

The Zarr spec declares Zarr as a format for both data and metadata. Zarr libraries are typically used for modeling data in Zarr's data model for analytics and not for search. Zarr metadata can be stored in formats which can be the backend for a query engine and thus used as a metadata model to support discovery as well. One very nice thing about this model is that usually Zarr metadata is managed at the same time as the data itself, so it is unlikely metadata gets out of sync with the data itself. Additional metadata can be stored in chunk manifests as a Zarr "virtual layer" which can then be used to load data from native Zarr storage or archival formats like NetCDF.

Analysis with STAC

STAC specifies a metadata model, not a data model. It typically stores metadata used to search for STAC items and their associated assets, which are subsequently used for analysis. STAC can be used to catalogue Zarr as a collection- or item-level asset. The current STAC Zarr documentation details a convention for cataloguing Zarr in STAC.

Right now, the options to go from data search to analysis are:

  1. Traditional cloud native workflow: STAC for search only: Use a STAC query engine (such as pystac-client + STAC API) used to discover assets ➡️ data i/o + format libraries to read, parse and analyze discovered assets (xarray)
  2. STAC for metadata analytics: Store add'l metadata stored in STAC used to deliver analytics itself (e.g. STAC-geoparquet + pandas/duckdb), analytics on raw data not available
  3. STAC for data analytics (3 options):
    i. You can use wrapper libraries for reading file metadata into the xarray data model (e.g. stackstac or odc-stack).
    ii. You could also store and load chunk manifests in STAC into the xarray/Zarr data model (using virtualizarr). This workflow is not strongly recommended at this time, it's too clunky and ideally you just write a virtual zarr store (see next option)
    iii. Store Zarr entry points in STAC (recommended)
  4. Zarr for search and analytics: You can store Zarr metadata in a format that works with a query engine (DataFusion using a Zarr Table Provider) used to directly search and analyze data stored in files and represented as Zarr. Use this when you don't need to query data in STAC.

cc @gadomski @wildintellect @jsignell

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions