Skip to content

Conversation

@danscales
Copy link
Collaborator

GTC-3375 Add "copy_solo_tiles" option to optimize creation of int-dist alerts raster

When we merge integrated alerts and dist alerts rasters, much of the globe has only dist alerts. So, we want to optimize the process by copying the dist alerts tiles directly when to the final raster if there is no corresponding integrated alerts tile. We provide the copy_solo_tiles option to do this, which runs the "copy_solo_tiles.sh" script after the main raster is created with "union_bands = False". The script also correctly updates extent.geojson and tiles.geojson.

This is a fairly specialized option, but we have a similar specialized option with unify_projections.

…t alerts

When we merge integrated alerts and dist alerts rasters, much of the
globe has only dist alerts. So, we want to optimize the process by
copying the dist alerts tiles directly when to the final raster if there
is no corresponding integrated alerts tile. We provide the
copy_solo_tiles option to do this, which runs the "copy_solo_tiles.sh"
script after the main raster is created with "union_bands = False". The
script also correctly updates extent.geojson and tiles.geojson.
"tiles.geojson file on S3 or a folder (prefix) on S3 or GCS. "
"Features in tiles.geojson must have path starting with either /vsis3/ or /vsigs/",
"Features in tiles.geojson must have path starting with either /vsis3/ or /vsigs/"
"auxiliary_assets is ignored if source_uri is set (for creating new versions)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would of course be better to raise an error immediately, rather than just ignoring it. I know that's not in scope for this PR, but since you're already adding the doc could you add a TODO to enforce this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I add an extra asset right below this. Feel free to comment on that - the tests pass, so at least no test request triggered that validation error.

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 63.63636% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.26%. Comparing base (6563a73) to head (28209c2).

Files with missing lines Patch % Lines
...s/raster_tile_set_assets/raster_tile_set_assets.py 66.66% 2 Missing ⚠️
app/tasks/raster_tile_set_assets/utils.py 33.33% 2 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #710      +/-   ##
===========================================
- Coverage    76.29%   76.26%   -0.03%     
===========================================
  Files          143      143              
  Lines         6740     6749       +9     
===========================================
+ Hits          5142     5147       +5     
- Misses        1598     1602       +4     
Flag Coverage Δ
unittests 76.26% <63.63%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dmannarino
Copy link
Member

I think this is a good idea, but I'm not crazy about how special-case the implementation is. I also know you're trying to get this done quickly, so feel bad suggesting generalizing it a little (for example, taking unique tiles from each source for any number of sources), particularly with my impression of how advanced some of that shell scripting already is. I felt SO bad, in fact, that I wrote some suggested code for a Python script to do the work that your shell script does, but is a little more general. I hope I'm not being presumptuous, and I apologize if I am. I humbly offer this (untested, but almost complete) Python code if you want to use it:

#!/usr/bin/env python3
"""
Copy tiles that are unique to a single source location to a common destination.

Usage:
    copy_unique.py --sources s3://bucket1/path1 --sources s3://bucket2/path2 --dest s3://dest/path
"""

from collections import Counter
from pathlib import PurePosixPath
from typing import List, Tuple, Dict, Annotated

import typer

from aws_utils import (
    get_aws_files,
    get_s3_path_parts,
    upload_s3,
)


app = typer.Typer(help="Copy tiles unique to a single source to a common destination")


def parse_s3_uri(uri: str) -> Tuple[str, PurePosixPath]:
    """Parse S3 URI into bucket and path."""
    raise NotImplementedError


def validate_uri_format(uri: str) -> None:
    """Validate that URI has the expected format."""
    raise NotImplementedError


def get_base_path(uri: str) -> str:
    """Get base path without the geotiff/{tile_id}.tif part."""
    raise NotImplementedError


@app.command()
def main(
    sources: Annotated[List[str], typer.Option(help="Source S3 URI (can be specified multiple times)")],
    dest: Annotated[str, typer.Option(help="Destination S3 URI")]
):
    """
    Copy tiles that are unique to a single source location to a common destination.

    Only tiles that appear in exactly one source will be copied.
    """
    if not sources:
        typer.echo("Error: At least one source must be specified", err=True)
        raise typer.Exit(1)

    typer.echo(f"Processing {len(sources)} source locations...")

    # Validate all URIs
    for uri in sources + [dest]:
        try:
            validate_uri_format(uri)
        except ValueError as e:
            typer.echo(f"Error: {e}", err=True)
            raise typer.Exit(1)

    # Get base paths and list tiles for each source
    source_paths = [get_base_path(uri) for uri in sources]

    source_tile_paths: List[List[str]] = []
    for path in source_paths:
        typer.echo(f"Getting list of tiles from {path}...")
        bucket, prefix = get_s3_path_parts(path)
        tile_paths = [
            tile_path for tile_path in
            get_aws_files(bucket, prefix)
        ]
        typer.echo(f"  Found {len(tile_paths)} tiles")
        source_tile_paths.append(tile_paths)

    # Count the number of each tile amongst the sources
    tile_counter = Counter()
    tile_to_source: Dict[str, int] = {}

    for i, tile_paths in enumerate(source_tile_paths):
        for tile_path in tile_paths:
            file_name = tile_path.rsplit("/", maxsplit=1)[1]
            tile_counter[file_name] += 1
            if file_name not in tile_to_source:
                tile_to_source[file_name] = i

    # Filter for tiles that appear in exactly one source
    unique_tiles: List[Tuple[str, int]] = [
        (file_name, tile_to_source[file_name])
        for file_name, count in tile_counter.items()
        if count == 1
    ]

    if not unique_tiles:
        typer.echo("No tiles found that are unique to a single source")
        return

    typer.echo(f"\nFound {len(unique_tiles)} tiles unique to single sources")

    dest_base_path = get_base_path(dest)

    # Copy each unique tile to the destination
    for i, (file_name, source_idx) in enumerate(unique_tiles):
        typer.echo(f"Copying {i+1}/{len(unique_tiles)}: {file_name} (from source {source_idx})")
        try:
            upload_s3(file_name, source_paths[source_idx], dest_base_path)
        except Exception:
            typer.echo(f"Failed to copy {file_name}, continuing...", err=True)
            continue

    typer.echo("\nAll unique tiles copied to destination")

if __name__ == '__main__':
    app()

@danscales
Copy link
Collaborator Author

Below is my filled-in version of Daniel's suggested copy_unique.py script. Along with filling in things and adding validation, etc., I actually needed a copy_s3() function, not upload_s3(). Daniel suggested an approach where we use the script to copy the unique tiles first. However, doing that and then doing the normal raster call with union_bands = True results in a much longer run (3 hours, rather 1:25). If I set union_bands = False after doing the copy, then the run was faster, but the tiles.geojson/extent.geojson were not right. So, I am going to stay with the current less general batch script for now, but recording the more general python script here for later:

#!/usr/bin/env python3
"""
Copy tiles that are unique to a single source location to a common destination.

Usage:
    copy_unique.py --sources s3://bucket1/path1 --sources s3://bucket2/path2 --dest s3://dest/path
"""

from collections import Counter
from typing import List, Tuple, Dict, Annotated

import typer
import os

from aws_utils import (
    get_aws_files,
    get_s3_path_parts,
    get_s3_client
)
from urllib.parse import urlparse


app = typer.Typer(help="Copy tiles unique to a single source to a common destination")


def get_base_path(uri: str) -> str:
    """Validate path (next to last path component is "geotiff") and return with last path
    component removed."""

    parsed_uri = urlparse(uri)
    path = parsed_uri.path.strip('/')  # Remove leading/trailing slashes for consistent splitting

    path_components = path.split('/')

    if len(path_components) < 2 or path_components[-2] != 'geotiff':
        raise ValueError(
            f"Expected 'geotiff' as the next-to-last path component in the URI path. "
            f"Found: {'/'.join(path_components[-2:]) if len(path_components) >= 2 else path}"
        )

    base_path_components = path_components[:-1]

    base_uri_prefix = f"{parsed_uri.scheme}://{parsed_uri.netloc}/"
    base_path = base_uri_prefix + '/'.join(base_path_components)
    return base_path


def replace_last_component(uri: str, new_component: str):
    head = os.path.dirname(uri)
    return os.path.join(head, new_component)


def copy_s3(client, file: str, source: str, dest: str):
    sbucket, skey = get_s3_path_parts(f"{source}/{file}")
    dbucket, dkey = get_s3_path_parts(f"{dest}/{file}")
    copy_source = {
        "Bucket": sbucket,
        "Key": skey
    }
    try:
        client.copy_object(CopySource=copy_source, Bucket=dbucket, Key=dkey)
    except Exception as e:
        print(f"Error copying object: {e}")


@app.command()
def main(
    sources: Annotated[List[str], typer.Option(help="Source S3 URI (can be specified multiple times)")],
    dest: Annotated[str, typer.Option(help="Destination S3 URI")]
):
    """
    Copy tiles that are unique to a single source location to a common destination.

    Only tiles that appear in exactly one source will be copied.
    """
    if not sources:
        typer.echo("Error: At least one source must be specified", err=True)
        raise typer.Exit(1)

    typer.echo(f"Processing {len(sources)} source locations...")

    # Get base paths and list tiles for each source
    source_paths = [get_base_path(uri) for uri in sources]

    source_tile_paths: List[List[str]] = []
    for path in source_paths:
        typer.echo(f"Getting list of tiles from {path}...")
        bucket, prefix = get_s3_path_parts(path)
        tile_paths = [
            tile_path for tile_path in
            get_aws_files(bucket, prefix)
        ]
        typer.echo(f"  Found {len(tile_paths)} tiles")
        source_tile_paths.append(tile_paths)

    # Count the number of each tile amongst the sources
    tile_counter: Counter = Counter()
    tile_to_source: Dict[str, int] = {}

    for i, tile_paths in enumerate(source_tile_paths):
        for tile_path in tile_paths:
            file_name = tile_path.rsplit("/", maxsplit=1)[1]
            tile_counter[file_name] += 1
            if file_name not in tile_to_source:
                tile_to_source[file_name] = i

    # Filter for tiles that appear in exactly one source
    unique_tiles: List[Tuple[str, int]] = [
        (file_name, tile_to_source[file_name])
        for file_name, count in tile_counter.items()
        if count == 1
    ]

    if not unique_tiles:
        typer.echo("No tiles found that are unique to a single source")
        return

    typer.echo(f"\nFound {len(unique_tiles)} tiles unique to single sources")

    dest_base_path = get_base_path(dest)

    # Copy each unique tile to the destination
    client = get_s3_client()
    for i, (file_name, source_idx) in enumerate(unique_tiles):
        typer.echo(f"Copying {i + 1}/{len(unique_tiles)}: {file_name} (from source {source_idx})")
        try:
            copy_s3(client, file_name, source_paths[source_idx], dest_base_path)
            copy_s3(client, file_name,
                    replace_last_component(source_paths[source_idx], "gdal-geotiff"),
                    replace_last_component(dest_base_path, "gdal-geotiff"))
        except Exception:
            typer.echo(f"Failed to copy {file_name}, continuing...", err=True)
            continue

    typer.echo("\nAll unique tiles copied to destination")


if __name__ == '__main__':
    app()

@danscales danscales merged commit 010c777 into develop Oct 28, 2025
1 of 2 checks passed
@danscales danscales deleted the copy_on_solo branch October 28, 2025 18:16
danscales added a commit that referenced this pull request Oct 28, 2025
…t alerts raster (#710)

* GTC-3375 Add "copy_solo_tiles" option to optimize creation of int-dist alerts

When we merge integrated alerts and dist alerts rasters, much of the
globe has only dist alerts. So, we want to optimize the process by
copying the dist alerts tiles directly when to the final raster if there
is no corresponding integrated alerts tile. We provide the
copy_solo_tiles option to do this, which runs the "copy_solo_tiles.sh"
script after the main raster is created with "union_bands = False". The
script also correctly updates extent.geojson and tiles.geojson.

* Validate that auxiliary_asset is not provided if source_uri is provided.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants