layout: topic.njk

Querying STAC Catalogs Programmatically: A Production Workflow for Python Remote Sensing Pipelines

Modern remote sensing infrastructure has shifted decisively from monolithic FTP archives to distributed, API-driven ecosystems. Querying STAC Catalogs Programmatically is now a foundational requirement for environmental data engineers and Python GIS developers building scalable raster processing pipelines. The SpatioTemporal Asset Catalog (STAC) specification standardizes how geospatial assets are described, discovered, and accessed across cloud providers, government repositories, and commercial platforms.

This guide provides a tested, production-ready workflow for interacting with STAC APIs in Python. You will learn how to construct spatiotemporal filters, handle pagination safely, validate asset readiness, and prepare results for downstream raster operations. For broader architectural context on how metadata maps to actual raster I/O, see Core Raster Fundamentals & STAC Mapping.

Prerequisites & Environment Setup

Before implementing programmatic queries, ensure your environment meets the following baseline requirements:

  • Python 3.9+: Required for modern type hints, zoneinfo support, and async-compatible HTTP clients.
  • Core Libraries: pystac-client (≥0.7.0), shapely (≥2.0), pyproj, requests, and rasterio (for downstream validation).
  • Network Access: Unrestricted outbound HTTPS to public STAC endpoints (e.g., earth-search.aws.element84.com, planetarycomputer.microsoft.com/api/stac/v1). Private catalogs may require API keys or OAuth2 tokens.
  • Conceptual Baseline: Familiarity with the STAC Specification and how JSON metadata maps to cloud-optimized raster formats.

Install dependencies via pip:

pip install pystac-client shapely pyproj rasterio requests

Step-by-Step Query Workflow

1. Initialize the STAC Client

A STAC client acts as the programmatic entry point to a catalog’s API. Initialization requires a valid endpoint URL and optional configuration for authentication, timeout thresholds, and retry logic. The client automatically parses the root catalog, discovers available collections, and validates API conformance classes against the OGC API - Features standard.

import pystac_client
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# Production-safe client initialization with retry strategy
def get_stac_client(url: str, timeout: int = 30) -> pystac_client.Client:
    retry_strategy = Retry(
        total=3,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504]
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session = requests.Session()
    session.mount("https://", adapter)
    session.mount("http://", adapter)

    return pystac_client.Client.open(url, headers={})

2. Define Spatiotemporal & Metadata Constraints

STAC queries are constructed around three primary dimensions. Proper constraint definition prevents over-fetching and reduces downstream compute costs.

Spatial Constraints

You can filter by bounding box (bbox) or GeoJSON geometry (intersects). When defining coordinates, maintain consistent decimal precision and validate against the target catalog’s native projection. For detailed guidance on avoiding floating-point drift and snapping artifacts, review Handling coordinate precision in STAC spatial queries. If your workflow requires projecting geometries before querying, consult Mastering CRS Transformations in Rasterio to ensure spatial filters align with the catalog’s coordinate reference system.

from shapely.geometry import box, mapping
from datetime import datetime, timezone

# Example: Bounding box (WGS84)
bbox = [-122.5, 37.5, -122.0, 38.0]

# Example: Polygon intersection
geometry = box(-122.45, 37.7, -122.35, 37.8)
geojson_filter = mapping(geometry)

Temporal Constraints

Temporal filters use ISO 8601 datetime strings. Open-ended ranges are expressed with ... When targeting specific satellite missions, align your date windows with acquisition schedules and revisit periods. For practical examples of temporal filtering against multi-sensor archives, see Using pystac-client to filter Sentinel-2 imagery by date.

# Closed range
datetime_range = "2023-06-01T00:00:00Z/2023-06-30T23:59:59Z"

# Open-ended range (from June 2023 to present)
open_range = "2023-06-01T00:00:00Z/.."

Metadata & Query Extensions

Beyond space and time, STAC supports the Query Extension for filtering on properties like eo:cloud_cover, platform, or custom vendor fields. Always validate that the target catalog implements the extension before relying on it.

3. Execute Query & Handle Pagination

Large catalogs return paginated results. Production pipelines must iterate through pages safely, respecting next links and avoiding memory exhaustion. The pystac-client search() method returns an iterator that handles pagination transparently when consumed correctly.

def execute_stac_search(
    client: pystac_client.Client,
    bbox: tuple[float, float, float, float] | None = None,
    datetime_str: str | None = None,
    collections: list[str] | None = None,
    query: dict | None = None,
    max_items: int | None = None
) -> list[dict]:
    """Execute a paginated STAC search with safe memory handling."""
    search = client.search(
        bbox=bbox,
        datetime=datetime_str,
        collections=collections,
        query=query
    )
    
    results = []
    for item in search.items_as_dicts():
        if max_items and len(results) >= max_items:
            break
        results.append(item)
    return results

Production Notes:

  • Use items_as_dicts() instead of items() when you only need JSON payloads. It avoids instantiating full pystac.Item objects, reducing memory overhead by ~40%.
  • Implement a hard max_items cap in automated workflows to prevent runaway requests during API outages or misconfigured filters.
  • Log pagination cursors and request durations for observability.

4. Extract Assets & Validate Raster Readiness

Each matching item contains an assets dictionary mapping asset keys (e.g., B04, visual, data-mask) to URLs, MIME types, and metadata. Before passing these URLs to raster I/O libraries, validate that:

  1. The asset exists and is accessible (HTTP 200).
  2. The MIME type matches your expected format (e.g., image/tiff; application=geotiff; profile=cloud-optimized).
  3. The file structure supports byte-range requests.

Understanding how cloud-native formats enable partial reads is critical for efficient pipelines. Review Understanding Cloud-Optimized GeoTIFF Structure to grasp how internal tiling and overviews impact query performance.

import rasterio
from rasterio.errors import RasterioIOError
import requests

def validate_asset_url(url: str, timeout: int = 10) -> bool:
    """Verify asset accessibility and COG compatibility."""
    try:
        # Quick HEAD request to check status and headers
        resp = requests.head(url, timeout=timeout, allow_redirects=True)
        if resp.status_code != 200:
            return False
        
        # Verify byte-range support (required for COG streaming)
        if "bytes" not in resp.headers.get("Accept-Ranges", ""):
            return False
            
        # Optional: lightweight rasterio validation
        with rasterio.open(url) as src:
            if src.count == 0:
                return False
        return True
    except (requests.RequestException, RasterioIOError, Exception):
        return False

5. Prepare for Downstream Processing

Once assets pass validation, they can be streamed directly into rasterio, xarray, or rioxarray workflows. Avoid downloading entire files to disk unless necessary for heavy pixel-wise transformations. Instead, leverage HTTP range requests to read only the bounding windows required for your analysis.

When working with heterogeneous catalogs, STAC items often contain nested JSON structures, custom extensions, or vendor-specific metadata. For robust schema enforcement and type-safe extraction, consider Parsing complex STAC item properties with pydantic to standardize your ingestion layer before passing data to numerical computing stacks.

Production Best Practices

Rate Limiting & Concurrency

Public STAC endpoints enforce strict rate limits. Wrap your query loops with exponential backoff and respect Retry-After headers. For high-throughput pipelines, use asyncio with aiohttp or httpx, but ensure you do not exceed concurrent connection limits.

Caching & Idempotency

STAC item IDs are globally unique within a catalog. Cache query results using item IDs as keys to avoid redundant API calls. When building scheduled pipelines, store the last successful query timestamp and use it as the lower bound for subsequent runs.

Error Handling & Logging

Never assume network stability or catalog consistency. Implement structured logging that captures:

  • Request payloads and response status codes
  • Pagination depth and total item counts
  • Asset validation failures with explicit URLs
  • CRS mismatches or geometry validation errors

Security & Authentication

For private catalogs, inject API keys via environment variables rather than hardcoding them. Use pystac_client.Client.open(url, headers={"Authorization": f"Bearer {os.getenv('STAC_TOKEN')}"}). Rotate credentials regularly and restrict IAM permissions to read-only access where possible.

Conclusion

Querying STAC Catalogs Programmatically transforms geospatial data discovery from a manual, error-prone process into a reproducible, scalable engineering workflow. By combining robust client initialization, precise spatiotemporal filtering, safe pagination, and asset validation, your Python pipelines can reliably ingest petabytes of cloud-native raster data. As the ecosystem matures, adopting standardized extensions, leveraging COG streaming, and enforcing strict schema validation will keep your infrastructure resilient and performant.

Deep-Dive Articles