Querying STAC Catalogs Programmatically

Modern remote sensing infrastructure has shifted from monolithic FTP archives to distributed, API-driven ecosystems. Discovering, filtering, and retrieving satellite assets programmatically is now the bottleneck that determines whether a raster pipeline is reproducible at scale — or fragile by design. For the full architectural context of how STAC metadata maps to raster I/O, see Core Raster Fundamentals & STAC Mapping.

The SpatioTemporal Asset Catalog (STAC) specification standardises how geospatial assets are described, discovered, and accessed across cloud providers, government repositories, and commercial platforms. This page walks through a tested, production-ready workflow for interacting with STAC APIs in Python: client initialisation, spatiotemporal filtering, safe pagination, asset validation, and handoff to downstream raster tools.


STAC Query Pipeline Diagram showing the flow from a Python pystac-client instance through the STAC API search endpoint, into paginated item results, through asset validation, and finally to COG byte-range reads in rasterio or xarray. pystac_client .Client.open() STAC API /search (POST) Paginated Item Results Asset Validation COG Byte-range Read rasterio / xarray search() next page assets{} URL Filters: bbox · datetime · collections · eo:cloud_cover (Query Extension / CQL2)

Prerequisites

Install the required libraries before working through the steps below:

pip install "pystac-client>=0.7.0" "shapely>=2.0" pyproj "rasterio>=1.3" requests
Library Min version Why required
pystac-client 0.7.0 STAC API client with pagination and conformance detection
shapely 2.0 Geometry construction for intersects filters
pyproj 3.5 CRS validation before submitting spatial filters
rasterio 1.3 Asset validation and downstream window reads
requests 2.31 Session-level retry strategy for the HTTP transport

Conceptual prerequisites: Understanding how the STAC item hierarchy (Catalog → Collection → Item → Asset) maps to files on object storage is assumed. If understanding Cloud-Optimized GeoTIFF structure is new to you, read that page first — knowing how internal tiling and overviews work explains why byte-range validation matters in step 4.


Step-by-Step Query Workflow

Step 1 — Initialise the STAC client

A pystac_client.Client is the programmatic entry point to a catalog’s /search endpoint. On open, it fetches the root catalog, discovers available collections, and validates the API’s conformance classes against the OGC API – Features standard. Supply a retry-equipped requests.Session so transient HTTP 429 and 5xx errors do not abort long-running pipeline runs.

import os
import pystac_client
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry


def get_stac_client(url: str, token: str | None = None) -> pystac_client.Client:
    """Open a STAC client with retry logic and optional bearer-token auth."""
    retry_strategy = Retry(
        total=3,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["HEAD", "GET", "POST"],
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session = requests.Session()
    session.mount("https://", adapter)
    session.mount("http://", adapter)

    headers: dict[str, str] = {}
    if token:
        headers["Authorization"] = f"Bearer {token}"

    return pystac_client.Client.open(url, headers=headers)


# Public catalogs — no auth required
EARTH_SEARCH = "https://earth-search.aws.element84.com/v1"
PC = "https://planetarycomputer.microsoft.com/api/stac/v1"

client = get_stac_client(EARTH_SEARCH)

For private catalogs, load the token from an environment variable rather than hardcoding it:

client = get_stac_client(PC, token=os.getenv("PC_SDK_SUBSCRIPTION_KEY"))

Step 2 — Define spatiotemporal and metadata constraints

STAC queries are built from three dimensions. Precise constraint definition avoids over-fetching and reduces the downstream compute cost of filtering large Item collections in memory.

Spatial constraints

Filter by bounding box (bbox) or GeoJSON geometry (intersects). Coordinates must be in WGS84 (EPSG:4326). If your area of interest is in a projected CRS, reproject it before querying — misaligned CRS inputs silently return empty results. For worked examples of projecting geometries, see mastering CRS transformations in rasterio.

from shapely.geometry import box, mapping

# Bounding box: [west, south, east, north] in WGS84
bbox = [-122.5, 37.5, -122.0, 38.0]

# GeoJSON polygon for non-rectangular AOIs
aoi_geom = box(-122.45, 37.70, -122.35, 37.80)
geojson_filter = mapping(aoi_geom)

Temporal constraints

Temporal filters use ISO 8601 datetime strings. Use .. for open-ended ranges. Align date windows with acquisition schedules: Sentinel-2 revisits every 5 days (two satellites) and Landsat-8/9 revisits every 8 days.

# Closed range
datetime_range = "2025-06-01T00:00:00Z/2025-06-30T23:59:59Z"

# Open-ended: everything from June 2025 to now
open_range = "2025-06-01T00:00:00Z/.."

# Single acquisition date (inclusive of the full day)
single_day = "2025-07-15T00:00:00Z/2025-07-15T23:59:59Z"

Property filters (Query Extension)

Beyond space and time, many STAC APIs support property filtering via the Query Extension or CQL2. This enables server-side cloud-cover thresholds, platform selection, and custom vendor fields — keeping the result set small before it reaches your process.

# Query Extension syntax (older APIs: earth-search v0, USGS)
query_ext = {
    "eo:cloud_cover": {"lte": 20},
    "platform": {"eq": "sentinel-2b"},
}

# CQL2-JSON syntax (newer APIs: earth-search v1, Planetary Computer)
# Pass via filter= and filter-lang="cql2-json"
cql2_filter = {
    "op": "and",
    "args": [
        {"op": "<=", "args": [{"property": "eo:cloud_cover"}, 20]},
        {"op": "=", "args": [{"property": "platform"}, "sentinel-2b"]},
    ],
}

Always introspect the catalog’s conformance classes before relying on these extensions:

conformance = client.conformance_classes()
supports_query = any("stac-api-query" in c for c in conformance)
supports_cql2  = any("cql2" in c for c in conformance)

Step 3 — Execute the search and handle pagination

Large catalogs return paginated results, typically capping at 250–500 items per page. The pystac-client search() iterator follows next links transparently when consumed as a generator. Always impose a hard max_items cap in automated pipelines to prevent runaway iteration during API outages or misconfigured filters.

def execute_stac_search(
    client: pystac_client.Client,
    collections: list[str],
    bbox: list[float] | None = None,
    datetime_str: str | None = None,
    query: dict | None = None,
    max_items: int = 500,
) -> list[dict]:
    """Execute a paginated STAC search and return items as plain dicts.

    Uses items_as_dicts() to avoid instantiating full pystac.Item objects,
    which reduces memory overhead for large result sets.
    """
    search = client.search(
        collections=collections,
        bbox=bbox,
        datetime=datetime_str,
        query=query,
        max_items=max_items,
    )

    items: list[dict] = []
    for item in search.items_as_dicts():
        items.append(item)
        if len(items) >= max_items:
            break

    return items


# Example: Sentinel-2 L2A over San Francisco, June 2025, ≤20 % cloud cover
items = execute_stac_search(
    client=client,
    collections=["sentinel-2-l2a"],
    bbox=[-122.5, 37.5, -122.0, 38.0],
    datetime_str="2025-06-01T00:00:00Z/2025-06-30T23:59:59Z",
    query={"eo:cloud_cover": {"lte": 20}},
    max_items=50,
)

print(f"Found {len(items)} items")
for item in items[:3]:
    print(item["id"], item["properties"].get("eo:cloud_cover"))

Memory note: items_as_dicts() streams results without buffering the full pystac.ItemCollection in RAM. For million-item catalogs, process items in the loop rather than collecting them all into a list.


Step 4 — Extract assets and validate raster readiness

Each Item contains an assets dictionary mapping asset keys (e.g., B04, visual, SCL, data-mask) to href URLs, MIME types, and roles. Before passing URLs to raster I/O libraries, validate three properties: HTTP reachability, byte-range support (required for COG streaming), and parseable raster metadata.

Understanding why byte-range support matters requires familiarity with how cloud-optimized GeoTIFF structure enables partial reads — the internal tiling and overview layout determine which byte offsets the client must fetch.

import rasterio
from rasterio.errors import RasterioIOError


def validate_asset_url(url: str, timeout: int = 10) -> bool:
    """Return True if the asset is reachable, byte-range capable, and parseable."""
    try:
        # HEAD request: check reachability and byte-range support
        resp = requests.head(url, timeout=timeout, allow_redirects=True)
        if resp.status_code != 200:
            return False
        if "bytes" not in resp.headers.get("Accept-Ranges", ""):
            return False

        # Open header-only with rasterio to confirm the file is a valid raster
        with rasterio.open(url) as src:
            return src.count > 0 and src.width > 0

    except (requests.RequestException, RasterioIOError, Exception):
        return False


def extract_validated_assets(
    items: list[dict],
    asset_keys: list[str],
) -> list[dict]:
    """Return a list of validated asset records for each requested asset key."""
    validated: list[dict] = []
    for item in items:
        for key in asset_keys:
            asset = item.get("assets", {}).get(key)
            if asset is None:
                continue
            href = asset.get("href", "")
            if validate_asset_url(href):
                validated.append(
                    {
                        "item_id": item["id"],
                        "asset_key": key,
                        "href": href,
                        "media_type": asset.get("type", ""),
                        "datetime": item["properties"].get("datetime"),
                        "cloud_cover": item["properties"].get("eo:cloud_cover"),
                    }
                )
    return validated


# Validate the red and NIR bands for NDVI
validated_assets = extract_validated_assets(
    items,
    asset_keys=["B04", "B08"],   # Sentinel-2 Red and NIR
)

Step 5 — Prepare results for downstream processing

Validated asset URLs can be streamed directly into rasterio, xarray, or rioxarray workflows without downloading full files. HTTP range requests read only the bounding window your analysis needs, which is where COG tiling pays off at scale.

For band math operations with xarray — such as computing NDVI from these Sentinel-2 assets — pass the hrefs as lazy-loaded arrays. For windowed reads with fine-grained spatial control, the handling pixel resolution and scaling page covers how to align window geometry with the native geotransform before reading.

import rasterio
from rasterio.windows import from_bounds
from rasterio.crs import CRS
from pyproj import Transformer

def read_asset_window(
    href: str,
    bounds_wgs84: tuple[float, float, float, float],
) -> tuple:
    """Read a spatial window from a COG asset, reprojecting the AOI if needed.

    Args:
        href: Validated COG asset URL.
        bounds_wgs84: AOI as (west, south, east, north) in WGS84.

    Returns:
        Tuple of (numpy array, rasterio transform, CRS string).
    """
    with rasterio.open(href) as src:
        # Reproject the WGS84 AOI into the raster's native CRS
        transformer = Transformer.from_crs(
            "EPSG:4326", src.crs.to_epsg(), always_xy=True
        )
        west, south = transformer.transform(bounds_wgs84[0], bounds_wgs84[1])
        east, north = transformer.transform(bounds_wgs84[2], bounds_wgs84[3])

        window = from_bounds(west, south, east, north, src.transform)
        data = src.read(1, window=window)
        window_transform = src.window_transform(window)

    return data, window_transform, src.crs.to_string()


# Read the Red band window for the first validated asset
red_assets = [a for a in validated_assets if a["asset_key"] == "B04"]
if red_assets:
    data, transform, crs = read_asset_window(
        red_assets[0]["href"],
        bounds_wgs84=(-122.45, 37.70, -122.35, 37.80),
    )
    print(f"Read array shape: {data.shape}, CRS: {crs}")

Parameter Reference

Key parameters for pystac_client.Client.search():

Parameter Type Default Usage note
collections list[str] None Restrict to named collections; omitting searches the entire catalog
bbox list[float] None [west, south, east, north] in WGS84; mutually exclusive with intersects
intersects dict (GeoJSON) None Polygon or multi-polygon for non-rectangular AOIs
datetime str None ISO 8601 single datetime or range start/end; use .. for open ends
query dict None Query Extension property filters (lte, eq, gte, etc.)
filter dict None CQL2-JSON filter expression for newer APIs
filter_lang str None "cql2-json" or "cql2-text" (required when filter is set)
max_items int None Hard cap on returned items; None iterates all pages
limit int None Page size hint; the API may ignore or override this
fields dict None {"include": [...], "exclude": [...]} to reduce payload size

Verification and Testing

After a search, confirm correct results before investing in downstream compute:

import json
from datetime import datetime, timezone


def verify_search_results(
    items: list[dict],
    bbox: list[float],
    datetime_range: str,
    max_cloud: float | None = None,
) -> dict:
    """Run assertions on a STAC search result set and return a summary."""
    issues: list[str] = []
    start_str, end_str = datetime_range.split("/")
    dt_start = datetime.fromisoformat(start_str.replace("Z", "+00:00"))
    dt_end = (
        datetime.now(tz=timezone.utc)
        if end_str == ".."
        else datetime.fromisoformat(end_str.replace("Z", "+00:00"))
    )

    for item in items:
        item_id = item["id"]
        props = item.get("properties", {})

        # Check temporal bounds
        item_dt_str = props.get("datetime", "")
        if item_dt_str:
            item_dt = datetime.fromisoformat(item_dt_str.replace("Z", "+00:00"))
            if not (dt_start <= item_dt <= dt_end):
                issues.append(f"{item_id}: datetime {item_dt_str} out of range")

        # Check cloud cover threshold
        if max_cloud is not None:
            cloud = props.get("eo:cloud_cover")
            if cloud is not None and cloud > max_cloud:
                issues.append(f"{item_id}: cloud_cover {cloud} exceeds {max_cloud}")

        # Check that requested assets exist
        assets = item.get("assets", {})
        if not assets:
            issues.append(f"{item_id}: no assets found")

    return {
        "total_items": len(items),
        "issues": issues,
        "passed": len(issues) == 0,
    }


report = verify_search_results(
    items,
    bbox=[-122.5, 37.5, -122.0, 38.0],
    datetime_range="2025-06-01T00:00:00Z/2025-06-30T23:59:59Z",
    max_cloud=20.0,
)
print(json.dumps(report, indent=2))

Expected output for a clean result set:

{
  "total_items": 12,
  "issues": [],
  "passed": true
}

Troubleshooting

Empty result set despite known data existing

Cause: The bbox coordinates are inverted (lat/lon swapped) or the geometry is not in WGS84.
Fix: Confirm order is [west, south, east, north] with values in the range [-180, -90, 180, 90]. Reproject non-WGS84 geometries before the call; see fixing EPSG mismatches in rasterio.open for a worked example.

ConformanceError or query filter silently ignored

Cause: The target catalog does not implement the Query Extension or CQL2.
Fix: Introspect client.conformance_classes() before constructing property filters. Fall back to in-memory filtering with a list comprehension over items_as_dicts().

Pagination stops early — result count much lower than expected

Cause: The limit parameter is treated as a maximum per-page hint and the API may honour a lower server-side cap. Or max_items was set too low.
Fix: Remove the limit argument and let items_as_dicts() follow next links automatically. Raise max_items or set it to None with caution and a log counter.

Asset URL returns 403 or 401

Cause: The asset is in a requester-pays S3 bucket, or the catalog requires per-asset signed URLs.
Fix: For Earth Search (AWS), configure your AWS credentials via boto3 / environment variables so rasterio can sign range requests. For Planetary Computer, use the planetary-computer Python package to sign asset hrefs before passing them to rasterio.

rasterio.open() hangs indefinitely on an asset URL

Cause: The asset is a classic TIFF (not COG) — it lacks internal tiling, so rasterio must fetch the entire file header before proceeding. Or the object-storage bucket blocks range requests.
Fix: Validate Accept-Ranges: bytes in the HEAD response before opening. Confirm the file is a valid COG with rio cogeo validate <url>. Review cloud-optimized GeoTIFF structure to understand why non-COG files cause this.


Frequently Asked Questions

Which STAC API extension controls property filtering like cloud cover?

The Query Extension (conformance class stac-api-query) allows filtering on item properties using operators like lte, eq, and gte. Newer APIs implement CQL2 (conformance class cql2), which supports richer logical expressions. Always check client.conformance_classes() before relying on either.

Why does pystac-client return fewer items than expected?

Most catalogs cap results per page (often 250–500 items). Use items_as_dicts() as a generator — it follows next-page links automatically. A hard max_items guard prevents runaway iteration, so check that value is not set too low.

How do I authenticate with a private STAC catalog?

Pass an Authorization header via Client.open(url, headers={"Authorization": "Bearer TOKEN"}). Load the token from an environment variable (os.getenv("STAC_TOKEN")) rather than hardcoding it.


Deep-Dive Articles