Querying STAC Catalogs Programmatically
Modern remote sensing infrastructure has shifted from monolithic FTP archives to distributed, API-driven ecosystems. Discovering, filtering, and retrieving satellite assets programmatically is now the bottleneck that determines whether a raster pipeline is reproducible at scale — or fragile by design. For the full architectural context of how STAC metadata maps to raster I/O, see Core Raster Fundamentals & STAC Mapping.
The SpatioTemporal Asset Catalog (STAC) specification standardises how geospatial assets are described, discovered, and accessed across cloud providers, government repositories, and commercial platforms. This page walks through a tested, production-ready workflow for interacting with STAC APIs in Python: client initialisation, spatiotemporal filtering, safe pagination, asset validation, and handoff to downstream raster tools.
Prerequisites
Install the required libraries before working through the steps below:
pip install "pystac-client>=0.7.0" "shapely>=2.0" pyproj "rasterio>=1.3" requests
| Library | Min version | Why required |
|---|---|---|
pystac-client |
0.7.0 | STAC API client with pagination and conformance detection |
shapely |
2.0 | Geometry construction for intersects filters |
pyproj |
3.5 | CRS validation before submitting spatial filters |
rasterio |
1.3 | Asset validation and downstream window reads |
requests |
2.31 | Session-level retry strategy for the HTTP transport |
Conceptual prerequisites: Understanding how the STAC item hierarchy (Catalog → Collection → Item → Asset) maps to files on object storage is assumed. If understanding Cloud-Optimized GeoTIFF structure is new to you, read that page first — knowing how internal tiling and overviews work explains why byte-range validation matters in step 4.
Step-by-Step Query Workflow
Step 1 — Initialise the STAC client
A pystac_client.Client is the programmatic entry point to a catalog’s /search endpoint. On open, it fetches the root catalog, discovers available collections, and validates the API’s conformance classes against the OGC API – Features standard. Supply a retry-equipped requests.Session so transient HTTP 429 and 5xx errors do not abort long-running pipeline runs.
import os
import pystac_client
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def get_stac_client(url: str, token: str | None = None) -> pystac_client.Client:
"""Open a STAC client with retry logic and optional bearer-token auth."""
retry_strategy = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["HEAD", "GET", "POST"],
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session = requests.Session()
session.mount("https://", adapter)
session.mount("http://", adapter)
headers: dict[str, str] = {}
if token:
headers["Authorization"] = f"Bearer {token}"
return pystac_client.Client.open(url, headers=headers)
# Public catalogs — no auth required
EARTH_SEARCH = "https://earth-search.aws.element84.com/v1"
PC = "https://planetarycomputer.microsoft.com/api/stac/v1"
client = get_stac_client(EARTH_SEARCH)
For private catalogs, load the token from an environment variable rather than hardcoding it:
client = get_stac_client(PC, token=os.getenv("PC_SDK_SUBSCRIPTION_KEY"))
Step 2 — Define spatiotemporal and metadata constraints
STAC queries are built from three dimensions. Precise constraint definition avoids over-fetching and reduces the downstream compute cost of filtering large Item collections in memory.
Spatial constraints
Filter by bounding box (bbox) or GeoJSON geometry (intersects). Coordinates must be in WGS84 (EPSG:4326). If your area of interest is in a projected CRS, reproject it before querying — misaligned CRS inputs silently return empty results. For worked examples of projecting geometries, see mastering CRS transformations in rasterio.
from shapely.geometry import box, mapping
# Bounding box: [west, south, east, north] in WGS84
bbox = [-122.5, 37.5, -122.0, 38.0]
# GeoJSON polygon for non-rectangular AOIs
aoi_geom = box(-122.45, 37.70, -122.35, 37.80)
geojson_filter = mapping(aoi_geom)
Temporal constraints
Temporal filters use ISO 8601 datetime strings. Use .. for open-ended ranges. Align date windows with acquisition schedules: Sentinel-2 revisits every 5 days (two satellites) and Landsat-8/9 revisits every 8 days.
# Closed range
datetime_range = "2025-06-01T00:00:00Z/2025-06-30T23:59:59Z"
# Open-ended: everything from June 2025 to now
open_range = "2025-06-01T00:00:00Z/.."
# Single acquisition date (inclusive of the full day)
single_day = "2025-07-15T00:00:00Z/2025-07-15T23:59:59Z"
Property filters (Query Extension)
Beyond space and time, many STAC APIs support property filtering via the Query Extension or CQL2. This enables server-side cloud-cover thresholds, platform selection, and custom vendor fields — keeping the result set small before it reaches your process.
# Query Extension syntax (older APIs: earth-search v0, USGS)
query_ext = {
"eo:cloud_cover": {"lte": 20},
"platform": {"eq": "sentinel-2b"},
}
# CQL2-JSON syntax (newer APIs: earth-search v1, Planetary Computer)
# Pass via filter= and filter-lang="cql2-json"
cql2_filter = {
"op": "and",
"args": [
{"op": "<=", "args": [{"property": "eo:cloud_cover"}, 20]},
{"op": "=", "args": [{"property": "platform"}, "sentinel-2b"]},
],
}
Always introspect the catalog’s conformance classes before relying on these extensions:
conformance = client.conformance_classes()
supports_query = any("stac-api-query" in c for c in conformance)
supports_cql2 = any("cql2" in c for c in conformance)
Step 3 — Execute the search and handle pagination
Large catalogs return paginated results, typically capping at 250–500 items per page. The pystac-client search() iterator follows next links transparently when consumed as a generator. Always impose a hard max_items cap in automated pipelines to prevent runaway iteration during API outages or misconfigured filters.
def execute_stac_search(
client: pystac_client.Client,
collections: list[str],
bbox: list[float] | None = None,
datetime_str: str | None = None,
query: dict | None = None,
max_items: int = 500,
) -> list[dict]:
"""Execute a paginated STAC search and return items as plain dicts.
Uses items_as_dicts() to avoid instantiating full pystac.Item objects,
which reduces memory overhead for large result sets.
"""
search = client.search(
collections=collections,
bbox=bbox,
datetime=datetime_str,
query=query,
max_items=max_items,
)
items: list[dict] = []
for item in search.items_as_dicts():
items.append(item)
if len(items) >= max_items:
break
return items
# Example: Sentinel-2 L2A over San Francisco, June 2025, ≤20 % cloud cover
items = execute_stac_search(
client=client,
collections=["sentinel-2-l2a"],
bbox=[-122.5, 37.5, -122.0, 38.0],
datetime_str="2025-06-01T00:00:00Z/2025-06-30T23:59:59Z",
query={"eo:cloud_cover": {"lte": 20}},
max_items=50,
)
print(f"Found {len(items)} items")
for item in items[:3]:
print(item["id"], item["properties"].get("eo:cloud_cover"))
Memory note: items_as_dicts() streams results without buffering the full pystac.ItemCollection in RAM. For million-item catalogs, process items in the loop rather than collecting them all into a list.
Step 4 — Extract assets and validate raster readiness
Each Item contains an assets dictionary mapping asset keys (e.g., B04, visual, SCL, data-mask) to href URLs, MIME types, and roles. Before passing URLs to raster I/O libraries, validate three properties: HTTP reachability, byte-range support (required for COG streaming), and parseable raster metadata.
Understanding why byte-range support matters requires familiarity with how cloud-optimized GeoTIFF structure enables partial reads — the internal tiling and overview layout determine which byte offsets the client must fetch.
import rasterio
from rasterio.errors import RasterioIOError
def validate_asset_url(url: str, timeout: int = 10) -> bool:
"""Return True if the asset is reachable, byte-range capable, and parseable."""
try:
# HEAD request: check reachability and byte-range support
resp = requests.head(url, timeout=timeout, allow_redirects=True)
if resp.status_code != 200:
return False
if "bytes" not in resp.headers.get("Accept-Ranges", ""):
return False
# Open header-only with rasterio to confirm the file is a valid raster
with rasterio.open(url) as src:
return src.count > 0 and src.width > 0
except (requests.RequestException, RasterioIOError, Exception):
return False
def extract_validated_assets(
items: list[dict],
asset_keys: list[str],
) -> list[dict]:
"""Return a list of validated asset records for each requested asset key."""
validated: list[dict] = []
for item in items:
for key in asset_keys:
asset = item.get("assets", {}).get(key)
if asset is None:
continue
href = asset.get("href", "")
if validate_asset_url(href):
validated.append(
{
"item_id": item["id"],
"asset_key": key,
"href": href,
"media_type": asset.get("type", ""),
"datetime": item["properties"].get("datetime"),
"cloud_cover": item["properties"].get("eo:cloud_cover"),
}
)
return validated
# Validate the red and NIR bands for NDVI
validated_assets = extract_validated_assets(
items,
asset_keys=["B04", "B08"], # Sentinel-2 Red and NIR
)
Step 5 — Prepare results for downstream processing
Validated asset URLs can be streamed directly into rasterio, xarray, or rioxarray workflows without downloading full files. HTTP range requests read only the bounding window your analysis needs, which is where COG tiling pays off at scale.
For band math operations with xarray — such as computing NDVI from these Sentinel-2 assets — pass the hrefs as lazy-loaded arrays. For windowed reads with fine-grained spatial control, the handling pixel resolution and scaling page covers how to align window geometry with the native geotransform before reading.
import rasterio
from rasterio.windows import from_bounds
from rasterio.crs import CRS
from pyproj import Transformer
def read_asset_window(
href: str,
bounds_wgs84: tuple[float, float, float, float],
) -> tuple:
"""Read a spatial window from a COG asset, reprojecting the AOI if needed.
Args:
href: Validated COG asset URL.
bounds_wgs84: AOI as (west, south, east, north) in WGS84.
Returns:
Tuple of (numpy array, rasterio transform, CRS string).
"""
with rasterio.open(href) as src:
# Reproject the WGS84 AOI into the raster's native CRS
transformer = Transformer.from_crs(
"EPSG:4326", src.crs.to_epsg(), always_xy=True
)
west, south = transformer.transform(bounds_wgs84[0], bounds_wgs84[1])
east, north = transformer.transform(bounds_wgs84[2], bounds_wgs84[3])
window = from_bounds(west, south, east, north, src.transform)
data = src.read(1, window=window)
window_transform = src.window_transform(window)
return data, window_transform, src.crs.to_string()
# Read the Red band window for the first validated asset
red_assets = [a for a in validated_assets if a["asset_key"] == "B04"]
if red_assets:
data, transform, crs = read_asset_window(
red_assets[0]["href"],
bounds_wgs84=(-122.45, 37.70, -122.35, 37.80),
)
print(f"Read array shape: {data.shape}, CRS: {crs}")
Parameter Reference
Key parameters for pystac_client.Client.search():
| Parameter | Type | Default | Usage note |
|---|---|---|---|
collections |
list[str] |
None |
Restrict to named collections; omitting searches the entire catalog |
bbox |
list[float] |
None |
[west, south, east, north] in WGS84; mutually exclusive with intersects |
intersects |
dict (GeoJSON) |
None |
Polygon or multi-polygon for non-rectangular AOIs |
datetime |
str |
None |
ISO 8601 single datetime or range start/end; use .. for open ends |
query |
dict |
None |
Query Extension property filters (lte, eq, gte, etc.) |
filter |
dict |
None |
CQL2-JSON filter expression for newer APIs |
filter_lang |
str |
None |
"cql2-json" or "cql2-text" (required when filter is set) |
max_items |
int |
None |
Hard cap on returned items; None iterates all pages |
limit |
int |
None |
Page size hint; the API may ignore or override this |
fields |
dict |
None |
{"include": [...], "exclude": [...]} to reduce payload size |
Verification and Testing
After a search, confirm correct results before investing in downstream compute:
import json
from datetime import datetime, timezone
def verify_search_results(
items: list[dict],
bbox: list[float],
datetime_range: str,
max_cloud: float | None = None,
) -> dict:
"""Run assertions on a STAC search result set and return a summary."""
issues: list[str] = []
start_str, end_str = datetime_range.split("/")
dt_start = datetime.fromisoformat(start_str.replace("Z", "+00:00"))
dt_end = (
datetime.now(tz=timezone.utc)
if end_str == ".."
else datetime.fromisoformat(end_str.replace("Z", "+00:00"))
)
for item in items:
item_id = item["id"]
props = item.get("properties", {})
# Check temporal bounds
item_dt_str = props.get("datetime", "")
if item_dt_str:
item_dt = datetime.fromisoformat(item_dt_str.replace("Z", "+00:00"))
if not (dt_start <= item_dt <= dt_end):
issues.append(f"{item_id}: datetime {item_dt_str} out of range")
# Check cloud cover threshold
if max_cloud is not None:
cloud = props.get("eo:cloud_cover")
if cloud is not None and cloud > max_cloud:
issues.append(f"{item_id}: cloud_cover {cloud} exceeds {max_cloud}")
# Check that requested assets exist
assets = item.get("assets", {})
if not assets:
issues.append(f"{item_id}: no assets found")
return {
"total_items": len(items),
"issues": issues,
"passed": len(issues) == 0,
}
report = verify_search_results(
items,
bbox=[-122.5, 37.5, -122.0, 38.0],
datetime_range="2025-06-01T00:00:00Z/2025-06-30T23:59:59Z",
max_cloud=20.0,
)
print(json.dumps(report, indent=2))
Expected output for a clean result set:
{
"total_items": 12,
"issues": [],
"passed": true
}
Troubleshooting
Empty result set despite known data existing
Cause: The bbox coordinates are inverted (lat/lon swapped) or the geometry is not in WGS84.
Fix: Confirm order is [west, south, east, north] with values in the range [-180, -90, 180, 90]. Reproject non-WGS84 geometries before the call; see fixing EPSG mismatches in rasterio.open for a worked example.
ConformanceError or query filter silently ignored
Cause: The target catalog does not implement the Query Extension or CQL2.
Fix: Introspect client.conformance_classes() before constructing property filters. Fall back to in-memory filtering with a list comprehension over items_as_dicts().
Pagination stops early — result count much lower than expected
Cause: The limit parameter is treated as a maximum per-page hint and the API may honour a lower server-side cap. Or max_items was set too low.
Fix: Remove the limit argument and let items_as_dicts() follow next links automatically. Raise max_items or set it to None with caution and a log counter.
Asset URL returns 403 or 401
Cause: The asset is in a requester-pays S3 bucket, or the catalog requires per-asset signed URLs.
Fix: For Earth Search (AWS), configure your AWS credentials via boto3 / environment variables so rasterio can sign range requests. For Planetary Computer, use the planetary-computer Python package to sign asset hrefs before passing them to rasterio.
rasterio.open() hangs indefinitely on an asset URL
Cause: The asset is a classic TIFF (not COG) — it lacks internal tiling, so rasterio must fetch the entire file header before proceeding. Or the object-storage bucket blocks range requests.
Fix: Validate Accept-Ranges: bytes in the HEAD response before opening. Confirm the file is a valid COG with rio cogeo validate <url>. Review cloud-optimized GeoTIFF structure to understand why non-COG files cause this.
Frequently Asked Questions
Which STAC API extension controls property filtering like cloud cover?
The Query Extension (conformance class stac-api-query) allows filtering on item properties using operators like lte, eq, and gte. Newer APIs implement CQL2 (conformance class cql2), which supports richer logical expressions. Always check client.conformance_classes() before relying on either.
Why does pystac-client return fewer items than expected?
Most catalogs cap results per page (often 250–500 items). Use items_as_dicts() as a generator — it follows next-page links automatically. A hard max_items guard prevents runaway iteration, so check that value is not set too low.
How do I authenticate with a private STAC catalog?
Pass an Authorization header via Client.open(url, headers={"Authorization": "Bearer TOKEN"}). Load the token from an environment variable (os.getenv("STAC_TOKEN")) rather than hardcoding it.
Related
- Using pystac-client to filter Sentinel-2 imagery by date — Targeted walkthrough of date-range and scene-selection patterns for Sentinel-2 archives.
- Understanding Cloud-Optimized GeoTIFF Structure — How internal tiling and overviews determine which byte ranges STAC asset reads must fetch.
- How to read COG headers without downloading full files — Efficient header inspection before committing to a full window read.
- Mastering CRS Transformations in rasterio — Ensures spatial filters and raster windows share a consistent coordinate reference system.
- Band Math Operations with xarray — What to do with STAC asset URLs once they are validated and ready for spectral computation.