Core Raster Fundamentals & STAC Mapping
Modern remote sensing workflows have shifted from monolithic, file-based processing to distributed, cloud-native architectures. At the centre of this transition is a dual requirement: a rigorous understanding of low-level raster data structures and the ability to navigate spatiotemporal catalogs efficiently. This section covers every layer of that stack — from pixel geometry and COG byte layout through STAC-driven discovery and lazy xarray evaluation — giving Python GIS developers, remote sensing analysts, and environmental data engineers a resilient architectural baseline for scalable, reproducible pipelines.
Conceptual Foundation: The Raster Data Model
A raster dataset represents continuous spatial phenomena as a two-dimensional grid of cells. Each cell carries a numeric value and is anchored in space by four interdependent components that every production pipeline must handle explicitly:
Geotransform. The six-parameter affine transformation matrix maps pixel coordinates (row, col) to geographic coordinates (x, y). In rasterio this is exposed as an Affine object. It encodes the origin, pixel width and height, and any rotation or skew. Misalignment during concatenation — even by a fraction of a pixel — can silently shift features by hundreds of metres. Correct affine handling underpins the detailed workflows in Handling Pixel Resolution and Scaling.
Data type and bit depth. uint8 is standard for classified land cover; float32 or int16 is required for surface reflectance or digital elevation models. Choosing the wrong type before arithmetic causes silent overflow or precision loss that is extremely difficult to detect downstream.
Band structure. Multispectral and hyperspectral datasets stack channels along a third dimension. Consistent band ordering and spectral response alignment are non-negotiable during index calculations — details covered in Band Math Operations with Xarray.
Nodata and quality masks. Missing data is represented as NaN, 0, or explicit bitmasks depending on the product. Improper handling — particularly assuming 0 means empty — introduces statistical contamination that propagates silently through aggregations. Correct nodata propagation is foundational before any step of Extracting and Parsing Raster Metadata.
The diagram below shows how these four components interact from sensor pixel to georeferenced value, and where the cloud-native layer inserts itself.
CRS: The Hidden Dependency
Coordinate Reference Systems (CRS) add another layer of complexity. Raster grids are stored in projected systems (e.g., UTM) for accurate distance calculations, but discovery interfaces default to geographic coordinates (WGS84). Before stacking multi-source data, every grid must share an identical CRS — Mastering CRS Transformations in Rasterio covers the reprojection patterns that avoid introducing resampling artifacts or losing geospatial precision.
Key Components Table
| Component | Class / Format | Role in the pipeline |
|---|---|---|
rasterio.DatasetReader |
Python object | Low-level raster I/O: reads pixels, geotransform, CRS, nodata, and band count from any GDAL-supported format |
Affine (affine library) |
Python object | Six-parameter matrix that converts (row, col) ↔ (x, y) in the dataset’s CRS |
rasterio.CRS |
Python object | Encapsulates EPSG codes, PROJ strings, and WKT; used in reproject and warp operations |
xarray.DataArray |
Python object | N-dimensional labelled array; stores raster stacks as (time, band, y, x) with coordinate metadata |
xarray.Dataset |
Python object | Dict-like container of DataArray variables sharing coordinates; used for multi-band or multi-variable products |
| Cloud-Optimized GeoTIFF (COG) | File format | Internal-tiled, overview-embedded GeoTIFF that supports HTTP Range reads for partial spatial access |
| STAC Item | JSON document | Single spatiotemporal record linking to one or more COG assets; carries bbox, datetime, and extension properties |
pystac_client.Client |
Python object | STAC API client; issues search queries by bbox, datetime, collection, and property filters |
stackstac.stack() |
Python function | Converts a list of STAC Items into a Dask-backed xarray.DataArray by auto-resolving CRS, resolution, and band alignment |
odc.stac.load() |
Python function | Alternative STAC-to-array loader; preserves native projections and provides tighter grouping control |
Production Patterns
The four patterns below underpin every workflow in this section. Each one is a prerequisite for the more specific workflows linked from it.
Pattern 1 — Metadata-First Inspection
Before reading a single pixel, inspect the dataset’s spatial context. This pattern prevents the silent data corruption that arises when scaling factors, nodata values, or CRS strings are assumed rather than read.
import rasterio
from rasterio.crs import CRS
with rasterio.open("s3://sentinel-cogs/sentinel-s2-l2a-cogs/32/T/NM/2025/6/S2B_32TNM_20250615_0_L2A/B04.tif") as src:
print("CRS: ", src.crs)
print("Transform: ", src.transform)
print("Shape: ", src.height, src.width)
print("Dtype: ", src.dtypes)
print("Nodata: ", src.nodata)
print("Scales: ", src.scales)
print("Offsets: ", src.offsets)
print("Tags: ", src.tags())
The full extraction workflow — including GDAL subdataset handling and multi-band metadata normalisation — is covered in Extracting and Parsing Raster Metadata.
Pattern 2 — COG-Aware Windowed Read
Reading a spatial subset from a COG without downloading the full file requires constructing a rasterio.windows.Window from a bounding box in the dataset’s CRS. The library translates this into one or more HTTP Range requests targeting the appropriate internal tiles.
import rasterio
from rasterio.windows import from_bounds
from rasterio.crs import CRS
import rasterio.warp
cog_url = "s3://sentinel-cogs/sentinel-s2-l2a-cogs/32/T/NM/2025/6/S2B_32TNM_20250615_0_L2A/B04.tif"
# AOI in WGS84 — must be reprojected to dataset CRS before windowing
aoi_wgs84 = (-105.5, 39.5, -104.5, 40.5) # (left, bottom, right, top)
with rasterio.open(cog_url) as src:
# Reproject AOI bounds to dataset CRS
aoi_native = rasterio.warp.transform_bounds(
CRS.from_epsg(4326), src.crs, *aoi_wgs84
)
window = from_bounds(*aoi_native, transform=src.transform)
data = src.read(window=window) # shape: (bands, rows, cols)
window_transform = src.window_transform(window)
print("Read shape:", data.shape)
print("Window transform:", window_transform)
The structure that makes this possible — internal tiling, overview placement, and IFD layout — is explained in detail at Understanding Cloud-Optimized GeoTIFF Structure.
Pattern 3 — STAC Search to Lazy Stack
This is the entry point for any pipeline that processes multiple scenes or time steps. The STAC query defines what data to load; stackstac translates the results into a Dask-backed array without transferring pixel data until .compute() is called.
import stackstac
from pystac_client import Client
client = Client.open("https://earth-search.aws.element84.com/v1")
search = client.search(
collections=["sentinel-2-l2a"],
bbox=[-105.5, 39.5, -104.5, 40.5],
datetime="2025-06-01/2025-06-30",
query={"eo:cloud_cover": {"lt": 10}},
)
items = list(search.item_collection())
print(f"Found {len(items)} items")
stack = stackstac.stack(
items,
assets=["B04", "B08", "SCL"],
rescale=False, # keep raw int16; apply scaling factors explicitly
epsg=32613,
resolution=10,
)
# stack is a Dask-backed DataArray: (time, band, y, x) — nothing downloaded yet
print(stack)
The full catalog search workflow, including pagination and property-filter syntax, lives at Querying STAC Catalogs Programmatically.
Pattern 4 — Lazy Index Calculation with Quality Masking
Once a lazy stack exists, radiometric operations and quality masking compose cleanly before the single .compute() call that triggers actual I/O.
import numpy as np
# Apply Sentinel-2 Collection 1 scaling: reflectance = DN * 0.0001 + (−0.1)
nir = stack.sel(band="B08").astype("float32") * 0.0001 - 0.1
red = stack.sel(band="B04").astype("float32") * 0.0001 - 0.1
# SCL classes 4 (vegetation), 5 (bare soil), 6 (water) = valid pixels
scl = stack.sel(band="SCL")
valid = scl.isin([4, 5, 6])
ndvi = (nir - red) / (nir + red)
ndvi_masked = ndvi.where(valid)
# Temporal mean — still lazy
monthly_mean = ndvi_masked.mean(dim="time")
# Compute only at write time
result = monthly_mean.compute()
print("Result shape:", result.shape, "| dtype:", result.dtype)
Composing spectral formulas, temporal aggregations, and multi-index pipelines at scale is covered in Band Math Operations with Xarray.
Common Pitfalls and Failure Modes
These are the six failure modes that most frequently corrupt production raster pipelines. Each one is silent — the code runs without error but produces wrong outputs.
1. Misaligned Geotransform After Reprojection
Symptom: Pixel values from two reprojected scenes are offset by 1–5 pixels when overlaid.
Cause: rasterio.warp.reproject() defaults to aligning the output grid to the destination CRS origin, not to a shared reference grid. Two scenes reprojected independently rarely land on the same pixel boundaries.
Fix: Always specify a common destination array or pass explicit transform, width, and height derived from a reference dataset. The patterns in Mastering CRS Transformations in Rasterio show how to pin outputs to a shared grid.
2. Missing COG Overviews
Symptom: Reading a COG at low zoom (small window relative to full scene) is dramatically slower than expected, or downloads the entire file.
Cause: The file was written as a plain GeoTIFF or COG without embedded overviews. GDAL falls back to reading and downsampling the full-resolution data.
Fix: Regenerate with gdal_translate -of COG -co OVERVIEWS=IGNORE_EXISTING or validate with rio cogeo validate. The COG Specification and the full internals walkthrough at Understanding Cloud-Optimized GeoTIFF Structure cover the correct overview levels.
3. STAC Pagination Truncation
Symptom: Search returns fewer items than expected; a temporal stack has gaps.
Cause: STAC APIs cap results per page (often 100 or 250). Calling .items() without handling pagination silently drops items beyond the first page.
Fix: Use search.item_collection() which pystac-client automatically paginates, or iterate with search.pages() when memory is constrained. Validate item counts against the context object returned by supporting APIs.
4. Scaling Factor Not Applied
Symptom: NDVI values outside [-1, 1]; reflectance values in the thousands.
Cause: Sentinel-2 L2A (Collection 1 onwards) stores reflectance as int16 with scale 0.0001 and offset -0.1. Reading raw int16 and computing indices without applying these factors yields mathematically invalid results.
Fix: Read src.scales and src.offsets from the dataset metadata, or apply the product-specification factors explicitly before arithmetic. See the radiometric handling steps in Handling Pixel Resolution and Scaling.
5. Silent Nodata Inclusion in Aggregations
Symptom: Statistical aggregations (mean, percentile, max) are contaminated; the output raster contains stripe artefacts at scene boundaries.
Cause: A dataset’s nodata value (0, -9999, 32767) is read into a plain NumPy array where it is treated as a valid numeric value.
Fix: Always mask nodata before aggregation: xarray.where(data != src.nodata) or read via rasterio.read(masked=True) which returns a numpy.ma.MaskedArray. The batch extraction workflow at Automating Metadata Extraction for Batch Raster Jobs shows how to automate this check across many files.
6. Chunk Boundary Resampling Artefacts
Symptom: Faint grid lines appear in the output raster at Dask chunk boundaries; pixel values at chunk edges differ from interior pixels.
Cause: Resampling or convolution operations (e.g., bilinear interpolation, Gaussian smoothing) require context pixels beyond each chunk’s boundary. Without explicit overlap, Dask processes each chunk in isolation and the boundary pixels are computed from edge-padded data.
Fix: Add overlap with da.map_overlap(..., depth=kernel_radius, boundary='reflect'). For COG reads, align chunk sizes to the COG’s internal tile size (commonly 256 or 512) to eliminate partial tile fetches.
Performance and Scale Considerations
Memory Layout and Chunking Strategy
The single most impactful choice in a cloud raster pipeline is chunk size. Chunks that are too small create Dask task graph overhead that dwarfs computation time; chunks that are too large trigger out-of-memory errors on worker nodes.
A reliable starting point: match chunks to the COG’s internal tile size in the spatial dimensions, and set the time dimension to a small integer (4–16) based on the number of bands and the worker’s RAM.
# Rechunk to align with COG 512×512 internal tiles
# Assumes workers have ~8 GB RAM; adjust time chunk accordingly
stack_chunked = stack.chunk({"time": 8, "band": -1, "y": 512, "x": 512})
For continental-scale mosaics, use rasterio.vrt.WarpedVRT to virtualise the reprojection without materialising intermediate files. When reading many small windows from COGs on S3, set GDAL_HTTP_MAX_RETRY=5 and GDAL_HTTP_RETRY_DELAY=0.5 as environment variables to handle transient 503 responses without explicit retry logic in Python.
Parallelism Model
stackstac and odc-stac both produce Dask task graphs. The parallelism model depends on the scheduler:
- Local machine:
dask.compute()with the threaded scheduler is safe for I/O-bound work (most raster pipelines). Usenthreadsequal to the number of vCPUs. - Dask distributed cluster: Each worker must have GDAL and the necessary cloud credentials. Pass
GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIRto suppress redundant directory scans on object storage. - Coiled or AWS Batch: Prefer pre-built Docker images that include
rasterio[s3]andproj-datato avoid slow PROJ datum download on worker startup.
I/O Cost Reduction
- Request fewer assets. Only include the STAC asset keys you need (e.g.,
["B04", "B08", "SCL"]rather than all 13 Sentinel-2 bands). - Use overviews for exploration. Read from
src.overview_levelfor quick visual checks before committing to full-resolution computation. - Prefer ZSTD compression for COGs you generate: it achieves near-DEFLATE compression ratios at 3–5× faster decode speed, reducing CPU bottlenecks on high-throughput workers.
FAQ
Q: Do I need to download a COG before reading it with rasterio?
No. When rasterio opens a URL prefixed with /vsicurl/ or an s3:// path (with rasterio[s3] installed), GDAL issues HTTP Range requests automatically. You only download the tiles and overview levels required for your read window.
Q: What is the difference between stackstac and odc-stac?
stackstac reprojects all items to a common target CRS and resolution automatically, which simplifies multi-scene stacking at the cost of an implicit warp step. odc-stac (odc.stac.load()) preserves native projections by default and provides finer grouping control, making it preferable when you want to avoid an intermediate resampling step or when items share a consistent native CRS.
Q: Why does pystac-client return fewer items than the STAC API web interface shows?
The API enforces a per-page limit. search.items() only retrieves the first page. Use search.item_collection() to transparently paginate, or check search.matched() before iterating to detect the discrepancy.
Q: How do I apply Sentinel-2 Collection 1 scaling factors correctly?
Multiply the raw int16 value by 0.0001 and add -0.1 to get surface reflectance. Cast to float32 first to avoid integer overflow: reflectance = band.astype("float32") * 0.0001 - 0.1. The full parameter table including per-band offsets for other missions is in Handling Pixel Resolution and Scaling.
Related
- Extracting and Parsing Raster Metadata — Read CRS, geotransform, nodata, and scaling factors before any pixel operation; the essential first step in every pipeline.
- Understanding Cloud-Optimized GeoTIFF Structure — Deep dive into internal tiling, overview layout, and byte-range I/O patterns that make COG reads efficient.
- Mastering CRS Transformations in Rasterio — Reproject rasters and bounding boxes without introducing misalignment or resampling artefacts.
- Querying STAC Catalogs Programmatically — Use
pystac-clientto filter spatiotemporal assets by bbox, datetime, and cloud cover before loading any data. - Band Math Operations with Xarray — Compose spectral index calculations and temporal aggregations as lazy Dask graphs over STAC-sourced stacks.
- Handling Pixel Resolution and Scaling — Align multi-sensor grids, apply radiometric scaling, and validate output dynamic ranges.
- Satellite Processing Workflows & Index Pipelines — The sibling section covering end-to-end spectral index pipelines, cloud masking, mosaicking, and temporal aggregation.