Core Raster Fundamentals & STAC Mapping

Modern remote sensing workflows have shifted from monolithic, file-based processing to distributed, cloud-native architectures. At the centre of this transition is a dual requirement: a rigorous understanding of low-level raster data structures and the ability to navigate spatiotemporal catalogs efficiently. This section covers every layer of that stack — from pixel geometry and COG byte layout through STAC-driven discovery and lazy xarray evaluation — giving Python GIS developers, remote sensing analysts, and environmental data engineers a resilient architectural baseline for scalable, reproducible pipelines.


Conceptual Foundation: The Raster Data Model

A raster dataset represents continuous spatial phenomena as a two-dimensional grid of cells. Each cell carries a numeric value and is anchored in space by four interdependent components that every production pipeline must handle explicitly:

Geotransform. The six-parameter affine transformation matrix maps pixel coordinates (row, col) to geographic coordinates (x, y). In rasterio this is exposed as an Affine object. It encodes the origin, pixel width and height, and any rotation or skew. Misalignment during concatenation — even by a fraction of a pixel — can silently shift features by hundreds of metres. Correct affine handling underpins the detailed workflows in Handling Pixel Resolution and Scaling.

Data type and bit depth. uint8 is standard for classified land cover; float32 or int16 is required for surface reflectance or digital elevation models. Choosing the wrong type before arithmetic causes silent overflow or precision loss that is extremely difficult to detect downstream.

Band structure. Multispectral and hyperspectral datasets stack channels along a third dimension. Consistent band ordering and spectral response alignment are non-negotiable during index calculations — details covered in Band Math Operations with Xarray.

Nodata and quality masks. Missing data is represented as NaN, 0, or explicit bitmasks depending on the product. Improper handling — particularly assuming 0 means empty — introduces statistical contamination that propagates silently through aggregations. Correct nodata propagation is foundational before any step of Extracting and Parsing Raster Metadata.

The diagram below shows how these four components interact from sensor pixel to georeferenced value, and where the cloud-native layer inserts itself.

Raster Data Model and Cloud-Native Pipeline Diagram showing the flow from sensor pixels through the raster data model components (geotransform, data type, band structure, nodata masks) into COG storage and STAC catalog discovery, terminating in a lazy xarray DataArray. Sensor Pixel Grid (row, col) → DN value Raster Data Model Affine geotransform Data type / bit depth Band structure Nodata + quality masks CRS (projection) rasterio.DatasetReader COG Storage Internal tiling (256×256) Embedded overviews Header-first IFD HTTP Range reads S3 / GCS / Azure Blob STAC Catalog Catalog (grouping) Collection (metadata) Item (scene/tile/time) Asset href → COG bbox / datetime filter pystac_client.Client stackstac / odc-stac Lazy xarray DataArray Dask-backed chunks No data transferred until .compute() dims: (time, band, y, x) direct rasterio.open() path

CRS: The Hidden Dependency

Coordinate Reference Systems (CRS) add another layer of complexity. Raster grids are stored in projected systems (e.g., UTM) for accurate distance calculations, but discovery interfaces default to geographic coordinates (WGS84). Before stacking multi-source data, every grid must share an identical CRS — Mastering CRS Transformations in Rasterio covers the reprojection patterns that avoid introducing resampling artifacts or losing geospatial precision.


Key Components Table

Component Class / Format Role in the pipeline
rasterio.DatasetReader Python object Low-level raster I/O: reads pixels, geotransform, CRS, nodata, and band count from any GDAL-supported format
Affine (affine library) Python object Six-parameter matrix that converts (row, col)(x, y) in the dataset’s CRS
rasterio.CRS Python object Encapsulates EPSG codes, PROJ strings, and WKT; used in reproject and warp operations
xarray.DataArray Python object N-dimensional labelled array; stores raster stacks as (time, band, y, x) with coordinate metadata
xarray.Dataset Python object Dict-like container of DataArray variables sharing coordinates; used for multi-band or multi-variable products
Cloud-Optimized GeoTIFF (COG) File format Internal-tiled, overview-embedded GeoTIFF that supports HTTP Range reads for partial spatial access
STAC Item JSON document Single spatiotemporal record linking to one or more COG assets; carries bbox, datetime, and extension properties
pystac_client.Client Python object STAC API client; issues search queries by bbox, datetime, collection, and property filters
stackstac.stack() Python function Converts a list of STAC Items into a Dask-backed xarray.DataArray by auto-resolving CRS, resolution, and band alignment
odc.stac.load() Python function Alternative STAC-to-array loader; preserves native projections and provides tighter grouping control

Production Patterns

The four patterns below underpin every workflow in this section. Each one is a prerequisite for the more specific workflows linked from it.

Pattern 1 — Metadata-First Inspection

Before reading a single pixel, inspect the dataset’s spatial context. This pattern prevents the silent data corruption that arises when scaling factors, nodata values, or CRS strings are assumed rather than read.

import rasterio
from rasterio.crs import CRS

with rasterio.open("s3://sentinel-cogs/sentinel-s2-l2a-cogs/32/T/NM/2025/6/S2B_32TNM_20250615_0_L2A/B04.tif") as src:
    print("CRS:        ", src.crs)
    print("Transform:  ", src.transform)
    print("Shape:      ", src.height, src.width)
    print("Dtype:      ", src.dtypes)
    print("Nodata:     ", src.nodata)
    print("Scales:     ", src.scales)
    print("Offsets:    ", src.offsets)
    print("Tags:       ", src.tags())

The full extraction workflow — including GDAL subdataset handling and multi-band metadata normalisation — is covered in Extracting and Parsing Raster Metadata.

Pattern 2 — COG-Aware Windowed Read

Reading a spatial subset from a COG without downloading the full file requires constructing a rasterio.windows.Window from a bounding box in the dataset’s CRS. The library translates this into one or more HTTP Range requests targeting the appropriate internal tiles.

import rasterio
from rasterio.windows import from_bounds
from rasterio.crs import CRS
import rasterio.warp

cog_url = "s3://sentinel-cogs/sentinel-s2-l2a-cogs/32/T/NM/2025/6/S2B_32TNM_20250615_0_L2A/B04.tif"

# AOI in WGS84 — must be reprojected to dataset CRS before windowing
aoi_wgs84 = (-105.5, 39.5, -104.5, 40.5)  # (left, bottom, right, top)

with rasterio.open(cog_url) as src:
    # Reproject AOI bounds to dataset CRS
    aoi_native = rasterio.warp.transform_bounds(
        CRS.from_epsg(4326), src.crs, *aoi_wgs84
    )
    window = from_bounds(*aoi_native, transform=src.transform)
    data = src.read(window=window)          # shape: (bands, rows, cols)
    window_transform = src.window_transform(window)

print("Read shape:", data.shape)
print("Window transform:", window_transform)

The structure that makes this possible — internal tiling, overview placement, and IFD layout — is explained in detail at Understanding Cloud-Optimized GeoTIFF Structure.

Pattern 3 — STAC Search to Lazy Stack

This is the entry point for any pipeline that processes multiple scenes or time steps. The STAC query defines what data to load; stackstac translates the results into a Dask-backed array without transferring pixel data until .compute() is called.

import stackstac
from pystac_client import Client

client = Client.open("https://earth-search.aws.element84.com/v1")
search = client.search(
    collections=["sentinel-2-l2a"],
    bbox=[-105.5, 39.5, -104.5, 40.5],
    datetime="2025-06-01/2025-06-30",
    query={"eo:cloud_cover": {"lt": 10}},
)
items = list(search.item_collection())
print(f"Found {len(items)} items")

stack = stackstac.stack(
    items,
    assets=["B04", "B08", "SCL"],
    rescale=False,   # keep raw int16; apply scaling factors explicitly
    epsg=32613,
    resolution=10,
)
# stack is a Dask-backed DataArray: (time, band, y, x) — nothing downloaded yet
print(stack)

The full catalog search workflow, including pagination and property-filter syntax, lives at Querying STAC Catalogs Programmatically.

Pattern 4 — Lazy Index Calculation with Quality Masking

Once a lazy stack exists, radiometric operations and quality masking compose cleanly before the single .compute() call that triggers actual I/O.

import numpy as np

# Apply Sentinel-2 Collection 1 scaling: reflectance = DN * 0.0001 + (−0.1)
nir = stack.sel(band="B08").astype("float32") * 0.0001 - 0.1
red = stack.sel(band="B04").astype("float32") * 0.0001 - 0.1

# SCL classes 4 (vegetation), 5 (bare soil), 6 (water) = valid pixels
scl = stack.sel(band="SCL")
valid = scl.isin([4, 5, 6])

ndvi = (nir - red) / (nir + red)
ndvi_masked = ndvi.where(valid)

# Temporal mean — still lazy
monthly_mean = ndvi_masked.mean(dim="time")

# Compute only at write time
result = monthly_mean.compute()
print("Result shape:", result.shape, "| dtype:", result.dtype)

Composing spectral formulas, temporal aggregations, and multi-index pipelines at scale is covered in Band Math Operations with Xarray.


Common Pitfalls and Failure Modes

These are the six failure modes that most frequently corrupt production raster pipelines. Each one is silent — the code runs without error but produces wrong outputs.

1. Misaligned Geotransform After Reprojection

Symptom: Pixel values from two reprojected scenes are offset by 1–5 pixels when overlaid. Cause: rasterio.warp.reproject() defaults to aligning the output grid to the destination CRS origin, not to a shared reference grid. Two scenes reprojected independently rarely land on the same pixel boundaries. Fix: Always specify a common destination array or pass explicit transform, width, and height derived from a reference dataset. The patterns in Mastering CRS Transformations in Rasterio show how to pin outputs to a shared grid.

2. Missing COG Overviews

Symptom: Reading a COG at low zoom (small window relative to full scene) is dramatically slower than expected, or downloads the entire file. Cause: The file was written as a plain GeoTIFF or COG without embedded overviews. GDAL falls back to reading and downsampling the full-resolution data. Fix: Regenerate with gdal_translate -of COG -co OVERVIEWS=IGNORE_EXISTING or validate with rio cogeo validate. The COG Specification and the full internals walkthrough at Understanding Cloud-Optimized GeoTIFF Structure cover the correct overview levels.

3. STAC Pagination Truncation

Symptom: Search returns fewer items than expected; a temporal stack has gaps. Cause: STAC APIs cap results per page (often 100 or 250). Calling .items() without handling pagination silently drops items beyond the first page. Fix: Use search.item_collection() which pystac-client automatically paginates, or iterate with search.pages() when memory is constrained. Validate item counts against the context object returned by supporting APIs.

4. Scaling Factor Not Applied

Symptom: NDVI values outside [-1, 1]; reflectance values in the thousands. Cause: Sentinel-2 L2A (Collection 1 onwards) stores reflectance as int16 with scale 0.0001 and offset -0.1. Reading raw int16 and computing indices without applying these factors yields mathematically invalid results. Fix: Read src.scales and src.offsets from the dataset metadata, or apply the product-specification factors explicitly before arithmetic. See the radiometric handling steps in Handling Pixel Resolution and Scaling.

5. Silent Nodata Inclusion in Aggregations

Symptom: Statistical aggregations (mean, percentile, max) are contaminated; the output raster contains stripe artefacts at scene boundaries. Cause: A dataset’s nodata value (0, -9999, 32767) is read into a plain NumPy array where it is treated as a valid numeric value. Fix: Always mask nodata before aggregation: xarray.where(data != src.nodata) or read via rasterio.read(masked=True) which returns a numpy.ma.MaskedArray. The batch extraction workflow at Automating Metadata Extraction for Batch Raster Jobs shows how to automate this check across many files.

6. Chunk Boundary Resampling Artefacts

Symptom: Faint grid lines appear in the output raster at Dask chunk boundaries; pixel values at chunk edges differ from interior pixels. Cause: Resampling or convolution operations (e.g., bilinear interpolation, Gaussian smoothing) require context pixels beyond each chunk’s boundary. Without explicit overlap, Dask processes each chunk in isolation and the boundary pixels are computed from edge-padded data. Fix: Add overlap with da.map_overlap(..., depth=kernel_radius, boundary='reflect'). For COG reads, align chunk sizes to the COG’s internal tile size (commonly 256 or 512) to eliminate partial tile fetches.


Performance and Scale Considerations

Memory Layout and Chunking Strategy

The single most impactful choice in a cloud raster pipeline is chunk size. Chunks that are too small create Dask task graph overhead that dwarfs computation time; chunks that are too large trigger out-of-memory errors on worker nodes.

A reliable starting point: match chunks to the COG’s internal tile size in the spatial dimensions, and set the time dimension to a small integer (4–16) based on the number of bands and the worker’s RAM.

# Rechunk to align with COG 512×512 internal tiles
# Assumes workers have ~8 GB RAM; adjust time chunk accordingly
stack_chunked = stack.chunk({"time": 8, "band": -1, "y": 512, "x": 512})

For continental-scale mosaics, use rasterio.vrt.WarpedVRT to virtualise the reprojection without materialising intermediate files. When reading many small windows from COGs on S3, set GDAL_HTTP_MAX_RETRY=5 and GDAL_HTTP_RETRY_DELAY=0.5 as environment variables to handle transient 503 responses without explicit retry logic in Python.

Parallelism Model

stackstac and odc-stac both produce Dask task graphs. The parallelism model depends on the scheduler:

  • Local machine: dask.compute() with the threaded scheduler is safe for I/O-bound work (most raster pipelines). Use nthreads equal to the number of vCPUs.
  • Dask distributed cluster: Each worker must have GDAL and the necessary cloud credentials. Pass GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR to suppress redundant directory scans on object storage.
  • Coiled or AWS Batch: Prefer pre-built Docker images that include rasterio[s3] and proj-data to avoid slow PROJ datum download on worker startup.

I/O Cost Reduction

  • Request fewer assets. Only include the STAC asset keys you need (e.g., ["B04", "B08", "SCL"] rather than all 13 Sentinel-2 bands).
  • Use overviews for exploration. Read from src.overview_level for quick visual checks before committing to full-resolution computation.
  • Prefer ZSTD compression for COGs you generate: it achieves near-DEFLATE compression ratios at 3–5× faster decode speed, reducing CPU bottlenecks on high-throughput workers.

FAQ

Q: Do I need to download a COG before reading it with rasterio? No. When rasterio opens a URL prefixed with /vsicurl/ or an s3:// path (with rasterio[s3] installed), GDAL issues HTTP Range requests automatically. You only download the tiles and overview levels required for your read window.

Q: What is the difference between stackstac and odc-stac? stackstac reprojects all items to a common target CRS and resolution automatically, which simplifies multi-scene stacking at the cost of an implicit warp step. odc-stac (odc.stac.load()) preserves native projections by default and provides finer grouping control, making it preferable when you want to avoid an intermediate resampling step or when items share a consistent native CRS.

Q: Why does pystac-client return fewer items than the STAC API web interface shows? The API enforces a per-page limit. search.items() only retrieves the first page. Use search.item_collection() to transparently paginate, or check search.matched() before iterating to detect the discrepancy.

Q: How do I apply Sentinel-2 Collection 1 scaling factors correctly? Multiply the raw int16 value by 0.0001 and add -0.1 to get surface reflectance. Cast to float32 first to avoid integer overflow: reflectance = band.astype("float32") * 0.0001 - 0.1. The full parameter table including per-band offsets for other missions is in Handling Pixel Resolution and Scaling.


Topics