Satellite Processing Workflows & Index Pipelines

Q: What Python libraries are essential for satellite processing pipelines?

rasterio handles low-level raster I/O and CRS operations; xarray provides labelled N-dimensional arrays for multi-band, multi-date cubes; dask enables out-of-core parallel computation; rioxarray bridges rasterio's spatial metadata into xarray; and pystac-client queries STAC catalogues to discover scenes programmatically.

Q: How do I handle cloud contamination in time-series analysis?

Integrate QA band bit-flag parsing or an ML cloud detector (s2cloudless, Fmask) to produce a binary mask array, set contaminated pixels to NaN, then propagate the mask through all downstream transformations so statistical aggregations exclude bad observations automatically.

Modern Earth observation programmes generate petabytes of multispectral, hyperspectral, and SAR data every day. Turning raw satellite imagery into analysis-ready products that support precision agriculture, deforestation detection, and urban heat island quantification demands a reproducible, scalable engineering architecture — not one-off scripts. This section covers every stage of that architecture, from STAC-based scene discovery through spectral index derivation to temporal compositing, implemented in Python using rasterio, xarray, and dask. For the raster data model and coordinate reference fundamentals that underpin every stage described here, see Core Raster Fundamentals & STAC Mapping.

Pipeline Architecture: the Four-Tier DAG Model

A production satellite processing pipeline is a directed acyclic graph (DAG) in which each node represents a deterministic, idempotent transformation. Organising the pipeline as a DAG — rather than a sequential script — makes individual stages independently testable, restartable on failure, and parallelisable across workers.

The standard four-tier architecture is:

Ingestion & cataloguing — Automated scene discovery via SpatioTemporal Asset Catalog (STAC) queries, secure retrieval of Level-1/Level-2 products, and local or object-store staging.
Preprocessing & harmonisation — Geometric correction, radiometric calibration, spatial subsetting, and atmospheric quality filtering.
Analytical processing — Spectral index computation, change detection, classification, and feature extraction.
Output & archival — Export to analysis-ready formats (Cloud-Optimized GeoTIFF, Zarr, NetCDF), metadata enrichment, and version-controlled storage.

The diagram below illustrates how data flows through these tiers and where each processing stage fits.

Key Components

The table below lists the primary classes, objects, and formats that appear throughout this pipeline, with one-line descriptions of their role.

Component	Library	Role in the pipeline
`rasterio.DatasetReader`	rasterio	Opens GeoTIFF / COG scenes; exposes bands, transform, CRS, and windowed reads
`rasterio.mask.mask()`	rasterio	Clips raster arrays to vector geometries, preserving geotransform
`rasterio.merge.merge()`	rasterio	Stitches adjacent or overlapping scenes into a single mosaic
`xarray.Dataset`	xarray	Labelled N-dimensional container for multi-band, multi-date raster cubes
`xarray.DataArray`	xarray	Single-variable array with named coordinates (x, y, time, band)
`rioxarray` extension	rioxarray	Attaches CRS and spatial methods (`.rio.reproject()`, `.rio.clip()`) to xarray objects
`dask.array.Array`	dask	Lazy, chunked array that defers computation until `.compute()` or `.to_zarr()` is called
`pystac_client.Client`	pystac-client	Queries STAC endpoints with bbox/time/property filters; returns `ItemCollection`
`stackstac.stack()`	stackstac	Converts a STAC `ItemCollection` into a labelled `xarray.DataArray` with dask backing
Cloud-Optimized GeoTIFF	rasterio / GDAL	Internally tiled and overviewed GeoTIFF enabling HTTP range-request reads
Zarr store	zarr	Chunked, compressed N-D array format optimised for object-store I/O with xarray/dask

Production Patterns

The four patterns below underpin every workflow in this section. Each links to the dedicated page that elaborates it in full.

Pattern 1 — Lazy scene stack from STAC

Rather than downloading and reading scenes eagerly, build a lazy xarray.DataArray backed by dask chunks. This defers all I/O until an explicit .compute() call and allows the entire pipeline to be expressed as a graph.

import pystac_client
import stackstac

catalog = pystac_client.Client.open("https://earth-search.aws.element84.com/v1")
items = catalog.search(
    collections=["sentinel-2-l2a"],
    bbox=[-3.2, 51.4, -2.9, 51.6],      # Cardiff, Wales
    datetime="2023-06-01/2023-09-30",
    query={"eo:cloud_cover": {"lt": 20}},
).item_collection()

# Stack into a labelled 4-D array: (time, band, y, x)
stack = stackstac.stack(
    items,
    assets=["B04", "B08", "B11"],        # Red, NIR, SWIR
    resolution=10,
    chunksize={"x": 2048, "y": 2048},
)

Querying STAC Catalogs Programmatically covers filter syntax, authentication, and pagination in detail.

Pattern 2 — Vectorised index computation with NaN masking

Spectral index formulas should be expressed as xarray arithmetic so that cloud-masked NaN values propagate automatically and computation remains lazy.

import numpy as np

red  = stack.sel(band="B04").astype("float32") / 10_000
nir  = stack.sel(band="B08").astype("float32") / 10_000

# Avoid division-by-zero with a small epsilon
ndvi = (nir - red) / (nir + red + 1e-10)
ndvi = ndvi.clip(-1.0, 1.0)             # guard against floating-point overshoot
ndvi.name = "NDVI"

Spectral Index Calculation Pipelines details declarative index registries and multi-index batch computation.

Pattern 3 — Cloud-masked temporal median composite

Compositing multiple acquisitions into a single clean image relies on good cloud masks. Pixels flagged as cloud or shadow must be set to NaN before the reduction so they are excluded from the median.

# Assume `cloud_mask` is a boolean DataArray aligned to `ndvi` (True = cloudy)
ndvi_clean = ndvi.where(~cloud_mask)    # NaN where cloudy

# Median composite over time — ignores NaN automatically
ndvi_median = ndvi_clean.median(dim="time")

Cloud and Shadow Masking Strategies explains QA-band bit-flag parsing and ML-based detectors such as s2cloudless.

Pattern 4 — Chunked export to Cloud-Optimized GeoTIFF

Materialising a dask-backed result to disk should use tiled writes so the output is COG-compliant from the start, avoiding a separate gdal_translate step.

import rasterio
from rasterio.transform import from_bounds

profile = {
    "driver": "GTiff",
    "dtype": "float32",
    "width": ndvi_median.sizes["x"],
    "height": ndvi_median.sizes["y"],
    "count": 1,
    "crs": ndvi_median.rio.crs,
    "transform": ndvi_median.rio.transform(),
    "compress": "zstd",
    "tiled": True,
    "blockxsize": 512,
    "blockysize": 512,
    "interleave": "band",
}

with rasterio.open("ndvi_median_2023.tif", "w", **profile) as dst:
    dst.write(ndvi_median.compute().values, 1)

Understanding Cloud-Optimized GeoTIFF Structure explains internal tiling, overview levels, and how HTTP range requests exploit COG byte order.

Processing Stages in Depth

Spatial Subsetting and Region-of-Interest Extraction

Raw satellite scenes routinely cover 100 × 100 km footprints. Clipping to an area of interest before any computation reduces I/O by 90 % or more on regional analyses. Automated Image Clipping and Cropping covers rasterio.mask for vector-masked extraction, bounding-box windowed reads, and the CRS alignment steps that prevent sub-pixel misregistration when the vector boundary uses a different projection than the raster. The key implementation requirements are: reproject the vector to the raster CRS before masking, preserve the geotransform in the output profile, and validate that output array dimensions match the requested footprint using rasterio.transform.array_bounds.

Radiometric Calibration and Atmospheric Correction

Level-1 data stores digital numbers that must be scaled to top-of-atmosphere reflectance using coefficients in the scene metadata. For multi-temporal consistency, either standardise on pre-corrected Level-2 products or apply atmospheric correction via Sen2Cor, LaSRC, or a look-up-table approach such as 6S. Production pipelines should load and cache calibration coefficients once at startup, then apply them as vectorised numpy or xarray broadcasting operations — never recalculate per-pixel in a Python loop. The affine geotransform that governs pixel resolution and scaling must be preserved unmodified through this stage.

Cloud and Shadow Masking

Unmasked clouds and their shadows introduce severe positive and negative biases into spectral indices and temporal composites. Cloud and Shadow Masking Strategies details how to parse sensor-specific QA-band bit-packing formats (Sentinel-2 SCL, Landsat C2 QA_PIXEL), apply morphological dilation to catch cloud shadow edges, and integrate probability-based ML detectors. The critical engineering requirement is that masked pixels are represented as NaN — not zero or a fill value — so that downstream reductions (mean, median, percentile) naturally exclude them without conditional logic.

Seamless Mosaicking and Edge Blending

Continental-scale analysis requires stitching dozens or hundreds of individual scenes. Seamless Mosaicking and Edge Blending covers the rasterio.merge priority-stack approach, feathering via alpha-weighted blending in overlap regions, and histogram matching to remove radiometric discontinuities between scenes acquired on different dates. When inputs come from multiple sensors with different spectral response functions, band-specific scaling factors must be applied before blending to prevent artificial gradients at sensor boundaries.

Advanced Resampling and Upscaling

Aligning datasets for multi-source fusion demands careful interpolation choices. Advanced Resampling and Upscaling Techniques classifies the correct kernel by data type: nearest for categorical land-cover labels, bilinear or cubic for continuous reflectance, and average (area-weighted) for downscaling to a coarser grid. Pipelines should expose the resampling algorithm as a configurable parameter rather than a hardcoded constant, validate output pixel alignment against a reference grid using rasterio.transform.array_bounds, and cache resampled tiles to avoid repeated computation during iterative analysis.

Spectral Index Derivation

Spectral Index Calculation Pipelines describes how to move beyond hardcoded NDVI scripts toward declarative YAML-driven index registries that map formula definitions to sensor band names. At runtime an execution engine compiles these definitions into optimised xarray expressions. The three non-negotiable implementation requirements are: ensure all input arrays share identical CRS, spatial resolution, and chunk structure before combining; apply an epsilon offset or where condition to prevent division-by-zero NaN propagation; and validate index outputs against known reference scenes using xarray.testing.assert_allclose.

Temporal Aggregation and Time-Series Analysis

Single-date indices capture a snapshot; environmental monitoring requires tracking phenological cycles, disturbance events, and multi-year trends. Temporal Aggregation and Time-Series Analysis covers xarray.resample() for monthly or seasonal composites, xarray.rolling() for smoothing splines over irregular observation sequences, and harmonic regression to model annual cycles in vegetation indices. When band math operations with xarray feed into temporal stacks, aligning all acquisitions to a uniform temporal grid before reduction is essential for preventing aliasing artefacts in the aggregated signal.

Common Pitfalls and Failure Modes

Misaligned geotransforms between source scenes

Symptom: Band arithmetic or mosaic operations produce spatially shifted or striped outputs.
Cause: Two scenes share the same nominal CRS but have geotransforms snapped to different pixel grids — common when mixing Sentinel-2 tiles at 10 m and 20 m resolutions, or scenes re-projected by different tools.
Fix: Before any multi-source operation, reproject all inputs to a canonical reference grid using rioxarray.reproject_match(), passing the reference dataset as the target. Verify alignment with assert ds_a.rio.transform() == ds_b.rio.transform().

NaN contamination from unmasked fill values

Symptom: Spectral index images contain large black regions over ocean or outside the scene footprint, and temporal medians are unexpectedly low.
Cause: Rasterio or stackstac reads nodata values as 0 or a sentinel integer rather than NaN because the nodata field was not propagated to the float array.
Fix: After reading, apply ds.where(ds != nodata_value) to convert fill integers to NaN, or pass fill_value=np.nan to stackstac.stack().

Missing COG overviews causing slow spatial subsetting

Symptom: Reading a small spatial window from a “COG” takes as long as reading the entire file.
Cause: The file was written as a standard tiled GeoTIFF without internal overviews; rasterio cannot exploit HTTP range requests without the overview structure. Understanding Cloud-Optimized GeoTIFF Structure details why overview placement matters.
Fix: Rebuild the file with rio cogeo create or pass overview_level="auto" when writing with rasterio.

Out-of-memory spikes when computing large dask graphs

Symptom: Workers exhaust RAM mid-computation and either crash or swap catastrophically.
Cause: Chunk sizes were set without accounting for the full task graph width — many intermediate arrays may be alive simultaneously.
Fix: Reduce chunk spatial dimensions (try 1024 × 1024 instead of 4096 × 4096), insert explicit ds.persist() calls at natural pipeline boundaries to materialise and garbage-collect intermediate results, and monitor the dask dashboard to identify memory hotspots.

STAC pagination silently truncating scene lists

Symptom: A multi-month query returns only 100 items regardless of actual archive size.
Cause: The STAC endpoint applies a default limit and the client is not iterating pages.
Fix: Use pystac_client with .pages() to iterate all result pages, or explicitly pass max_items=None to catalog.search(). Validate the item count against the server’s numberMatched field in the first response.

CRS mismatch between raster and vector inputs

Symptom: rasterio.mask.mask() returns an all-nodata array, or raises a PROJ error.
Cause: The vector geometry is in WGS-84 (EPSG:4326) while the raster is in a projected UTM CRS, and no reprojection step was applied.
Fix: Reproject the vector to the raster CRS using geopandas.GeoDataFrame.to_crs(src.crs) before passing shapes to rasterio.mask.mask(). Mastering CRS Transformations in rasterio covers PROJ environment requirements and common EPSG lookup failures.

Performance and Scale Considerations

Memory layout and chunking strategy

xarray + dask processes raster cubes as a grid of chunks. Chunk selection directly controls the trade-off between task-graph overhead and peak memory. For remote I/O against object-store COGs, 256–512 MB per chunk (after decompression) is typically optimal — smaller chunks cause scheduler overhead to dominate; larger chunks cause memory spikes when multiple tasks run in parallel on the same worker.

Chunk along spatial axes first. For a (time, band, y, x) cube reading Sentinel-2 at 10 m, {"time": 1, "band": -1, "y": 2048, "x": 2048} keeps all bands per spatial chunk so a single range request retrieves a complete 2048 × 2048 multi-band tile. Chunking time to 1 means temporal reductions (median) schedule naturally without loading the entire archive.

Lazy evaluation and `.compute()` placement

The entire pipeline — STAC query, masking, index computation, resampling — should remain lazy until a single terminal .compute() or .to_zarr() call. Avoid intermediate .compute() calls in the middle of the graph; each one materialises the array, breaks dask’s ability to fuse operations, and prevents garbage collection of upstream chunks. Insert .persist() only at genuine checkpoints (e.g., after cloud masking) where the cost of recomputation on failure exceeds the memory cost of persistence.

Parallelism model

For local development, dask.distributed.LocalCluster with 4–8 threads per worker handles most single-scene workflows. For continental or multi-decadal archives, use dask.distributed against a cloud cluster (Coiled on AWS, Vertex AI Workbench, or a Kubernetes deployment). The key tuning knobs are:

n_workers — match to the number of physical cores; dask is I/O-bound on COG reads, so more workers help up to the point where network bandwidth saturates.
threads_per_worker — keep at 1 or 2 for CPU-intensive operations (numpy, scipy) to avoid GIL contention; raise to 4–8 for pure I/O tasks.
memory_limit per worker — set to 80 % of available RAM per core to leave headroom for OS and inter-process buffers.

Cloud-native storage formats

Traditional GeoTIFFs without internal tiling suffer from slow random access. Production pipelines should export to COG for single-file spatial products or Zarr for multi-temporal analysis cubes stored in object storage. COGs enable sub-second spatial subsetting via HTTP range requests and preserve overview pyramids for rapid visualisation. Zarr stores chunk metadata in a lightweight JSON manifest and compress each chunk independently, enabling parallel reads and writes without file locking — critical for distributed dask workers writing results concurrently.

Frequently Asked Questions

What Python libraries are essential for satellite processing pipelines?

rasterio handles low-level raster I/O and CRS operations; xarray provides labelled N-dimensional arrays for multi-band, multi-date cubes; dask enables out-of-core parallel computation; rioxarray bridges rasterio’s spatial metadata into xarray; and pystac-client queries STAC catalogues to discover scenes programmatically.

How do I handle cloud contamination in time-series analysis?

Integrate QA-band bit-flag parsing or an ML cloud detector (s2cloudless, Fmask) to produce a boolean mask array, set contaminated pixels to NaN with .where(~cloud_mask), then propagate the mask through all downstream transformations so statistical aggregations exclude bad observations automatically.

What chunk size should I use with dask for raster processing?

Target 256–512 MB per chunk for remote I/O workloads. For local SSD reads you can go up to 1 GB. Chunk along spatial axes first (x, y) and keep the full band dimension per chunk to avoid repeated reads of the same tile offset.

How do I write a COG without a separate gdal_translate step?

Write directly from rasterio with "tiled": True, "blockxsize": 512, "blockysize": 512, and "compress": "zstd" in the profile, then add overviews with dst.build_overviews([2, 4, 8, 16, 32], resampling=rasterio.enums.Resampling.average) and call dst.update_tags(ns="rio_overview", resampling="average"). Alternatively, use rio cogeo create from the rio-cogeo CLI.

Should I use stackstac or odc-stac to build xarray cubes from STAC?

Both convert a STAC ItemCollection to a dask-backed xarray.DataArray. stackstac has simpler defaults and handles mixed-resolution assets well; odc-stac offers finer control over projection, oversampling, and grouping. For most Sentinel-2 workflows, stackstac is the faster starting point.

Core Raster Fundamentals & STAC Mapping — the raster data model, COG byte layout, and STAC hierarchy that this pipeline is built on top of.
Automated Image Clipping and Cropping — vector-masked and bounding-box subsetting with rasterio.mask, including CRS alignment and edge-artefact prevention.
Cloud and Shadow Masking Strategies — QA-band bit-flag parsing, s2cloudless integration, and morphological mask refinement.
Seamless Mosaicking and Edge Blending — feathering, histogram matching, and priority-stack compositing for multi-scene mosaics.
Spectral Index Calculation Pipelines — declarative YAML index registries, NaN-safe band math, and validation against reference scenes.
Temporal Aggregation and Time-Series Analysis — monthly and seasonal compositing, rolling-window smoothing, and harmonic regression over multi-year archives.
Band Math Operations with xarray — vectorised arithmetic on multi-band DataArrays, including the calculating-ndvi-directly-from-xarray-dataarrays worked example.