Band Math Operations with Xarray
Remote sensing analysis routinely requires mathematical transformations across spectral bands to derive biophysical indices, surface temperatures, or atmospheric corrections. Unlike traditional desktop GIS band calculators, xarray integrates directly with Dask for out-of-core computation, preserves spatial metadata throughout the processing chain, and enables reproducible, code-driven pipelines. This page covers the complete workflow — from aligned ingestion through optimized export — within the broader Core Raster Fundamentals & STAC Mapping context.
The diagram below shows where band math sits in a typical satellite analysis pipeline:
Prerequisites
Install the required libraries into an isolated virtual environment before running any code on this page:
pip install xarray rioxarray rasterio "dask[complete]" numpy
| Library | Min version | Why required |
|---|---|---|
xarray |
2023.6 | Labeled multi-dimensional array operations and .where() expressions |
rioxarray |
0.15 | Spatial I/O, CRS management, and .rio.to_raster() export |
rasterio |
1.3 | Backend COG reads; affine transform and nodata handling |
dask[complete] |
2023.10 | Lazy evaluation and parallel graph execution for large scenes |
numpy |
1.25 | Universal functions (ufuncs) applied across DataArrays |
Conceptual prerequisites: familiarity with raster grids, affine transforms, and CRS definitions is assumed. If you need a primer, see Mastering CRS Transformations in Rasterio for projection concepts, and Handling Pixel Resolution and Scaling for the relationship between pixel size and affine parameters.
Step-by-Step Workflow
1. Ingest and Align Multi-Band Rasters
Band math requires strict spatial and dimensional alignment. Mismatched resolutions, extents, or CRS definitions produce either silent pixel misalignment or hard broadcast failures. Always open files with chunking enabled so Dask can evaluate lazily.
import xarray as xr
import rioxarray # noqa: F401 — registers .rio accessor on DataArray
import numpy as np
def load_aligned_rasters(
red_path: str,
nir_path: str,
chunk_size: int = 512,
) -> tuple[xr.DataArray, xr.DataArray]:
"""
Open two spectral band files, apply chunking, and return spatially
aligned DataArrays clipped to their shared extent.
Args:
red_path: Path or VSICURL URI to the Red band (COG recommended).
nir_path: Path or VSICURL URI to the NIR band.
chunk_size: Chunk edge length in pixels; align with COG tile size.
Returns:
Tuple of (red, nir) DataArrays on the same spatial grid.
"""
open_kwargs = {
"chunks": {"x": chunk_size, "y": chunk_size},
"mask_and_scale": True, # applies scale_factor / add_offset attrs
"lock": False, # allow concurrent Dask reads
}
red = rioxarray.open_rasterio(red_path, **open_kwargs)
nir = rioxarray.open_rasterio(nir_path, **open_kwargs)
# Drop the redundant band dimension from single-band COG files
red = red.squeeze("band", drop=True)
nir = nir.squeeze("band", drop=True)
# Reproject NIR to Red's CRS/grid if sensors differ
if red.rio.crs != nir.rio.crs:
nir = nir.rio.reproject_match(red)
# Restrict to overlapping pixels; "inner" join drops boundary padding
red_aligned, nir_aligned = xr.align(red, nir, join="inner")
return red_aligned, nir_aligned
The join="inner" strategy in xr.align prevents NaN propagation at tile edges — critical when compositing scenes from multi-date STAC catalog queries. For multi-sensor workflows where the Red and NIR bands originate from different satellites, call nir.rio.reproject_match(red) before alignment to harmonize both grids in one resampling pass.
2. Execute Core Band Math Expressions
Once aligned, mathematical operations use NumPy universal functions applied element-wise across DataArrays. Xarray automatically broadcasts matching dimensions, preserves coordinate labels, and propagates NaN masks from mask_and_scale. The key discipline here is guarding zero-division before it silently corrupts output pixels.
def compute_ndvi(red: xr.DataArray, nir: xr.DataArray) -> xr.DataArray:
"""
Calculate NDVI = (NIR - Red) / (NIR + Red) with explicit zero-division
protection and CF-compliant metadata attachment.
Returns a lazy DataArray; call .compute() or .rio.to_raster() to execute.
"""
# Cast to float32 early to prevent silent float64 promotion later
red = red.astype(np.float32)
nir = nir.astype(np.float32)
denominator = nir + red
# xr.where preserves the Dask graph — np.where would collapse lazy chunks
ndvi = xr.where(
denominator != 0,
(nir - red) / denominator,
np.nan,
)
ndvi.attrs.update({
"long_name": "Normalized Difference Vegetation Index",
"units": "dimensionless",
"valid_range": [-1.0, 1.0],
"formula": "(NIR - Red) / (NIR + Red)",
})
# Carry the spatial metadata forward explicitly
ndvi = ndvi.rio.write_crs(red.rio.crs)
ndvi = ndvi.rio.write_transform(red.rio.transform())
ndvi = ndvi.rio.write_nodata(np.nan, encoded=True)
return ndvi
The same pattern extends to any normalized difference index. For NDWI (water), substitute the Green band for Red; for NDSI (snow), use SWIR in place of NIR. Detailed masking patterns and atmospheric correction integration are covered in Calculating NDVI Directly from xarray DataArrays.
If you are running a production spectral index calculation pipeline across many scenes, wrap the above in a parameterized function that accepts band names and formula strings, then dispatch via Dask’s delayed interface.
3. Optimize with Dask and Chunking Strategies
Xarray’s Dask integration queues mathematical expressions as a task graph rather than executing them immediately. This means you can chain dozens of band math operations before paying any computation cost. The tradeoff is that chunk size selection directly governs both memory headroom and task-scheduling overhead.
def rechunk_for_export(ndvi: xr.DataArray, tile_size: int = 512) -> xr.DataArray:
"""
Rechunk to COG tile boundaries before triggering any computation.
COG internal tiles are typically 256 or 512 pixels square.
"""
return ndvi.chunk({"x": tile_size, "y": tile_size})
def compute_and_write(
ndvi: xr.DataArray,
output_path: str,
compress: str = "LZW",
) -> None:
"""
Trigger Dask graph execution and write a tiled, compressed COG.
The .rio.to_raster() call computes and streams chunk-by-chunk;
it does NOT load the full array into RAM first.
"""
ndvi_tiled = rechunk_for_export(ndvi)
ndvi_tiled.rio.to_raster(
output_path,
driver="GTiff",
tiled=True,
blockxsize=512,
blockysize=512,
compress=compress,
dtype="float32",
lock=True, # single-writer lock prevents GeoTIFF corruption
)
Chunk size guidelines:
| Scene size | Recommended chunk | Rationale |
|---|---|---|
| < 5 000 × 5 000 px | 512 × 512 | Matches common COG internal tile; low overhead |
| 5 000–20 000 px | 1 024 × 1 024 | Fewer tasks; keeps Dask scheduler lean |
| > 20 000 px (Sentinel-2 full) | 2 048 × 2 048 | Avoids excessive task count; RAM allows it |
Oversized chunks exhaust RAM during intermediate steps; undersized chunks create scheduling overhead that can dominate compute time. The Cloud-Optimized GeoTIFF structure page explains how internal tile sizes map to efficient HTTP range reads for remote data.
4. Export and Preserve Spatial Metadata
Geospatial outputs must retain CRS, affine transform, and nodata definitions — without them, downstream tools (QGIS, rasterio, stacking pipelines) silently misinterpret the data. rioxarray propagates these attributes automatically through .rio.to_raster(), but explicit validation before export prevents metadata loss during complex pipeline handoffs.
def verify_spatial_metadata(arr: xr.DataArray) -> None:
"""
Assert critical geospatial attributes exist and are consistent before
writing. Raises AssertionError with a descriptive message on failure.
"""
assert arr.rio.crs is not None, (
"Missing CRS — call arr.rio.write_crs(...) before export"
)
assert arr.rio.transform() is not None, (
"Missing affine transform — check that spatial coords were preserved"
)
has_nodata = arr.rio.nodata is not None or "nodata" in arr.attrs
assert has_nodata, (
"Missing nodata — call arr.rio.write_nodata(np.nan, encoded=True)"
)
assert arr.dtype in (np.float32, np.float64), (
f"Unexpected dtype {arr.dtype!r} — cast to float32 before writing NDVI"
)
When producing multiple index outputs from a single scene (NDVI, NDWI, EVI), bundle them into an xr.Dataset and call .rio.to_raster() once. This eliminates redundant reads of the underlying COG bands and guarantees pixel-perfect alignment across all outputs.
def write_index_stack(
red: xr.DataArray,
nir: xr.DataArray,
green: xr.DataArray,
output_path: str,
) -> None:
"""Write NDVI and NDWI as a two-band GeoTIFF in a single pass."""
ndvi = compute_ndvi(red, nir)
green = green.astype(np.float32)
nir32 = nir.astype(np.float32)
denom_ndwi = green + nir32
ndwi = xr.where(denom_ndwi != 0, (green - nir32) / denom_ndwi, np.nan)
ndwi = ndwi.rio.write_crs(red.rio.crs)
ds = xr.Dataset({"NDVI": ndvi, "NDWI": ndwi})
ds.rio.to_raster(output_path, driver="GTiff", compress="LZW", dtype="float32")
Parameter Reference
| Parameter | API location | Type | Default | Usage note |
|---|---|---|---|---|
chunks |
rioxarray.open_rasterio |
dict | None (eager) |
Set {"x": N, "y": N} to enable Dask lazy loading; align N with COG tile size |
mask_and_scale |
rioxarray.open_rasterio |
bool | False |
Set True to auto-apply scale_factor/add_offset and mask _FillValue pixels |
lock |
rioxarray.open_rasterio |
bool/Lock | thread lock | Set False for parallel multi-thread reads; set True for multi-process GeoTIFF writes |
join |
xr.align |
str | "inner" |
"inner" clips to shared extent; "outer" pads with NaN; "exact" raises on mismatch |
skipna |
.max(), .mean(), .reduce() |
bool | True |
Keeps NaN-masked pixels from contaminating temporal composites |
keep_attrs |
most reductions | bool | False |
Pass True to propagate long_name, units, and custom metadata through reductions |
compress |
.rio.to_raster |
str | None |
"LZW" for floating-point; "DEFLATE" for better ratio at higher CPU cost |
dtype |
.rio.to_raster |
str | source dtype | Always pass "float32" explicitly for NDVI/NDWI to avoid writing float64 GeoTIFFs |
Verification and Testing
After running the full pipeline, confirm the output meets geospatial and analytical expectations before passing it downstream.
import rasterio
import numpy as np
def verify_ndvi_output(output_path: str) -> None:
"""Run assertions against a written NDVI GeoTIFF."""
with rasterio.open(output_path) as src:
data = src.read(1, masked=True)
# Spatial metadata checks
assert src.crs is not None, "Output has no CRS"
assert src.transform != rasterio.transform.IDENTITY, "Transform is identity"
assert src.nodata is not None, "Nodata not encoded"
assert src.dtypes[0] == "float32", f"Expected float32, got {src.dtypes[0]}"
# Value range checks for a vegetation scene
valid = data.compressed()
assert valid.min() >= -1.0, f"NDVI below -1: {valid.min():.4f}"
assert valid.max() <= 1.0, f"NDVI above +1: {valid.max():.4f}"
# Sanity: meaningful data should exist
assert len(valid) > 0, "All pixels are masked — check nodata masking"
print(
f"NDVI OK — shape: {src.shape}, "
f"range: [{valid.min():.3f}, {valid.max():.3f}], "
f"valid px: {len(valid):,}"
)
Visual verification: load the output in QGIS using a diverging colour ramp (red → white → green, anchored at −1/0/+1) and compare to a true-colour composite. Vegetation should show strong positive values; water and bare soil should be negative or near zero.
For CI/CD integration, run verify_ndvi_output as a pytest fixture against a small synthetic raster with known pixel values, asserting exact NDVI values to four decimal places.
Troubleshooting
Silent NaN spread across entire output
Cause: mask_and_scale=True masked pixels where the denominator is zero, but xr.where was applied after masking rather than before, letting NaN propagate through the division.
Fix: Apply xr.where(denominator != 0, ...) before any division, and confirm your _FillValue is correctly encoded in the source COG metadata.
ValueError: cannot reindex with a non-monotonic or duplicate index
Cause: xr.align cannot reconcile x/y coordinate arrays that are not strictly monotonic, which happens when COG tiles have been manually subset or concatenated incorrectly.
Fix: Call arr.sortby(["y", "x"]) on each DataArray before aligning, or reproject both inputs from scratch with rio.reproject_match.
Chunk size OOM during .rio.to_raster()
Cause: The rechunked DataArray fits in RAM individually, but Dask materialises multiple adjacent chunks concurrently, multiplying memory use.
Fix: Reduce chunk size to 256 × 256, or set DASK_SCHEDULER=synchronous during debugging to force single-threaded execution and isolate the memory-heavy step.
CRSError: CRS is not set after arithmetic
Cause: Arithmetic operations on xr.DataArray objects drop the .rio spatial attributes because they live in encoding, not attrs.
Fix: Always call result.rio.write_crs(source.rio.crs) and result.rio.write_transform(source.rio.transform()) after any operation that could strip metadata.
dask.array graph grows unbounded across a time loop
Cause: Calling .where() or arithmetic inside a Python for loop accumulates graph nodes without executing; by iteration 50 the graph is gigabytes in size.
Fix: Call .compute() or .persist() at the end of each loop iteration, or restructure using xr.concat followed by a single vectorised operation across the time dimension.
Advanced Patterns
Temporal Maximum Value Composites
Environmental workflows routinely require reducing a time-stacked DataArray to a seasonal composite, an anomaly raster, or a growing-season maximum. Xarray’s .reduce() and .groupby() handle these natively over a time dimension added during STAC-to-array loading.
def seasonal_max_ndvi(time_series: xr.DataArray) -> xr.DataArray:
"""
Compute the maximum NDVI per pixel across a time dimension,
ignoring cloud-masked NaN pixels.
Expects time_series to have dims ("time", "y", "x").
"""
composite = time_series.max(dim="time", skipna=True, keep_attrs=True)
composite.attrs["composite_method"] = "max_value_composite"
return composite
For complex statistical reductions using apply_ufunc with Dask, always set dask="parallelized" and verify your custom function is serializable (no lambda, no local closures referencing mutable state). The output_dtypes argument is required when Dask cannot infer the output type.
Applying Band Math via apply_ufunc
When a transformation cannot be expressed as pure xarray arithmetic — for example, a lookup-table correction or a physics-based reflectance model — use apply_ufunc to wrap a NumPy function while preserving the Dask graph:
def toa_to_boa_correction(toa: xr.DataArray, gain: float, offset: float) -> xr.DataArray:
"""
Apply a linear bottom-of-atmosphere correction: BOA = gain * TOA + offset.
Works lazily with Dask-backed DataArrays.
"""
return xr.apply_ufunc(
lambda arr: gain * arr + offset,
toa,
dask="parallelized",
output_dtypes=[np.float32],
keep_attrs=True,
)
Validation Against Reference Datasets
Automated band math pipelines need statistical validation to catch spectral drift, resampling artifacts, or scaling errors across sensor generations:
def validate_against_reference(
predicted: xr.DataArray,
reference: xr.DataArray,
) -> dict[str, float]:
"""Compute RMSE, bias, and R² over co-located valid pixels."""
p, r = xr.align(predicted, reference, join="inner")
mask = ~(np.isnan(p) | np.isnan(r))
p_valid = p.values[mask.values].astype(np.float64)
r_valid = r.values[mask.values].astype(np.float64)
rmse = float(np.sqrt(np.mean((p_valid - r_valid) ** 2)))
bias = float(np.mean(p_valid - r_valid))
r2 = float(np.corrcoef(p_valid, r_valid)[0, 1] ** 2)
return {"rmse": rmse, "bias": bias, "r2": r2}
Embed this as a pytest fixture that runs against a held-out reference scene on every data version bump to catch silent regressions in upstream COG generation.
Related
- Calculating NDVI Directly from xarray DataArrays — optimized masking, multi-temporal stacking, and atmospheric correction integration for the NDVI workflow specifically
- Mastering CRS Transformations in Rasterio — projection harmonization before multi-sensor band alignment
- Handling Pixel Resolution and Scaling — affine transform mechanics and resolution matching when bands come from different sensor channels
- Spectral Index Calculation Pipelines — production architecture for running NDVI, EVI, NDWI, and NDBI at scale across large STAC catalogs
- Understanding Cloud-Optimized GeoTIFF Structure — how COG internal tiling interacts with Dask chunk size and HTTP range reads for remote band data