layout: article.njk
Calculating NDVI directly from xarray DataArrays
Calculating NDVI directly from xarray DataArrays transforms traditional raster band math into a labeled, coordinate-aware operation. Instead of managing raw NumPy indices and manual axis alignment, you apply the standard formula (NIR - Red) / (NIR + Red) using declarative xarray operators. The library automatically handles spatial broadcasting, lazy execution via Dask, and metadata propagation. This eliminates silent misalignment bugs, preserves CF-compliant metadata, and scales seamlessly from single-scene processing to multi-temporal continental stacks.
Core Implementation
import xarray as xr
import numpy as np
def calculate_ndvi_xarray(data: xr.DataArray | xr.Dataset) -> xr.DataArray:
"""Compute NDVI from xarray objects with safe division and metadata preservation."""
# Extract bands regardless of input structure
if isinstance(data, xr.Dataset):
red = data["red"]
nir = data["nir"]
else:
# Assumes a 'band' coordinate exists; fallback to positional indexing if needed
red = data.sel(band="red")
nir = data.sel(band="nir")
# Prevent integer overflow/wrap-around by casting to float32
red = red.astype("float32")
nir = nir.astype("float32")
# Core NDVI calculation
denominator = nir + red
ndvi = (nir - red) / denominator
# Handle division-by-zero (e.g., black borders, nodata) safely
ndvi = ndvi.where(denominator != 0, np.nan)
ndvi = ndvi.clip(-1, 1)
# Attach standard metadata
ndvi.attrs.update({
"long_name": "Normalized Difference Vegetation Index",
"standard_name": "NDVI",
"units": "1",
"formula": "(NIR - RED) / (NIR + RED)",
"valid_range": [-1.0, 1.0]
})
return ndvi
Step-by-Step Breakdown
1. Structure-Agnostic Band Extraction
The function accepts either an xr.Dataset (where bands are separate variables) or an xr.DataArray (where bands are stacked along a coordinate dimension). Using .sel(band="red") leverages xarray’s label-based indexing, removing the need to hardcode array positions like data[0] or data[1]. This is critical when processing multi-sensor archives where band ordering varies.
2. Safe Type Casting
Raw satellite products (Sentinel-2 L2A, Landsat 9 SR) typically ship as uint16. Performing subtraction on unsigned integers causes wrap-around underflow (e.g., 50 - 100 becomes 65486). Casting to float32 before arithmetic prevents this while keeping memory overhead minimal. float64 is unnecessary for NDVI’s -1 to 1 range and doubles RAM consumption during lazy graph construction.
3. Division-by-Zero Handling
Satellite scenes often contain padding, scan gaps, or 0 values in both NIR and Red channels. Direct division produces NaN or Inf, which can break downstream aggregations. Using ndvi.where(denominator != 0, np.nan) explicitly masks invalid pixels while preserving the original coordinate grid. The subsequent .clip(-1, 1) enforces physical bounds, catching any floating-point drift or calibration artifacts.
4. Metadata Preservation
Attaching CF-compliant attributes ensures interoperability with GIS software, NetCDF exporters, and visualization libraries. The standard_name and valid_range keys are recognized by tools like rioxarray, MetPy, and Panoply, enabling automatic colormap scaling and unit-aware plotting without manual overrides.
Performance & Production Considerations
Dask Chunking Strategy
NDVI is compute-bound but memory-light. When working with large archives, chunk along spatial dimensions to maximize cache locality and minimize inter-worker communication. Optimal chunk sizes typically follow (1, 1024, 1024) for (time, y, x) layouts. Avoid chunking along the band dimension if you plan to compute multiple indices (EVI, SAVI, NDWI), as it fragments the task graph and increases scheduling overhead. Refer to the official xarray Dask integration guide for chunking diagnostics and memory profiling workflows.
Coordinate Alignment & Lazy Execution
xarray aligns arrays by coordinate labels, not positional order. If your red and NIR bands originate from separate files or different processing pipelines, xarray automatically reindexes them to a common grid before arithmetic. This prevents subtle georeferencing offsets that plague NumPy-based workflows. Because operations are lazy, calling calculate_ndvi_xarray() only builds a computational graph. The actual calculation triggers only when you call .compute(), .to_netcdf(), or .plot().
Memory Footprint & Data Types
For continental-scale stacks, monitor peak memory during the .compute() phase. If OOM errors occur, reduce spatial chunk sizes or enable xr.set_options(enable_cftimeindex=False) if temporal coordinates are strictly Gregorian. Always write outputs to float32 NetCDF or Zarr to maintain precision without inflating storage costs.
Pipeline Integration
This pattern integrates cleanly into broader Band Math Operations with Xarray workflows. Because xarray tracks spatial coordinates (x, y, crs) and temporal dimensions natively, you can chain NDVI computation directly after STAC catalog ingestion without manual array slicing or coordinate reordering.
A typical production pipeline looks like this:
- Query a STAC API using
pystac-clientto fetch item URIs - Open with
stackstacorodc-stacto build a lazyxr.DataArray - Apply
calculate_ndvi_xarray()to the stacked cube - Compute temporal aggregations (
.mean(dim="time"),.quantile()) - Export to cloud-optimized Zarr or GeoTIFF
By keeping operations lazy until the final export step, you avoid materializing intermediate arrays in RAM. This approach scales to petabyte-scale archives when paired with distributed Dask clusters or cloud-native storage backends. For foundational concepts on coordinate systems, raster I/O, and STAC metadata mapping, review the Core Raster Fundamentals & STAC Mapping reference guide.