layout: article.njk

Calculating NDVI directly from xarray DataArrays

Calculating NDVI directly from xarray DataArrays transforms traditional raster band math into a labeled, coordinate-aware operation. Instead of managing raw NumPy indices and manual axis alignment, you apply the standard formula (NIR - Red) / (NIR + Red) using declarative xarray operators. The library automatically handles spatial broadcasting, lazy execution via Dask, and metadata propagation. This eliminates silent misalignment bugs, preserves CF-compliant metadata, and scales seamlessly from single-scene processing to multi-temporal continental stacks.

Core Implementation

import xarray as xr
import numpy as np

def calculate_ndvi_xarray(data: xr.DataArray | xr.Dataset) -> xr.DataArray:
    """Compute NDVI from xarray objects with safe division and metadata preservation."""
    # Extract bands regardless of input structure
    if isinstance(data, xr.Dataset):
        red = data["red"]
        nir = data["nir"]
    else:
        # Assumes a 'band' coordinate exists; fallback to positional indexing if needed
        red = data.sel(band="red")
        nir = data.sel(band="nir")
        
    # Prevent integer overflow/wrap-around by casting to float32
    red = red.astype("float32")
    nir = nir.astype("float32")
    
    # Core NDVI calculation
    denominator = nir + red
    ndvi = (nir - red) / denominator
    
    # Handle division-by-zero (e.g., black borders, nodata) safely
    ndvi = ndvi.where(denominator != 0, np.nan)
    ndvi = ndvi.clip(-1, 1)
    
    # Attach standard metadata
    ndvi.attrs.update({
        "long_name": "Normalized Difference Vegetation Index",
        "standard_name": "NDVI",
        "units": "1",
        "formula": "(NIR - RED) / (NIR + RED)",
        "valid_range": [-1.0, 1.0]
    })
    return ndvi

Step-by-Step Breakdown

1. Structure-Agnostic Band Extraction

The function accepts either an xr.Dataset (where bands are separate variables) or an xr.DataArray (where bands are stacked along a coordinate dimension). Using .sel(band="red") leverages xarray’s label-based indexing, removing the need to hardcode array positions like data[0] or data[1]. This is critical when processing multi-sensor archives where band ordering varies.

2. Safe Type Casting

Raw satellite products (Sentinel-2 L2A, Landsat 9 SR) typically ship as uint16. Performing subtraction on unsigned integers causes wrap-around underflow (e.g., 50 - 100 becomes 65486). Casting to float32 before arithmetic prevents this while keeping memory overhead minimal. float64 is unnecessary for NDVI’s -1 to 1 range and doubles RAM consumption during lazy graph construction.

3. Division-by-Zero Handling

Satellite scenes often contain padding, scan gaps, or 0 values in both NIR and Red channels. Direct division produces NaN or Inf, which can break downstream aggregations. Using ndvi.where(denominator != 0, np.nan) explicitly masks invalid pixels while preserving the original coordinate grid. The subsequent .clip(-1, 1) enforces physical bounds, catching any floating-point drift or calibration artifacts.

4. Metadata Preservation

Attaching CF-compliant attributes ensures interoperability with GIS software, NetCDF exporters, and visualization libraries. The standard_name and valid_range keys are recognized by tools like rioxarray, MetPy, and Panoply, enabling automatic colormap scaling and unit-aware plotting without manual overrides.

Performance & Production Considerations

Dask Chunking Strategy

NDVI is compute-bound but memory-light. When working with large archives, chunk along spatial dimensions to maximize cache locality and minimize inter-worker communication. Optimal chunk sizes typically follow (1, 1024, 1024) for (time, y, x) layouts. Avoid chunking along the band dimension if you plan to compute multiple indices (EVI, SAVI, NDWI), as it fragments the task graph and increases scheduling overhead. Refer to the official xarray Dask integration guide for chunking diagnostics and memory profiling workflows.

Coordinate Alignment & Lazy Execution

xarray aligns arrays by coordinate labels, not positional order. If your red and NIR bands originate from separate files or different processing pipelines, xarray automatically reindexes them to a common grid before arithmetic. This prevents subtle georeferencing offsets that plague NumPy-based workflows. Because operations are lazy, calling calculate_ndvi_xarray() only builds a computational graph. The actual calculation triggers only when you call .compute(), .to_netcdf(), or .plot().

Memory Footprint & Data Types

For continental-scale stacks, monitor peak memory during the .compute() phase. If OOM errors occur, reduce spatial chunk sizes or enable xr.set_options(enable_cftimeindex=False) if temporal coordinates are strictly Gregorian. Always write outputs to float32 NetCDF or Zarr to maintain precision without inflating storage costs.

Pipeline Integration

This pattern integrates cleanly into broader Band Math Operations with Xarray workflows. Because xarray tracks spatial coordinates (x, y, crs) and temporal dimensions natively, you can chain NDVI computation directly after STAC catalog ingestion without manual array slicing or coordinate reordering.

A typical production pipeline looks like this:

Query a STAC API using pystac-client to fetch item URIs
Open with stackstac or odc-stac to build a lazy xr.DataArray
Apply calculate_ndvi_xarray() to the stacked cube
Compute temporal aggregations (.mean(dim="time"), .quantile())
Export to cloud-optimized Zarr or GeoTIFF

By keeping operations lazy until the final export step, you avoid materializing intermediate arrays in RAM. This approach scales to petabyte-scale archives when paired with distributed Dask clusters or cloud-native storage backends. For foundational concepts on coordinate systems, raster I/O, and STAC metadata mapping, review the Core Raster Fundamentals & STAC Mapping reference guide.