Calculating NDVI directly from xarray DataArrays

Apply (NIR - Red) / (NIR + Red) with label-based band selection and safe division, and the result retains full spatial coordinate metadata:

import xarray as xr
import numpy as np

def ndvi(data: xr.DataArray | xr.Dataset) -> xr.DataArray:
    nir = (data["nir"] if isinstance(data, xr.Dataset) else data.sel(band="nir")).astype("float32")
    red = (data["red"] if isinstance(data, xr.Dataset) else data.sel(band="red")).astype("float32")
    denom = nir + red
    result = (nir - red) / denom.where(denom != 0)
    return result.clip(-1, 1)

That four-liner is the core. The rest of this page explains why each step is necessary and shows production-ready variants.


Why this case arises in remote sensing workflows

NDVI is the most widely used spectral index in land-cover monitoring, phenology tracking, and agricultural yield modelling. But computing it correctly from raw satellite data involves several failure modes that do not exist in simple NumPy array math: unsigned-integer underflow, division by zero at scene borders, coordinate misalignment between separately-downloaded band files, and loss of the CRS and geotransform during array arithmetic.

This page belongs to the Band Math Operations with Xarray workflow. For the broader data-loading and coordinate management context, see Core Raster Fundamentals & STAC Mapping.


Environment & setup

Package Minimum version Role on this page
xarray 2023.6 Labeled n-D arrays, .sel(), .where(), .clip()
numpy 1.24 np.nan sentinel value
rioxarray 0.15 Spatial extension: .rio.to_raster(), CRS propagation
dask[complete] 2023.9 Lazy computation for large scenes (optional but recommended)
pip install "xarray>=2023.6" "numpy>=1.24" "rioxarray>=0.15" "dask[complete]>=2023.9"

How xarray NDVI calculation works — data-flow diagram

The SVG below shows the transformation path from a raw multi-band DataArray to a masked, CF-annotated NDVI output.

xarray NDVI calculation data flow A left-to-right pipeline: raw uint16 DataArray → band selection → float32 cast → arithmetic → nodata mask → clip → CF metadata → NDVI output. Raw DataArray (uint16, band dim) Band selection .sel(band="nir") .sel(band="red") or Dataset["nir"] Cast float32 .astype("float32") Arithmetic (nir − red) / (nir + red) .where(denom≠0) NDVI output float32 [−1, 1] CRS preserved CF attrs attached All steps are lazy when backed by Dask — execution defers to .compute() or file export

Complete working example

import xarray as xr
import numpy as np
import rioxarray  # noqa: F401 — activates the .rio accessor on DataArrays


def calculate_ndvi(data: xr.DataArray | xr.Dataset) -> xr.DataArray:
    """
    Compute NDVI from a labeled xarray object.

    Parameters
    ----------
    data : xr.DataArray with a 'band' coordinate containing 'nir' and 'red',
           OR an xr.Dataset with variables named 'nir' and 'red'.

    Returns
    -------
    xr.DataArray : float32 NDVI, range [-1, 1], nodata pixels set to NaN,
                   spatial coordinates and CRS inherited from input.
    """
    # --- Band extraction (structure-agnostic) ---
    if isinstance(data, xr.Dataset):
        # Dataset: bands are separate variables; direct key access preserves attrs
        nir = data["nir"]
        red = data["red"]
    else:
        # DataArray: bands stacked on a 'band' coordinate with string labels.
        # .sel() is label-based — immune to band-order differences between sensors.
        nir = data.sel(band="nir")
        red = data.sel(band="red")

    # --- float32 cast (MUST precede arithmetic on uint16 inputs) ---
    # Sentinel-2 L2A and Landsat Collection 2 SR are uint16.
    # uint16 subtraction wraps: 50 - 100 → 65486, not -50.
    # float32 gives ~7 significant figures — ample for NDVI's [-1, 1] range.
    nir = nir.astype("float32")
    red = red.astype("float32")

    # --- Core formula with safe division ---
    denominator = nir + red
    # Dividing by NaN propagates NaN rather than raising ZeroDivisionError
    ndvi = (nir - red) / denominator.where(denominator != 0)

    # --- Enforce physical bounds ---
    # Floating-point drift or miscalibrated pixels can produce values outside [-1, 1].
    # .clip() is cheap and prevents downstream colormap / histogram anomalies.
    ndvi = ndvi.clip(-1.0, 1.0)

    # --- CF-compliant metadata ---
    # Tools like rioxarray, Panoply, and MetPy read these keys automatically
    # for unit-aware plotting and auto-scaling.
    ndvi.attrs.update({
        "long_name": "Normalized Difference Vegetation Index",
        "standard_name": "normalized_difference_vegetation_index",
        "units": "1",          # dimensionless — CF convention for pure ratios
        "formula": "(NIR - Red) / (NIR + Red)",
        "valid_range": [-1.0, 1.0],
    })
    return ndvi


# --- Minimal usage example ---
if __name__ == "__main__":
    import rasterio

    # Open a multi-band GeoTIFF where band order is: 1=Red, 2=NIR (e.g. Landsat 8 B4, B5)
    with rasterio.open("landsat_scene.tif") as src:
        da = xr.open_rasterio(src).assign_coords(
            band=["red", "nir"]   # label the positional bands
        )

    ndvi_result = calculate_ndvi(da)
    print(ndvi_result)
    # Output: <xarray.DataArray (y: ..., x: ...)>
    # Coordinates: x, y, spatial_ref

    # Write to cloud-optimized GeoTIFF
    ndvi_result.rio.to_raster("ndvi_output.tif", driver="GTiff",
                               compress="deflate", dtype="float32")

Variant patterns

Variant 1 — Dask-enabled large-scene processing

For multi-temporal stacks or continental extents, open the data with Dask chunking before calling calculate_ndvi. The function is fully lazy: all operations compose into a single task graph that only materialises when you write to disk.

import xarray as xr
import rioxarray  # noqa: F401

# stackstac or odc-stac produce chunked DataArrays directly from STAC items.
# chunk=(1, 1024, 1024) keeps each task ~4 MB in float32 — a good baseline.
da = xr.open_dataset(
    "sentinel2_stack.zarr",
    engine="zarr",
    chunks={"time": 1, "y": 1024, "x": 1024},
)["nir_red_stack"]

# calculate_ndvi returns a lazy graph — no RAM spike yet
ndvi_lazy = calculate_ndvi(da)

# Write directly to Zarr; Dask executes chunk-by-chunk
ndvi_lazy.to_zarr("ndvi_timeseries.zarr", mode="w")

Avoid chunking along the band dimension when you plan to compute multiple indices (EVI, NDWI, SAVI) from the same stack — it fragments the task graph and increases scheduler overhead.

Variant 2 — Separate single-band GeoTIFF files

Sentinel-2 data downloaded from the Copernicus Hub arrives as one GeoTIFF per band. Use xr.align to reconcile any minor extent differences before applying the formula.

import xarray as xr
import rioxarray  # noqa: F401

nir = xr.open_dataarray("T32TNT_20240601_B08.tif", engine="rasterio").squeeze("band")
red = xr.open_dataarray("T32TNT_20240601_B04.tif", engine="rasterio").squeeze("band")

# align() reindexes both arrays to a common coordinate grid.
# join="inner" clips to the shared extent; "outer" pads with NaN.
nir_aligned, red_aligned = xr.align(nir, red, join="inner")

nir_f = nir_aligned.astype("float32")
red_f = red_aligned.astype("float32")
denom = nir_f + red_f
ndvi = ((nir_f - red_f) / denom.where(denom != 0)).clip(-1, 1)

ndvi.rio.to_raster("ndvi_sentinel2.tif", driver="GTiff", compress="deflate")

Variant 3 — Nodata-aware masking from raster metadata

Raw COG files encode a nodata value (often 0 or 65535). Apply an explicit nodata mask before the NDVI formula so fill pixels do not produce spurious low-NDVI values that contaminate aggregations.

import xarray as xr
import rioxarray  # noqa: F401
import numpy as np

da = xr.open_dataarray("s2_b4_b8.tif", engine="rasterio")
# .rio.nodata reads the nodata value recorded in the GeoTIFF metadata
nodata_val = da.rio.nodata  # e.g. 0 for Sentinel-2 L2A

# Replace nodata with NaN before any arithmetic
da = da.where(da != nodata_val).astype("float32")

nir = da.sel(band=1).rename({"band": None})  # single-band squeeze
red = da.sel(band=0).rename({"band": None})

denom = nir + red
ndvi = ((nir - red) / denom.where(denom != 0)).clip(-1, 1)

Common errors

FloatingPointError: invalid value encountered in true_divide This appears when numpy’s floating-point error handling is set to raise (e.g. via np.seterr(invalid='raise')). Fix: either reset to default with np.seterr(invalid='ignore') before the calculation, or use the .where(denominator != 0) pattern shown above to pre-mask zeros.

ValueError: cannot select a dimension that does not exist: 'band' Your DataArray uses a different dimension name — common variants are 'bands', 'wavelength', or a positional integer index. Print data.dims to inspect, then pass the correct name to .sel(), or rename the dimension first: data = data.rename({"bands": "band"}).

xr.MergeError: cannot broadcast arrays when combining separate band files The NIR and Red GeoTIFFs have slightly different extents or pixel grids (common with Sentinel-2 10 m vs. 20 m bands). Fix: resample to a common grid using rioxarray’s .rio.reproject_match(reference_da) before calling the formula. See Handling Pixel Resolution and Scaling for a full reprojection workflow.