Spectral Index Calculation Pipelines
Spectral indices distill raw multispectral reflectance into quantifiable environmental metrics, enabling consistent monitoring of vegetation health, water extent, soil moisture, and urban heat. Building a robust pipeline requires more than simple band arithmetic: it demands memory-aware I/O, deterministic error handling, and reproducible coordinate transformations. For the full architectural context of where index computation sits inside an end-to-end geospatial workflow, see Satellite Processing Workflows & Index Pipelines.
The patterns below prioritize scalability, numerical stability, and seamless integration with cloud-native data stacks. They are aimed at remote sensing analysts, environmental data engineers, and research teams deploying automated geospatial pipelines.
Pipeline architecture overview
The diagram below shows the linear, stateless execution model used throughout this guide. Each stage is isolated so it can be tested and replaced independently.
Prerequisites
Install the core stack before starting. Pin major versions to prevent silent coordinate drift from GDAL/PROJ updates:
pip install "rasterio>=1.3" "numpy>=1.24" "xarray>=2023.1" \
"rioxarray>=0.15" "dask[array]>=2023.1" "pyproj>=3.6"
| Library | Min version | Why required |
|---|---|---|
rasterio |
1.3 | Windowed I/O, CRS parsing, affine transforms |
numpy |
1.24 | Vectorized arithmetic, errstate context manager |
xarray |
2023.1 | Labeled multidimensional arrays with time dimensions |
rioxarray |
0.15 | CRS-aware .clip(), .reproject(), COG writes via xarray |
dask[array] |
2023.1 | Out-of-core chunked execution for large scenes |
pyproj |
3.6 | CRS validation, from_epsg() standardization |
Conceptual prerequisites: You should be comfortable with raster data models and affine geotransforms (covered in Extracting and Parsing Raster Metadata) and understand how band math operations with xarray work before applying the patterns here. Input data must be surface reflectance products — Landsat Collection 2 SR or Sentinel-2 L2A — with documented scale factors.
Step-by-step workflow
Step 1 — Install dependencies and validate environment
Before opening any raster, confirm the library versions and GDAL/PROJ environment are consistent. Silent version mismatches between rasterio, pyproj, and the underlying GDAL shared library cause coordinate errors that are hard to trace:
import rasterio
import numpy as np
import xarray as xr
import rioxarray # registers .rio accessor on xr.DataArray
from pyproj import CRS
# Verify GDAL and PROJ versions at import time
print(f"rasterio {rasterio.__version__}, GDAL {rasterio.__gdal_version__}")
print(f"numpy {np.__version__}, xarray {xr.__version__}")
# Confirm PROJ data directory is accessible
from pyproj.exceptions import ProjError
try:
_ = CRS.from_epsg(4326)
print("PROJ data directory: OK")
except ProjError as exc:
raise EnvironmentError(
"PROJ cannot locate its data files. "
"Set PROJ_DATA or reinstall pyproj."
) from exc
If PROJ_DATA is unset in a containerised environment the CRS lookups silently fall back to approximate transformations, producing sub-pixel coordinate errors in reprojected outputs.
Step 2 — Validate inputs and extract raster metadata
The pipeline must fail fast when metadata is inconsistent. Parse raster headers to verify band count, data type, CRS, and affine transform before loading any pixels. Reject datasets with missing NODATA definitions or corrupted geotags — these will produce geometric artifacts that propagate through every downstream calculation:
from pathlib import Path
import rasterio
def validate_raster_inputs(
nir_path: Path,
red_path: Path,
expected_crs_epsg: int = 32636,
) -> dict:
"""
Validate that two band rasters are spatially consistent.
Returns a metadata dict used by all downstream stages.
Raises ValueError on any inconsistency.
"""
meta = {}
with rasterio.open(nir_path) as nir_src, rasterio.open(red_path) as red_src:
# CRS check
for name, src in [("NIR", nir_src), ("Red", red_src)]:
if src.crs is None:
raise ValueError(f"{name} raster has no CRS defined.")
if src.crs.to_epsg() != expected_crs_epsg:
raise ValueError(
f"{name} raster CRS is EPSG:{src.crs.to_epsg()}, "
f"expected EPSG:{expected_crs_epsg}."
)
# Grid alignment: same transform, shape, and NODATA
if nir_src.transform != red_src.transform:
raise ValueError("NIR and Red rasters have different affine transforms.")
if nir_src.shape != red_src.shape:
raise ValueError(
f"Shape mismatch: NIR {nir_src.shape} vs Red {red_src.shape}."
)
if nir_src.nodata is None or red_src.nodata is None:
raise ValueError("Both rasters must define a NODATA value.")
meta = {
"crs": nir_src.crs,
"transform": nir_src.transform,
"width": nir_src.width,
"height": nir_src.height,
"dtype": nir_src.dtypes[0],
"nodata_nir": nir_src.nodata,
"nodata_red": red_src.nodata,
"count": nir_src.count,
}
return meta
For batch automation across many scenes, see Automating Metadata Extraction for Batch Raster Jobs for patterns that extend this approach across entire archives.
Step 3 — Apply spatial subsetting and ROI clipping
Raw satellite scenes often exceed the area of interest by an order of magnitude. Subsetting early reduces I/O overhead and computational load for every subsequent stage. Integrating automated image clipping and cropping before index computation avoids loading unnecessary border pixels:
import geopandas as gpd
import rioxarray
import xarray as xr
from pathlib import Path
def load_and_clip_bands(
nir_path: Path,
red_path: Path,
roi_vector: Path,
chunk_size: int = 1024,
) -> tuple[xr.DataArray, xr.DataArray]:
"""
Load NIR and Red bands as Dask-backed DataArrays, then clip to ROI.
chunk_size controls the tile size passed to Dask; tune for available RAM.
"""
roi = gpd.read_file(roi_vector).to_crs("EPSG:32636")
nir = xr.open_dataarray(
nir_path, engine="rasterio",
chunks={"x": chunk_size, "y": chunk_size},
)
red = xr.open_dataarray(
red_path, engine="rasterio",
chunks={"x": chunk_size, "y": chunk_size},
)
# rioxarray.clip expects geometries in the same CRS as the raster
nir_clipped = nir.rio.clip(roi.geometry, roi.crs, drop=True, all_touched=False)
red_clipped = red.rio.clip(roi.geometry, roi.crs, drop=True, all_touched=False)
return nir_clipped, red_clipped
Set all_touched=False (the default) so only pixels whose centres fall inside the polygon contribute to the clip. Switching to all_touched=True inflates edge pixel counts and can introduce border noise in narrow-feature analysis.
Step 4 — Apply radiometric scaling to surface reflectance
Surface reflectance products are stored as scaled integers to reduce file size — typically int16 with a scale factor of 0.0001 and an additive offset of 0.0. Apply these before any index computation. Never perform arithmetic on raw digital numbers:
import numpy as np
import xarray as xr
# Landsat Collection 2 SR scale factors (verify in the product guide)
SCALE = 2.75e-05
OFFSET = -0.2
def apply_radiometric_scaling(
band: xr.DataArray,
nodata_val: float,
scale: float = SCALE,
offset: float = OFFSET,
) -> xr.DataArray:
"""
Convert scaled-integer DN to fractional surface reflectance [0, 1].
NODATA pixels are set to NaN so they propagate cleanly through index math.
"""
# Cast before arithmetic to avoid int16 overflow
band_float = band.astype(np.float32)
# Mask NODATA before scaling so fill values don't produce junk reflectances
band_float = band_float.where(band_float != nodata_val, other=np.nan)
return band_float * scale + offset
Sentinel-2 L2A uses a scale of 0.0001 with no offset. Always consult the official product documentation — do not hard-code values without verification — because scale factors differ between processing baselines.
Step 5 — Compute index with numerical stability guards
Index formulas such as NDVI, EVI, NDWI, and NDBI all involve normalized ratios, so division-by-zero and near-zero denominators must be handled explicitly. Using numpy.errstate suppresses IEEE 754 warnings while the explicit mask replaces invalid results with NaN rather than ±Inf:
import numpy as np
import xarray as xr
def compute_ndvi(
nir: xr.DataArray,
red: xr.DataArray,
) -> xr.DataArray:
"""
Normalized Difference Vegetation Index: (NIR - Red) / (NIR + Red).
Inputs must already be scaled to fractional reflectance.
"""
denom = nir + red
with np.errstate(divide="ignore", invalid="ignore"):
ndvi = (nir - red) / denom
# Replace zero-denominator results; NaN inputs propagate automatically
ndvi = ndvi.where(denom != 0, other=np.nan)
return ndvi.rename("NDVI")
def compute_evi(
nir: xr.DataArray,
red: xr.DataArray,
blue: xr.DataArray,
G: float = 2.5,
C1: float = 6.0,
C2: float = 7.5,
L: float = 1.0,
) -> xr.DataArray:
"""
Enhanced Vegetation Index — less saturated than NDVI in dense canopies.
Requires a Blue band (Landsat Band 2, Sentinel-2 Band 2).
"""
denom = nir + C1 * red - C2 * blue + L
with np.errstate(divide="ignore", invalid="ignore"):
evi = G * (nir - red) / denom
evi = evi.where(denom != 0, other=np.nan)
return evi.rename("EVI")
def compute_ndwi(
green: xr.DataArray,
nir: xr.DataArray,
) -> xr.DataArray:
"""
Normalized Difference Water Index (McFeeters 1996): (Green - NIR) / (Green + NIR).
Positive values indicate open water; negative values indicate non-water.
"""
denom = green + nir
with np.errstate(divide="ignore", invalid="ignore"):
ndwi = (green - nir) / denom
ndwi = ndwi.where(denom != 0, other=np.nan)
return ndwi.rename("NDWI")
When working with large scenes, these functions execute lazily inside Dask chunks — no pixels are read until .compute() is called. Keep chunk sizes at 512–1024 pixels per side for most 10-m to 30-m resolution products; larger tiles improve Dask scheduling efficiency but increase memory pressure per worker.
Step 6 — Serialize outputs with embedded provenance
Write index rasters with explicit metadata: CRS, affine transform, data type (float32), and a descriptive NODATA value. Embedding processing provenance in TIFF tags prevents audit failures in environmental compliance workflows and allows any downstream consumer to reconstruct the exact computation:
import json
import hashlib
from pathlib import Path
import numpy as np
import rasterio
from rasterio.crs import CRS
def write_index_cog(
index_array: np.ndarray,
output_path: Path,
meta_template: dict,
provenance: dict,
nodata_val: float = -9999.0,
) -> None:
"""
Write a float32 index raster as a Cloud-Optimized GeoTIFF.
provenance dict is serialized into the TIFF's DESCRIPTION tag.
"""
profile = {
**meta_template,
"driver": "GTiff",
"dtype": "float32",
"nodata": nodata_val,
"count": 1,
"compress": "zstd",
"tiled": True,
"blockxsize": 512,
"blockysize": 512,
"interleave": "band",
}
# Replace NaN with the explicit NODATA fill before writing
index_filled = np.where(np.isnan(index_array), nodata_val, index_array).astype(np.float32)
tmp_path = output_path.with_suffix(".tmp.tif")
with rasterio.open(tmp_path, "w", **profile) as dst:
dst.write(index_filled, 1)
dst.update_tags(
DESCRIPTION=json.dumps(provenance),
)
# Atomic rename prevents partial files on interruption
tmp_path.rename(output_path)
For downstream streaming via web maps or cloud analytics, convert to Cloud-Optimized GeoTIFF format by adding copy_src_overviews=True and building internal overviews with rasterio.open(...).build_overviews() before the COG copy step.
Step 7 — Verify outputs and run post-computation assertions
Verification is not optional. A pipeline that runs to completion without explicit checks can silently produce out-of-range values, misaligned grids, or near-empty valid-pixel counts that only surface weeks later in time-series analysis:
import numpy as np
import rasterio
from pathlib import Path
def verify_index_output(
output_path: Path,
index_name: str,
expected_crs_epsg: int,
valid_min: float = -1.0,
valid_max: float = 1.0,
) -> None:
"""
Assert that a written index file is numerically and spatially sane.
Raises AssertionError on any failure; designed to run in CI pipelines.
"""
with rasterio.open(output_path) as src:
assert src.crs.to_epsg() == expected_crs_epsg, (
f"{index_name}: CRS is EPSG:{src.crs.to_epsg()}, "
f"expected EPSG:{expected_crs_epsg}."
)
assert src.dtypes[0] == "float32", (
f"{index_name}: dtype is {src.dtypes[0]}, expected float32."
)
data = src.read(1, masked=True)
# Range check (NDVI, EVI, NDWI all span -1 to 1 physically)
valid = data.compressed() # exclude masked/NODATA pixels
assert valid.size > 0, f"{index_name}: output contains no valid pixels."
assert float(valid.min()) >= valid_min, (
f"{index_name}: minimum value {valid.min():.4f} below {valid_min}."
)
assert float(valid.max()) <= valid_max, (
f"{index_name}: maximum value {valid.max():.4f} above {valid_max}."
)
# Warn if NaN density is suspiciously high (>50% likely indicates a masking bug)
nan_frac = 1.0 - valid.size / data.size
if nan_frac > 0.5:
import warnings
warnings.warn(
f"{index_name}: {nan_frac:.1%} of pixels are NODATA. "
"Check cloud/shadow masking and ROI alignment.",
stacklevel=2,
)
Parameter reference
| Parameter / argument | Type | Default | Usage note |
|---|---|---|---|
chunk_size (load step) |
int |
1024 |
Dask tile edge length in pixels. Reduce to 512 for 10-m Sentinel-2 on machines with < 16 GB RAM |
all_touched (clip step) |
bool |
False |
Include edge pixels whose centre lies outside the polygon; only set True for thin linear features |
scale, offset (scaling step) |
float |
product-specific | Landsat C2: scale=2.75e-05, offset=-0.2. Sentinel-2 L2A: scale=1e-04, offset=0.0 |
nodata_val (write step) |
float |
-9999.0 |
Must match the nodata field in the rasterio profile; np.nan is not a valid NODATA sentinel for integer-compatible readers |
G, C1, C2, L (EVI) |
float |
2.5, 6.0, 7.5, 1.0 |
MODIS-standard EVI coefficients; do not alter for cross-sensor comparability |
blockxsize, blockysize |
int |
512 |
COG internal tile size; 512×512 is the de-facto standard for efficient HTTP range reads |
compress |
str |
"zstd" |
zstd provides better ratios than lzw for float32 data; use deflate for GDAL < 2.2 compat |
Handling multi-sensor band gaps and spectral harmonization
Cross-platform analysis frequently encounters missing bands due to sensor degradation, acquisition gaps, or differing instrument specifications. Implement graceful degradation strategies rather than silently skipping indices:
Band substitution: Map equivalent wavelengths across sensors using documented spectral response functions — for example, Landsat Collection 2 Band 5 at 865 nm maps to Sentinel-2 Band 8A at 865 nm. Simple band-number substitution is insufficient; verify spectral overlap before treating bands as equivalent.
Fallback indices: If a primary band is unavailable, compute a proxy index with adjusted coefficients — for example, fall back from NDVI to SAVI when soil brightness dominates an arid-land scene.
Mask propagation: Ensure missing-band masks propagate through all downstream calculations to prevent false-positive environmental signals. Integrating cloud and shadow masking strategies before index computation eliminates the largest source of contaminated pixels in optical time series.
Never interpolate missing bands without explicit documentation. Spectral harmonization requires radiometric consistency, not just spatial alignment.
Scaling to continental extents and tile stitching
Global or regional deployments require distributed execution and robust tile management. Partition large extents into overlapping geospatial tiles to prevent edge artifacts during resampling. Process each tile independently using Dask delayed functions, then merge outputs while preserving spatial continuity.
When adjacent tiles exhibit radiometric drift or seam lines, apply seamless mosaicking and edge blending techniques during the final assembly stage. Feathered blending, histogram matching, and priority-based compositing ensure that index values transition smoothly across tile boundaries.
For multi-date stacks, use temporal aggregation and time-series analysis to stack index outputs into xarray.Dataset objects with time dimensions, then apply rolling windows, seasonal composites, or anomaly detection algorithms.
When adjacent tiles require different resolutions to be aligned, apply advanced resampling and upscaling techniques to harmonize grid spacings before stitching.
Troubleshooting
ValueError: Input shapes do not match — rioxarray .clip() fails when the ROI vector and raster are in different CRS. Always reproject the vector to the raster CRS with gdf.to_crs(raster.rio.crs) before clipping. Alternatively enable rioxarray’s auto_crs option if you are working interactively, but avoid it in production scripts where silent reprojections can change areal statistics.
All-NaN output raster — the most common cause is failing to cast integer bands to float32 before division. Scaled-integer values in int16 produce integer division truncated to zero; 0/0 then yields NaN via errstate. Confirm band.dtype is float32 after the scaling step.
Index values outside [-1, 1] — indicates that the scale/offset was not applied or the wrong constants were used. Print np.nanmin(ndvi) and np.nanmax(ndvi) immediately after computation. If values exceed 1 by a factor of 10 000, the 0.0001 scale factor was missed.
MemoryError during Dask compute — chunk size is too large for available workers. Reduce chunk_size from 1024 to 512 or 256, or increase Dask worker memory limits. Use dask.diagnostics.ProgressBar to monitor which task graph stage is exhausting memory.
Seam lines between tiles — if processing a scene broken into tiles, ensure a 64-pixel overlap between adjacent tiles and crop the overlap away after index computation, before stitching. Without overlap, edge resampling in the scaling step produces one-pixel discontinuities that show up as visible lines in false-colour composites.
CRS not defined on output raster — rasterio.open(..., "w", ...) requires an explicit crs key in the profile. If meta_template came from a raster opened without CRS, the profile’s crs field will be None. Always validate profile["crs"] is not None before writing.
FAQ
Why does NDVI computation return NaN across the entire output raster?
The most common cause is failing to cast integer bands to float32 before division. Scaled-integer surface reflectance values (e.g., int16) produce integer division truncated to zero, then NaN propagates from 0/0. Cast both bands with .astype(np.float32) and apply the scale/offset before any arithmetic.
Can I compute NDVI directly from raw digital numbers?
You can compute the ratio, but the results are not physically meaningful and cannot be compared across dates or sensors. Surface reflectance products (Landsat Collection 2 SR, Sentinel-2 L2A) must have their scale factor and offset applied first so that pixel values represent fractional reflectance between 0 and 1.
What is the correct NDVI range and how do I detect sensor artifacts?
Physically valid NDVI spans -1 to 1. Values outside this range indicate corrupted pixels, cloud/shadow contamination, or missing-data fill values. After computation, assert np.nanmin(ndvi) >= -1.0 and np.nanmax(ndvi) <= 1.0; flag or mask pixels that fail these bounds before downstream use.
Related
- Satellite Processing Workflows & Index Pipelines — parent section covering the full pipeline from raw acquisition to analysis-ready outputs
- Band Math Operations with xarray — foundational patterns for labeled multidimensional arithmetic that underpin every index formula
- Automated Image Clipping and Cropping — align inputs to administrative boundaries or watershed polygons before computation
- Cloud and Shadow Masking Strategies — remove contaminated pixels before index computation to prevent skewed statistics
- Seamless Mosaicking and Edge Blending — stitch tile outputs into continuous index surfaces without seam artifacts
- Temporal Aggregation and Time-Series Analysis — stack index outputs across dates for phenological and drought monitoring