Why do monthly composites contain NaN values even when some valid observations exist?

xarray's resample().mean() propagates NaN if skipna is not explicitly set to True. Add skipna=True and guard with a count() threshold to retain only pixels with enough valid observations.

How do I deduplicate timestamps when stacking multiple satellite scenes?

After concatenation, filter the time index with da.sel(time=~da.indexes['time'].duplicated()) before any aggregation to avoid double-counting overlapping orbits.

What chunk size should I use for temporal aggregation with Dask?

A spatial chunk of 2048×2048 pixels per time slice works well for Sentinel-2 10 m data on a 16 GB machine. Set the time chunk to 1 (one scene per chunk) to keep memory predictable during resample operations.

Temporal Aggregation and Time-Series Analysis

Raw satellite archives are sequences of snapshots, each carrying cloud contamination, orbital gaps, and sensor noise. Converting those snapshots into statistically robust composites — monthly medians, phenological envelopes, interannual trend slopes — is the central challenge of satellite time-series work. This page walks through a production-grade Python pipeline for stacking, masking, aggregating, and validating multi-temporal raster data using xarray and dask. For the full architectural context of where temporal aggregation fits, see Satellite Processing Workflows & Index Pipelines.

Prerequisites

Install the required libraries before running any code on this page:

pip install "xarray>=2023.1.0" "rioxarray>=0.15.0" "dask[array]>=2023.1.0" \
            "rasterio>=1.3.0" "numpy>=1.24" "pandas>=2.0" "netcdf4>=1.6"

Version table

Library	Minimum version	Why required
`xarray`	2023.1.0	`resample()` / `groupby()` with named time frequencies; DataArray coordinate handling
`rioxarray`	0.15.0	Registers the `.rio` accessor for CRS-aware clipping and write-back
`dask[array]`	2023.1.0	Lazy chunked arrays that defer computation until `.compute()` is called
`rasterio`	1.3.0	Low-level GeoTIFF I/O and coordinate transforms
`numpy`	1.24	Array math; required by xarray and dask backends
`pandas`	2.0	`DatetimeIndex` and timestamp parsing utilities
`netcdf4`	1.6	Backend engine for `xr.Dataset.to_netcdf()`

Conceptual prerequisites

You should understand how Cloud-Optimized GeoTIFF structure enables byte-range reads, which is what makes lazy dask-backed stacking efficient.
Familiarity with how band math operations with xarray work will help you understand how spectral indices are computed before stacking.
Input scenes should already be clipped to your area of interest (see Automated Image Clipping and Cropping) and cloud-masked (see Cloud and Shadow Masking Strategies).

All input rasters must share an identical CRS, spatial resolution, and grid origin. Temporal metadata must use UTC-normalised datetime objects. Unmasked cloud pixels introduce outliers that corrupt every percentile-based composite you produce.

Step-by-Step Workflow

Step 1 — Ingest rasters and parse acquisition timestamps

Load files lazily, extracting ISO-formatted dates from filenames. Attach each date as a time coordinate so xarray can later aggregate along that axis.

import xarray as xr
import rioxarray  # registers the .rio accessor
import pandas as pd
import numpy as np
from pathlib import Path
import glob

def load_scene(path: str, date: pd.Timestamp) -> xr.DataArray:
    """Open a single GeoTIFF as a 1-date DataArray backed by dask."""
    da = rioxarray.open_rasterio(
        path,
        masked=True,           # NoData → NaN automatically
        chunks={"x": 2048, "y": 2048},
    )
    da = da.squeeze("band", drop=True)   # drop the single-band axis
    da = da.expand_dims(time=[date])
    return da

file_paths = sorted(glob.glob("data/ndvi_*.tif"))

# Extract the date from the trailing YYYY-MM-DD in the stem, e.g. ndvi_2023-06-15.tif
dates = [pd.Timestamp(Path(p).stem.split("_")[-1]) for p in file_paths]

arrays = [load_scene(p, d) for p, d in zip(file_paths, dates)]
da_stack = xr.concat(arrays, dim="time")
# da_stack is now (time, y, x) — fully dask-backed, nothing computed yet
print(da_stack)

Keep chunks at {"x": 2048, "y": 2048, "time": 1}. Setting time: 1 means each task in the Dask graph handles exactly one scene, so memory usage stays flat regardless of archive depth.

Step 2 — Deduplicate the time axis

Multi-scene archives often contain overlapping acquisitions from adjacent orbital paths. Aggregating duplicate timestamps inflates pixel counts and biases means.

time_index = da_stack.indexes["time"]
da_stack = da_stack.sel(time=~time_index.duplicated(keep="first"))

If you need to combine overlapping scenes rather than drop duplicates (for example, to fill spatial gaps at swath edges), apply Seamless Mosaicking and Edge Blending before stacking so the per-scene rasters are already spatially merged.

Step 3 — Clip to area of interest and apply quality masks

Aggregating full-tile extents wastes memory and compute time. Clip first, then mask.

import geopandas as gpd

aoi = gpd.read_file("aoi/study_region.gpkg")

# Clip preserves the CRS and geotransform — requires rioxarray >= 0.15
da_clipped = da_stack.rio.clip(
    aoi.geometry.values,
    aoi.crs,
    drop=True,
    all_touched=False,
)

# Apply a pre-computed boolean quality mask (True = invalid pixel)
# qa_mask must share the same spatial grid; time axis broadcast automatically
# da_clipped = da_clipped.where(~qa_mask)

If your quality layer comes from Fmask or s2cloudless, the Cloud and Shadow Masking Strategies page explains how to build a boolean DataArray that aligns with your stacked scenes.

Step 4 — Validate the valid-pixel distribution before aggregation

Aggregating sparse time series without checking coverage produces silent statistical failures: a “monthly median” built from one clear pixel is as misleading as no composite at all.

# Count valid (non-NaN) observations per pixel per month
valid_monthly = da_clipped.resample(time="MS").count()

# Flag pixels where fewer than 3 valid scenes contributed
insufficient_coverage = valid_monthly < 3

# Quick summary to the console — examine before committing to aggregation
pct_covered = float((~insufficient_coverage).mean()) * 100
print(f"Pixels with ≥3 valid monthly observations: {pct_covered:.1f}%")

A coverage below ~70 % in a given month is a signal to either widen the compositing window or flag that month as unreliable in downstream models.

Step 5 — Temporal aggregation and metric extraction

Execute statistical reductions along the time dimension. Use skipna=True universally, then mask composites with the coverage threshold computed in Step 4.

# Monthly composites
monthly_mean   = da_clipped.resample(time="MS").mean(skipna=True)
monthly_median = da_clipped.resample(time="MS").median(skipna=True)
monthly_p90    = da_clipped.resample(time="MS").quantile(0.90, skipna=True)

# Guard with coverage threshold — pixels below threshold become NaN
monthly_mean   = monthly_mean.where(valid_monthly >= 3)
monthly_median = monthly_median.where(valid_monthly >= 3)

# Annual variability
annual_std = da_clipped.groupby("time.year").std(skipna=True)

# Seasonal grouping (DJF, MAM, JJA, SON)
seasonal_mean = da_clipped.resample(time="QS-DEC").mean(skipna=True)

For trend extraction, use scipy.stats.theilslopes applied pixel-wise via xr.apply_ufunc. Theil-Sen is robust to cloud-induced outliers in a way that ordinary least squares is not.

from scipy.stats import theilslopes
import xarray as xr

def theil_sen_slope(values: np.ndarray) -> float:
    """Return Theil-Sen slope for a 1-D time series; NaN if insufficient data."""
    mask = ~np.isnan(values)
    if mask.sum() < 4:
        return np.nan
    result = theilslopes(values[mask])
    return float(result.slope)

# Apply pixel-wise — dask will parallelise across spatial chunks
slope_map = xr.apply_ufunc(
    theil_sen_slope,
    da_clipped,
    input_core_dims=[["time"]],
    vectorize=True,
    dask="parallelized",
    output_dtypes=[float],
)

Step 6 — Write outputs and validate

Trigger computation and persist to CF-compliant NetCDF. Include compression to keep file sizes manageable.

encoding = {
    "ndvi": {"zlib": True, "complevel": 4, "dtype": "float32"}
}

# Build a Dataset so metadata travels with the arrays
ds_out = xr.Dataset(
    {
        "ndvi_monthly_mean":   monthly_mean.rename("ndvi"),
        "ndvi_monthly_median": monthly_median.rename("ndvi"),
        "valid_count":         valid_monthly,
    }
)

ds_out.to_netcdf("output/ndvi_monthly_composites.nc", encoding=encoding)
print("Written:", ds_out)

Parameter Reference

Parameter / option	Type	Default	Usage note
`chunks={"x": N, "y": N}`	`dict`	`None` (eager)	Passed to `open_rasterio`; set to 2048 for 10 m Sentinel-2 on 16 GB RAM
`masked=True`	`bool`	`False`	Converts NoData to `NaN`; mandatory for `skipna` aggregations to work correctly
`resample(time="MS")`	`str` freq	—	`"MS"` = month start; `"QS-DEC"` = seasonal; `"YS"` = annual
`skipna=True`	`bool`	`False`	Must be explicit in `mean()`, `median()`, `std()`, `quantile()` — omitting it silently returns NaN for any window containing masked pixels
`drop=True` in `rio.clip()`	`bool`	`False`	Trims the array to the bounding box of the AOI, reducing downstream array size
`all_touched=False` in `rio.clip()`	`bool`	`False`	Only pixels whose centres fall inside the polygon are retained; set `True` for thin linear features
`keep="first"` in `duplicated()`	`str`	`"first"`	Retains the first scene when timestamps collide; switch to `"last"` for reprocessed tiles
`dask="parallelized"` in `apply_ufunc`	`str`	`"forbidden"`	Enables Dask parallelism for pixel-wise functions; requires `output_dtypes`

Verification and Testing

After writing the output file, confirm the composite dimensions, coordinate ranges, and value distribution before handing the file to any downstream model.

import xarray as xr
import numpy as np

ds = xr.open_dataset("output/ndvi_monthly_composites.nc")

# 1. Dimension check — time should equal number of months in source archive
assert "time" in ds.dims, "Missing time dimension"
assert ds["ndvi_monthly_mean"].dims == ("time", "y", "x")

# 2. Coordinate sanity
time_delta = ds.time.diff("time").values
assert all(td > np.timedelta64(0) for td in time_delta), "Time axis not monotonically increasing"

# 3. Value range for NDVI (−1 to 1); adapt for EVI, NBR, etc.
vmin = float(ds["ndvi_monthly_mean"].min(skipna=True))
vmax = float(ds["ndvi_monthly_mean"].max(skipna=True))
assert -1.05 <= vmin, f"NDVI below physical minimum: {vmin}"
assert vmax <= 1.05,  f"NDVI above physical maximum: {vmax}"

# 4. Coverage check — warn if any month has less than 30% valid pixels
coverage = (ds["valid_count"] >= 3).mean(dim=["y", "x"])
low_coverage_months = ds.time.where(coverage < 0.30, drop=True)
if low_coverage_months.size > 0:
    print("Low-coverage months (< 30% valid pixels):", low_coverage_months.values)

print("Verification passed.")

For visual inspection, plot a spatial map of valid_count for the worst month — this immediately identifies persistent cloud-cover zones or missing tiles that need gap-filling before the composite enters any analysis.

Troubleshooting

Memory exhaustion during .compute() Cause: chunk size too large for available RAM, or .load() called on the full stack before aggregation. Fix: Reduce chunks to {"x": 1024, "y": 1024}, avoid .load() on intermediate arrays, and call .persist() only when an intermediate result is referenced multiple times. For archives exceeding ~500 scenes, switch to a Zarr store pre-built with xarray_beam or odc-stac to eliminate the per-file task graph.

NaN-only composites when pixels should be valid Cause: skipna not set, or the quality mask was applied with zero encoding (0 = invalid) rather than NaN, so xarray treats 0 as a valid observation. Fix: Confirm masking was applied with da.where(~qa_mask) and not da * (1 - qa_mask). Print da_clipped.isnull().mean() to verify NaN fraction before aggregating.

Duplicate timestamps causing incorrect pixel counts Cause: Two scenes from adjacent orbital paths acquired seconds apart share the same date string after truncation to day precision. Fix: Apply the deduplication step in Step 2 before any aggregation. If both scenes are needed for gap-filling, mosaic them first (see Seamless Mosaicking and Edge Blending) so the stack holds one representative scene per date.

ValueError: cannot reindex or align along dimension 'time' when concatenating Cause: One or more files have an incompatible spatial grid (different resolution or origin after resampling). xr.concat along time requires matching x and y coordinates. Fix: Run rasterio.transform checks on each file before loading, or standardise grids with Advanced Resampling and Upscaling Techniques before stacking.

Theil-Sen slope returns NaN for most pixels Cause: Fewer than 4 valid observations per pixel after cloud masking — the threshold in the theil_sen_slope function. Fix: Either extend the time window to include more scenes, or lower the guard threshold (but document it), or use harmonic regression (numpy.polyfit with seasonal terms) which tolerates fewer observations.

Satellite Processing Workflows & Index Pipelines — parent section covering the full pipeline from ingestion to index delivery.
Cloud and Shadow Masking Strategies — produce the quality masks that temporal aggregation depends on; covers Fmask and s2cloudless integration.
Spectral Index Calculation Pipelines — compute NDVI, EVI, NBR, and similar indices before stacking so band-ratio values, not raw reflectances, enter the time series.
Seamless Mosaicking and Edge Blending — merge overlapping scenes along orbital boundaries before temporal stacking to eliminate step-function radiometric artefacts.
Advanced Resampling and Upscaling Techniques — harmonise spatial resolution across sensors before concatenating multi-sensor archives into a single time series.
Band Math Operations with xarray — the xarray DataArray model that underpins every operation on this page, with broadcasting rules and arithmetic examples.