Temporal Aggregation and Time-Series Analysis

Raw satellite archives are sequences of snapshots, each carrying cloud contamination, orbital gaps, and sensor noise. Converting those snapshots into statistically robust composites — monthly medians, phenological envelopes, interannual trend slopes — is the central challenge of satellite time-series work. This page walks through a production-grade Python pipeline for stacking, masking, aggregating, and validating multi-temporal raster data using xarray and dask. For the full architectural context of where temporal aggregation fits, see Satellite Processing Workflows & Index Pipelines.

Temporal aggregation pipeline: raw scenes to monthly composites A left-to-right data-flow diagram showing five stages: Raw Scenes, Clip and Mask, Stack DataArray, Resample and Aggregate, and Output Composites. Arrows connect each stage in sequence. Raw Scenes GeoTIFFs / COGs STAC assets Clip + Mask AOI crop cloud / QA mask Stack DataArray xr.concat along time dask chunks Resample + Aggregate mean / median / p90 Theil-Sen slope Output NetCDF / Zarr monthly composites Step 1 Steps 2–3 Step 4 Step 5 Step 6

Prerequisites

Install the required libraries before running any code on this page:

pip install "xarray>=2023.1.0" "rioxarray>=0.15.0" "dask[array]>=2023.1.0" \
            "rasterio>=1.3.0" "numpy>=1.24" "pandas>=2.0" "netcdf4>=1.6"

Version table

Library Minimum version Why required
xarray 2023.1.0 resample() / groupby() with named time frequencies; DataArray coordinate handling
rioxarray 0.15.0 Registers the .rio accessor for CRS-aware clipping and write-back
dask[array] 2023.1.0 Lazy chunked arrays that defer computation until .compute() is called
rasterio 1.3.0 Low-level GeoTIFF I/O and coordinate transforms
numpy 1.24 Array math; required by xarray and dask backends
pandas 2.0 DatetimeIndex and timestamp parsing utilities
netcdf4 1.6 Backend engine for xr.Dataset.to_netcdf()

Conceptual prerequisites

All input rasters must share an identical CRS, spatial resolution, and grid origin. Temporal metadata must use UTC-normalised datetime objects. Unmasked cloud pixels introduce outliers that corrupt every percentile-based composite you produce.


Step-by-Step Workflow

Step 1 — Ingest rasters and parse acquisition timestamps

Load files lazily, extracting ISO-formatted dates from filenames. Attach each date as a time coordinate so xarray can later aggregate along that axis.

import xarray as xr
import rioxarray  # registers the .rio accessor
import pandas as pd
import numpy as np
from pathlib import Path
import glob

def load_scene(path: str, date: pd.Timestamp) -> xr.DataArray:
    """Open a single GeoTIFF as a 1-date DataArray backed by dask."""
    da = rioxarray.open_rasterio(
        path,
        masked=True,           # NoData → NaN automatically
        chunks={"x": 2048, "y": 2048},
    )
    da = da.squeeze("band", drop=True)   # drop the single-band axis
    da = da.expand_dims(time=[date])
    return da

file_paths = sorted(glob.glob("data/ndvi_*.tif"))

# Extract the date from the trailing YYYY-MM-DD in the stem, e.g. ndvi_2023-06-15.tif
dates = [pd.Timestamp(Path(p).stem.split("_")[-1]) for p in file_paths]

arrays = [load_scene(p, d) for p, d in zip(file_paths, dates)]
da_stack = xr.concat(arrays, dim="time")
# da_stack is now (time, y, x) — fully dask-backed, nothing computed yet
print(da_stack)

Keep chunks at {"x": 2048, "y": 2048, "time": 1}. Setting time: 1 means each task in the Dask graph handles exactly one scene, so memory usage stays flat regardless of archive depth.

Step 2 — Deduplicate the time axis

Multi-scene archives often contain overlapping acquisitions from adjacent orbital paths. Aggregating duplicate timestamps inflates pixel counts and biases means.

time_index = da_stack.indexes["time"]
da_stack = da_stack.sel(time=~time_index.duplicated(keep="first"))

If you need to combine overlapping scenes rather than drop duplicates (for example, to fill spatial gaps at swath edges), apply Seamless Mosaicking and Edge Blending before stacking so the per-scene rasters are already spatially merged.

Step 3 — Clip to area of interest and apply quality masks

Aggregating full-tile extents wastes memory and compute time. Clip first, then mask.

import geopandas as gpd

aoi = gpd.read_file("aoi/study_region.gpkg")

# Clip preserves the CRS and geotransform — requires rioxarray >= 0.15
da_clipped = da_stack.rio.clip(
    aoi.geometry.values,
    aoi.crs,
    drop=True,
    all_touched=False,
)

# Apply a pre-computed boolean quality mask (True = invalid pixel)
# qa_mask must share the same spatial grid; time axis broadcast automatically
# da_clipped = da_clipped.where(~qa_mask)

If your quality layer comes from Fmask or s2cloudless, the Cloud and Shadow Masking Strategies page explains how to build a boolean DataArray that aligns with your stacked scenes.

Step 4 — Validate the valid-pixel distribution before aggregation

Aggregating sparse time series without checking coverage produces silent statistical failures: a “monthly median” built from one clear pixel is as misleading as no composite at all.

# Count valid (non-NaN) observations per pixel per month
valid_monthly = da_clipped.resample(time="MS").count()

# Flag pixels where fewer than 3 valid scenes contributed
insufficient_coverage = valid_monthly < 3

# Quick summary to the console — examine before committing to aggregation
pct_covered = float((~insufficient_coverage).mean()) * 100
print(f"Pixels with ≥3 valid monthly observations: {pct_covered:.1f}%")

A coverage below ~70 % in a given month is a signal to either widen the compositing window or flag that month as unreliable in downstream models.

Step 5 — Temporal aggregation and metric extraction

Execute statistical reductions along the time dimension. Use skipna=True universally, then mask composites with the coverage threshold computed in Step 4.

# Monthly composites
monthly_mean   = da_clipped.resample(time="MS").mean(skipna=True)
monthly_median = da_clipped.resample(time="MS").median(skipna=True)
monthly_p90    = da_clipped.resample(time="MS").quantile(0.90, skipna=True)

# Guard with coverage threshold — pixels below threshold become NaN
monthly_mean   = monthly_mean.where(valid_monthly >= 3)
monthly_median = monthly_median.where(valid_monthly >= 3)

# Annual variability
annual_std = da_clipped.groupby("time.year").std(skipna=True)

# Seasonal grouping (DJF, MAM, JJA, SON)
seasonal_mean = da_clipped.resample(time="QS-DEC").mean(skipna=True)

For trend extraction, use scipy.stats.theilslopes applied pixel-wise via xr.apply_ufunc. Theil-Sen is robust to cloud-induced outliers in a way that ordinary least squares is not.

from scipy.stats import theilslopes
import xarray as xr

def theil_sen_slope(values: np.ndarray) -> float:
    """Return Theil-Sen slope for a 1-D time series; NaN if insufficient data."""
    mask = ~np.isnan(values)
    if mask.sum() < 4:
        return np.nan
    result = theilslopes(values[mask])
    return float(result.slope)

# Apply pixel-wise — dask will parallelise across spatial chunks
slope_map = xr.apply_ufunc(
    theil_sen_slope,
    da_clipped,
    input_core_dims=[["time"]],
    vectorize=True,
    dask="parallelized",
    output_dtypes=[float],
)

Step 6 — Write outputs and validate

Trigger computation and persist to CF-compliant NetCDF. Include compression to keep file sizes manageable.

encoding = {
    "ndvi": {"zlib": True, "complevel": 4, "dtype": "float32"}
}

# Build a Dataset so metadata travels with the arrays
ds_out = xr.Dataset(
    {
        "ndvi_monthly_mean":   monthly_mean.rename("ndvi"),
        "ndvi_monthly_median": monthly_median.rename("ndvi"),
        "valid_count":         valid_monthly,
    }
)

ds_out.to_netcdf("output/ndvi_monthly_composites.nc", encoding=encoding)
print("Written:", ds_out)

Parameter Reference

Parameter / option Type Default Usage note
chunks={"x": N, "y": N} dict None (eager) Passed to open_rasterio; set to 2048 for 10 m Sentinel-2 on 16 GB RAM
masked=True bool False Converts NoData to NaN; mandatory for skipna aggregations to work correctly
resample(time="MS") str freq "MS" = month start; "QS-DEC" = seasonal; "YS" = annual
skipna=True bool False Must be explicit in mean(), median(), std(), quantile() — omitting it silently returns NaN for any window containing masked pixels
drop=True in rio.clip() bool False Trims the array to the bounding box of the AOI, reducing downstream array size
all_touched=False in rio.clip() bool False Only pixels whose centres fall inside the polygon are retained; set True for thin linear features
keep="first" in duplicated() str "first" Retains the first scene when timestamps collide; switch to "last" for reprocessed tiles
dask="parallelized" in apply_ufunc str "forbidden" Enables Dask parallelism for pixel-wise functions; requires output_dtypes

Verification and Testing

After writing the output file, confirm the composite dimensions, coordinate ranges, and value distribution before handing the file to any downstream model.

import xarray as xr
import numpy as np

ds = xr.open_dataset("output/ndvi_monthly_composites.nc")

# 1. Dimension check — time should equal number of months in source archive
assert "time" in ds.dims, "Missing time dimension"
assert ds["ndvi_monthly_mean"].dims == ("time", "y", "x")

# 2. Coordinate sanity
time_delta = ds.time.diff("time").values
assert all(td > np.timedelta64(0) for td in time_delta), "Time axis not monotonically increasing"

# 3. Value range for NDVI (−1 to 1); adapt for EVI, NBR, etc.
vmin = float(ds["ndvi_monthly_mean"].min(skipna=True))
vmax = float(ds["ndvi_monthly_mean"].max(skipna=True))
assert -1.05 <= vmin, f"NDVI below physical minimum: {vmin}"
assert vmax <= 1.05,  f"NDVI above physical maximum: {vmax}"

# 4. Coverage check — warn if any month has less than 30% valid pixels
coverage = (ds["valid_count"] >= 3).mean(dim=["y", "x"])
low_coverage_months = ds.time.where(coverage < 0.30, drop=True)
if low_coverage_months.size > 0:
    print("Low-coverage months (< 30% valid pixels):", low_coverage_months.values)

print("Verification passed.")

For visual inspection, plot a spatial map of valid_count for the worst month — this immediately identifies persistent cloud-cover zones or missing tiles that need gap-filling before the composite enters any analysis.


Troubleshooting

Memory exhaustion during .compute() Cause: chunk size too large for available RAM, or .load() called on the full stack before aggregation. Fix: Reduce chunks to {"x": 1024, "y": 1024}, avoid .load() on intermediate arrays, and call .persist() only when an intermediate result is referenced multiple times. For archives exceeding ~500 scenes, switch to a Zarr store pre-built with xarray_beam or odc-stac to eliminate the per-file task graph.

NaN-only composites when pixels should be valid Cause: skipna not set, or the quality mask was applied with zero encoding (0 = invalid) rather than NaN, so xarray treats 0 as a valid observation. Fix: Confirm masking was applied with da.where(~qa_mask) and not da * (1 - qa_mask). Print da_clipped.isnull().mean() to verify NaN fraction before aggregating.

Duplicate timestamps causing incorrect pixel counts Cause: Two scenes from adjacent orbital paths acquired seconds apart share the same date string after truncation to day precision. Fix: Apply the deduplication step in Step 2 before any aggregation. If both scenes are needed for gap-filling, mosaic them first (see Seamless Mosaicking and Edge Blending) so the stack holds one representative scene per date.

ValueError: cannot reindex or align along dimension 'time' when concatenating Cause: One or more files have an incompatible spatial grid (different resolution or origin after resampling). xr.concat along time requires matching x and y coordinates. Fix: Run rasterio.transform checks on each file before loading, or standardise grids with Advanced Resampling and Upscaling Techniques before stacking.

Theil-Sen slope returns NaN for most pixels Cause: Fewer than 4 valid observations per pixel after cloud masking — the threshold in the theil_sen_slope function. Fix: Either extend the time window to include more scenes, or lower the guard threshold (but document it), or use harmonic regression (numpy.polyfit with seasonal terms) which tolerates fewer observations.