Cloud and Shadow Masking Strategies

Atmospheric and shadow contamination is one of the most pervasive quality problems in optical satellite time series. Even a small fraction of unmasked cloud pixels propagates spectral errors into spectral index calculation pipelines, corrupts land-cover classifications, and produces spurious change signals in temporal analysis. This guide walks through a production-tested Python workflow — QA bit decoding, spectral thresholding, geometric shadow projection, and morphological cleanup — that produces reliable boolean masks for Sentinel-2 and Landsat scenes. For the broader architectural context, see Satellite Processing Workflows & Index Pipelines.


Cloud and Shadow Masking Pipeline Five sequential stages: Ingest & Align, QA Bit Decode, Spectral Threshold, Shadow Projection, Morphological Refine, then masked output. Arrows connect each stage left to right. Ingest & Align Bands rasterio QA Bit Decode numpy bitwise Spectral Threshold SWIR / blue ratio Shadow Projection scipy.ndimage.shift Morphological Refinement binary open/close Bool Mask Cloud & Shadow Masking Pipeline Steps 2 & 3 produce independent boolean arrays — union them before shadow projection

Prerequisites

Install the required packages:

pip install rasterio numpy scipy scikit-image pyproj
Library Min version Why required
rasterio 1.3.0 Raster I/O, affine transforms, CRS handling
numpy 1.24 Bitwise operations, array broadcasting
scipy 1.10 Morphological filters, geometric shift
scikit-image 0.20 disk structuring element for morphology
pyproj 3.4 CRS conversion for shadow azimuth math

Conceptual prerequisites. You should be comfortable reading surface reflectance products (L2A for Sentinel-2, Collection 2 Level-2 for Landsat) and understand how affine transforms map pixel indices to geographic coordinates — see Handling Pixel Resolution and Scaling if you need a refresher. Cloud masking also benefits from applying Automated Image Clipping and Cropping early in the pipeline: cropping to your area of interest before masking reduces array size and cuts bitwise operation time significantly.


Step-by-Step Workflow

Step 1 — Ingest and Spatially Align Bands

Load surface reflectance bands and the provider QA layer into a common NumPy array. Verify that all inputs share the same affine transform and CRS before any mask operation; misaligned QA layers are the leading source of edge-artifact failures.

import rasterio
import numpy as np
from pathlib import Path


def load_aligned_raster(
    band_paths: list[Path],
    qa_path: Path,
) -> tuple[np.ndarray, np.ndarray, dict, rasterio.Affine]:
    """Load reflectance bands and a QA layer, asserting spatial alignment.

    Returns
    -------
    reflectance : ndarray, shape (B, H, W), float32
    qa_band     : ndarray, shape (H, W), uint16 or uint8
    meta        : rasterio profile dict for the first band
    transform   : affine geotransform
    """
    with rasterio.open(band_paths[0]) as src:
        meta = src.meta.copy()
        ref_transform = src.transform
        ref_crs = src.crs
        ref_shape = (src.height, src.width)

    bands: list[np.ndarray] = []
    for p in band_paths:
        with rasterio.open(p) as src:
            assert src.transform == ref_transform, f"Transform mismatch: {p}"
            assert src.crs == ref_crs, f"CRS mismatch: {p}"
            bands.append(src.read(1).astype(np.float32))

    reflectance = np.stack(bands)  # (B, H, W)

    with rasterio.open(qa_path) as qa_src:
        assert qa_src.shape == ref_shape, "QA layer shape mismatch"
        qa_band = qa_src.read(1)

    return reflectance, qa_band, meta, ref_transform

Step 2 — Decode Provider QA Bit Flags

Sentinel-2 L2A ships a dedicated Scene Classification Layer (SCL) with integer class codes. Landsat Collection 2 uses the QA_PIXEL band with bitwise-packed flags. Both strategies extract cloud and shadow pixels without any additional spectral computation.

def decode_sentinel2_scl(scl: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
    """Decode the Sentinel-2 SCL band into cloud and shadow boolean masks.

    SCL class codes:
      3  = cloud shadow
      8  = medium-probability cloud
      9  = high-probability cloud
      10 = thin cirrus
    """
    cloud_mask = np.isin(scl, [8, 9, 10])
    shadow_mask = (scl == 3)
    return cloud_mask.astype(bool), shadow_mask.astype(bool)


def decode_landsat_qa_pixel(
    qa_band: np.ndarray,
    dilated_cloud_bit: int = 1,
    cloud_bit: int = 3,
    shadow_bit: int = 4,
) -> tuple[np.ndarray, np.ndarray]:
    """Extract cloud and shadow masks from the Landsat Collection 2 QA_PIXEL band.

    Bit positions follow the USGS Collection 2 specification:
      bit 1 = dilated cloud
      bit 3 = cloud
      bit 4 = cloud shadow
    """
    cloud_mask = (
        ((qa_band >> dilated_cloud_bit) & 1).astype(bool)
        | ((qa_band >> cloud_bit) & 1).astype(bool)
    )
    shadow_mask = ((qa_band >> shadow_bit) & 1).astype(bool)
    return cloud_mask, shadow_mask

Step 3 — Supplement with Spectral Thresholds

Provider QA flags occasionally miss thin cirrus or hazy scenes, particularly in tropical or high-latitude environments. A lightweight spectral filter based on the SWIR-to-blue ratio catches the residual contamination. Union the spectral result with the QA-decoded mask so neither source is authoritative alone.

def spectral_cloud_filter(
    reflectance: np.ndarray,
    swir_idx: int = 4,
    blue_idx: int = 0,
    nir_idx: int = 3,
    red_idx: int = 2,
    cirrus_ratio: float = 0.8,
    dense_blue_thresh: float = 0.2,
    dense_ndvi_thresh: float = 1.1,
) -> np.ndarray:
    """Flag pixels not caught by QA bands using empirical spectral ratios.

    Tuned for Sentinel-2 and Landsat L2A surface reflectance (0–1 scale).
    Returns a boolean array of the same spatial shape.
    """
    eps = 1e-9
    blue = reflectance[blue_idx].astype(np.float32) + eps
    red  = reflectance[red_idx].astype(np.float32) + eps
    nir  = reflectance[nir_idx].astype(np.float32)
    swir = reflectance[swir_idx].astype(np.float32)

    # Thin cirrus: elevated SWIR reflectance relative to blue
    thin_cirrus = (swir / blue) > cirrus_ratio
    # Dense cloud: high blue brightness and suppressed vegetation signal
    dense_cloud = (reflectance[blue_idx] > dense_blue_thresh) & ((nir / red) < dense_ndvi_thresh)

    return thin_cirrus | dense_cloud


def combine_cloud_masks(
    qa_cloud: np.ndarray,
    qa_shadow: np.ndarray,
    spectral_cloud: np.ndarray,
) -> np.ndarray:
    """Union all cloud flags into a single boolean contamination mask."""
    return qa_cloud | qa_shadow | spectral_cloud

For scenes with persistent haze or complex topography, the dedicated probabilistic approaches in Implementing Fmask and s2cloudless in Python reduce commission errors further, at the cost of heavier dependencies.

Step 4 — Project Cloud Mask to Shadow Footprint

Cloud shadows cannot be reliably detected through spectral thresholding alone over dark water or dense forest. The standard approach shifts the cloud mask in the direction of the solar azimuth by a distance proportional to the cloud base height and local terrain elevation.

The pixel displacement distance D (pixels) is:

$$D = \frac{H_c - H_{dem}}{\tan(\theta_{elev}) \times \text{pixel_size}}$$

where $H_c$ is estimated cloud base height (m), $H_{dem}$ is terrain elevation (m), and $\theta_{elev}$ is solar elevation (degrees).

from scipy.ndimage import shift as ndimage_shift


def project_shadow_mask(
    cloud_mask: np.ndarray,
    solar_elev_deg: float,
    solar_azimuth_deg: float,
    pixel_size_m: float = 10.0,
    cloud_height_m: float = 2000.0,
    terrain_elev_m: float = 0.0,
) -> np.ndarray:
    """Shift the cloud mask along the solar azimuth to approximate shadow position.

    Parameters
    ----------
    cloud_mask       : bool array (H, W)
    solar_elev_deg   : solar elevation angle in degrees (from horizon)
    solar_azimuth_deg: solar azimuth in degrees (clockwise from north)
    pixel_size_m     : ground sampling distance in metres
    cloud_height_m   : estimated cloud base height above sea level (m)
    terrain_elev_m   : mean terrain elevation in the scene (m)

    Returns
    -------
    Boolean array of the same shape with projected shadow footprint.
    """
    elev_rad = np.deg2rad(solar_elev_deg)
    az_rad   = np.deg2rad(solar_azimuth_deg)

    net_height = max(cloud_height_m - terrain_elev_m, 0.0)
    distance_px = net_height / (np.tan(elev_rad) * pixel_size_m)

    # Row index increases downward; north-facing shift is negative row offset
    dy = -distance_px * np.cos(az_rad)
    dx =  distance_px * np.sin(az_rad)

    shadow_proj = ndimage_shift(
        cloud_mask.astype(np.float32),
        shift=(dy, dx),
        order=0,
        mode="constant",
        cval=0.0,
    )
    return shadow_proj > 0.5

Step 5 — Morphological Refinement and Final Mask Assembly

Raw spectral and geometric masks contain salt-and-pepper noise and fragmented boundaries. Binary opening removes isolated flag pixels; binary closing fills small interior holes. Union the morphologically cleaned cloud and projected shadow arrays to produce the final contamination mask.

from scipy.ndimage import binary_opening, binary_closing
from skimage.morphology import disk


def refine_mask(mask: np.ndarray, radius: int = 2) -> np.ndarray:
    """Remove noise and close interior gaps with symmetric open→close operations."""
    struct = disk(radius)
    opened = binary_opening(mask, structure=struct)
    closed = binary_closing(opened, structure=struct)
    return closed.astype(bool)


def build_final_mask(
    cloud_mask: np.ndarray,
    shadow_mask: np.ndarray,
    projected_shadow: np.ndarray,
    morph_radius: int = 2,
) -> np.ndarray:
    """Combine and refine all contamination sources into one boolean mask.

    True  = contaminated (cloud or shadow), should be excluded from analysis.
    False = clear, suitable for downstream analytics.
    """
    raw = cloud_mask | shadow_mask | projected_shadow
    return refine_mask(raw, radius=morph_radius)

Parameter Reference

Parameter Type Default Notes
cloud_height_m float 2000.0 Cloud base height above sea level (m). Increase to 4000–6000 for deep convective clouds in tropics.
terrain_elev_m float 0.0 Scene mean elevation (m). Use a DEM median for high-relief terrain.
pixel_size_m float 10.0 Ground sampling distance (m). 10 m for Sentinel-2 10 m bands; 30 m for Landsat.
morph_radius int 2 Disk radius in pixels for open/close. Raise to 3–4 for scenes with broken cloud edges.
cirrus_ratio float 0.8 SWIR/blue ratio above which a pixel is flagged as thin cirrus. Lower to 0.6 in humid tropics.
dense_blue_thresh float 0.2 Blue-band reflectance threshold for dense cloud detection (0–1 scale).

Verification and Testing

After building the mask, validate coverage statistics before writing the output:

def validate_mask(
    final_mask: np.ndarray,
    scene_id: str,
    warn_threshold: float = 0.80,
) -> dict:
    """Compute per-scene mask statistics and warn on anomalous coverage.

    Returns a dict with 'scene_id', 'total_pixels', 'masked_pixels',
    'masked_fraction', and 'status'.
    """
    total   = final_mask.size
    masked  = int(final_mask.sum())
    frac    = masked / total

    status = "warn" if frac > warn_threshold else "ok"
    if status == "warn":
        import warnings
        warnings.warn(
            f"{scene_id}: {frac:.1%} of pixels masked — check solar angle metadata "
            "or QA bit positions for this sensor/processor version.",
            stacklevel=2,
        )

    return {
        "scene_id": scene_id,
        "total_pixels": total,
        "masked_pixels": masked,
        "masked_fraction": round(frac, 4),
        "status": status,
    }

Write the boolean mask as a separate single-band GeoTIFF alongside the reflectance data — never overwrite the original pixels:

def write_mask(
    mask: np.ndarray,
    meta: dict,
    output_path: Path,
) -> None:
    """Write a boolean mask as a uint8 GeoTIFF (0 = clear, 1 = masked)."""
    mask_meta = meta.copy()
    mask_meta.update({"count": 1, "dtype": "uint8", "nodata": None})

    with rasterio.open(output_path, "w", **mask_meta) as dst:
        dst.write(mask.astype(np.uint8), 1)

Visual check. Overlay the boolean mask on a true-colour RGB composite using any GIS tool or matplotlib.imshow. Cloud boundaries should coincide with bright white regions; shadow boundaries should fall geometrically downwind of their cloud source in the solar azimuth direction.


Troubleshooting

Shadow footprint is completely absent or wildly displaced. The most common cause is a solar azimuth expressed as meteorological convention (0° = south) rather than geographic convention (0° = north). Confirm the angle source: Sentinel-2 metadata uses geographic north; some older processors use a different convention. A 180° offset shifts shadows to the opposite side of the cloud.

QA mask covers large continuous areas that appear cloud-free in the RGB. This usually indicates a wrong bit position. Landsat Collection 1 and Collection 2 differ: Collection 2 moved the cloud-shadow bit from position 3 to 4. Always verify bit definitions against the product’s MTL.txt or STAC item properties rather than hardcoding values.

MemoryError on large scenes. Process the scene in spatial tiles using rasterio.windows. Read each window’s reflectance and QA slice, build the mask tile, and write it back before moving to the next. For guidance on efficient windowed reads, see Optimizing rasterio Window Reads for Memory Efficiency.

Mask edges show one-pixel-wide artefacts at tile boundaries. When splitting a scene into processing tiles, add an overlap buffer of at least morph_radius + 1 pixels on each edge and crop the buffer after morphological refinement. Alternatively, apply Seamless Mosaicking and Edge Blending at the mosaicking stage to suppress boundary artefacts globally.

High omission errors over thin cirrus in high-altitude scenes. Reduce cirrus_ratio from 0.8 to 0.5–0.6 and cross-validate against the SCL cirrus class (code 10 for Sentinel-2). If omission errors remain significant, replace the spectral filter with the probabilistic model from Implementing Fmask and s2cloudless in Python.


Frequently Asked Questions

Which QA bits flag clouds in Landsat Collection 2 QA_PIXEL? Bit 1 marks dilated cloud, bit 3 marks cloud, and bit 4 marks cloud shadow in the Landsat Collection 2 QA_PIXEL band. Decode with (qa >> bit) & 1 to extract individual flags without interfering with adjacent bits.

How accurate is geometric shadow projection compared to s2cloudless? Geometric projection is fast and deterministic but requires a reliable cloud-height estimate and a DEM for steep terrain. s2cloudless adds a machine-learning probability layer that typically reduces omission errors over mixed or hazy scenes at the cost of additional dependencies.

Should I store the mask as a separate file or apply it in-place? Always store the boolean mask as a separate GeoTIFF alongside the reflectance data. Applying it in-place destroys the original pixel values and prevents reprocessing with updated thresholds or a different algorithm.


Deep-Dive Articles