Implementing Fmask and s2cloudless in Python

Combine a physics-based Fmask QA band with the s2cloudless machine-learning probability map using a logical OR merge: a pixel is masked if either detector flags it. This hybrid approach outperforms either method alone, especially for semi-transparent cirrus and topographic shadow. For broader context on when to use this strategy versus simpler SCL-flag approaches, see Cloud and Shadow Masking Strategies.

from s2cloudless import S2PixelCloudDetector
import numpy as np, rasterio

def hybrid_mask(band_paths, fmask_path, threshold=0.45):
    stack = np.stack([
        rasterio.open(p).read(1).astype(np.float32) / 10000.0
        for p in band_paths
    ], axis=0)                                    # (10, H, W)
    inp = stack.transpose(1, 2, 0)[np.newaxis]   # (1, H, W, 10)
    prob = S2PixelCloudDetector(threshold=threshold, all_bands=True
        ).get_cloud_probability_maps(inp)[0]      # (H, W)
    with rasterio.open(fmask_path) as src:
        qa = src.read(1)
    return (prob >= threshold) | (qa == 4) | (qa == 2)

Why this combination arises in remote sensing pipelines

Single-model cloud detection fails in predictable ways. Fmask relies on physical spectral thresholds — brightness temperature, NDVI ratios — which are reliable for thick convective cloud and geometric shadow projection but miss thin cirrus that lacks a thermal signal. s2cloudless uses a LightGBM classifier trained on Sentinel-2 top-of-atmosphere (TOA) reflectance and handles cirrus well, but can underdetect shadows and small opaque clouds where the spectral signature is ambiguous.

This pairing is therefore standard inside Satellite Processing Workflows & Index Pipelines: run both detectors in parallel, union their boolean outputs, and pass the result downstream before any spectral index calculation or temporal aggregation so cloud-contaminated pixels never reach the analytics layer.

The data-flow below shows how the two detectors feed into the fusion step:

Hybrid cloud masking data flow Sentinel-2 L1C bands are split into two parallel detection paths — s2cloudless ML probability and Fmask QA categorical — which are combined by a logical OR into a single boolean cloud-shadow mask. Sentinel-2 L1C bands DN ÷ 10000 → [0, 1] s2cloudless LightGBM probability map threshold ≥ 0.45 → boolean Fmask (CLI) QA categorical band values 2 (shadow) + 4 (cloud) Logical OR fusion Boolean mask → uint8 COG

Environment and setup

Package Minimum version Purpose
s2cloudless 1.5.0 ML cloud probability from Sentinel-2 L1C
rasterio 1.3.0 Band I/O and COG output
numpy 1.24 Array operations and boolean fusion
lightgbm 3.3 Installed automatically by s2cloudless
pip install s2cloudless>=1.5.0 rasterio>=1.3.0 numpy>=1.24

Band order is strict. s2cloudless with all_bands=True requires exactly these 10 bands in order: B01, B02, B04, B05, B08, B8A, B09, B10, B11, B12. Passing bands in the wrong order silently degrades accuracy without raising an error.

Fmask is a CLI tool. The official Python bindings are deprecated and unstable outside conda-managed environments. Invoke Fmask via subprocess before running this script, then point fmask_path at the generated *_fmask.tif QA file.

Complete working example

The function below processes a full Sentinel-2 scene using chunked rasterio windows to keep memory bounded. All band files are read once at the start; Fmask QA is decoded in a single pass; s2cloudless inference runs tile by tile.

import numpy as np
import rasterio
from rasterio.windows import Window
from s2cloudless import S2PixelCloudDetector
from pathlib import Path


def load_and_scale_bands(band_paths: list[str]) -> tuple[np.ndarray, dict]:
    """
    Load Sentinel-2 L1C bands into a float32 stack normalized to [0, 1].

    band_paths must follow s2cloudless band order:
    B01, B02, B04, B05, B08, B8A, B09, B10, B11, B12
    """
    with rasterio.open(band_paths[0]) as src:
        profile = src.profile.copy()
        height, width = src.height, src.width

    # Sentinel-2 L1C quantification value: DN / 10000 = TOA reflectance
    stack = np.zeros((len(band_paths), height, width), dtype=np.float32)
    for i, path in enumerate(band_paths):
        with rasterio.open(path) as src:
            stack[i] = np.clip(src.read(1, out_dtype=np.float32) / 10000.0, 0.0, 1.0)

    return stack, profile


def decode_fmask_qa(qa_path: str) -> np.ndarray:
    """
    Return a boolean array where True means cloud or cloud shadow.

    Fmask 4.x categorical values:
      0 = clear land, 1 = water, 2 = cloud shadow,
      3 = snow/ice,   4 = cloud, 255 = fill/nodata
    """
    with rasterio.open(qa_path) as src:
        qa = src.read(1)
    return (qa == 4) | (qa == 2)


def generate_hybrid_mask(
    band_paths: list[str],
    fmask_qa_path: str,
    output_path: str,
    cloud_threshold: float = 0.45,
    chunk_size: int = 1024,
) -> None:
    """
    Write a uint8 boolean cloud-shadow mask by merging s2cloudless
    probability and Fmask QA via logical OR.

    Parameters
    ----------
    band_paths       : 10 Sentinel-2 L1C band files in s2cloudless order
    fmask_qa_path    : path to pre-generated Fmask *_fmask.tif
    output_path      : destination COG path (uint8, 0=clear, 1=masked)
    cloud_threshold  : s2cloudless probability cut-off (default 0.45)
    chunk_size       : tile side length in pixels for chunked inference
    """
    stack, profile = load_and_scale_bands(band_paths)
    height, width = stack.shape[1], stack.shape[2]

    # Instantiate detector once — the LightGBM model loads at construction
    detector = S2PixelCloudDetector(threshold=cloud_threshold, all_bands=True)

    # Load Fmask for the full scene in one read (fast sequential access)
    fmask_full = decode_fmask_qa(fmask_qa_path)

    # Output is uint8 boolean; deflate compression saves ~70 % vs uncompressed
    out_profile = profile.copy()
    out_profile.update(dtype="uint8", count=1, compress="deflate", nodata=None)

    with rasterio.open(output_path, "w", **out_profile) as dst:
        for row in range(0, height, chunk_size):
            for col in range(0, width, chunk_size):
                win = Window(
                    col, row,
                    min(chunk_size, width - col),
                    min(chunk_size, height - row),
                )

                # ── s2cloudless inference ────────────────────────────────────
                # Slice shape: (bands, H, W) → transpose → (H, W, bands)
                # Add batch dimension: (1, H, W, bands) as expected by the API
                chunk = stack[
                    :,
                    win.row_off : win.row_off + win.height,
                    win.col_off : win.col_off + win.width,
                ].transpose(1, 2, 0)[np.newaxis]
                # get_cloud_probability_maps returns (n_samples, H, W)
                prob = detector.get_cloud_probability_maps(chunk)[0]
                ml_mask = prob >= cloud_threshold

                # ── Fmask QA slice ───────────────────────────────────────────
                fmask_chunk = fmask_full[
                    win.row_off : win.row_off + win.height,
                    win.col_off : win.col_off + win.width,
                ]

                # ── Fusion: flag pixel if EITHER detector marks it ───────────
                hybrid = (ml_mask | fmask_chunk).astype(np.uint8)
                dst.write(hybrid, 1, window=win)

Variant patterns

Fmask 5.x bit-packed QA

Fmask 5.x switched from categorical integers to a bit-packed uint8 for compact storage. Replace the equality checks with bitwise masks:

def decode_fmask5_qa(qa_path: str) -> np.ndarray:
    """Decode Fmask 5.x bit-packed QA band (check sensor-specific docs for bit definitions)."""
    with rasterio.open(qa_path) as src:
        qa = src.read(1)
    # Bit 0 = cloud, bit 1 = cloud shadow (verify against your Fmask 5.x release notes)
    fmask_cloud  = (qa & 0b00000001) > 0
    fmask_shadow = (qa & 0b00000010) > 0
    return fmask_cloud | fmask_shadow

Always cross-check the bit positions against the Fmask algorithm reference for your sensor version — bit definitions differ between Landsat and Sentinel-2 outputs.

Dask-enabled variant for large AOIs

When the scene footprint exceeds available RAM, replacing the explicit chunk loop with rioxarray and Dask parallelizes tile processing across CPU cores without manual window management. This pattern integrates naturally with Advanced Resampling and Upscaling Techniques when spatial resolution harmonization is required before masking.

import rioxarray
import xarray as xr
import dask.array as da

def dask_hybrid_mask(
    band_paths: list[str],
    fmask_qa_path: str,
    cloud_threshold: float = 0.45,
    chunks: dict = {"x": 1024, "y": 1024},
) -> xr.DataArray:
    """
    Build the hybrid mask lazily with Dask.
    Call .compute() to materialise or .rio.to_raster(...) to write.
    """
    # Load bands as a lazy Dask-backed DataArray
    arrays = [
        rioxarray.open_rasterio(p, chunks=chunks).squeeze("band", drop=True) / 10000.0
        for p in band_paths
    ]
    stack = xr.concat(arrays, dim="band")          # (band, y, x)

    # Fmask (small enough to load eagerly)
    with rasterio.open(fmask_qa_path) as src:
        fmask_np = src.read(1)
    fmask_da = da.from_array(fmask_np, chunks=(chunks["y"], chunks["x"]))
    fmask_mask = (fmask_da == 4) | (fmask_da == 2)

    # s2cloudless does not accept Dask arrays directly — map_blocks bridges the gap
    def _apply_detector(block: np.ndarray) -> np.ndarray:
        # block shape: (bands, H, W)
        inp = block.transpose(1, 2, 0)[np.newaxis]
        prob = S2PixelCloudDetector(
            threshold=cloud_threshold, all_bands=True
        ).get_cloud_probability_maps(inp)[0]
        return (prob >= cloud_threshold).astype(np.uint8)

    ml_mask = da.map_blocks(
        _apply_detector,
        stack.data,                        # (bands, y, x) Dask array
        dtype=np.uint8,
        drop_axis=0,                       # output drops the band dimension
    )

    hybrid = da.logical_or(ml_mask, fmask_mask).astype(np.uint8)
    return xr.DataArray(hybrid, coords={"y": stack.y, "x": stack.x})

Applying the mask to downstream index calculations

Once the boolean mask is written, apply it before computing any vegetation or water index so contaminated pixels are excluded from statistics. This step connects directly to Spectral Index Calculation Pipelines:

import numpy as np
import rasterio

def apply_cloud_mask_to_band(
    band_path: str,
    mask_path: str,
    nodata_value: float = np.nan,
) -> tuple[np.ndarray, dict]:
    """Replace cloud/shadow pixels with nodata before index arithmetic."""
    with rasterio.open(band_path) as src:
        band = src.read(1, out_dtype=np.float32)
        profile = src.profile.copy()
    with rasterio.open(mask_path) as src:
        mask = src.read(1).astype(bool)
    band[mask] = nodata_value
    profile.update(dtype="float32", nodata=nodata_value)
    return band, profile

Common errors

ValueError: Input data has incorrect shapes2cloudless received an array with the wrong number of bands or a shape that is not (n, H, W, 10). Verify that band_paths contains exactly 10 files in the documented order and that the transpose step produces (H, W, 10) before the np.newaxis batch dimension is added.

rasterio.errors.CRSError: Unable to open EPSG support file — PROJ environment variables are not set. Resolve with export PROJ_LIB=$(python -c "import pyproj; print(pyproj.datadir.get_data_dir())"), or install the pyproj wheel which bundles PROJ data automatically.

MemoryError during full-scene band load — The default approach loads all 10 bands into RAM before chunking. For 60 m/20 m mixed-resolution scenes at full swath width (110 km), reduce chunk_size to 512 or switch to the Dask variant above.

Frequently asked questions

Which Sentinel-2 bands does s2cloudless require? s2cloudless expects exactly 10 bands in this order: B01, B02, B04, B05, B08, B8A, B09, B10, B11, B12, all normalized to [0, 1] TOA reflectance. Omitting B10 (cirrus) in particular removes the detector’s primary cirrus signal.

What do Fmask categorical values mean? In Fmask 4.x: 0 = clear land, 1 = water, 2 = cloud shadow, 3 = snow/ice, 4 = cloud, 255 = fill/nodata. Fmask 5.x switches to bit-packed encoding where individual bits represent each class.

Can I call Fmask directly from Python? The official Python bindings are deprecated and unstable outside conda-managed environments. Production workflows invoke Fmask via subprocess, then read the resulting *_fmask.tif QA band with rasterio. Direct in-memory execution via ctypes or pybind11 is not supported in modern environments.

How should I tune the cloud threshold? The default 0.45 balances precision and recall for mid-latitude Sentinel-2 scenes. In tropical regions or areas with persistent haze, lower the threshold to 0.35 and validate against a stratified random sample of manually labelled pixels. A well-tuned hybrid mask typically achieves F1 ≥ 0.92 for clouds and F1 ≥ 0.88 for shadows across diverse biomes.