Extracting and Parsing Raster Metadata

Reliable remote sensing pipelines begin with accurate geospatial metadata. Before any pixel-level computation, band math, or spatial join can occur, your system must correctly interpret coordinate reference systems, affine transforms, data types, nodata sentinels, and sensor-specific tags. This guide provides a production-tested workflow for programmatically reading, validating, and structuring raster headers using rasterio — a foundational step within the broader Core Raster Fundamentals & STAC Mapping architecture.


Pipeline position

Metadata extraction sits at the entry point of every ingestion and analysis pipeline. It runs before CRS transformation, before pixel resolution normalisation, and before any data is read from disk or cloud storage. Parsed metadata drives all subsequent decisions: which resampling kernel to apply, whether to reproject, which nodata mask to propagate, and how to align multi-source grids.

The diagram below shows where metadata extraction sits in a typical pipeline:

Raster metadata extraction pipeline position Metadata extraction is the first step; it feeds a CRS validity check, nodata normalisation, and resolution alignment, which all converge into pixel-level operations. Open raster (rasterio.open) Extract metadata CRS · transform · nodata CRS validity check Nodata normalisation Resolution alignment Pixel-level operations

Prerequisites

pip install "rasterio>=1.3" "pyproj>=3.4" "numpy>=1.24" "pydantic>=2.0"

For GDAL binary compatibility, prefer the conda-forge build:

conda install -c conda-forge rasterio pyproj numpy
Library Min version Why required
rasterio 1.3 Dataset reader, affine transform, tag access
pyproj 3.4 CRS string normalisation, authority lookups
numpy 1.24 Dtype introspection, nodata range checks
pydantic 2.0 Optional — strict model validation and JSON schema

Conceptual prerequisites:


Step-by-step workflow

Step 1 — Open the raster safely

Always open files inside a with block. This guarantees that the underlying GDAL dataset handle is released immediately after parsing, which prevents file descriptor leaks in long-running workers or multiprocessing pools.

import rasterio

uri = "s3://my-bucket/scene.tif"   # local path or /vsis3/ URI both work

with rasterio.open(uri) as src:
    print(src.driver)       # GTiff, HDF5, JP2OpenJPEG, …
    print(src.count)        # number of bands
    print(src.width, src.height)

For cloud-hosted files, configure GDAL environment variables before the first open call to enable HTTP range requests and avoid full-file downloads:

import os
os.environ["GDAL_DISABLE_READDIR_ON_OPEN"] = "EMPTY_DIR"
os.environ["CPL_VSIL_CURL_ALLOWED_EXTENSIONS"] = ".tif,.tiff"
os.environ["AWS_NO_SIGN_REQUEST"] = "YES"   # for public buckets

Step 2 — Read core spatial attributes

The five attributes below form the geometric contract for the raster. Every downstream operation — reprojection, window reads, spatial joins — depends on them being correct.

with rasterio.open(uri) as src:
    # CRS: normalise to an authority string (e.g. "EPSG:32632")
    crs_str = src.crs.to_authority() if src.crs else None
    # e.g. ("EPSG", "32632") — convert to "EPSG:32632" as needed

    # Bounding box in native CRS units (left, bottom, right, top)
    bounds = src.bounds           # BoundingBox namedtuple

    # Affine transform: maps pixel (col, row) → (x, y) in CRS units
    # src.transform[0]  = pixel width (positive)
    # src.transform[4]  = pixel height (negative for north-up rasters)
    transform = src.transform

    # Pixel dimensions — always positive regardless of image orientation
    res_x, res_y = abs(src.res[0]), abs(src.res[1])

    # Native data type and nodata sentinel
    dtype   = src.dtypes[0]   # uniform dtype assumed; check src.dtypes for heterogeneous bands
    nodata  = src.nodata      # may be None — handle below

Step 3 — Extract driver-specific tags

rasterio exposes GDAL’s metadata namespace hierarchy through src.tags(). Dataset-level tags are under the default namespace; sensor calibration, solar geometry, and acquisition timestamps often live in named namespaces like IMAGERY or RPC.

with rasterio.open(uri) as src:
    # Dataset-level tags (TIFF GDAL metadata domain)
    dataset_tags = src.tags()

    # Named namespace — e.g. IMAGERY for Landsat/Sentinel scene metadata
    imagery_tags = src.tags(ns="IMAGERY")

    # Per-band tags (band indices are 1-based in rasterio)
    band1_tags = src.tags(1)

    # GDAL Metadata Domain — driver-specific (e.g. HDF5 GROUP metadata)
    # src.tags(ns="HDF5_GLOBAL") works for HDF/NetCDF drivers

    print(dataset_tags.get("AREA_OR_POINT"))    # "Area" or "Point"
    print(imagery_tags.get("ACQUISITIONDATETIME"))

Step 4 — Validate parsed values

Raw header extraction is only half the work. Before values propagate to downstream computations, verify them against your pipeline’s expectations.

from pyproj import CRS
import numpy as np

def validate_metadata(crs_str, res_x, res_y, nodata, dtype):
    issues = []

    # 1. CRS must be parseable and not unknown
    if crs_str is None:
        issues.append("CRS is missing — assign one explicitly before processing")
    else:
        try:
            crs_obj = CRS.from_user_input(crs_str)
            if crs_obj.is_geographic and res_x > 1.0:
                issues.append(
                    f"CRS appears geographic but pixel size {res_x:.6f} looks like metres"
                )
        except Exception as e:
            issues.append(f"CRS unparseable: {e}")

    # 2. Resolution must be positive and plausible
    if res_x <= 0 or res_y <= 0:
        issues.append(f"Non-positive resolution: x={res_x}, y={res_y}")

    # 3. Nodata must fit within the dtype's representable range
    if nodata is not None:
        info = np.iinfo(dtype) if np.issubdtype(np.dtype(dtype), np.integer) else np.finfo(dtype)
        if not (info.min <= nodata <= info.max):
            issues.append(f"Nodata {nodata} is outside {dtype} range [{info.min}, {info.max}]")

    return issues

Nodata inference fallback: when src.nodata is None, choose a safe default based on dtype before proceeding:

NODATA_DEFAULTS = {
    "uint8":   0,
    "uint16":  0,
    "int16":   -32768,
    "float32": float("nan"),
    "float64": float("nan"),
}

effective_nodata = nodata if nodata is not None else NODATA_DEFAULTS.get(dtype)

Step 5 — Serialise into a structured object

Wrap parsed values in a frozen dataclass so that downstream code receives a type-checked, immutable record. The frozen flag prevents accidental mutation in parallel workers.

import rasterio
from rasterio.transform import Affine
from dataclasses import dataclass, field
from typing import Any
import logging
from datetime import datetime, timezone

logger = logging.getLogger(__name__)

@dataclass(frozen=True)
class RasterMetadata:
    source_uri:   str
    crs:          str | None
    bounds:       tuple[float, float, float, float]   # (left, bottom, right, top)
    resolution:   tuple[float, float]                  # (x_res, y_res) — always positive
    transform:    tuple[float, ...]                    # 6-element GDAL geotransform
    dtype:        str
    nodata:       float | None
    effective_nodata: float | None                     # inferred if nodata is None
    band_count:   int
    width:        int
    height:       int
    tags:         dict[str, Any] = field(default_factory=dict)
    extracted_at: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat())

NODATA_DEFAULTS: dict[str, float] = {
    "uint8": 0, "uint16": 0, "int16": -32768,
    "float32": float("nan"), "float64": float("nan"),
}

def extract_raster_metadata(uri: str) -> RasterMetadata:
    """
    Open a raster, extract all spatial and radiometric header values,
    and return a frozen, validated metadata record.

    Args:
        uri: Local file path or GDAL virtual filesystem URI (e.g. /vsis3/…)

    Returns:
        RasterMetadata dataclass instance

    Raises:
        rasterio.errors.RasterioError: On I/O or GDAL-level failure
    """
    try:
        with rasterio.open(uri) as src:
            crs_auth   = src.crs.to_authority() if src.crs else None
            crs_str    = ":".join(crs_auth) if crs_auth else (src.crs.to_string() if src.crs else None)
            bounds     = src.bounds
            res        = src.res
            transform  = tuple(src.transform.to_gdal())   # (c, a, b, f, d, e) GDAL order
            dtype      = src.dtypes[0]
            nodata     = src.nodata
            tags       = src.tags()

        eff_nodata = nodata if nodata is not None else NODATA_DEFAULTS.get(dtype)

        return RasterMetadata(
            source_uri       = uri,
            crs              = crs_str,
            bounds           = (bounds.left, bounds.bottom, bounds.right, bounds.top),
            resolution       = (abs(res[0]), abs(res[1])),
            transform        = transform,
            dtype            = dtype,
            nodata           = nodata,
            effective_nodata = eff_nodata,
            band_count       = src.count,
            width            = src.width,
            height           = src.height,
            tags             = tags,
        )
    except rasterio.errors.RasterioError:
        logger.exception("GDAL-level failure reading %s", uri)
        raise
    except Exception:
        logger.exception("Unexpected error parsing metadata from %s", uri)
        raise

Step 6 — Attach provenance and log a structured event

Provenance tracking — source URI, parser version, library versions, extraction timestamp — is required for scientific reproducibility and for diagnosing silent regressions when library versions change.

import rasterio
import sys
import json

def extraction_event(meta: RasterMetadata) -> dict:
    """Build a structured log record suitable for JSON logging or STAC item properties."""
    return {
        "source_uri":    meta.source_uri,
        "extracted_at":  meta.extracted_at,
        "crs":           meta.crs,
        "resolution":    meta.resolution,
        "dtype":         meta.dtype,
        "nodata":        meta.nodata,
        "rasterio_ver":  rasterio.__version__,
        "python_ver":    sys.version,
        "status":        "ok",
    }

# Emit to structured logger
logger.info(json.dumps(extraction_event(meta)))

Parameter reference

Parameter / attribute Type Default Usage note
src.crs rasterio.crs.CRS or None Call .to_authority() for a canonical ("EPSG", "xxxx") tuple; fall back to .to_string() for non-EPSG CRS
src.transform affine.Affine Index [0] = x pixel size, [4] = y pixel size (negative for north-up). Use .to_gdal() for JSON-serialisable 6-tuple
src.res tuple[float, float] (col_spacing, row_spacing) — both positive regardless of orientation
src.bounds BoundingBox .left, .bottom, .right, .top in native CRS units
src.nodata float or None None None means no nodata tag was written; implement a dtype-based fallback
src.dtypes tuple[str, ...] One entry per band; typically uniform but check for mixed-type files
src.count int Total band count including alpha; check src.colorinterp to identify alpha bands
src.tags(ns=None) dict[str, str] {} ns=None returns default GDAL domain; pass ns="IMAGERY" or ns="RPC" for named domains
src.meta dict Full header dict ready to pass to rasterio.open(dst, "w", **src.meta)

Verification and testing

After extraction, assert these invariants before the pipeline proceeds:

import math

def assert_metadata_valid(meta: RasterMetadata) -> None:
    assert meta.crs is not None, "CRS must not be None"
    assert all(r > 0 for r in meta.resolution), "Pixel dimensions must be positive"
    assert meta.width > 0 and meta.height > 0, "Raster must have non-zero extent"
    assert meta.band_count >= 1, "At least one band required"
    assert len(meta.transform) == 6, "Transform must contain exactly 6 GDAL coefficients"

    # Transform coefficient [0] = x pixel size (must match resolution[0])
    assert math.isclose(abs(meta.transform[1]), meta.resolution[0], rel_tol=1e-6), \
        "Transform pixel width must match reported resolution"

    # Nodata should not be NaN for integer types (NaN is not representable)
    if meta.dtype in ("uint8", "uint16", "int16", "int32"):
        assert meta.effective_nodata is None or not math.isnan(float(meta.effective_nodata)), \
            f"NaN nodata is invalid for integer dtype {meta.dtype}"

# Quick smoke test against a known fixture
if __name__ == "__main__":
    meta = extract_raster_metadata("tests/fixtures/sentinel2_b04.tif")
    assert_metadata_valid(meta)
    print("All metadata assertions passed")
    print(f"  CRS:        {meta.crs}")
    print(f"  Resolution: {meta.resolution[0]:.1f} m")
    print(f"  Bounds:     {meta.bounds}")
    print(f"  Nodata:     {meta.effective_nodata}")

Expected console output for a 10 m Sentinel-2 band:

All metadata assertions passed
  CRS:        EPSG:32632
  Resolution: 10.0 m
  Bounds:     (300000.0, 5690220.0, 409800.0, 5800020.0)
  Nodata:     0

Troubleshooting

CPLE_NotSupported: The PROJ library has not been built with network support

Cause: PROJ_NETWORK=ON was set but the PROJ binary was compiled without libcurl. This surfaces when transforming obscure datums that require online grid lookups.

Fix: Unset PROJ_NETWORK or install proj-data to obtain the full grid shift dataset locally:

conda install -c conda-forge proj-data

src.crs is None on a file that has projection information

Cause: The CRS is embedded in a non-standard metadata tag (common in legacy ENVI .hdr files or HDF products) rather than the standard GeoTIFF GeoKeyDirectory tag.

Fix: Read the authority string from src.tags() and assign it manually:

with rasterio.open(uri) as src:
    if src.crs is None:
        epsg = src.tags().get("EPSG") or src.tags(ns="ENVI").get("coordinate_system")
        # Assign via pyproj and reproject to a known CRS before proceeding

ValueError: transform is degenerate (scale component is zero)

Cause: The affine transform was written with zero pixel size — a common artefact of converting rasters through GIS desktop tools that drop the resolution metadata.

Fix: Inspect src.transform and reconstruct it from bounds and width/height:

from rasterio.transform import from_bounds

with rasterio.open(uri) as src:
    if src.transform[0] == 0.0:
        corrected = from_bounds(*src.bounds, src.width, src.height)

Nodata pixels not being masked in downstream operations

Cause: src.nodata is None so rasterio.read(masked=True) returns a fully unmasked array.

Fix: Pass fill_value explicitly when reading, or apply the inferred nodata yourself:

import numpy as np

with rasterio.open(uri) as src:
    data = src.read(1)
    eff_nodata = src.nodata if src.nodata is not None else 0
    masked = np.ma.masked_equal(data, eff_nodata)

Batch extraction silently skips corrupt files

Cause: A bare except Exception: continue pattern in a loop suppresses all errors.

Fix: Separate I/O errors (retryable) from parse errors (non-retryable) and log both. See Automating metadata extraction for batch raster jobs for a full error-handling and retry strategy.


Scaling to batch and cloud workloads

Single-file parsing is straightforward. Ingesting a full Sentinel-2 archive or a STAC collection adds three considerations:

Avoid redundant downloads. Cloud-Optimized GeoTIFF files store metadata in the first IFD, typically within the first 16–32 KB. Configure GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR and CPL_VSIL_CURL_ALLOWED_EXTENSIONS=.tif so rasterio fetches only the header bytes via HTTP range requests.

Cache parsed results. Metadata is stable across pipeline runs for immutable source files. Write RasterMetadata records to a Parquet or SQLite catalogue keyed on source_uri + file checksum. Skip re-parsing any URI whose checksum has not changed.

Parallelise without GDAL thread conflicts. Use concurrent.futures.ProcessPoolExecutor rather than threads, since GDAL’s global state is not thread-safe. Each worker process gets its own GDAL context. For the full batch orchestration pattern — including STAC-aligned output, retry logic, and Parquet caching — refer to Automating metadata extraction for batch raster jobs.


FAQ

Why is src.nodata None even though the raster has a nodata value set?

GDAL reads nodata from the TIFFTAG_GDAL_NODATA tag or from per-band metadata. If the file was written without setting that tag (common with older ENVI or HDF workflows), rasterio reports None. Inspect src.tags() and src.tags(1) for band-level NODATA keys, or run gdalinfo to confirm what is stored in the file header.

How do I extract metadata from a COG over S3 without downloading the whole file?

Prefix the S3 URI with /vsis3/ (e.g., rasterio.open('/vsis3/bucket/key.tif')) and set GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR and CPL_VSIL_CURL_ALLOWED_EXTENSIONS=.tif. rasterio will fetch only the IFD header bytes via HTTP range requests, typically under 32 KB for a well-formed COG.

What is the difference between src.transform and src.meta['transform']?

They return the same Affine object. src.transform is a direct attribute access, while src.meta['transform'] is part of the full metadata dictionary that also includes driver, dtype, nodata, width, height, count, and crs. Use src.meta when forwarding the complete header to rasterio.open() in write mode.


Deep-Dive Articles