Extracting and Parsing Raster Metadata
Reliable remote sensing pipelines begin with accurate geospatial metadata. Before any pixel-level computation, band math, or spatial join can occur, your system must correctly interpret coordinate reference systems, affine transforms, data types, nodata sentinels, and sensor-specific tags. This guide provides a production-tested workflow for programmatically reading, validating, and structuring raster headers using rasterio — a foundational step within the broader Core Raster Fundamentals & STAC Mapping architecture.
Pipeline position
Metadata extraction sits at the entry point of every ingestion and analysis pipeline. It runs before CRS transformation, before pixel resolution normalisation, and before any data is read from disk or cloud storage. Parsed metadata drives all subsequent decisions: which resampling kernel to apply, whether to reproject, which nodata mask to propagate, and how to align multi-source grids.
The diagram below shows where metadata extraction sits in a typical pipeline:
Prerequisites
pip install "rasterio>=1.3" "pyproj>=3.4" "numpy>=1.24" "pydantic>=2.0"
For GDAL binary compatibility, prefer the conda-forge build:
conda install -c conda-forge rasterio pyproj numpy
| Library | Min version | Why required |
|---|---|---|
rasterio |
1.3 | Dataset reader, affine transform, tag access |
pyproj |
3.4 | CRS string normalisation, authority lookups |
numpy |
1.24 | Dtype introspection, nodata range checks |
pydantic |
2.0 | Optional — strict model validation and JSON schema |
Conceptual prerequisites:
- Understanding of the TIFF Image File Directory (IFD) structure — see Understanding Cloud-Optimized GeoTIFF Structure for how COG headers are laid out in the file
- Familiarity with EPSG codes and authority-based CRS definitions — see Mastering CRS Transformations in Rasterio
Step-by-step workflow
Step 1 — Open the raster safely
Always open files inside a with block. This guarantees that the underlying GDAL dataset handle is released immediately after parsing, which prevents file descriptor leaks in long-running workers or multiprocessing pools.
import rasterio
uri = "s3://my-bucket/scene.tif" # local path or /vsis3/ URI both work
with rasterio.open(uri) as src:
print(src.driver) # GTiff, HDF5, JP2OpenJPEG, …
print(src.count) # number of bands
print(src.width, src.height)
For cloud-hosted files, configure GDAL environment variables before the first open call to enable HTTP range requests and avoid full-file downloads:
import os
os.environ["GDAL_DISABLE_READDIR_ON_OPEN"] = "EMPTY_DIR"
os.environ["CPL_VSIL_CURL_ALLOWED_EXTENSIONS"] = ".tif,.tiff"
os.environ["AWS_NO_SIGN_REQUEST"] = "YES" # for public buckets
Step 2 — Read core spatial attributes
The five attributes below form the geometric contract for the raster. Every downstream operation — reprojection, window reads, spatial joins — depends on them being correct.
with rasterio.open(uri) as src:
# CRS: normalise to an authority string (e.g. "EPSG:32632")
crs_str = src.crs.to_authority() if src.crs else None
# e.g. ("EPSG", "32632") — convert to "EPSG:32632" as needed
# Bounding box in native CRS units (left, bottom, right, top)
bounds = src.bounds # BoundingBox namedtuple
# Affine transform: maps pixel (col, row) → (x, y) in CRS units
# src.transform[0] = pixel width (positive)
# src.transform[4] = pixel height (negative for north-up rasters)
transform = src.transform
# Pixel dimensions — always positive regardless of image orientation
res_x, res_y = abs(src.res[0]), abs(src.res[1])
# Native data type and nodata sentinel
dtype = src.dtypes[0] # uniform dtype assumed; check src.dtypes for heterogeneous bands
nodata = src.nodata # may be None — handle below
Step 3 — Extract driver-specific tags
rasterio exposes GDAL’s metadata namespace hierarchy through src.tags(). Dataset-level tags are under the default namespace; sensor calibration, solar geometry, and acquisition timestamps often live in named namespaces like IMAGERY or RPC.
with rasterio.open(uri) as src:
# Dataset-level tags (TIFF GDAL metadata domain)
dataset_tags = src.tags()
# Named namespace — e.g. IMAGERY for Landsat/Sentinel scene metadata
imagery_tags = src.tags(ns="IMAGERY")
# Per-band tags (band indices are 1-based in rasterio)
band1_tags = src.tags(1)
# GDAL Metadata Domain — driver-specific (e.g. HDF5 GROUP metadata)
# src.tags(ns="HDF5_GLOBAL") works for HDF/NetCDF drivers
print(dataset_tags.get("AREA_OR_POINT")) # "Area" or "Point"
print(imagery_tags.get("ACQUISITIONDATETIME"))
Step 4 — Validate parsed values
Raw header extraction is only half the work. Before values propagate to downstream computations, verify them against your pipeline’s expectations.
from pyproj import CRS
import numpy as np
def validate_metadata(crs_str, res_x, res_y, nodata, dtype):
issues = []
# 1. CRS must be parseable and not unknown
if crs_str is None:
issues.append("CRS is missing — assign one explicitly before processing")
else:
try:
crs_obj = CRS.from_user_input(crs_str)
if crs_obj.is_geographic and res_x > 1.0:
issues.append(
f"CRS appears geographic but pixel size {res_x:.6f} looks like metres"
)
except Exception as e:
issues.append(f"CRS unparseable: {e}")
# 2. Resolution must be positive and plausible
if res_x <= 0 or res_y <= 0:
issues.append(f"Non-positive resolution: x={res_x}, y={res_y}")
# 3. Nodata must fit within the dtype's representable range
if nodata is not None:
info = np.iinfo(dtype) if np.issubdtype(np.dtype(dtype), np.integer) else np.finfo(dtype)
if not (info.min <= nodata <= info.max):
issues.append(f"Nodata {nodata} is outside {dtype} range [{info.min}, {info.max}]")
return issues
Nodata inference fallback: when src.nodata is None, choose a safe default based on dtype before proceeding:
NODATA_DEFAULTS = {
"uint8": 0,
"uint16": 0,
"int16": -32768,
"float32": float("nan"),
"float64": float("nan"),
}
effective_nodata = nodata if nodata is not None else NODATA_DEFAULTS.get(dtype)
Step 5 — Serialise into a structured object
Wrap parsed values in a frozen dataclass so that downstream code receives a type-checked, immutable record. The frozen flag prevents accidental mutation in parallel workers.
import rasterio
from rasterio.transform import Affine
from dataclasses import dataclass, field
from typing import Any
import logging
from datetime import datetime, timezone
logger = logging.getLogger(__name__)
@dataclass(frozen=True)
class RasterMetadata:
source_uri: str
crs: str | None
bounds: tuple[float, float, float, float] # (left, bottom, right, top)
resolution: tuple[float, float] # (x_res, y_res) — always positive
transform: tuple[float, ...] # 6-element GDAL geotransform
dtype: str
nodata: float | None
effective_nodata: float | None # inferred if nodata is None
band_count: int
width: int
height: int
tags: dict[str, Any] = field(default_factory=dict)
extracted_at: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat())
NODATA_DEFAULTS: dict[str, float] = {
"uint8": 0, "uint16": 0, "int16": -32768,
"float32": float("nan"), "float64": float("nan"),
}
def extract_raster_metadata(uri: str) -> RasterMetadata:
"""
Open a raster, extract all spatial and radiometric header values,
and return a frozen, validated metadata record.
Args:
uri: Local file path or GDAL virtual filesystem URI (e.g. /vsis3/…)
Returns:
RasterMetadata dataclass instance
Raises:
rasterio.errors.RasterioError: On I/O or GDAL-level failure
"""
try:
with rasterio.open(uri) as src:
crs_auth = src.crs.to_authority() if src.crs else None
crs_str = ":".join(crs_auth) if crs_auth else (src.crs.to_string() if src.crs else None)
bounds = src.bounds
res = src.res
transform = tuple(src.transform.to_gdal()) # (c, a, b, f, d, e) GDAL order
dtype = src.dtypes[0]
nodata = src.nodata
tags = src.tags()
eff_nodata = nodata if nodata is not None else NODATA_DEFAULTS.get(dtype)
return RasterMetadata(
source_uri = uri,
crs = crs_str,
bounds = (bounds.left, bounds.bottom, bounds.right, bounds.top),
resolution = (abs(res[0]), abs(res[1])),
transform = transform,
dtype = dtype,
nodata = nodata,
effective_nodata = eff_nodata,
band_count = src.count,
width = src.width,
height = src.height,
tags = tags,
)
except rasterio.errors.RasterioError:
logger.exception("GDAL-level failure reading %s", uri)
raise
except Exception:
logger.exception("Unexpected error parsing metadata from %s", uri)
raise
Step 6 — Attach provenance and log a structured event
Provenance tracking — source URI, parser version, library versions, extraction timestamp — is required for scientific reproducibility and for diagnosing silent regressions when library versions change.
import rasterio
import sys
import json
def extraction_event(meta: RasterMetadata) -> dict:
"""Build a structured log record suitable for JSON logging or STAC item properties."""
return {
"source_uri": meta.source_uri,
"extracted_at": meta.extracted_at,
"crs": meta.crs,
"resolution": meta.resolution,
"dtype": meta.dtype,
"nodata": meta.nodata,
"rasterio_ver": rasterio.__version__,
"python_ver": sys.version,
"status": "ok",
}
# Emit to structured logger
logger.info(json.dumps(extraction_event(meta)))
Parameter reference
| Parameter / attribute | Type | Default | Usage note |
|---|---|---|---|
src.crs |
rasterio.crs.CRS or None |
— | Call .to_authority() for a canonical ("EPSG", "xxxx") tuple; fall back to .to_string() for non-EPSG CRS |
src.transform |
affine.Affine |
— | Index [0] = x pixel size, [4] = y pixel size (negative for north-up). Use .to_gdal() for JSON-serialisable 6-tuple |
src.res |
tuple[float, float] |
— | (col_spacing, row_spacing) — both positive regardless of orientation |
src.bounds |
BoundingBox |
— | .left, .bottom, .right, .top in native CRS units |
src.nodata |
float or None |
None |
None means no nodata tag was written; implement a dtype-based fallback |
src.dtypes |
tuple[str, ...] |
— | One entry per band; typically uniform but check for mixed-type files |
src.count |
int |
— | Total band count including alpha; check src.colorinterp to identify alpha bands |
src.tags(ns=None) |
dict[str, str] |
{} |
ns=None returns default GDAL domain; pass ns="IMAGERY" or ns="RPC" for named domains |
src.meta |
dict |
— | Full header dict ready to pass to rasterio.open(dst, "w", **src.meta) |
Verification and testing
After extraction, assert these invariants before the pipeline proceeds:
import math
def assert_metadata_valid(meta: RasterMetadata) -> None:
assert meta.crs is not None, "CRS must not be None"
assert all(r > 0 for r in meta.resolution), "Pixel dimensions must be positive"
assert meta.width > 0 and meta.height > 0, "Raster must have non-zero extent"
assert meta.band_count >= 1, "At least one band required"
assert len(meta.transform) == 6, "Transform must contain exactly 6 GDAL coefficients"
# Transform coefficient [0] = x pixel size (must match resolution[0])
assert math.isclose(abs(meta.transform[1]), meta.resolution[0], rel_tol=1e-6), \
"Transform pixel width must match reported resolution"
# Nodata should not be NaN for integer types (NaN is not representable)
if meta.dtype in ("uint8", "uint16", "int16", "int32"):
assert meta.effective_nodata is None or not math.isnan(float(meta.effective_nodata)), \
f"NaN nodata is invalid for integer dtype {meta.dtype}"
# Quick smoke test against a known fixture
if __name__ == "__main__":
meta = extract_raster_metadata("tests/fixtures/sentinel2_b04.tif")
assert_metadata_valid(meta)
print("All metadata assertions passed")
print(f" CRS: {meta.crs}")
print(f" Resolution: {meta.resolution[0]:.1f} m")
print(f" Bounds: {meta.bounds}")
print(f" Nodata: {meta.effective_nodata}")
Expected console output for a 10 m Sentinel-2 band:
All metadata assertions passed
CRS: EPSG:32632
Resolution: 10.0 m
Bounds: (300000.0, 5690220.0, 409800.0, 5800020.0)
Nodata: 0
Troubleshooting
CPLE_NotSupported: The PROJ library has not been built with network support
Cause: PROJ_NETWORK=ON was set but the PROJ binary was compiled without libcurl. This surfaces when transforming obscure datums that require online grid lookups.
Fix: Unset PROJ_NETWORK or install proj-data to obtain the full grid shift dataset locally:
conda install -c conda-forge proj-data
src.crs is None on a file that has projection information
Cause: The CRS is embedded in a non-standard metadata tag (common in legacy ENVI .hdr files or HDF products) rather than the standard GeoTIFF GeoKeyDirectory tag.
Fix: Read the authority string from src.tags() and assign it manually:
with rasterio.open(uri) as src:
if src.crs is None:
epsg = src.tags().get("EPSG") or src.tags(ns="ENVI").get("coordinate_system")
# Assign via pyproj and reproject to a known CRS before proceeding
ValueError: transform is degenerate (scale component is zero)
Cause: The affine transform was written with zero pixel size — a common artefact of converting rasters through GIS desktop tools that drop the resolution metadata.
Fix: Inspect src.transform and reconstruct it from bounds and width/height:
from rasterio.transform import from_bounds
with rasterio.open(uri) as src:
if src.transform[0] == 0.0:
corrected = from_bounds(*src.bounds, src.width, src.height)
Nodata pixels not being masked in downstream operations
Cause: src.nodata is None so rasterio.read(masked=True) returns a fully unmasked array.
Fix: Pass fill_value explicitly when reading, or apply the inferred nodata yourself:
import numpy as np
with rasterio.open(uri) as src:
data = src.read(1)
eff_nodata = src.nodata if src.nodata is not None else 0
masked = np.ma.masked_equal(data, eff_nodata)
Batch extraction silently skips corrupt files
Cause: A bare except Exception: continue pattern in a loop suppresses all errors.
Fix: Separate I/O errors (retryable) from parse errors (non-retryable) and log both. See Automating metadata extraction for batch raster jobs for a full error-handling and retry strategy.
Scaling to batch and cloud workloads
Single-file parsing is straightforward. Ingesting a full Sentinel-2 archive or a STAC collection adds three considerations:
Avoid redundant downloads. Cloud-Optimized GeoTIFF files store metadata in the first IFD, typically within the first 16–32 KB. Configure GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR and CPL_VSIL_CURL_ALLOWED_EXTENSIONS=.tif so rasterio fetches only the header bytes via HTTP range requests.
Cache parsed results. Metadata is stable across pipeline runs for immutable source files. Write RasterMetadata records to a Parquet or SQLite catalogue keyed on source_uri + file checksum. Skip re-parsing any URI whose checksum has not changed.
Parallelise without GDAL thread conflicts. Use concurrent.futures.ProcessPoolExecutor rather than threads, since GDAL’s global state is not thread-safe. Each worker process gets its own GDAL context. For the full batch orchestration pattern — including STAC-aligned output, retry logic, and Parquet caching — refer to Automating metadata extraction for batch raster jobs.
FAQ
Why is src.nodata None even though the raster has a nodata value set?
GDAL reads nodata from the TIFFTAG_GDAL_NODATA tag or from per-band metadata. If the file was written without setting that tag (common with older ENVI or HDF workflows), rasterio reports None. Inspect src.tags() and src.tags(1) for band-level NODATA keys, or run gdalinfo to confirm what is stored in the file header.
How do I extract metadata from a COG over S3 without downloading the whole file?
Prefix the S3 URI with /vsis3/ (e.g., rasterio.open('/vsis3/bucket/key.tif')) and set GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR and CPL_VSIL_CURL_ALLOWED_EXTENSIONS=.tif. rasterio will fetch only the IFD header bytes via HTTP range requests, typically under 32 KB for a well-formed COG.
What is the difference between src.transform and src.meta['transform']?
They return the same Affine object. src.transform is a direct attribute access, while src.meta['transform'] is part of the full metadata dictionary that also includes driver, dtype, nodata, width, height, count, and crs. Use src.meta when forwarding the complete header to rasterio.open() in write mode.
Related
- Core Raster Fundamentals & STAC Mapping — parent section covering the full raster data model, STAC discovery, and cloud-native access patterns
- Automating metadata extraction for batch raster jobs — multiprocessing pool, Parquet caching, and STAC-aligned output for large-scale ingestion
- Understanding Cloud-Optimized GeoTIFF Structure — IFD layout, internal overviews, and HTTP range request mechanics that determine what metadata is available without a full download
- Mastering CRS Transformations in Rasterio — deterministic reprojection patterns that depend on correctly parsed CRS and affine transform
- Handling Pixel Resolution and Scaling — resolution normalisation and window-read strategies that consume the
resolutionvalues extracted here