How to Read COG Headers Without Downloading Full Files
Open a remote Cloud-Optimized GeoTIFF URL with rasterio and read .crs, .bounds, .res, and .overviews() directly — GDAL’s /vsicurl/ backend automatically issues an HTTP Range request for the first 16–32 KB containing the Image File Directory, so no pixel data is transferred:
import rasterio
url = "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/36/N/YF/2023/6/S2A_36NYF_20230615_0_L2A/B04.tif"
with rasterio.open(url) as src:
print(src.crs, src.bounds, src.res)
This pattern is the entry point for metadata-first validation in any pipeline that consumes assets from Understanding Cloud-Optimized GeoTIFF Structure.
Why This Arises in Remote Sensing Workflows
Satellite archives on AWS Open Data, Microsoft Planetary Computer, and Google Earth Engine Storage routinely hold files that are several hundred megabytes to several gigabytes each. When you need to validate spatial alignment, confirm CRS, or check that overview pyramids are present before scheduling a batch job, downloading every file is prohibitively expensive in both time and egress cost.
COGs are specifically structured so that all structural metadata — tile offsets, CRS definitions, band descriptions, and overview level pointers — sits at the head of the file. The HTTP Range request mechanism exploits this layout: a single partial-content response of 16–32 KB carries everything needed to populate rasterio’s dataset object. Pixel arrays remain on the remote server until you explicitly call .read().
This workflow sits inside Understanding Cloud-Optimized GeoTIFF Structure and feeds directly into asset-level filtering in Querying STAC Catalogs Programmatically: fetch headers to reject misaligned or incomplete assets before constructing read windows.
For the broader architectural context of how this fits into catalog-driven ingestion, see Core Raster Fundamentals & STAC Mapping.
How GDAL Routes a Remote URL
The diagram below shows the request sequence when you call rasterio.open(url) on an https:// URI.
GDAL’s virtual filesystem layer (/vsicurl/) intercepts the https:// URI before any I/O occurs, rewrites the open call as an HTTP Range request, and caches the response. The caller receives a fully populated DatasetReader object. This is the mechanism that makes the metadata extraction patterns described elsewhere in this site possible at scale.
Environment & Setup
| Package | Minimum version | Why required |
|---|---|---|
rasterio |
1.3.0 | Exposes GDAL’s /vsicurl/ through its Python API |
GDAL (C library) |
3.4.0 | Implements HTTP Range, /vsicurl/, /vsis3/, /vsigs/ |
libcurl |
7.68 | Handles the underlying HTTP/HTTPS transport for GDAL |
pip install "rasterio>=1.3.0"
Verify GDAL can reach remote files before running production code:
import rasterio
from rasterio.env import Env
with Env(GDAL_DISABLE_READDIR_ON_OPEN="EMPTY_DIR"):
with rasterio.open("https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/36/N/YF/2023/6/S2A_36NYF_20230615_0_L2A/B04.tif") as src:
print(src.driver) # GTiff — confirms a successful header-only open
Complete Working Example
The function below extracts a full metadata dictionary from a remote COG. Every field comes from the header — .read() is never called.
import os
import rasterio
from rasterio.env import Env
from rasterio.errors import RasterioIOError
from typing import Any
def read_cog_header(url: str, *, aws_unsigned: bool = False) -> dict[str, Any]:
"""
Fetch COG metadata via HTTP Range request — no pixel data is transferred.
Parameters
----------
url : str
Remote COG path. Accepts https://, s3://, gs://, or /vsicurl/ prefixes.
aws_unsigned : bool
Set True for public AWS Open Data buckets that do not require signing.
Returns
-------
dict[str, Any]
Spatial metadata parsed from the IFD. Pixel arrays are not included.
"""
gdal_env: dict[str, str] = {
# Suppress directory-listing requests that add latency and cost
"GDAL_DISABLE_READDIR_ON_OPEN": "EMPTY_DIR",
# Consolidate fragmented byte ranges into one TCP round-trip
"GDAL_HTTP_MERGE_CONSECUTIVE_RANGES": "YES",
# Enable HTTP/2 multiplexing for concurrent range requests
"GDAL_HTTP_MULTIPLEX": "YES",
"GDAL_HTTP_VERSION": "2",
}
if aws_unsigned:
# Public buckets reject signed requests; disable AWS Signature V4
gdal_env["AWS_NO_SIGN_REQUEST"] = "YES"
try:
with Env(**gdal_env):
with rasterio.open(url) as src:
return {
"driver": src.driver, # Should be "GTiff" for COGs
"dtype": src.dtypes[0], # Data type of band 1
"count": src.count, # Number of spectral bands
"width": src.width, # Columns (pixels)
"height": src.height, # Rows (pixels)
"crs": src.crs.to_epsg() if src.crs else None, # EPSG code or None
"transform": src.transform, # Affine geotransform (origin, pixel size)
"nodata": src.nodata, # Fill / no-data sentinel value
"resolution": src.res, # (x_res, y_res) in CRS units
"bounds": src.bounds, # BoundingBox(left, bottom, right, top)
# List of overview (pyramid) levels per band — empty means no overviews
"overviews": [src.overviews(i) for i in range(1, src.count + 1)],
# Internal block (tile) size — (height, width) tuple
"block_shapes": list(src.block_shapes),
"compression": src.compression.value if src.compression else None,
}
except RasterioIOError as exc:
raise RuntimeError(f"Could not read COG header from {url!r}: {exc}") from exc
# Example — Sentinel-2 L2A red band from AWS Open Data (public bucket)
if __name__ == "__main__":
s2_url = (
"https://sentinel-cogs.s3.us-west-2.amazonaws.com"
"/sentinel-s2-l2a-cogs/36/N/YF/2023/6"
"/S2A_36NYF_20230615_0_L2A/B04.tif"
)
meta = read_cog_header(s2_url, aws_unsigned=True)
print(f"CRS: EPSG:{meta['crs']}")
print(f"Resolution: {meta['resolution']} m")
print(f"Overview levels (band 1): {meta['overviews'][0]}")
print(f"Block shape: {meta['block_shapes'][0]}")
What to inspect in the returned dictionary
| Field | Pipeline use |
|---|---|
crs |
Spatial indexing, STAC item validation, CRS mismatch rejection |
bounds |
Bounding-box intersection checks before windowed reads |
resolution |
Dynamic overview selection, avoiding unnecessary high-res fetches |
overviews |
Confirm pyramid presence; absent overviews are a COG validation failure |
dtype & nodata |
Memory allocation sizing, masking strategy, type-cast decisions |
block_shapes |
Confirm internal tiling (256×256 or 512×512) vs. strip layout |
compression |
Identify DEFLATE, ZSTD, LZW — affects decompression overhead per worker |
Variant Patterns
1. Reading from a private S3 bucket
For buckets that require IAM authentication, configure credentials via environment variables rather than embedding tokens in the URI. GDAL’s /vsis3/ backend inherits the active session.
import os
import rasterio
from rasterio.env import Env
from typing import Any
def read_private_s3_cog(s3_uri: str) -> dict[str, Any]:
"""
Read a COG header from a private S3 bucket.
Credentials are read from AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
environment variables, a ~/.aws/credentials profile, or an IAM role.
"""
with Env(
GDAL_DISABLE_READDIR_ON_OPEN="EMPTY_DIR",
AWS_REGION=os.getenv("AWS_REGION", "us-west-2"),
):
with rasterio.open(s3_uri) as src:
return {
"crs": src.crs.to_epsg() if src.crs else None,
"bounds": src.bounds,
"overviews": [src.overviews(i) for i in range(1, src.count + 1)],
}
2. Batch header validation across a STAC item collection
When you have already retrieved a list of asset URLs from Querying STAC Catalogs Programmatically, you can validate all headers concurrently without downloading any imagery:
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import Any
import rasterio
from rasterio.env import Env
from rasterio.errors import RasterioIOError
GDAL_ENV = {
"GDAL_DISABLE_READDIR_ON_OPEN": "EMPTY_DIR",
"GDAL_HTTP_MERGE_CONSECUTIVE_RANGES": "YES",
"AWS_NO_SIGN_REQUEST": "YES",
}
def _fetch_single(url: str) -> dict[str, Any] | None:
try:
with Env(**GDAL_ENV):
with rasterio.open(url) as src:
return {
"url": url,
"crs": src.crs.to_epsg() if src.crs else None,
"resolution": src.res,
"has_overviews": any(src.overviews(1)),
}
except RasterioIOError:
return None # Caller decides how to handle missing assets
def validate_asset_headers(urls: list[str], max_workers: int = 8) -> list[dict[str, Any]]:
"""Return header metadata for every URL; None entries indicate failed reads."""
results: list[dict[str, Any] | None] = [None] * len(urls)
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {pool.submit(_fetch_single, url): i for i, url in enumerate(urls)}
for future in as_completed(futures):
results[futures[future]] = future.result()
return [r for r in results if r is not None]
ThreadPoolExecutor is appropriate here because the work is I/O-bound — each thread blocks on a network round-trip of ~16 KB. For CPU-bound raster processing after validation, switch to ProcessPoolExecutor or a Dask cluster.
3. Confirming a file is a valid COG before processing
A file named .tif may not be internally tiled or have overviews. Use block_shapes and overviews to gate downstream work:
import rasterio
from rasterio.env import Env
def is_valid_cog(url: str) -> bool:
"""
Return True only if the remote file uses internal tiling (block height > 1)
and has at least one overview level on band 1.
"""
with Env(GDAL_DISABLE_READDIR_ON_OPEN="EMPTY_DIR", AWS_NO_SIGN_REQUEST="YES"):
with rasterio.open(url) as src:
block_h, block_w = src.block_shapes[0]
is_tiled = block_h > 1 and block_w > 1 # strip layout → block_h == 1
has_overviews = len(src.overviews(1)) > 0
return is_tiled and has_overviews
Pairing this check with the automating metadata extraction for batch raster jobs pattern lets you build a pre-flight filter that rejects non-conformant files before they reach expensive processing stages.
Common Errors
CPLE_HttpResponse exception: HTTP response code: 403
The bucket requires authentication. Set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, or add AWS_NO_SIGN_REQUEST=YES for publicly accessible buckets that reject signed requests.
CPLE_OpenFailed: /vsicurl/ unable to open ... timeout
GDAL cannot reach the server. Check that libcurl was compiled into your GDAL build (gdal-config --formats | grep HTTP), that outbound HTTPS is not blocked by a firewall, and that CURL_CA_BUNDLE points to a valid CA bundle.
RasterioIOError: file.tif is not a supported file format
The remote path returns HTML (e.g., an S3 permission error page) rather than binary TIFF data. Print the raw response with curl -I <url> to inspect the actual HTTP status before debugging GDAL further.
Related
- Understanding Cloud-Optimized GeoTIFF Structure — the parent page covering IFD layout, internal tiling, and overview pyramids in full detail.
- Querying STAC Catalogs Programmatically — retrieve asset URLs from a STAC API, then validate headers before committing to downloads.
- Extracting and Parsing Raster Metadata — systematic approaches to harvesting CRS, transform, and band metadata across large raster collections.
- Automating Metadata Extraction for Batch Raster Jobs — build a pipeline that runs header reads across thousands of files and persists results to a metadata store.
- Optimizing Rasterio Window Reads for Memory Efficiency — once headers confirm alignment, use
Windowreads to fetch only the spatial subset you need.