How to Read COG Headers Without Downloading Full Files

Open a remote Cloud-Optimized GeoTIFF URL with rasterio and read .crs, .bounds, .res, and .overviews() directly — GDAL’s /vsicurl/ backend automatically issues an HTTP Range request for the first 16–32 KB containing the Image File Directory, so no pixel data is transferred:

import rasterio

url = "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/36/N/YF/2023/6/S2A_36NYF_20230615_0_L2A/B04.tif"

with rasterio.open(url) as src:
    print(src.crs, src.bounds, src.res)

This pattern is the entry point for metadata-first validation in any pipeline that consumes assets from Understanding Cloud-Optimized GeoTIFF Structure.


Why This Arises in Remote Sensing Workflows

Satellite archives on AWS Open Data, Microsoft Planetary Computer, and Google Earth Engine Storage routinely hold files that are several hundred megabytes to several gigabytes each. When you need to validate spatial alignment, confirm CRS, or check that overview pyramids are present before scheduling a batch job, downloading every file is prohibitively expensive in both time and egress cost.

COGs are specifically structured so that all structural metadata — tile offsets, CRS definitions, band descriptions, and overview level pointers — sits at the head of the file. The HTTP Range request mechanism exploits this layout: a single partial-content response of 16–32 KB carries everything needed to populate rasterio’s dataset object. Pixel arrays remain on the remote server until you explicitly call .read().

This workflow sits inside Understanding Cloud-Optimized GeoTIFF Structure and feeds directly into asset-level filtering in Querying STAC Catalogs Programmatically: fetch headers to reject misaligned or incomplete assets before constructing read windows.

For the broader architectural context of how this fits into catalog-driven ingestion, see Core Raster Fundamentals & STAC Mapping.


How GDAL Routes a Remote URL

The diagram below shows the request sequence when you call rasterio.open(url) on an https:// URI.

GDAL /vsicurl/ HTTP Range request sequence for COG header reads When rasterio.open() receives a remote URL, GDAL intercepts the call via /vsicurl/, issues a Range: bytes=0-16384 request, receives a 206 Partial Content response containing the IFD, caches the metadata in memory, and returns a dataset object to the caller. Pixel data is never transferred at this stage. Python / rasterio GDAL /vsicurl/ Remote HTTPS server rasterio.open(url) GET Range: bytes=0–16384 206 Partial Content (IFD + tags) Parse IFD, cache metadata DatasetReader (no pixel data) .crs, .bounds, .res, … Pixel data NOT transferred. Fetched only on .read() call.

GDAL’s virtual filesystem layer (/vsicurl/) intercepts the https:// URI before any I/O occurs, rewrites the open call as an HTTP Range request, and caches the response. The caller receives a fully populated DatasetReader object. This is the mechanism that makes the metadata extraction patterns described elsewhere in this site possible at scale.


Environment & Setup

Package Minimum version Why required
rasterio 1.3.0 Exposes GDAL’s /vsicurl/ through its Python API
GDAL (C library) 3.4.0 Implements HTTP Range, /vsicurl/, /vsis3/, /vsigs/
libcurl 7.68 Handles the underlying HTTP/HTTPS transport for GDAL
pip install "rasterio>=1.3.0"

Verify GDAL can reach remote files before running production code:

import rasterio
from rasterio.env import Env

with Env(GDAL_DISABLE_READDIR_ON_OPEN="EMPTY_DIR"):
    with rasterio.open("https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/36/N/YF/2023/6/S2A_36NYF_20230615_0_L2A/B04.tif") as src:
        print(src.driver)  # GTiff — confirms a successful header-only open

Complete Working Example

The function below extracts a full metadata dictionary from a remote COG. Every field comes from the header — .read() is never called.

import os
import rasterio
from rasterio.env import Env
from rasterio.errors import RasterioIOError
from typing import Any

def read_cog_header(url: str, *, aws_unsigned: bool = False) -> dict[str, Any]:
    """
    Fetch COG metadata via HTTP Range request — no pixel data is transferred.

    Parameters
    ----------
    url : str
        Remote COG path. Accepts https://, s3://, gs://, or /vsicurl/ prefixes.
    aws_unsigned : bool
        Set True for public AWS Open Data buckets that do not require signing.

    Returns
    -------
    dict[str, Any]
        Spatial metadata parsed from the IFD. Pixel arrays are not included.
    """
    gdal_env: dict[str, str] = {
        # Suppress directory-listing requests that add latency and cost
        "GDAL_DISABLE_READDIR_ON_OPEN": "EMPTY_DIR",
        # Consolidate fragmented byte ranges into one TCP round-trip
        "GDAL_HTTP_MERGE_CONSECUTIVE_RANGES": "YES",
        # Enable HTTP/2 multiplexing for concurrent range requests
        "GDAL_HTTP_MULTIPLEX": "YES",
        "GDAL_HTTP_VERSION": "2",
    }

    if aws_unsigned:
        # Public buckets reject signed requests; disable AWS Signature V4
        gdal_env["AWS_NO_SIGN_REQUEST"] = "YES"

    try:
        with Env(**gdal_env):
            with rasterio.open(url) as src:
                return {
                    "driver": src.driver,           # Should be "GTiff" for COGs
                    "dtype": src.dtypes[0],          # Data type of band 1
                    "count": src.count,              # Number of spectral bands
                    "width": src.width,              # Columns (pixels)
                    "height": src.height,            # Rows (pixels)
                    "crs": src.crs.to_epsg() if src.crs else None,  # EPSG code or None
                    "transform": src.transform,      # Affine geotransform (origin, pixel size)
                    "nodata": src.nodata,            # Fill / no-data sentinel value
                    "resolution": src.res,           # (x_res, y_res) in CRS units
                    "bounds": src.bounds,            # BoundingBox(left, bottom, right, top)
                    # List of overview (pyramid) levels per band — empty means no overviews
                    "overviews": [src.overviews(i) for i in range(1, src.count + 1)],
                    # Internal block (tile) size — (height, width) tuple
                    "block_shapes": list(src.block_shapes),
                    "compression": src.compression.value if src.compression else None,
                }
    except RasterioIOError as exc:
        raise RuntimeError(f"Could not read COG header from {url!r}: {exc}") from exc


# Example — Sentinel-2 L2A red band from AWS Open Data (public bucket)
if __name__ == "__main__":
    s2_url = (
        "https://sentinel-cogs.s3.us-west-2.amazonaws.com"
        "/sentinel-s2-l2a-cogs/36/N/YF/2023/6"
        "/S2A_36NYF_20230615_0_L2A/B04.tif"
    )
    meta = read_cog_header(s2_url, aws_unsigned=True)
    print(f"CRS: EPSG:{meta['crs']}")
    print(f"Resolution: {meta['resolution']} m")
    print(f"Overview levels (band 1): {meta['overviews'][0]}")
    print(f"Block shape: {meta['block_shapes'][0]}")

What to inspect in the returned dictionary

Field Pipeline use
crs Spatial indexing, STAC item validation, CRS mismatch rejection
bounds Bounding-box intersection checks before windowed reads
resolution Dynamic overview selection, avoiding unnecessary high-res fetches
overviews Confirm pyramid presence; absent overviews are a COG validation failure
dtype & nodata Memory allocation sizing, masking strategy, type-cast decisions
block_shapes Confirm internal tiling (256×256 or 512×512) vs. strip layout
compression Identify DEFLATE, ZSTD, LZW — affects decompression overhead per worker

Variant Patterns

1. Reading from a private S3 bucket

For buckets that require IAM authentication, configure credentials via environment variables rather than embedding tokens in the URI. GDAL’s /vsis3/ backend inherits the active session.

import os
import rasterio
from rasterio.env import Env
from typing import Any

def read_private_s3_cog(s3_uri: str) -> dict[str, Any]:
    """
    Read a COG header from a private S3 bucket.
    Credentials are read from AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
    environment variables, a ~/.aws/credentials profile, or an IAM role.
    """
    with Env(
        GDAL_DISABLE_READDIR_ON_OPEN="EMPTY_DIR",
        AWS_REGION=os.getenv("AWS_REGION", "us-west-2"),
    ):
        with rasterio.open(s3_uri) as src:
            return {
                "crs": src.crs.to_epsg() if src.crs else None,
                "bounds": src.bounds,
                "overviews": [src.overviews(i) for i in range(1, src.count + 1)],
            }

2. Batch header validation across a STAC item collection

When you have already retrieved a list of asset URLs from Querying STAC Catalogs Programmatically, you can validate all headers concurrently without downloading any imagery:

from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import Any
import rasterio
from rasterio.env import Env
from rasterio.errors import RasterioIOError

GDAL_ENV = {
    "GDAL_DISABLE_READDIR_ON_OPEN": "EMPTY_DIR",
    "GDAL_HTTP_MERGE_CONSECUTIVE_RANGES": "YES",
    "AWS_NO_SIGN_REQUEST": "YES",
}

def _fetch_single(url: str) -> dict[str, Any] | None:
    try:
        with Env(**GDAL_ENV):
            with rasterio.open(url) as src:
                return {
                    "url": url,
                    "crs": src.crs.to_epsg() if src.crs else None,
                    "resolution": src.res,
                    "has_overviews": any(src.overviews(1)),
                }
    except RasterioIOError:
        return None  # Caller decides how to handle missing assets

def validate_asset_headers(urls: list[str], max_workers: int = 8) -> list[dict[str, Any]]:
    """Return header metadata for every URL; None entries indicate failed reads."""
    results: list[dict[str, Any] | None] = [None] * len(urls)
    with ThreadPoolExecutor(max_workers=max_workers) as pool:
        futures = {pool.submit(_fetch_single, url): i for i, url in enumerate(urls)}
        for future in as_completed(futures):
            results[futures[future]] = future.result()
    return [r for r in results if r is not None]

ThreadPoolExecutor is appropriate here because the work is I/O-bound — each thread blocks on a network round-trip of ~16 KB. For CPU-bound raster processing after validation, switch to ProcessPoolExecutor or a Dask cluster.

3. Confirming a file is a valid COG before processing

A file named .tif may not be internally tiled or have overviews. Use block_shapes and overviews to gate downstream work:

import rasterio
from rasterio.env import Env

def is_valid_cog(url: str) -> bool:
    """
    Return True only if the remote file uses internal tiling (block height > 1)
    and has at least one overview level on band 1.
    """
    with Env(GDAL_DISABLE_READDIR_ON_OPEN="EMPTY_DIR", AWS_NO_SIGN_REQUEST="YES"):
        with rasterio.open(url) as src:
            block_h, block_w = src.block_shapes[0]
            is_tiled = block_h > 1 and block_w > 1   # strip layout → block_h == 1
            has_overviews = len(src.overviews(1)) > 0
            return is_tiled and has_overviews

Pairing this check with the automating metadata extraction for batch raster jobs pattern lets you build a pre-flight filter that rejects non-conformant files before they reach expensive processing stages.


Common Errors

CPLE_HttpResponse exception: HTTP response code: 403 The bucket requires authentication. Set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, or add AWS_NO_SIGN_REQUEST=YES for publicly accessible buckets that reject signed requests.

CPLE_OpenFailed: /vsicurl/ unable to open ... timeout GDAL cannot reach the server. Check that libcurl was compiled into your GDAL build (gdal-config --formats | grep HTTP), that outbound HTTPS is not blocked by a firewall, and that CURL_CA_BUNDLE points to a valid CA bundle.

RasterioIOError: file.tif is not a supported file format The remote path returns HTML (e.g., an S3 permission error page) rather than binary TIFF data. Print the raw response with curl -I <url> to inspect the actual HTTP status before debugging GDAL further.