Understanding Cloud-Optimized GeoTIFF Structure
The transition from monolithic raster archives to cloud-native data architectures has reshaped how remote sensing pipelines ingest, validate, and stream geospatial imagery. The Cloud-Optimized GeoTIFF (COG) specification enables efficient HTTP range requests, parallelized tile streaming, and on-the-fly spatial subsetting — all without rewriting analysis code. For environmental data engineers and Python GIS developers, understanding a COG’s internal layout is a prerequisite for building scalable, cost-effective workflows that avoid unnecessary egress and memory bottlenecks. For the full architectural context of where COGs sit in a catalog-driven pipeline, see Core Raster Fundamentals & STAC Mapping.
This guide dissects the internal anatomy of COGs step by step, provides a production-ready validation workflow, and documents the structural pitfalls most frequently encountered in Python pipelines.
Prerequisites
pip install rasterio>=1.3.0 numpy>=1.24
| Library | Min version | Why required |
|---|---|---|
rasterio |
1.3.0 | Block shape inspection, overview enumeration, profile access |
numpy |
1.24 | Array dtype introspection in validation helpers |
| GDAL (system) | 3.4 | libcurl support for /vsicurl/ remote reads; ZSTD codec |
Conceptual prerequisites:
- TIFF Image File Directory (IFD) structure and tag semantics
- HTTP range requests and how object-storage servers expose byte ranges
- Affine geotransforms — covered in Handling Pixel Resolution and Scaling
- Coordinate reference systems — covered in Mastering CRS Transformations in Rasterio
COG Internal Architecture
A standard GeoTIFF stores pixel data in horizontal strips, which forces full-file downloads before any meaningful processing can occur. A COG reorganizes this layout around four structural requirements, governed by the OGC GeoTIFF Standard and the COG specification (adopted as an official OGC standard in 2023).
Step 1 — Tile-Based Internal Layout
Instead of horizontal strips, COGs partition imagery into fixed-size square tiles — typically 256×256 or 512×512 pixels. Each tile is stored contiguously, so clients can request only the spatial extent required for analysis or rendering. The TIFF header maintains an offset table mapping tile coordinates to byte positions. This layout is essential for parallelized reads: multiple workers can fetch disjoint byte ranges simultaneously without contention, dramatically reducing I/O wait times in distributed compute environments.
Step 2 — Internal Overviews (Image Pyramids)
COGs embed lower-resolution copies of the base imagery as additional IFDs within the same file. Each IFD references a progressively downsampled tile grid — typically at factors of 2, 4, 8, and 16. When a client requests a regional overview or a web map tile, the server returns the appropriate IFD instead of reading and resampling full-resolution data. Properly ordered overviews prevent the over-fetching penalty that plagues visualization layers and dashboard queries. Overviews must be ordered coarsest-first and stored physically before the full-resolution pixel data.
Step 3 — Compression and Predictor Alignment
COGs mandate lossless compression to preserve analytical integrity. Common algorithms include DEFLATE, LZW, and ZSTD. Crucially, COGs pair compression with horizontal predictors: Predictor=2 for integer data and Predictor=3 for floating-point data. Predictors store the difference between adjacent pixel values rather than raw values, dramatically improving compression ratios for continuous raster data such as elevation models or multispectral reflectance. Mismatched predictor types can silently degrade read performance or corrupt pixel values during decompression.
Step 4 — Image File Directory (IFD) Architecture
The TIFF specification organizes metadata and data pointers into IFDs. A valid COG contains a primary IFD for the base resolution, followed by sequential overview IFDs stored in the file before the full-resolution pixel data. Each IFD stores critical tags: TileWidth, TileLength, Compression, PhotometricInterpretation, and spatial reference identifiers. Coordinate reference system definitions are embedded directly within these directory entries — understanding how IFDs chain together is essential when coordinating with mastering CRS transformations in rasterio, where CRS tags must match the declared projection.
Step-by-Step Validation Workflow
Step 1 — Inspect the Remote Header Without Downloading the File
Before downloading gigabytes of imagery, you can inspect the file header and verify HTTP range request support. The mechanics of reading COG headers without downloading full files cover the exact GDAL environment variables and /vsicurl/ prefix configuration needed to minimize network overhead during catalog crawling or pre-flight checks.
import os
import rasterio
# Configure GDAL for remote streaming; disable caching for validation runs
os.environ["GDAL_DISABLE_READDIR_ON_OPEN"] = "EMPTY_DIR"
os.environ["CPL_VSIL_CURL_ALLOWED_EXTENSIONS"] = ".tif,.tiff"
remote_url = "https://example.com/data/scene.tif" # replace with your COG URL
with rasterio.open(f"/vsicurl/{remote_url}") as src:
print("Driver: ", src.driver)
print("CRS: ", src.crs)
print("Dimensions: ", src.width, "×", src.height)
print("Block shapes:", src.block_shapes)
print("Overviews: ", src.overviews(1))
print("Compression:", src.compression)
Expected output for a valid COG: block_shapes entries smaller than the raster dimensions (e.g., (512, 512)), and at least two overview levels.
Step 2 — Automated Structural Validation
The function below validates all four COG structural requirements. It is designed to be integrated into CI/CD pipelines, data lake ingestion hooks, or automated STAC catalog validators.
import rasterio
from rasterio.enums import Compression
from typing import Any
def validate_cog_structure(filepath: str) -> dict[str, Any]:
"""
Validates core COG structural requirements for a local or remote path.
Returns a results dict with 'valid', 'checks', and 'warnings' keys.
"""
results: dict[str, Any] = {"valid": True, "checks": {}, "warnings": []}
try:
with rasterio.open(filepath) as src:
block_shapes = src.block_shapes
# 1. Tiling check: strip layout has height=1, width=raster_width
# A tiled layout has block dims smaller than the full image.
is_tiled = all(
h < src.height and w < src.width
for h, w in block_shapes
)
results["checks"]["is_tiled"] = is_tiled
if not is_tiled:
results["valid"] = False
results["warnings"].append(
"Strip layout detected — not a valid COG. "
"Re-create with gdal_translate -co TILED=YES."
)
# 2. Verify recommended tile dimensions
if is_tiled:
tile_h, tile_w = block_shapes[0]
results["checks"]["tile_size"] = f"{tile_w}×{tile_h}"
if tile_w not in (256, 512) or tile_h not in (256, 512):
results["warnings"].append(
f"Non-standard tile size {tile_w}×{tile_h}. "
"Use 256 or 512 for best HTTP range efficiency."
)
# 3. Compression — DEFLATE, LZW, or ZSTD recommended
comp = src.compression
results["checks"]["compression"] = comp.name if comp else "None"
if comp not in (Compression.deflate, Compression.lzw, Compression.zstd):
results["warnings"].append(
"Non-standard compression. Use DEFLATE, LZW, or ZSTD."
)
# 4. Internal overviews — must exist and be non-empty
overviews = src.overviews(1)
results["checks"]["overview_levels"] = len(overviews)
if len(overviews) == 0:
results["valid"] = False
results["warnings"].append(
"No internal overviews. Add with: "
"gdaladdo -r average file.tif 2 4 8 16"
)
# 5. Predictor — match to data dtype
if comp:
predictor = src.profile.get("predictor", 0)
results["checks"]["predictor"] = predictor
dtype = src.dtypes[0]
expected = 3 if "float" in dtype else 2
if predictor != expected:
results["warnings"].append(
f"Predictor {predictor} may not match dtype '{dtype}'. "
f"Expected Predictor={expected}."
)
except Exception as exc:
results["valid"] = False
results["error"] = str(exc)
return results
# Usage
if __name__ == "__main__":
import json
report = validate_cog_structure("/vsicurl/https://example.com/data/scene.tif")
print(json.dumps(report, indent=2))
Step 3 — Parameter Tuning During COG Creation
When creating COGs from raw imagery, the GDAL COG driver flags directly control all four structural properties:
import subprocess
def convert_to_cog(
input_path: str,
output_path: str,
compression: str = "ZSTD",
tile_size: int = 512,
overview_levels: str = "2 4 8 16 32",
) -> None:
"""
Converts a raster to a valid COG using the GDAL COG driver.
Builds internal overviews before writing to guarantee correct byte order.
"""
# Step A: build overviews on a temporary copy, then translate to COG
# (gdal_translate with GDAL's COG driver handles overview ordering automatically)
cmd = [
"gdal_translate",
"-of", "COG",
"-co", f"COMPRESS={compression}",
"-co", f"BLOCKSIZE={tile_size}",
"-co", "OVERVIEW_RESAMPLING=AVERAGE",
"-co", f"OVERVIEWS={'IGNORE_EXISTING'}",
input_path,
output_path,
]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
print(result.stdout)
convert_to_cog("raw_scene.tif", "output.cog.tif")
Step 4 — Output Validation and Assertions
After creation or ingestion, run programmatic assertions before accepting a file into a catalog:
import rasterio
def assert_cog_ready(filepath: str, min_overview_levels: int = 2) -> None:
"""
Raises AssertionError with a descriptive message if the COG fails validation.
Drop this into an ingestion pipeline or pytest fixture.
"""
with rasterio.open(filepath) as src:
tile_h, tile_w = src.block_shapes[0]
assert tile_h < src.height and tile_w < src.width, (
f"File is not tiled. block_shapes={src.block_shapes}"
)
assert tile_w in (256, 512), (
f"Non-standard tile width {tile_w}. Expected 256 or 512."
)
overviews = src.overviews(1)
assert len(overviews) >= min_overview_levels, (
f"Only {len(overviews)} overview levels found; "
f"need at least {min_overview_levels}."
)
assert src.compression in (
rasterio.enums.Compression.deflate,
rasterio.enums.Compression.lzw,
rasterio.enums.Compression.zstd,
), f"Unsupported compression: {src.compression}"
print(f"PASS: {filepath} meets COG structural requirements.")
Parameter Reference
| Parameter | Type | Default | Usage note |
|---|---|---|---|
TILED |
GDAL creation option (bool) | NO |
Must be YES; switches from strip to tile layout |
BLOCKSIZE |
GDAL creation option (int) | 256 | Tile edge length in pixels; 512 reduces round-trips for large queries |
COMPRESS |
GDAL creation option (str) | NONE |
ZSTD gives best ratio+speed for float data; DEFLATE for broad compatibility |
PREDICTOR |
GDAL creation option (int) | 1 | 2 for integer dtypes; 3 for float dtypes; 1 disables prediction |
OVERVIEW_RESAMPLING |
GDAL creation option (str) | NEAREST |
AVERAGE for continuous data; MODE for categorical/classification |
src.block_shapes |
rasterio property | — | List of (height, width) tuples per band; use to detect strip vs tile layout |
src.overviews(band) |
rasterio method | — | Returns list of overview scale factors for band; empty list means no overviews |
src.compression |
rasterio property | None |
rasterio.enums.Compression enum value; None if uncompressed |
Verification & Testing
After running assert_cog_ready, verify the file visually and structurally:
# GDAL command-line summary — confirms IFD order and overview presence
gdalinfo -json output.cog.tif | python3 -m json.tool | grep -A5 "overviews"
# COG-specific validator from cogeo-mosaic tooling
python3 -c "
from cogeo_mosaic.utils import get_footprints
from rio_cogeo.cogeo import cog_validate
is_valid, errors, warnings = cog_validate('output.cog.tif')
print('Valid:', is_valid)
print('Errors:', errors)
print('Warnings:', warnings)
"
Expected gdalinfo output includes "IFD Offsets" entries with the overview IFDs at lower byte offsets than the full-resolution data, confirming correct ordering. The rio_cogeo validator checks ghost metadata, IFD ordering, and range-request compatibility in a single call.
For pipelines that process validated COGs into analytical arrays, pairing this step with band math operations with xarray ensures only structurally sound rasters enter memory.
Troubleshooting
CPLE_AppDefined: File is not a GeoTIFF
Cause: the file driver is not TIFF (could be a BigTIFF, HDF5, or binary blob with .tif extension).
Fix: run gdalinfo <file> to identify the actual driver. BigTIFF files require GDAL’s -co BIGTIFF=YES flag and the bigtiff reader in rasterio.
block_shapes returns [(1, raster_width)]
Cause: the file uses strip layout instead of tiles — it is not a COG.
Fix: re-create with gdal_translate -of COG -co TILED=YES ... using the GDAL COG driver, which enforces correct byte ordering automatically.
overviews is empty after conversion
Cause: the source file had external .ovr sidecar overviews that were not internalized, or the -co OVERVIEWS=IGNORE_EXISTING flag was not set correctly.
Fix: run gdaladdo -ro <file> 2 4 8 16 to rebuild internal overviews, then reconvert with gdal_translate -of COG.
Remote URL opens locally but fails with /vsicurl/
Cause: the object-storage server does not support HTTP range requests (common with certain CDN configurations or pre-signed URL restrictions).
Fix: set CPL_CURL_VERBOSE=YES to log headers, then confirm the server returns Accept-Ranges: bytes. If not, pre-download and validate locally, or switch to a range-request-capable storage provider.
Pixel values are corrupted after decompression
Cause: mismatched PREDICTOR flag — most commonly, Predictor=3 (floating-point) applied to integer data.
Fix: verify that src.profile['predictor'] matches the band dtype: 2 for integer, 3 for float. Reconvert if there is a mismatch.
Integrating COG Validation into Broader Pipelines
Structurally sound COGs are only valuable when integrated into a full geospatial workflow. Modern pipelines follow a three-tier pattern:
- Discovery: querying STAC catalogs programmatically to locate assets by bounding box, date range, and cloud cover — without scanning storage buckets directly.
- Streaming and subsetting: HTTP range requests fetch only the tiles intersecting a target geometry. The handling pixel resolution and scaling guide covers window-based reads that leverage COG tile boundaries directly.
- Analytical processing: Validated tiles are loaded into memory arrays, reprojected to a common CRS, and processed using vectorized operations — band math operations with xarray and libraries like
rioxarraynatively respect COG tile boundaries, allowing scale-out to distributed clusters without rewriting I/O logic.
When designing these workflows, prioritize lazy evaluation and chunk-aware processing. Always validate spatial alignment early — mismatched projections or pixel resolutions compound errors during mosaicking or temporal aggregation.
Frequently Asked Questions
What tile size should a COG use?
256×256 or 512×512 pixels. 512×512 reduces HTTP round-trips for large spatial queries and map viewports; 256×256 is more efficient for high-latency connections with small AOIs.
Do COG overviews need to be in a specific order?
Yes. Overviews must be stored in descending resolution order (coarsest first) and physically before the full-resolution pixel data in the file. Out-of-order overviews cause web tile servers to fall back to full resolution, creating severe latency spikes. The GDAL COG driver handles this automatically; manual gdaladdo workflows do not guarantee correct byte ordering unless followed by a gdal_translate -of COG step.
Which compression predictor should I use for float rasters?
Use Predictor=3 for floating-point data (reflectance, elevation, temperature anomalies) and Predictor=2 for integer data (DN values, classification maps). Mismatching them silently degrades compression ratios or corrupts decompressed values without raising an error at read time.
Related
- How to read COG headers without downloading full files — minimal byte-range techniques for pre-flight metadata inspection
- Querying STAC Catalogs Programmatically — discover COG assets by spatial and temporal filter before streaming them
- Handling Pixel Resolution and Scaling — window reads and resampling aligned to COG tile boundaries
- Mastering CRS Transformations in Rasterio — reproject validated COG tiles to a common CRS for multi-scene analysis
- Band Math Operations with Xarray — compute spectral indices on lazily loaded COG arrays with chunk-aware I/O