Creating Nested Catalogs¶
Portolan supports hierarchical catalog structures where directories automatically become subcatalogs. This is useful for organizing large datasets by theme, region, or time period.
Quick Start¶
# Organize your data into themed directories
mkdir -p my-catalog/{climate,environment,housing}
cp climate-data/*.parquet my-catalog/climate/
cp env-data/*.parquet my-catalog/environment/
# Initialize and add everything
cd my-catalog
portolan init --auto --title "My Regional Data"
portolan add . --workers 4
# Add metadata and generate documentation
portolan metadata init
# Edit .portolan/metadata.yaml with your info
portolan readme --recursive
How Directory Structure Maps to STAC¶
Portolan infers the catalog hierarchy from your directory layout:
my-catalog/ # Root catalog (catalog.json)
├── climate/ # Subcatalog (climate/catalog.json)
│ ├── temperature/ # Collection (climate/temperature/collection.json)
│ │ └── temperature.parquet
│ └── precipitation/ # Collection
│ └── precipitation.parquet
└── demographics/ # Subcatalog
└── census-2020/ # Collection
└── census.parquet
When you run portolan add ., Portolan:
- Creates
catalog.jsonat the root with links to subcatalogs - Creates
catalog.jsonin each intermediate directory (subcatalogs) - Creates
collection.json+ item metadata in leaf directories (collections) - Generates
versions.jsonfor tracking at each level
Bulk Adding Files¶
Process many files efficiently with parallel workers:
portolan add . --workers 4 --verbose
The --verbose flag shows progress for each file. Without it, only changed/added files appear.
Metadata and READMEs¶
Setting Up Metadata¶
portolan metadata init
This creates .portolan/metadata.yaml with required fields (contact, license) and optional fields (citation, keywords, source URL, known issues).
Example:
contact:
name: "Data Team"
email: "data@example.org"
license: "CC-BY-4.0"
license_url: "https://creativecommons.org/licenses/by/4.0/"
keywords:
- climate
- regional data
- open data
source_url: "https://data.example.org/"
processing_notes: "Converted from Shapefile to GeoParquet with Hilbert sorting."
known_issues: "Temporal extent not specified for most datasets."
Generating READMEs¶
portolan readme --recursive
This generates README.md files at every level — root catalog, subcatalogs, and collections. Metadata from the root cascades down, so you only need to edit one metadata.yaml for consistent attribution across all READMEs.
To preview without writing:
portolan readme --stdout
Validation¶
Check the catalog structure and data formats:
portolan check --verbose
This validates:
- STAC metadata completeness
- Cloud-native format compliance (GeoParquet, COG)
- Provisional datetime warnings (items without explicit dates)
Example: The Hague Open Data¶
A real-world example with 6 thematic subcatalogs and 23 collections:
den-haag/
├── catalog.json
├── climate/ # 3 collections: heat maps, climate scores
├── environment/ # 7 collections: air quality, noise, soil
├── housing/ # 1 collection: energy labels
├── infrastructure/ # 3 collections: waste, zones, storage
├── nature/ # 7 collections: species, habitats, trees
└── water/ # 2 collections: gauges, water bodies
Created with:
portolan init --auto --title "The Hague Open Data" \
--description "Municipal open data from Den Haag, Netherlands"
portolan add . --workers 4
portolan metadata init
# Edit .portolan/metadata.yaml
portolan readme --recursive
portolan check
Tips¶
Start flat, restructure later. You can reorganize directories and re-run portolan add . — Portolan regenerates the STAC hierarchy from the current structure.
One metadata.yaml for consistency. Root-level metadata cascades to all READMEs. Only create collection-level metadata.yaml files when you need overrides.
Use --workers for large catalogs. Parallel processing significantly speeds up metadata extraction for catalogs with many files.