Data Preparation¶
Building an EPM model follows four phases — structural decisions, skeleton, country data, scenarios. The order matters: a wrong structural choice made late forces most of the data collection to be redone.
Overview¶
Horizon · Tech set
First EPM run
Hydro · CAPEX
Scenario variants
Phase 0 — Structural decisions¶
These four decisions must be locked in before any data collection. They shape the dimensions of almost every CSV.
Zones¶
How many zones, and which ones. Drives the z dimension of nearly every CSV.
Method: define a floor (minimum to capture real physics) and a ceiling (computation + data constraints), test 3–4 levels on a simplified run, stop when total system cost varies less than ~2% between two consecutive levels.
Drivers for more zones: official bidding zones, documented grid congestion, RE capacity factor spread > 25%, country size > 500 000 km², hydro far from load centres.
| Tool | When to use |
|---|---|
| gridflow (ESMAP) | Partitions a region into N zones using population, load, and RE rasters |
| PyPSA-Eur clustering | Partitions using OSM substations weighted by load — best for regions with good OSM/ENTSO-E coverage |
Representative time-slices¶
Number of representative days, hourly resolution, extreme days. Drives pHours.csv and every hourly profile shape.
| Guideline | |
|---|---|
| Minimum | 4 days (seasons) · 8+ if RE penetration > 20% · 12+ if storage is significant |
| Extreme days | Always add 2–3: winter peak, RE drought |
| Maximum | Beyond 24–30 days, investment decisions rarely change |
| Validation | NRMSE on load duration curve < 3% · NRMSE on RE curves < 5% |
This is decided here but computed in Phase 1 once time series are available — the tool is integrated in the repo.
Planning horizon¶
Drives y.csv. Base year: most recent year with complete data (typically 1–2 years before study start). Planning years: every 5 years is standard (2030, 2035, 2040, 2045, 2050). End year: 2050 for carbon neutrality studies.
Technology set¶
Drives tech.csv, fuel.csv, pTechFuel.csv. Use the same set across all countries — country-specific availability is controlled via max capacity = 0, not a different tech list. Include candidate technologies (offshore wind, green hydrogen) even if not yet deployed. Avoid editing the list after Phase 1: it propagates through many CSVs.
Phase 1 — Skeleton¶
Fill the CSVs that depend only on Phase 0 decisions, then run EPM with dummy data. The goal is not useful results — it is to verify the structure is sound before any real data collection.
| # | CSV | Content | How |
|---|---|---|---|
| 1 | zcmap.csv |
Zone → country mapping | Manual |
| 2 | y.csv |
Planning years | Manual |
| 3 | tech.csv, fuel.csv |
Technology and fuel lists | Manual |
| 4 | pTechFuel.csv |
Tech → fuel mapping | Manual |
| 5 | pSettings.csv |
VOLL, discount rate, features | Copy from data_test |
| 6 | pHours.csv |
Representative hours + weights | Snakemake pipeline ↓ |
The Snakemake pipeline¶
pHours.csv and the associated hourly profiles are generated by the pipeline in pre-analysis/. Configure it once, run it, and the EPM-ready files land in output_workflow/epm_export/.
flowchart LR
CFG(["open_data_config.yaml"])
CL["ERA5 climate\nclimate_pipeline.py"]
VR["VRE profiles\nvre_pipeline.py"]
LD["Load profiles\nload_pipeline.py"]
GM["Generation fleet\ngenerators_pipeline.py"]
RD["Representative days\nPoncelet MILP"]
OUT(["epm_export/\npHours · pVREProfile\npDemandProfile · pGenDataInput"])
CFG --> CL & VR & LD & GM
VR & LD --> RD
CL -.-> RD
RD & GM --> OUT
classDef default fill:#FFFBF0,stroke:#D97706,color:#333333
classDef out fill:#F0F4F8,stroke:#4A6FA5,color:#1a1a1a
class OUT out
Setup:
conda env create -f pre-analysis/open_data_env.yml -n epm-open-data
conda activate epm-open-data
# API keys — both free accounts
cp pre-analysis/config/api_tokens.example.ini pre-analysis/config/api_tokens.ini
# renewables.ninja token → renewables.ninja/profile
# CDS API key → cds.climate.copernicus.eu
Configure pre-analysis/config/open_data_config.yaml (countries, years, number of representative days), then run:
End-of-Phase-1 test — fill all remaining CSVs with dummy zeros, then:
The model must complete without errors. Failures here are structural issues, not data quality problems.
Phase 2 — Country data¶
Fill CSVs in dependency order: demand first, then supply, VRE, hydro, CAPEX. Sizing generation before knowing the load leads to a fleet that doesn't match — everything has to be redone.
Start with the country where data is most available. After each country, run --diagnostic with that country populated and the rest as stubs.
flowchart LR
D["1 · Demand"]
S["2 · Supply fleet"]
V["3 · VRE profiles"]
H["4 · Hydro & storage"]
C["5 · CAPEX"]
R["EPM run"]
D --> S --> V --> H --> C --> R
classDef default fill:#FFFBF0,stroke:#D97706,color:#333333
classDef run fill:#F0F4F8,stroke:#4A6FA5,color:#1a1a1a
class R run
1. Demand¶
| CSV | Content | Source |
|---|---|---|
pDemandForecast |
Annual peak + energy per zone/year | National utilities · IEA WEO · IRENA Planning Dashboard — manual |
pDemandProfile |
Hourly shape (normalized) | Snakemake load_pipeline.py → Toktarova et al. 2019 · ENTSO-E for Europe |
pDemandForecast is always manual — no open database provides consistent country-level forecasts at the required granularity.
2. Supply fleet¶
| CSV | Content | Source |
|---|---|---|
pGenDataInput |
All existing + candidate plants | Snakemake generators_pipeline.py → Global Energy Monitor exports a draft. Always review before use — GEM lags on recent retirements and sub-national locations. |
pFuelPrice |
Fuel cost per zone/year | IEA WEO · World Bank Commodity Forecasts — manual |
pAvailabilityCustom |
Plant-level availability overrides | Start from pAvailabilityDefault.csv; add rows only for plants that deviate |
3. VRE profiles¶
| CSV | Content | Source |
|---|---|---|
pVREProfile |
Hourly capacity factors per zone/tech (representative days) | Snakemake vre_pipeline.py → Renewables.ninja API + IRENA MSR, fed through the representative days optimizer |
The pipeline chains this automatically: raw VRE time series → Poncelet optimizer → pVREProfile.csv.
4. Hydro and storage¶
Hydropower availability cannot be automated — it requires matching plant locations to river discharge observations. The hydro notebooks in pre-analysis/notebooks/ must be run manually in order:
hydro_inflow.ipynb— loads GRDC river discharge, links to HydroRIVERS + plant locations, exports cleaned inflow profileshydro_basins.ipynb— visualizes catchment polygons to verify which GRDC stations link to which plantshydro_atlas_comparison.ipynb— QA: compares utility capacity factors against the African Hydropower Atlashydro_capacity_factors.ipynb(WIP) — merges Atlas + Global Hydropower Tracker into a consolidated catalog
Data to download before running (place in pre-analysis/dataset/):
| Dataset | Source |
|---|---|
| GRDC monthly discharge | grdc.bafg.de — manual request |
| HydroRIVERS shapefiles | hydrosheds.org |
| African Hydropower Atlas v2 | dataset/African_Hydropower_Atlas_v2-0.xlsx |
| Global Hydropower Tracker | globalenergymonitor.org |
Outputs: pAvailabilityCustom.csv (reservoir monthly factors) and pVREgenProfile.csv (run-of-river profiles).
5. CAPEX trajectories¶
| CSV | Content | Source |
|---|---|---|
pCapexTrajectories |
Cost evolution per technology and year | IRENA Renewable Power Generation Costs · IEA WEO technology assumptions — manual |
CAPEX is typically regional rather than country-specific — one table can cover the entire study area.
Phase 3 — Regional layer and scenarios¶
Transmission and trade¶
| CSV | Content | Source |
|---|---|---|
pTransferLimit |
Cross-zone capacity per year (existing + candidate) | Regional TSOs, national plans |
pTradePrice |
Buy/sell prices on external borders | Energy ministries, IEA |
pExtTransferLimit |
Capacities to/from external zones | Same |
pLossesTransmission |
Line losses per link | Utility data, or ~2–3% as default |
Scenarios¶
Scenarios overlay variant CSVs on top of the reference deployment — keep the reference clean.
paramNames,HighDemand,LowFuel,NoNewTransmission
pDemandForecast,demand/high_demand.csv,,
pFuelPrice,,supply/fuel_low.csv,
pTransferLimit,,,trade/no_expansion.csv
| Scenario type | CSVs to override |
|---|---|
| Demand growth | pDemandForecast |
| Fuel price | pFuelPrice |
| Carbon policy | pCarbonPrice, pEmissionsLimit |
| Technology costs | pCapexTrajectories |
| Transmission | pTransferLimit |
See Input Setup for the full scenarios.csv syntax.
Common pitfalls¶
- Supply before demand. Sizing generation before knowing the load means a fleet that doesn't match — everything has to be redone.
- Skipping the Phase 1 dummy run. Fill 15 CSVs, run EPM, get 40 tangled errors. Test the structure with dummy zeros first.
- Trusting GEM data as-is. The pipeline's
pGenDataInput_gap.csvis a draft — always cross-check against utility data. - Scenarios during collection. Keep the reference deployment clean. Scenarios are variants applied on top, last.
- Re-zoning mid-project. Changing zonation mid-collection cascades through every CSV in the model.