Skip to content

Data Preparation

Building an EPM model follows four phases — structural decisions, skeleton, country data, scenarios. The order matters: a wrong structural choice made late forces most of the data collection to be redone.


Overview

Phase 0
Structural decisions
Zones · Time slices
Horizon · Tech set
Phase 1
Skeleton
Dimension CSVs
First EPM run
Phase 2
Country data
Demand · Supply · VRE
Hydro · CAPEX
Phase 3
Regional & scenarios
Transmission · Trade
Scenario variants

Phase 0 — Structural decisions

These four decisions must be locked in before any data collection. They shape the dimensions of almost every CSV.

Zones

How many zones, and which ones. Drives the z dimension of nearly every CSV.

Method: define a floor (minimum to capture real physics) and a ceiling (computation + data constraints), test 3–4 levels on a simplified run, stop when total system cost varies less than ~2% between two consecutive levels.

Drivers for more zones: official bidding zones, documented grid congestion, RE capacity factor spread > 25%, country size > 500 000 km², hydro far from load centres.

Tool When to use
gridflow (ESMAP) Partitions a region into N zones using population, load, and RE rasters
PyPSA-Eur clustering Partitions using OSM substations weighted by load — best for regions with good OSM/ENTSO-E coverage

Representative time-slices

Number of representative days, hourly resolution, extreme days. Drives pHours.csv and every hourly profile shape.

Guideline
Minimum 4 days (seasons) · 8+ if RE penetration > 20% · 12+ if storage is significant
Extreme days Always add 2–3: winter peak, RE drought
Maximum Beyond 24–30 days, investment decisions rarely change
Validation NRMSE on load duration curve < 3% · NRMSE on RE curves < 5%

This is decided here but computed in Phase 1 once time series are available — the tool is integrated in the repo.

Planning horizon

Drives y.csv. Base year: most recent year with complete data (typically 1–2 years before study start). Planning years: every 5 years is standard (2030, 2035, 2040, 2045, 2050). End year: 2050 for carbon neutrality studies.

Technology set

Drives tech.csv, fuel.csv, pTechFuel.csv. Use the same set across all countries — country-specific availability is controlled via max capacity = 0, not a different tech list. Include candidate technologies (offshore wind, green hydrogen) even if not yet deployed. Avoid editing the list after Phase 1: it propagates through many CSVs.


Phase 1 — Skeleton

Fill the CSVs that depend only on Phase 0 decisions, then run EPM with dummy data. The goal is not useful results — it is to verify the structure is sound before any real data collection.

# CSV Content How
1 zcmap.csv Zone → country mapping Manual
2 y.csv Planning years Manual
3 tech.csv, fuel.csv Technology and fuel lists Manual
4 pTechFuel.csv Tech → fuel mapping Manual
5 pSettings.csv VOLL, discount rate, features Copy from data_test
6 pHours.csv Representative hours + weights Snakemake pipeline ↓

The Snakemake pipeline

pHours.csv and the associated hourly profiles are generated by the pipeline in pre-analysis/. Configure it once, run it, and the EPM-ready files land in output_workflow/epm_export/.

flowchart LR
    CFG(["open_data_config.yaml"])
    CL["ERA5 climate\nclimate_pipeline.py"]
    VR["VRE profiles\nvre_pipeline.py"]
    LD["Load profiles\nload_pipeline.py"]
    GM["Generation fleet\ngenerators_pipeline.py"]
    RD["Representative days\nPoncelet MILP"]
    OUT(["epm_export/\npHours · pVREProfile\npDemandProfile · pGenDataInput"])

    CFG --> CL & VR & LD & GM
    VR & LD --> RD
    CL -.-> RD
    RD & GM --> OUT

    classDef default fill:#FFFBF0,stroke:#D97706,color:#333333
    classDef out fill:#F0F4F8,stroke:#4A6FA5,color:#1a1a1a
    class OUT out

Setup:

conda env create -f pre-analysis/open_data_env.yml -n epm-open-data
conda activate epm-open-data

# API keys — both free accounts
cp pre-analysis/config/api_tokens.example.ini pre-analysis/config/api_tokens.ini
# renewables.ninja token  →  renewables.ninja/profile
# CDS API key             →  cds.climate.copernicus.eu

Configure pre-analysis/config/open_data_config.yaml (countries, years, number of representative days), then run:

cd pre-analysis
snakemake --snakefile Snakefile --cores 4

End-of-Phase-1 test — fill all remaining CSVs with dummy zeros, then:

python epm.py --folder_input data_<region> --diagnostic

The model must complete without errors. Failures here are structural issues, not data quality problems.


Phase 2 — Country data

Fill CSVs in dependency order: demand first, then supply, VRE, hydro, CAPEX. Sizing generation before knowing the load leads to a fleet that doesn't match — everything has to be redone.

Start with the country where data is most available. After each country, run --diagnostic with that country populated and the rest as stubs.

flowchart LR
    D["1 · Demand"]
    S["2 · Supply fleet"]
    V["3 · VRE profiles"]
    H["4 · Hydro & storage"]
    C["5 · CAPEX"]
    R["EPM run"]

    D --> S --> V --> H --> C --> R

    classDef default fill:#FFFBF0,stroke:#D97706,color:#333333
    classDef run fill:#F0F4F8,stroke:#4A6FA5,color:#1a1a1a
    class R run

1. Demand

CSV Content Source
pDemandForecast Annual peak + energy per zone/year National utilities · IEA WEO · IRENA Planning Dashboard — manual
pDemandProfile Hourly shape (normalized) Snakemake load_pipeline.pyToktarova et al. 2019 · ENTSO-E for Europe

pDemandForecast is always manual — no open database provides consistent country-level forecasts at the required granularity.

2. Supply fleet

CSV Content Source
pGenDataInput All existing + candidate plants Snakemake generators_pipeline.pyGlobal Energy Monitor exports a draft. Always review before use — GEM lags on recent retirements and sub-national locations.
pFuelPrice Fuel cost per zone/year IEA WEO · World Bank Commodity Forecasts — manual
pAvailabilityCustom Plant-level availability overrides Start from pAvailabilityDefault.csv; add rows only for plants that deviate

3. VRE profiles

CSV Content Source
pVREProfile Hourly capacity factors per zone/tech (representative days) Snakemake vre_pipeline.py → Renewables.ninja API + IRENA MSR, fed through the representative days optimizer

The pipeline chains this automatically: raw VRE time series → Poncelet optimizer → pVREProfile.csv.

4. Hydro and storage

Hydropower availability cannot be automated — it requires matching plant locations to river discharge observations. The hydro notebooks in pre-analysis/notebooks/ must be run manually in order:

  1. hydro_inflow.ipynb — loads GRDC river discharge, links to HydroRIVERS + plant locations, exports cleaned inflow profiles
  2. hydro_basins.ipynb — visualizes catchment polygons to verify which GRDC stations link to which plants
  3. hydro_atlas_comparison.ipynb — QA: compares utility capacity factors against the African Hydropower Atlas
  4. hydro_capacity_factors.ipynb (WIP) — merges Atlas + Global Hydropower Tracker into a consolidated catalog

Data to download before running (place in pre-analysis/dataset/):

Dataset Source
GRDC monthly discharge grdc.bafg.de — manual request
HydroRIVERS shapefiles hydrosheds.org
African Hydropower Atlas v2 dataset/African_Hydropower_Atlas_v2-0.xlsx
Global Hydropower Tracker globalenergymonitor.org

Outputs: pAvailabilityCustom.csv (reservoir monthly factors) and pVREgenProfile.csv (run-of-river profiles).

5. CAPEX trajectories

CSV Content Source
pCapexTrajectories Cost evolution per technology and year IRENA Renewable Power Generation Costs · IEA WEO technology assumptions — manual

CAPEX is typically regional rather than country-specific — one table can cover the entire study area.


Phase 3 — Regional layer and scenarios

Transmission and trade

CSV Content Source
pTransferLimit Cross-zone capacity per year (existing + candidate) Regional TSOs, national plans
pTradePrice Buy/sell prices on external borders Energy ministries, IEA
pExtTransferLimit Capacities to/from external zones Same
pLossesTransmission Line losses per link Utility data, or ~2–3% as default

Scenarios

Scenarios overlay variant CSVs on top of the reference deployment — keep the reference clean.

paramNames,HighDemand,LowFuel,NoNewTransmission
pDemandForecast,demand/high_demand.csv,,
pFuelPrice,,supply/fuel_low.csv,
pTransferLimit,,,trade/no_expansion.csv
Scenario type CSVs to override
Demand growth pDemandForecast
Fuel price pFuelPrice
Carbon policy pCarbonPrice, pEmissionsLimit
Technology costs pCapexTrajectories
Transmission pTransferLimit

See Input Setup for the full scenarios.csv syntax.


Common pitfalls

  • Supply before demand. Sizing generation before knowing the load means a fleet that doesn't match — everything has to be redone.
  • Skipping the Phase 1 dummy run. Fill 15 CSVs, run EPM, get 40 tangled errors. Test the structure with dummy zeros first.
  • Trusting GEM data as-is. The pipeline's pGenDataInput_gap.csv is a draft — always cross-check against utility data.
  • Scenarios during collection. Keep the reference deployment clean. Scenarios are variants applied on top, last.
  • Re-zoning mid-project. Changing zonation mid-collection cascades through every CSV in the model.