Representative Days for EPM

Representative Days for EPM#

Objective
Automate the selection of representative and special days for the EPM model by clustering multi-year load and VRE profiles and formatting the outputs for the Poncelet algorithm.

Data requirements (user-provided) and method

Data requirements: An input/ folder containing irena/data_capp_solar.csv, irena/data_capp_wind.csv, and load_full_year.csv, plus helper utilities (utils_reprdays.py) and a user-defined season mapping.
Method: Configure season definitions and exclusions, group the historical data by season, format it for the Poncelet algorithm, run the clustering/selection routine (including special days), and export the resulting pHours and pVREgen CSVs alongside diagnostics.

Overview of steps

Step 1 - Configure seasons, file names, and optional filters.
Step 2 - Create the input/output folders and verify the source files.
Step 3 - Process and group the data by season.
Step 4 - Format the datasets required by the Poncelet algorithm.
Step 5 - Generate representative and special days plus EPM-formatted outputs.
Step 6 - Plot and export diagnostics for validation.

import os
import pandas as pd
import matplotlib.pyplot as plt

from utils_reprdays import *

Step 1 - Configure user parameters and season definitions#

1. seasons_dict: This dictionary defines the mapping of months to seasons. The keys are month numbers (1-12), and the values are season numbers (1-4). Fore example seasons ca, be defined as follows:
- Season 1: May, June, July, August, September
- Season 2: January, February, March, April, October, November, December
1. filenames_input: This dictionary contains the filenames of the input data files. The keys are technology types (e.g., ‘PV’, ‘Wind’, ‘Load’), and the values are the corresponding filenames path from the input folder. The files should be placed in the input folder. Users are responsible for ensuring that the data is formatted correctly. Reference examples are available in the data_test folder. The required columns are zone, month (or season), day, hour.

# The definition of seasons is based on the month number.
seasons_dict = {
    1: 2,
    2: 2,
    3: 2,
    4: 2,
    5: 1,
    6: 1,
    7: 1,
    8: 1,
    9: 1,
    10: 2,
    11: 2,
    12: 2
}

# The name of the file that must be in the input folder
filenames_input = {'PV': 'input/data_capp_solar.csv',
                   'Wind': 'input/data_capp_wind.csv',
                   'Load': 'load_full_year.csv'
                   }

#
zones_to_remove = ['STP']

Step 2 - Create folder structure and validate inputs#

folder_input = 'input'
# Make folder
if not os.path.exists(folder_input):
    os.makedirs(folder_input)
    print(f'Input folder: {folder_input}')

folder_output = 'output'
# Make folder
if not os.path.exists(folder_output):
    os.makedirs(folder_output)
    print(f'Output folder: {folder_output}')

Step 3 - Process data and group by season#

Renewable ninja ata is processed to group months into season, to be used as input to EPM. This step may be skipped if one wishes to keep the seasonal definition at the monthly scale, or updated based on the seasonal grouping that makes the most sense for the case study at hand.

def month_to_season(data, seasons_dict, other_columns=None):
    """Convert month number to season number."""
    data = data.rename(columns={'season': 'month'})
    data['season'] = data.apply(lambda row: seasons_dict[row['month']], axis=1)
    data = data.sort_values(by=['season', 'month', 'day', 'hour'])
    data = data[~((data['month'] == 2) & (data['day'] == 29))]
    # Renumber days sequentially within each season
    data['season_day'] = data.groupby(other_columns + ['season']).cumcount() // 24 + 1
    data = data.drop(columns=['day']).rename(columns={'season_day': 'day'})
    data = data.set_index(other_columns + ['season', 'day', 'hour']).reset_index().drop(columns=['month'])
    data = data.sort_values(by=other_columns + ['season', 'day', 'hour'])
    return data

filenames = {key: os.path.join(folder_input, filename) for key, filename in filenames_input.items()}

display('WARNING: Ensure that the zones in the data are consistent across all input files.')

# Process each file and save the results
for key, filename in filenames.items():
    print(f'Processing {key} data from {filename}')
    if not os.path.exists(filename):
        raise FileNotFoundError(f'File {os.path.abspath(filename)} not found. Please check the input folder.')
    # Load the data
    data = pd.read_csv(filename, index_col=False)

    # Check if the required columns are present
    required_columns = ['zone', 'month', 'day', 'hour']
    if not all(col in data.columns for col in required_columns):
        raise ValueError(f'Missing required columns in {filename}. Required columns missing: {", ".join([col for col in required_columns if col not in data.columns])}')

    # Display zones in the data
    display(f'Number of zones in {key}: {data["zone"].nunique()}')
    display(f'Zones in {key}: {data["zone"].unique()}')

    # Load data hours should start with 0, not 1
    if data['hour'].min() == 1:
        data['hour'] = data['hour'] - 1

    # Rename value by 2018
    data = data.rename(columns={'value': 2018})

    # Remove zones that are not needed
    data = data[~data['zone'].isin(zones_to_remove)]

    # Convert month to season
    data = month_to_season(data, seasons_dict, other_columns=['zone'])

    # Save the data
    name, ext = os.path.splitext(filename)
    filename = f'{name}_season{ext}'

    data.to_csv(filename, float_format='%.4f', index=False)
    print(f'Data saved to {filename}')

'WARNING: Ensure that the zones in the data are consistent across all input files.'

Processing PV data from input/irena/data_capp_solar.csv

'Number of zones in PV: 10'

"Zones in PV: ['Angola' 'Burundi' 'Cameroon' 'CAR' 'Chad' 'Congo' 'DRC'\n 'EquatorialGuinea' 'Gabon' 'Rwanda']"

Data saved to input/irena/data_capp_solar_season.csv
Processing Wind data from input/irena/data_capp_wind.csv

'Number of zones in Wind: 6'

"Zones in Wind: ['Angola' 'Cameroon' 'CAR' 'Chad' 'Congo' 'DRC']"

Data saved to input/irena/data_capp_wind_season.csv
Processing Load data from input/load_full_year.csv

'Number of zones in Load: 11'

"Zones in Load: ['Angola' 'Burundi' 'Cameroon' 'CAR' 'Congo' 'Gabon' 'EquatorialGuinea'\n 'DRC' 'Rwanda' 'STP' 'Chad']"

Data saved to input/load_full_year_season.csv

Step 4 - Format data for the Poncelet algorithm#

filenames = {}
for tech, filename in filenames_input.items():
    name, ext = os.path.splitext(filename)
    filename = f'{name}_season{ext}'
    filename = os.path.join(folder_input, filename)

    if not os.path.exists(filename):
        print(f'File {filename} does not exist. Please check the input folder.')
        raise FileNotFoundError(f'File {filename} not found.')

    filenames.update({tech: filename})

# The name of the data, used to save the results
df_energy = format_data_energy(filenames)

# Drop columns with all NaN values
df_energy = df_energy.dropna(axis=1, how='all')

# Display df_energy
display(df_energy.head())
if len(df_energy) != 8760:
    print('Warning: The data does not contain a full year of data. Please check the input files.')
else:
    print('The data contains a full year of data.')

Representative year 2018
Warning: NaN values in the DataFrame

'Annual capacity factor (%):'

tech	zone	PV	Wind	Load
0	Angola	0.224599	0.397303	0.775620
1	Burundi	0.198113	NaN	0.742856
2	CAR	0.202921	0.390992	0.803250
3	Cameroon	0.207991	0.415764	0.740909
4	Chad	0.208414	0.642518	0.685850
5	Congo	0.183137	0.377975	0.720891
6	DRC	0.193951	0.439868	0.880314
7	EquatorialGuinea	0.173794	NaN	0.805217
8	Gabon	0.173166	NaN	0.783922
9	Rwanda	0.192586	NaN	0.757781

	season	day	hour	Load_Angola	Load_Burundi	Load_CAR	Load_Cameroon	Load_Chad	Load_Congo	Load_DRC	...	PV_DRC	PV_Rwanda	Wind_Angola	Wind_CAR	Wind_Cameroon	Wind_Chad	Wind_Congo	Wind_DRC
0	1	1	0	0.7989	0.602	0.7552	0.6838	0.6875	0.7073	0.8866	...	0.0000	0.0000	0.4575	0.6473	0.6783	0.2343	0.000	0.7204
1	1	1	1	0.7687	0.569	0.7433	0.6589	0.6625	0.6663	0.8704	...	0.0000	0.0000	0.4164	0.8103	0.8117	0.2442	0.000	0.7449
2	1	1	2	0.7579	0.569	0.7415	0.6388	0.6422	0.6432	0.8606	...	0.0000	0.0000	0.4093	0.8637	0.8557	0.2553	0.005	0.7827
3	1	1	3	0.7365	0.571	0.7314	0.6272	0.6306	0.6389	0.8520	...	0.0000	0.0000	0.4160	0.8929	0.8690	0.2407	0.043	0.7954
4	1	1	4	0.7297	0.580	0.7360	0.6359	0.6393	0.6338	0.8509	...	0.0121	0.0001	0.4371	0.8781	0.8862	0.2157	0.068	0.7979

5 rows × 29 columns

The data contains a full year of data.

Step 5 - Generate representative and special days#

User-defined parameters

User needs to change the following parameters in the following cell:

nbr_days: Defines the number of representative days used in the Poncelet algorithm.
n_clusters: Specifies the number of clusters (i.e., representative days) to generate, which will then be used to extract special days representing extreme clusters. More clusters lead to more extreme conditions being captured, but also increase computational complexity.
nbr_bins file: A higher number of zones or time series increases numerical complexity. Start with nbr_bins=10 (fewer bins) for easier computation. Increase the number of bins progressively if the problem remains easy to solve.
n_features_selection (optional): Enables automatic feature selection to reduce the number of time series used in the Poncelet algorithm. This is useful when:
- Modeling many zones or countries.
- The number of pairwise correlations becomes large, increasing computational time.
Use this parameter if the Poncelet algorithm takes more than a few minutes to run.

# Defines the number of representative days used in the Poncelet algorithm.
n_rep_days = 2

# Clustering the data to extract clusters corresponding to extreme conditions
n_clusters = 20

nbr_bins = 10

# Feature selection is optional but recommended if you are working with a large number of zones or time series. This reduces the number of pairwise correlations and helps avoid high computational complexity in the optimization step.
n_features_selection = 30

df_energy_cluster, df_closest_days, centroids_df = cluster_data_new(df_energy, n_clusters=n_clusters)

# Extracting special days as centroids of the extreme clusters
special_days, df_energy_no_special = get_special_days_clustering(df_closest_days,
                                                                 df_energy_cluster, threshold=0.1)

print('Number of hours in the year:', len(df_energy_no_special))
print('Removed days:', (len(df_energy) - len(df_energy_no_special)) / 24)

# Format the data (including correlation calculation) and save it in a .csv file
_, path_data_file = format_optim_repr_days(df_energy_no_special, folder_output)


selected_series, df, path_data_file_selection = (
    select_representative_series_hierarchical(path_data_file, n=n_features_selection, method='ward', metric='euclidean', scale=True, scale_method='standard'))

# Launch the optimization to find the representative days
path_data = path_data_file  # you want to include all the features to identify representative days
path_data = path_data_file_selection  # you only want to work with the reduced number of features to identify representative days

launch_optim_repr_days(path_data, folder_output, nbr_days=n_rep_days,
                       main_file='OptimizationModelZone.gms',
                       nbr_bins=nbr_bins)

# Get the results
repr_days = parse_repr_days(folder_output, special_days)

# Format the data to be used in EPM
format_epm_phours(repr_days, folder_output)
format_epm_pvreprofile(df_energy, repr_days, folder_output)
# only activate when load data is provided
format_epm_demandprofile(df_energy, repr_days, folder_output)

# Export in .csv format
repr_days.to_csv(os.path.join(folder_output, 'repr_days.csv'), index=False)
df_energy.to_csv(os.path.join(folder_output, 'df_energy.csv'), index=False)

Number of hours in the year: 7728
Removed days: 43.0
File saved at: output/data_formatted_optim.csv
File saved at /Users/lucas/Documents/World Bank/Projects/EPM_APPLIED/EPM_CAPP/pre-analysis/representative_days/output/data_formatted_optim_selection.csv
File saved to: gams/bins_settings_10.csv
Launch GAMS code
End GAMS code
Number of days: 10
Total weight: 365
season
Q1    153.0
Q2    212.0
Name: weight, dtype: float64
pHours file saved at: output/pHours.csv
Number of hours: 365
VRE Profile file saved at: output/pVREProfile.csv
pDemandProfile file saved at: output/pDemandProfile.csv

/Users/lucas/Documents/World Bank/Projects/EPM_APPLIED/EPM_CAPP/pre-analysis/representative_days/utils_reprdays.py:728: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
  pVREProfile = pVREProfile.stack(level=['fuel', 'zone'])
/Users/lucas/Documents/World Bank/Projects/EPM_APPLIED/EPM_CAPP/pre-analysis/representative_days/utils_reprdays.py:773: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
  pDemandProfile = pDemandProfile.stack('zone')

Step 6 - Plot results and export diagnostics#

Optional cell to plot the results. It will plot the load, wind and solar data for the representative year.

# Get data 
input_file = pd.read_csv(os.path.join(folder_output, 'data_formatted_optim.csv'), index_col=[0,1,2])
input_file.index.names = ['season', 'day', 'hour']

VREProfile = pd.read_csv(os.path.join(folder_output, 'pVREProfile.csv'), index_col=[0,1,2,3])

pHours = pd.read_csv(os.path.join(folder_output, 'pHours.csv'), index_col=[0,1])

# Checking the representative days

# === SETTINGS ===

season_colors = {
    'Q1': 'darkred',
    'Q2': 'dimgrey',
    'Q3': 'steelblue',
    'Q4': 'seagreen'}


# Total renewable production over all zones
plot_vre_repdays(input_file=input_file, vre_profile=VREProfile, pHours=pHours,
          season_colors=season_colors, min_alpha=0.5, max_alpha=1, path=os.path.join(folder_output, 'plot_vre_repdays.png'))

/Users/lucas/Documents/World Bank/Projects/EPM_APPLIED/EPM_CAPP/pre-analysis/representative_days/utils_reprdays.py:1189: UserWarning: No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
  ax.legend(loc='upper right', fontsize=fontsize_legend, frameon=False)
/Users/lucas/Documents/World Bank/Projects/EPM_APPLIED/EPM_CAPP/pre-analysis/representative_days/utils_reprdays.py:1189: UserWarning: No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
  ax.legend(loc='upper right', fontsize=fontsize_legend, frameon=False)

# Checking the representative days

# === SETTINGS ===
season_colors = {
    'Q1': 'darkred',
    'Q2': 'dimgrey',
    'Q3': 'steelblue',
    'Q4': 'seagreen'}

# Representative days per country
country = ['Angola']
plot_vre_repdays(input_file=input_file, vre_profile=VREProfile, pHours=pHours,
          season_colors=season_colors, countries=country, min_alpha=0.5, max_alpha=1, path=os.path.join(folder_output, 'plot_vre_repdays_angola.png'))

/Users/lucas/Documents/World Bank/Projects/EPM_APPLIED/EPM_CAPP/pre-analysis/representative_days/utils_reprdays.py:1189: UserWarning: No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
  ax.legend(loc='upper right', fontsize=fontsize_legend, frameon=False)
/Users/lucas/Documents/World Bank/Projects/EPM_APPLIED/EPM_CAPP/pre-analysis/representative_days/utils_reprdays.py:1189: UserWarning: No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
  ax.legend(loc='upper right', fontsize=fontsize_legend, frameon=False)