read

This post goes through the underpinnings of Exceedance Probabilities and the benefit solar energy developers and enthusiasts alike stand by knowing the basics.

Exceedance Probabilities in the context of solar energy are also referred to as P50/P90 analysis or simply, P values. Among banks and investment firms it’s the staple statistical method to determine the economic risk associated with solar resource uncertainty.

Objectively, a P50/P90 analysis determines the likelihood that a solar plant will yield an specific amount of energy (ie. dollars) during any given year of its life. For this reason, exceedance probabilities are paramount for a solar project to a) secure competitive financing and, b) manage operational costs and debt obligations.

So, what exactly are they?

P as in Percentile


P values refer to the probability that a certain value will be exceeded. For example, a P90 value of 100, means there is a 90% probability of exceeding 100.

In our context, a P90 of 100 is the likelihood that a solar plant will yield more than 100 units of energy.

Pro-tip: Notice P90 is NOT a 90% probability of producing 100, but of exceeding 100 units (subtle yet very different meaning and a common misunderstanding).

Methodologies to calculate P values

Dealing with ‘uncertainty’ and ‘statistical methods’ may sound intimidating, but once we clear out a few concepts it should be straight forward to grasp.

To calculate statistically robust P values of energy estimates we need two inputs:

1) A long-term historical weather dataset: Using multi-year data assures considering potential worst-case scenarios that could affect a project financial terms.

2) Performance system modeling: An hourly simulation of the system performance for every single year in the dataset provides the detailed expectations of a system output.

All else equal, yearly outputs are mainly driven by the weather conditions at the project’s site. Let’s not forget, the goal is to understand that variability and determine the asset expectations on a year-to-year basis.

From an statistics standpoint, we do this by fitting the historical dataset to a function we understand in order to make inferences from it.

In other words, we calculate P values through a two step process:

1) Fit the previously simulated yearly plant outputs to a distribution function

2) Calculate a desired P value derived from the function properties.

Consider a fictional 10 MW utility-scale plant somewhere in California


I’ve simulated yearly energy outputs for a 1998-2015 dataset at the site, see results in the following table:

Year 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Energy Output [GWh] 16.6 17.8 17.6 17.9 17.9 17.4 17.9 17.4 17.6 18.1 18.1 17.8 17.2 17.7 18.2 18.3 18.3 18.0

Plotting a histogram will show how the energy outputs are distributed, or, the density of our sample.

#Read csv with simulated data.
df = pd.read_csv('../data/Timeseries_Fictional_Plant.csv')
#Create axis and figure
fig_hist, ax_hist = plt.subplots(1,1)
#Figure Size
fig_hist.set_size_inches(10,6)
#Plot Histogram
df.hist(bins = 7, ax = ax_hist)
#Create Labels and some styling.
plt.ylabel('Frequency')
plt.xlabel('Energy Output [GWh]')
plt.title('Simulated Energy Output [GWh] 1998 - 2015')
plt.ylim(0,6)
plt.xlim(16, 19)

Fig. 1: Insolation Distribution likely to be Normally Distributed

The most typical path going forward is to assume the data follows a Normal Probability Distribution and to calculate Cumulative Form (the integral). This is a nice idea if the data was normally distributed, however, that’s arguably not the case with the solar resource.

Across a 20-30 year life of a solar project, outlier events such as cyclic weather patterns or volcanic eruptions may skew the data (in our example we potentially have one!).

An alternative approach is to not assume any particular distribution and build one directly from the data. Particularly, we want to build an Empirical Cumulative Density Function.

For illustration purposes, I will calculate the Exceedance Probabilities with both methods and expose why understanding the difference matters.

Let’s assume we are using the standard distribution first.

Normal Distribution Approach


This approach is simple since we know what the function looks like (Wikipedia). Note it takes two parameters, the Mean and Standard Deviation of the dataset.

#Compute Mean
forecast_mean = df.mean()
#Compute Standard Deviation of the Sample
forecast_std = df.std()

Of course, we can use python’s scipy.stats module to beautifully and painlessly plot our function:

#I'm using pandas, numpy, matplotlib, scipy.stats
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as sts

#Create Figure and Axis.
fig, ax = plt.subplots(1,1)
#Compute range values, let's do 4 +- std. deviations.
x_min = forecast_mean - forecast_std * 4
x_max = forecast_mean + forecast_std * 4
#Create range array
x_range = np.linspace(x_min, x_max, 200)
#Define mean and std as input parameters for function.
loc = forecast_mean
scale = forecast_std
#Fit function to parameters
normal = sts.norm(loc = loc, scale = scale)
#Evaluate function with X-range
y_pdf = normal.pdf(x_range)
#Plot Gaussian Probability Density Function
ax.plot(x_range, y_pdf, label = 'Gaussian PDF')
#Let's plot a 95% Confidence Interval for kicks!
tolerance = 0.05
low_bound = normal.ppf(tolerance/2)
upper_bound = normal.ppf(1 - (tolerance/2))

That’s about it in terms of math - now just adding style!

#Plot mean as vertical curve
ax.vlines(x = loc, ymin=0, ymax=normal.pdf(loc), linestyles='--', colors='orange', label = 'Mean: {0:.2f} [GWh]'.format(loc))
#Let's dray the intervals.
ax.vlines(x = low_bound, ymin=0, ymax= normal.pdf(low_bound), linestyles='--', colors='red', label =
'Confidence Intervals: 95%')
ax.vlines(x = upper_bound, ymin=0, ymax=normal.pdf(upper_bound), linestyles='--', colors='red')
#Define the xrange to show for formatting.
plt.xlim(x_min, x_max)
#Finally some titles, labels and legends!
plt.title('Gaussian Probability Density Function (PDF)')
plt.xlabel('Solar Plant Energy Output [GWh]')
plt.ylabel('Density')
plt.legend(frameon=True)

And …

Fig. 2: Fitted Normal Distribution

Now actually, we don’t need this form in our analysis. We want to calculate the probabilities that the forecasted production will be surpassed, meaning the integral of this curve is what interests us. It’s called a CDF, short for Cumulative Density Function!

I’ll try not to clutter this post with too much code, visit this project’s repo if you are interested on how I plot this next one.

Fig. 3: Cumulative Density Function

So, how to make sense of this plot?

Take the P10 value, it reads:

‘The likelihood that the plant will yield more than 18.4 GWh is 10%’.

Easy ride from here! We could solve the function to get the proportion of the population (probability) that is greater than any value P.

Now let’s step back for a second and think about this further.

If you were to invest on a solar plant with minimal risk exposure, would you want to calculate your Return of Investment using a high or low P value?

To answer that, notice a P10 value means that the proportion of yearly simulations where the outcome exceeds such value is only 10%. Another way to think about it is that, a P value is inversely proportional to the expected production.

For example a high probability of exceedance, ie. P90, will reference a relatively low production yield. And the reverse is true for low P values. That is why Financial Institutions and Plant Owners, on both their best interests would plan according to at least the P50 (which is the expected value or mean); the latter makes sure they service debt obligations and manage operational costs, while the former reduces risk of borrower’s default.

Empirical Approach


Now, let’s take a look to the Empirical procedure. Take our hypothetical plant and dataset, there are 18 years, 18 production values. Each value constitutes an equal contribution of the total probability or 1 / 18. Since the distribution is cumulative, we want to sort the values (lowest to highest), and do a cumulative sum of the total contribution at each consecutive data point.

The procedure is shown below:

### Calculate cumulative function
df_empirical = df

#Sort by value lowest - highest.
df_empirical.sort_values(by='Energy Output [GWh]', inplace=True)

#Assign equal probability to each event.
df_empirical['Prob'] = 1. / df1.shape[0]

#Calculate cumulative probability.
df_empirical['cumsum'] = df_empirical.Prob.cumsum()

# Define variables for plot.
x_empirical = df_empirical['Energy Output [GWh]'].values
y_empirical = df_empirical['cumsum'].values
Year Energy Output [GWh] Prob cumsum
1998 16.616828 0.055556 0.055556
2010 17.286320 0.055556 0.111111
2005 17.406996 0.055556 0.166667
2003 17.486454 0.055556 0.222222
2006 17.597609 0.055556 0.277778
2000 17.663284 0.055556 0.333333
2011 17.712516 0.055556 0.388889
2009 17.866143 0.055556 0.444444
1999 17.874383 0.055556 0.500000
2004 17.941267 0.055556 0.555556
2001 17.954113 0.055556 0.611111
2002 17.962099 0.055556 0.666667
2015 18.029019 0.055556 0.722222
2007 18.122649 0.055556 0.777778
2008 18.148823 0.055556 0.833333
2012 18.281073 0.055556 0.888889
2013 18.383961 0.055556 0.944444
2014 18.388794 0.055556 1.000000

With that, we are ready to plot.

Fig. 4: Empirical Distribution

In contrast to the first method, we obtain specific P values by performing linear interpolation. For example, a P90 value can be computed by interpolating between the values in the table for which the Cumulative Density is equal to 0.1. This is crucial since it means we need as many data points as possible to establish a representative cumulative curve and by consequence, reliable exceedance probabilities.

So, let’s plot them together since it appears they tell slightly different story than our first method:

Fig. 5: Empirical vs. Gaussian

The plot exposes the empirical distribution of an 18-year irradiance dataset deviates at various sections when compared to the calculated normal distribution.

For this example, the P90 and P50 values calculated using the normal CDF differs optimistically by 0.6% and -0.3%, respectively.

Take particular note to the deviations on the tail of the distributions. Assuming a normal distribution may lead to arrive to seemingly irrational conclusions, in this example it would suggest that the simulated production for year 1998 has a likelihood of occurring 1 in 330 years!

What we’ve learned so far:


1) To evaluate a project’s financial risk we use P50 and P90 exceedance probabilities based on a multi-year historical dataset.

2) Weather, just as stock and other phenomena is a fat tailed distribution which does not fit to the normal distribution very closely.

3) Because of number 2, the empirical methodology yields more reliable results and realistic estimates to manage your plant expectations.

4) When data allows, empirical distributions win!

Now dare I say it, HAPPY SOLAR ENERGY INVESTING!

Blog Logo

Pablo Felgueres


Published