Andy Richling
Christopher Kadow, Sebastian Illing
Institut für Meteorologie, Freie Universität Berlin

Version from March 10, 2017

This is a brief documentation about the Freva Specs Plugin and is currently still under construction. Comments or any kind of feedback is highly appreciated. Please send an e-mail to the authors.

1 Introduction

A common way to determine the quality of forecasts and further to compare them to other forecasts is often done by calculating scalar summary measures. For probabilistic forecasts there are a few verification scores which are generally used [e.g. Wilks2011]. The Brier Score (BS) can be chosen for binary events (e.g. rain/no rain) whereas analyzing discrete multiple-category events (e.g. below normal/normal/above normal precipitation) the Ranked Probability Score (RPS) is preferably used. In the continuous case (infinite number of predictand classes of infinitesimal width) the RPS is extended to the Continuous Ranked Probability Score (CRPS). In addition to the mentioned verification scores, the Plugin is able to calculate the associated skill scores comparing two forecasts or comparing a forecast to the climatological reference. Since the scores are biased to finite ensemble size, additionally ensemble-size-corrected ”Fair” (skill) scores [Ferro2013] are also calculated. All available ( skill) scores used in this Plugin are based on the SpecsVerification R-Package from Siegert [2014].

In section 2, the methods of pre-processing and score calculations are described. Sections 3 and 4 explain the input respectively the output of the Specs Plugin.

2 Methods

2.1 Pre-Processing

Before a calculation of verification scores can be done, the input data have to be pre-processed beforehand. The general pre-processing procedure includes a remapping to a chosen regular longitude-latitude grid and a yearly, monthly or seasonally averaging of every year to finally get yearly time series. In case of decadal experiments, time series will be created according to the Leadtime-Selector Plugin. Figure 1 additionally shows the schematic processing procedure of the Specs Plugin. Note that every of the following score calculation procedures is individually done for every month (season) and/or lead time.

For the calculation of BSS and RPSS metrics a definition of nth threshold(s) is necessary to classify the data into discrete categories. The data will be divided by thresholds Xth into nth + 1 categories Y following the rule Y th < Xthfor: th = 1 Xth1 Y th < Xthfor: 2 th nth 1 Y th+1 Xthfor: th = nth.

Hence, e.g. assume that two thresholds are selected; category Y 1 will contain data below the 1st threshold X1, category Y 2 will contain data equal or greater than X1 and below X2 and category Y 3 will contain data equal or greater than the 2nd threshold X2. In case of percentile-based categorization threshold values will be individually received for observation and forecast by separate percentile calculations beforehand.


Figure 1: Schematic processing procedure of the Specs Plugin.

2.2 Brier Skill Score (BSS-Metric)

For binary events (e.g. rain/no rain separated by a threshold) the Brier Score (BS) essentially gives information about the mean squared error of the probabilistic forecast with ensemble size M and is defined as

BS = 1 n t=1T (y t ot)2, (1)

where o stands for the observed event (if o = 1 then the event occur, if o = 0 the event does not occur) and y for the related forecast probability of the t-th forecast-event pair. In context of ensemble forecast the probabilistic forecast y is based on the relative number e M of ensemble members e predicting that the event occurs. The BS is bounded between 0 and 1 with a perfect forecast BS = 0. To evaluate the performance of one forecast against a reference forecast, the calculation of skill scores SS is a good choice. The associated skill score for binary events is the Brier Skill Score

BSS = 1 BSfc BSref, (2)

where BSfc is the Brier Score of the forecast and BSref the Brier Score of the reference. If no reference forecast experiment is selected in the Plugin, the climatology is used as the reference instead. Since scores are biased for finite ensemble size, the Brier Score will be adjusted following Ferro [2013] to be fair with comparing ensemble forecasts of different ensemble sizes. Thus, the Fair Brier Score FairBS for one forecast-event pair t is defined as

FairBSt = e M o2 e(M e) M2(M 1). (3)

According to the ensemble-size adjustment of the Bier Score, the Fair Brier Skill Score can also be calculated, whereas the Brier Scores of forecast (BSfc) and reference (BSref) will be calculated related to equation 3. Note that in the Fair-Score-background of this Plugin the climatology will get a hypothetical ensemble size which is equal to the number of observations.

2.3 Ranked Probability Skill Score (RPSS-Metric)

In case you want to divide data into more than 2 groups having discrete events (e.g. dry/near-normal/wet precipitation conditions) the Ranked Probability Score (RPS) can be chosen. The RPS is basically an extension of the Brier Score to multiple-event situations and is based on squared errors regarding to the cumulative probabilities in forecasts Y and observations O. The RPS is defined as

RPS = 1 n t=1T k=1K(Y k(t) Ok(t))2 , (4)

where K is the number of discrete categories. According to the Brier Skill Score, the Ranked Probability Skill Score

RPSS = 1 RPSfc RPSref, (5)

can be derived. With respect to ensemble-size-adjustment (see 2.2) the Fair Ranked Probability Score FairRPS for one forecast-event pair t follows

FairRPSt = k=1K Ek M Ok 2 Ek(M Ek) M2(M 1) , (6)

where E is the cumulative number of ensemble members. In this context the calculation of the Fair Ranked Probability Skill Score is similar to the computation of FairBSS in section 2.2.

2.4 Continuous Ranked Probability Skill Score (CRPSS-Metric)

The Continuous Ranked Probability Score (CRPS) is essentially the extension of the RPS to the continuous case. The number of predictand classes will be replaced by an infinite number of infinitesimal width, so that the summations in eq. 4 is substitute with the integrals.

CRPS = 1 n t=1T F Y (t)) FO(Y (t))2dx, (7)

where FY is the cumulative distribution function (CDF) from the forecast and FO is the Heaviside function jumping from 0 to 1 when the forecast variable Y is equal to the observation. Following Hersbach [2000] and Siegert [2014] respectively the CRPS derived from ensemble prediction systems can be written as

CRPS = 1 n t=1T 1 M i=1M|y i(t) ot| 1 M2 1i<jM|yi(t) yj(t)|. (8)

The corresponding Continuous Ranked Probability Skill Score is defined as

CRPSS = 1 CRPSfc CRPSref. (9)

According to the ensemble-size-adjustment, the Fair Continuous Ranked Probability Score FairCRPS follows

CRPS = 1 n t=1T 1 M i=1M|y i(t) ot| 1 M(M 1) 1i<jM|yi(t) yj(t)|. (10)

The calculation of the Fair Continuous Ranked Probability Skill Score is analogous to equation 9.

pict pict pict

Figure 2: Visualization of BS (left), RPS (center) and CRPS (right) conceptual examples. BS: For a specific date, 7 out of 10 ensemble members of a forecast system predict a frost day (green). In case, a frost day is observed (red solid) the BS for this event is given by (0.7 1)2. In case a temperature above 0C is observed (red dashed), the BS is given by (0.7 0)2. RPS: For a specific date, 3 out of 10 ensemble members predict a temperature, which is classified as category “below normal”, 5 out of 10 members predict temperature category “normal” and 2 out of 10 the category “above normal”. In this example a temperature classified as “normal” is observed, so that the cumulative probability of the observation jumps from 0 to 1 (red). Considering the cumulative probability of forecast (green), RPS is calculated by (0.3 0)2 + (0.8 1)2 + (1 1)2. Note that the last term is always zero. CRPS: For a specific date, temperature values predicted from ensemble members of a forecast system follow a hypothetical Gaussian distribution (green). Assuming that the observed temperature is 0C (red), the squared area between the cumulative probabilities of observation and forcast represents the CRPS. As we can see in the figure, the (squared) area and therby the CRPS depend on the distribution of the ensemble forecast.

2.5 Significance

The information about the signifiance of Skill Scores is based on the block-bootsrap method. For this, time series of length n is divided into (n-l+1) overlapping blocks of length l (set to five), where the k-th block contains the time series values xi (i goes from i = k to k + l 1). After that, a random selection with replacing is applied to generate a new time series of forecast-reference-observation cases and further to calculate the skill score. This procedure is done e.g. 1000 times to build up a sampling distribution of the considered skill score (e.g. Mason [2008]). The positive (negative) skill score is significantly different from zero if the 5%-percentile (95%-percentile) is greater (less) than zero.

3 Input Parameters

At first, you have to specify your output (Outputdir) and cache (Cachedir) directories. The data of FORECAST (”1”) and REFERENCE (”2”) can be selected via Project, Product, Institute, Model, Experiment and Ensemble. Leave all entries of REFERENCE empty if you want to compare calculated scores to the climatology. In case the FORECAST and/or REFERENCE is a decadal experiment set Leadtimesel True. Note in case you only have one decadal experiment (e.g. comparison between decadal experiment and historical) you have to input the decadal experiment as FORECAST. In Observation you can choose the observation to compare.

Next, you have to specify Metrics you want to calculate. In case of RPSS/BSS you have to configure the Thresholds, otherwise leave emtpy. Calculation of significance levels can be enabled by entering a Bootstrap number. Choose Variable and if available Level you want to analyze. In case of decadal experiments you have to specify Leadtimes and Decadals you want to analzye. Otherwise leave empty and insert years you want to process into the Year range parameter. Further, select the Time frequency of your input data and define the temporal resolution of output (Temp res). Additionally, you can choose specific months (Mon) which will be processed. In combination with the Timemean operator it is possible to define extended seasons with months of interest. In Grid you have to specify a regular lon-lat grid to remap all of the input data to the same grid. A selection of a specific Region is also possible as well as the option to subtract a field mean of a region from the time series (Region trend). Instead of a gridwise, a field-mean-based analysis can be chosen with the areamean operator.

Finally, you have the option to remove the cache directories (Cacheclear) and to specify the number of parallel-processing tasks (Ntask).

Table 1: Input options for the Specs Plugin.
Outputdir Output directory.
mandatory default: /miklip-work/b/user/evaluation_system/output/specs/date

Cachedir Cache directory
mandatory default: /miklip-work/b/user/evaluation_system/cache/specs/date

Project1 Choose project of FORECAST, e.g. cmip5, baseline1, baseline0.

Product1 Choose product of FORECAST, e.g. output.

Institute1 Choose institute of FORECAST, e.g. MPI-M.

Model1 Choose model of FORECAST, e.g. MPI-ESM-LR.

Experiment1 Choose experiment prefix of FORECAST, e.g. decs4e, dffg2e, dffs4e, historical.

Ensemble1 Choose ensemble of FORECAST, e.g. r1i1p1, r2i1p1 or ”*” for all members.

Leadtimesel1 Option switch to use the leadtimeselector for the FORECAST. Set ”True” for decadal experiments.
mandatory default: False

Project2 Choose project of REFERENCE, e.g. cmip5, baseline1, baseline0.

Product2 Choose product of REFERENCE, e.g. output.

Institute2 Choose institute of REFERENCE, e.g. MPI-M.

Model2 Choose model of REFERENCE, e.g. MPI-ESM-LR.

Experiment2 Choose experiment prefix of REFERENCE, e.g. decs4e, dffg2e, dffs4e, historical.

Ensemble2 Choose ensemble of REFERENCE, e.g. r1i1p1, r2i1p1 or ”*” for all members.

Leadtimesel2 Option switch to use the leadtimeselector for the REFERENCE. Set ”True” for decadal experiments.
mandatory default: False

Observation Choose observation to compare, e.g. merra, hadcrut3v, eraint. You can also specify an observation file.

Leadtimes Leadtimes to analyze if a decadal experiment is selected. Use ”1,3” for separate lead times or ”1-3” for averaging.

Decadals Comma separated lists of decadal experiments if a decadal experiment is selected.

Metrics Specify the scores you want to calculate (RPSS: RPS/RPSS/FairRPS/FairRPSS; BSS: BS/BSS/FairBS/FairBSS; CRPSS: CRPS/CRPSS/FairCRPS/FairCRPSS).
mandatory default: RPSS

Threshold Threshold(s) for separating categories following the format OPTION_THRESHOLD1,THRESHOLD2. Absolute values (OPTION=const) or percentile values (OPTION=perc) are possible. E.g. perc_1/3,2/3 to separate data into 3 categories based on the percentiles 33.3% and 66.7%. Use 1 threshold for BS(S) and at least 2 for RPS(S) calculation. Leave empty for CRPSS metric.

Variable The name of the variable you want to analyze (CMOR).

Year range Choose time period to be processed (comma separated; e.g. 1980,2010). Only available if no decadal experiment is chosen.

Time frequency Choose time frequency of input data. Comma seperated if you want to chose individual time frequencies for FORECAST and REFERENCE.
mandatory default: mon

Level Choose level [in Pa], e.g. 50000. Only available for files containing levels.

Grid Specify a regular lon-lat grid (gridfile or grid description like r72x36).
mandatory default: r72x36

Region Choose a region box to be processed. The format have to be W,E,S,N (e.g. -180,180,0,90 for NH).

default: -180,180,-90,90

Region trend Choose optional region for an anomaly calculation: The time series (field mean) from this box will be subtracted.

Area mean Option to calculate the field mean of the region box above instead of a grid-wise processing.
mandatory default: False

Temp res Temporal resolution of the output time series you want to analyze. Valid options are ”year”, ”seas”, and ”mon” (e.g. if you choose ”mon”, your selected months will be individually analyzed). Note in case of selecting ”seas” with decadal experiments, the winter season DJF only contains January and February!
mandatory default: year

Mon Select months to be processed. You also can set ”timemean” below ”True” if you want to calculate the yearly average over these months (e.g. To get time series of yearly averaged extended summer season MJJAS insert 5,6,7,8,9). Only available if ”temp_res” is set ”mon”.

Timemean Option to calculate the yearly mean of selected month above to get the time series. Only available if you insert some months in ”mon” and temp_res is set to ”mon”.
mandatory default: False

Bootstrap number Number of bootstrap runs for calculation of significance levels. Leave empty if you do not want to calculate significance levels. WARNING: This could take a lot of time (up to day/s) when activated!.

Cacheclear Option switch to NOT clear the cache.
mandatory default: True

Ntask Number of tasks
mandatory default: 24

4 Output

The processed files can be found in the selected Outputdir. Pre-processed NetCDF files of FORECAST/REFEERENCE/OBSERVATION time series are stored in fc/ref/obs directories containing the ”Variable” with dimension of [LON; LAT; DATE; ENSEMBLE]. NetCDF files [LON; LAT; DATE] of calculated scores can be found in a separated directory depending on your chosen Metrics. Output figures showing the gridwise Fair Skill Score of your chosen Metrics will be saved in the plot directory (only available if Areamean is set False).


Figure 3: Fair RPSS of Near-Surface temperature for the analysis baseline1 (FORECAST) against historical (REFERENCE) for lead time 2-5. As observation data, ERA-Interim is used.


   C. A. T. Ferro. Fair scores for ensemble forecasts. Q.R.J. Meteorol. Soc., 140: 1917–1923, 2013. doi: 10.1002/qj.2270.

   H. Hersbach. Decomposition of the Continuous Ranked Probability Score for Ensemble Prediction Systems. Wea. Forecasting, 15:559–570, 2000. doi:;2.

   S. J. Mason. Understanding forecast verification statistics. Meteorol. Appl., 15: 31–40, 2008. doi: 10.1002/met.51.

   S. Siegert. SpecsVerification: Forecast verification routines for the SPECS FP7 project, 2014. URL R package version 0.3-0.

   D. S. Wilks. Statistical methods in the atmospheric sciences. Academic Press, San Diego, CA, 3rd edition, 2011.