Problems

Andy Richling$\ast $ |

Christopher Kadow, Sebastian Illing |

Institut für Meteorologie, Freie Universität Berlin |

Version from March 10, 2017

This is a brief documentation about the Freva Specs Plugin and is currently still under construction. Comments or any kind of feedback is highly appreciated. Please send an e-mail to the authors.

### 1 Introduction

A common way to determine the quality of forecasts and further to compare them to other forecasts is often done by calculating scalar summary measures. For probabilistic forecasts there are a few verification scores which are generally used [e.g. Wilks, 2011]. The Brier Score (BS) can be chosen for binary events (e.g. rain/no rain) whereas analyzing discrete multiple-category events (e.g. below normal/normal/above normal precipitation) the Ranked Probability Score (RPS) is preferably used. In the continuous case (infinite number of predictand classes of infinitesimal width) the RPS is extended to the Continuous Ranked Probability Score (CRPS). In addition to the mentioned verification scores, the Plugin is able to calculate the associated skill scores comparing two forecasts or comparing a forecast to the climatological reference. Since the scores are biased to finite ensemble size, additionally ensemble-size-corrected ”Fair” (skill) scores [Ferro, 2013] are also calculated. All available ( skill) scores used in this Plugin are based on the SpecsVerification R-Package from Siegert [2014].

In section 2, the methods of pre-processing and score calculations are described. Sections 3 and 4 explain the input respectively the output of the Specs Plugin.

### 2 Methods

#### 2.1 Pre-Processing

Before a calculation of verification scores can be done, the input data have to be pre-processed beforehand. The general pre-processing procedure includes a remapping to a chosen regular longitude-latitude grid and a yearly, monthly or seasonally averaging of every year to finally get yearly time series. In case of decadal experiments, time series will be created according to the Leadtime-Selector Plugin. Figure 1 additionally shows the schematic processing procedure of the Specs Plugin. Note that every of the following score calculation procedures is individually done for every month (season) and/or lead time.

For the calculation of BSS and RPSS metrics a definition of ${n}_{th}$ threshold(s) is necessary to classify the data into discrete categories. The data will be divided by thresholds ${X}_{th}$ into ${n}_{th}+1$ categories $Y$ following the rule $$\begin{array}{llll}\hfill {Y}_{th}<{X}_{th}\phantom{\rule{2em}{0ex}}\mathrm{\text{for:}}& \phantom{\rule{1em}{0ex}}th=1\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill {X}_{th-1}\ge {Y}_{th}<{X}_{th}\phantom{\rule{2em}{0ex}}\mathrm{\text{for:}}& \phantom{\rule{1em}{0ex}}2\le th\le {n}_{th}-1\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill {Y}_{th+1}\ge {X}_{th}\phantom{\rule{2em}{0ex}}\mathrm{\text{for:}}& \phantom{\rule{1em}{0ex}}th={n}_{th}\phantom{\rule{1em}{0ex}}.\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\end{array}$$

Hence, e.g. assume that two thresholds are selected; category ${Y}_{1}$ will contain data below the 1st threshold ${X}_{1}$, category ${Y}_{2}$ will contain data equal or greater than ${X}_{1}$ and below ${X}_{2}$ and category ${Y}_{3}$ will contain data equal or greater than the 2nd threshold ${X}_{2}$. In case of percentile-based categorization threshold values will be individually received for observation and forecast by separate percentile calculations beforehand.

#### 2.2 Brier Skill Score (BSS-Metric)

For binary events (e.g. rain/no rain separated by a threshold) the Brier Score (BS) essentially gives information about the mean squared error of the probabilistic forecast with ensemble size $M$ and is defined as

$$BS=\frac{1}{n}\sum _{t=1}^{T}{\left({y}_{t}-{o}_{t}\right)}^{2},$$ | (1) |

where $o$ stands for the observed event (if $o=1$ then the event occur, if $o=0$ the event does not occur) and $y$ for the related forecast probability of the $t$-th forecast-event pair. In context of ensemble forecast the probabilistic forecast $y$ is based on the relative number $\frac{e}{M}$ of ensemble members $e$ predicting that the event occurs. The $BS$ is bounded between 0 and 1 with a perfect forecast $BS=0$. To evaluate the performance of one forecast against a reference forecast, the calculation of skill scores $SS$ is a good choice. The associated skill score for binary events is the Brier Skill Score

$$BSS=1-\frac{B{S}_{fc}}{B{S}_{ref}},$$ | (2) |

where $B{S}_{fc}$ is the Brier Score of the forecast and $B{S}_{ref}$ the Brier Score of the reference. If no reference forecast experiment is selected in the Plugin, the climatology is used as the reference instead. Since scores are biased for finite ensemble size, the Brier Score will be adjusted following Ferro [2013] to be fair with comparing ensemble forecasts of different ensemble sizes. Thus, the Fair Brier Score $FairBS$ for one forecast-event pair $t$ is defined as

$$FairB{S}_{t}={\left(\frac{e}{M}-o\right)}^{2}-\frac{e\left(M-e\right)}{{M}^{2}\left(M-1\right)}.$$ | (3) |

According to the ensemble-size adjustment of the Bier Score, the Fair Brier Skill Score can also be calculated, whereas the Brier Scores of forecast ($B{S}_{fc}$) and reference ($B{S}_{ref}$) will be calculated related to equation 3. Note that in the Fair-Score-background of this Plugin the climatology will get a hypothetical ensemble size which is equal to the number of observations.

#### 2.3 Ranked Probability Skill Score (RPSS-Metric)

In case you want to divide data into more than 2 groups having discrete events (e.g. dry/near-normal/wet precipitation conditions) the Ranked Probability Score (RPS) can be chosen. The RPS is basically an extension of the Brier Score to multiple-event situations and is based on squared errors regarding to the cumulative probabilities in forecasts $Y$ and observations $O$. The RPS is defined as

$$RPS=\frac{1}{n}\sum _{t=1}^{T}\left[\sum _{k=1}^{K}{\left({Y}_{k}\left(t\right)-{O}_{k}\left(t\right)\right)}^{2}\right],$$ | (4) |

where $K$ is the number of discrete categories. According to the Brier Skill Score, the Ranked Probability Skill Score

$$RPSS=1-\frac{RP{S}_{fc}}{RP{S}_{ref}},$$ | (5) |

can be derived. With respect to ensemble-size-adjustment (see 2.2) the Fair Ranked Probability Score $FairRPS$ for one forecast-event pair $t$ follows

$$FairRP{S}_{t}=\sum _{k=1}^{K}\left[{\left(\frac{{E}_{k}}{M}-{O}_{k}\right)}^{2}-\frac{{E}_{k}\left(M-{E}_{k}\right)}{{M}^{2}\left(M-1\right)}\right],$$ | (6) |

where $E$ is the cumulative number of ensemble members. In this context the calculation of the Fair Ranked Probability Skill Score is similar to the computation of $FairBSS$ in section 2.2.

#### 2.4 Continuous Ranked Probability Skill Score (CRPSS-Metric)

The Continuous Ranked Probability Score (CRPS) is essentially the extension of the RPS to the continuous case. The number of predictand classes will be replaced by an infinite number of infinitesimal width, so that the summations in eq. 4 is substitute with the integrals.

$$CRPS=\frac{1}{n}\sum _{t=1}^{T}\left[{\int}_{-\infty}^{\infty}{\left({F}_{Y}\left(t\right))\right)}^{-}\right]2dx,$$ | (7) |

where ${F}_{Y}$ is the cumulative distribution function (CDF) from the forecast and ${F}_{O}$ is the Heaviside function jumping from 0 to 1 when the forecast variable $Y$ is equal to the observation. Following Hersbach [2000] and Siegert [2014] respectively the CRPS derived from ensemble prediction systems can be written as

$$CRPS=\frac{1}{n}\sum _{t=1}^{T}\left[\frac{1}{M}\sum _{i=1}^{M}|{y}_{i}\left(t\right)-{o}_{t}|-\frac{1}{{M}^{2}}\sum _{1\le i<j\le M}^{}|{y}_{i}\left(t\right)-{y}_{j}\left(t\right)|\right].$$ | (8) |

The corresponding Continuous Ranked Probability Skill Score is defined as

$$CRPSS=1-\frac{CRP{S}_{fc}}{CRP{S}_{ref}}.$$ | (9) |

According to the ensemble-size-adjustment, the Fair Continuous Ranked Probability Score $FairCRPS$ follows

The calculation of the Fair Continuous Ranked Probability Skill Score is analogous to equation 9.

#### 2.5 Significance

The information about the signifiance of Skill Scores is based on the block-bootsrap method. For this, time series of length $n$ is divided into (n-l+1) overlapping blocks of length l (set to five), where the $k$-th block contains the time series values ${x}_{i}$ ($i$ goes from $i=k$ to $k+l-1$). After that, a random selection with replacing is applied to generate a new time series of forecast-reference-observation cases and further to calculate the skill score. This procedure is done e.g. 1000 times to build up a sampling distribution of the considered skill score (e.g. Mason [2008]). The positive (negative) skill score is significantly different from zero if the 5%-percentile (95%-percentile) is greater (less) than zero.

### 3 Input Parameters

At first, you have to specify your output (Outputdir) and cache (Cachedir) directories. The data of FORECAST (”1”) and REFERENCE (”2”) can be selected via Project, Product, Institute, Model, Experiment and Ensemble. Leave all entries of REFERENCE empty if you want to compare calculated scores to the climatology. In case the FORECAST and/or REFERENCE is a decadal experiment set Leadtimesel True. Note in case you only have one decadal experiment (e.g. comparison between decadal experiment and historical) you have to input the decadal experiment as FORECAST. In Observation you can choose the observation to compare.

Next, you have to specify Metrics you want to calculate. In case of RPSS/BSS you have to configure the Thresholds, otherwise leave emtpy. Calculation of significance levels can be enabled by entering a Bootstrap number. Choose Variable and if available Level you want to analyze. In case of decadal experiments you have to specify Leadtimes and Decadals you want to analzye. Otherwise leave empty and insert years you want to process into the Year range parameter. Further, select the Time frequency of your input data and define the temporal resolution of output (Temp res). Additionally, you can choose specific months (Mon) which will be processed. In combination with the Timemean operator it is possible to define extended seasons with months of interest. In Grid you have to specify a regular lon-lat grid to remap all of the input data to the same grid. A selection of a specific Region is also possible as well as the option to subtract a field mean of a region from the time series (Region trend). Instead of a gridwise, a field-mean-based analysis can be chosen with the areamean operator.

Finally, you have the option to remove the cache directories (Cacheclear) and to specify the number of parallel-processing tasks (Ntask).

### 4 Output

The processed files can be found in the selected Outputdir. Pre-processed NetCDF files of FORECAST/REFEERENCE/OBSERVATION time series are stored in fc/ref/obs directories containing the ”Variable” with dimension of [LON; LAT; DATE; ENSEMBLE]. NetCDF files [LON; LAT; DATE] of calculated scores can be found in a separated directory depending on your chosen Metrics. Output figures showing the gridwise Fair Skill Score of your chosen Metrics will be saved in the plot directory (only available if Areamean is set False).

### References

C. A. T. Ferro. Fair scores for ensemble forecasts. Q.R.J. Meteorol. Soc., 140: 1917–1923, 2013. doi: 10.1002/qj.2270.

H. Hersbach. Decomposition of the Continuous Ranked Probability Score for Ensemble Prediction Systems. Wea. Forecasting, 15:559–570, 2000. doi: http://dx.doi.org/10.1175/1520-0434(2000)015?0559:DOTCRP?2.0.CO;2.

S. J. Mason. Understanding forecast verification statistics. Meteorol. Appl., 15: 31–40, 2008. doi: 10.1002/met.51.

S. Siegert. SpecsVerification: Forecast verification routines for the SPECS FP7 project, 2014. URL http://CRAN.R-project.org/package=SpecsVerification. R package version 0.3-0.

D. S. Wilks. Statistical methods in the atmospheric sciences. Academic Press, San Diego, CA, 3rd edition, 2011.