Statistical Methods in Forecasting

Figure 1: Example of the mean error distance (MED) applied to North American Regional Climate Change (NARCCAP) model output for the large-scale severe weather indicator variable, WmSh (maximum updraft velocity multiplied by convective available potential energy (ms-1, cf. Gilleland et al 2016 ).  Values above the 1-1 dashed line indicate a propensity for false alarm areas (a tendency to forecast in areas without observations), and values below the one-to-one line indicate a propensity for not forecasting in areas where observations exist.
Figure 1: Example of the mean error distance (MED) applied to North American Regional Climate Change (NARCCAP) model output for the large-scale severe weather indicator variable, WmSh (maximum updraft velocity multiplied by convective available potential energy (ms-1, cf. Gilleland et al 2016 ).  Values above the 1-1 dashed line indicate a propensity for false alarm areas (a tendency to forecast in areas without observations), and values below the one-to-one line indicate a propensity for not forecasting in areas where observations exist.

Forecast verification and evaluation activities typically are based on relatively simple metrics that measure the meteorological performance of forecasts and forecasting systems. Metrics such as the Probability of Detection, Root Mean Squared Error, and Equitable Threat Score provide information that is useful for monitoring changes in performance of single aspects of forecast performance with time. However, they generally do not provide information that can be used to improve forecasts, or that can be helpful for making decisions. Moreover, it is possible for high quality forecasts– such as high-resolution forecasts – to have very poor scores when evaluated using these standard metrics even when they provide useful information, while poorer quality forecasts may score better. In response to these limitations, RAL scientists develop improved verification approaches and tools that provide more meaningful and relevant information about forecast performance. The focus of this effort is on diagnostic, statistically valid approaches, including new spatial approaches that can provide more meaningful information (for forecast developers as well as forecast users) about forecast performance.

FY2017 Accomplishments

In the last year, a new spatial pattern-based characterization was derived by Gilleland (2017) that enables a user to easily determine if a forecast has a problem, on average, with areas of false alarms or misses.  The technique exploits the asymmetry in the mean error distance (MED); a measure that was introduced in, and shunned by, the image analysis research community.  The measure was not useful in image analysis because of its asymmetry; however, that asymmetry can provide meaningful information in the context of forecast evaluation.  Figure 1 shows an example of the MED from a recent study evaluating regional climate models for large-scale severe storm indicator variables.

Figure 2: Diagram explaining the tiered system for test cases in MesoVICT.  The idea is to make sure every method is tested on at least the one core case (middle).  The experiments become increasingly complex as they radiate out from the core with the idea that not all participants will have the means to continue all the way out to the last tier as many participants do not have funding to do all the tiers.
Figure 2: Diagram explaining the tiered system for test cases in MesoVICT.  The idea is to make sure every method is tested on at least the one core case (middle).  The experiments become increasingly complex as they radiate out from the core with the idea that not all participants will have the means to continue all the way out to the last tier as many participants do not have funding to do all the tiers.

Work also continues on the Mesoscale Verification Intercomparison in Complex Terrain (MesoVICT) project.  The project is the second phase of the spatial forecast verification inter-comparison project (ICP), and is intended to further inform potential users about the newer spatial methods, and in particular, how they can be used in more realistic meteorological situations, including: complex terrain, ensembles of forecasts, ensembles of observations, and additional variables besides precipitation, such as wind vectors.  In order to facilitate a comprehensive comparison that enables all methods to be minimally tested with the same set of cases, a tiered approach to the cases has been established (cf. Figure 1).  The project is ramping up as results are beginning to be published.  A new special collection of papers for Monthly Weather Review has been proposed, and a project overview paper has been submitted to the Bulletin of the American Meteorological Society (Dorninger et al. Submitted on September 6, 2017).  In the fall of 2016, a meeting was held in Bologna, Italy that included researchers from around Europe, and initial results were presented.

Another recent study involves testing different statistical hypothesis testing procedures using a scenario that compares verification results for competing forecast models.  The study is a follow-on to one conducted by Hering and Genton (2011) who also introduced a newer procedure that accounts for dependence in the underlying series.  The present work focuses only on a pair of single time series forecasts verified against a single observation series, and is conducted in collaboration with Amanda S. Hering at Baylor University.  Simulations are made to test the usual paired and non-paired normal approximation hypothesis tests, those with a variance inflation factor added (to account for dependence), the corresponding t-tests, as well as various bootstrap procedures.  Two aims are to determine how each method handles dependence in time as well as dependence between forecasts; at present, many of the techniques fail in the latter case.

FY2018 Goals

  • Begin writing a paper to establish a statistical hypothesis testing framework for the newer spatial verification methods.
  • Continue explorations of the applicability and usefulness of distance metrics and image warping approaches for evaluation of spatial forecasts.
  • Continue working on the MesoVICT project including both its organization and writing papers evaluating one or more spatial verification techniques (e.g., MODE).

 

References 

Dorninger, M., E. Gilleland, B. Casati, M. P. Mittermaier, E. E. Ebert, B. G. Brown, and L. J. Wilson, 2017. Mesoscale Verification Inter-Comparison over Complex Terrain. Submitted to Bull. Amer. Meteorol. Soc. on 6 September 2017.

Gilleland, E., 2017. A new characterization in the spatial verification framework for false alarms, misses, and overall patterns. Weather Forecast., 32 (1), 187 - 198, DOI: 10.1175/WAF-D-16-0134.1.

Gilleland, E., M. Bukovsky, C. L. Williams, S. McGinnis, C. M. Ammann, B. G. Brown, and L. O. Mearns, 2016. Evaluating NARCCAP model performance for frequencies of severe-storm environments. Advances in Statistical Climatology, Meteorology and Oceanography, 2 (2), 137--153, DOI: 10.5194/ascmo-2-137-2016.

Gilleland, E., A. S. Hering, T. L. Fowler, and B. G. Brown, 2017. Testing the tests: What are the impacts of incorrect assumptions when applying confidence intervals or hypothesis tests to compare competing forecasts? Submitted to Mon. Wea. Rev. on 17 October 2017

Hering, A. S. and M. G. Genton, 2011: Comparing spatial predictions. Technometrics, 53 (4), 414 - 425, DOI 10.1198/TECH.2011.10136