Statistical Methods in Forecasting

Background

Forecast verification and evaluation activities typically are based on relatively simple metrics that measure the meteorological performance of forecasts and forecasting systems. Metrics such as the Probability of Detection, Root Mean Squared Error, and Equitable Threat Score provide information that is useful for monitoring changes in performance of single aspects of forecast performance with time. However, they generally do not provide information that can be used to improve forecasts, or that can be helpful for making decisions. Moreover, it is possible for high quality forecasts– such as high-resolution forecasts – to have very poor scores when evaluated using these standard metrics even when they provide useful information, while poorer quality forecasts may score better. In response to these limitations, WSAP scientists develop improved verification approaches and tools that provide more meaningful and relevant information about forecast performance. The focus of this effort is on diagnostic, statistically valid approaches, including new spatial approaches that can provide more meaningful information (for forecast developers as well as forecast users) about forecast performance. 

FY2018 Accomplishments

Figure 1: Some results from Gilleland et al. (2018; reproduced from their Figure 3) demonstrating the empirical size of hypothesis tests at the 10% level for comparing the absolute-error loss between two competing forecasts, say F1 and F2.  Top left are results for when no contemporaneous correlation (i.e., when F1 and F2 are not correlated with each other), top right has moderate contemporaneous correlation and bottom left is high contemporaneous correlation.  Each has moderately strong temporal dependence.  An accurate test for the 10% level should have empirical size at approximately 10% (horizontal dotted light-gray line).  In each case, the HG test of Hering and Genton (2011) is the most accurate.
Figure 1: Some results from Gilleland et al. (2018; reproduced from their Figure 3) demonstrating the empirical size of hypothesis tests at the 10% level for comparing the absolute-error loss between two competing forecasts, say F1 and F2.  Top left are results for when no contemporaneous correlation (i.e., when F1 and F2 are not correlated with each other), top right has moderate contemporaneous correlation and bottom left is high contemporaneous correlation.  Each has moderately strong temporal dependence.  An accurate test for the 10% level should have empirical size at approximately 10% (horizontal dotted light-gray line).  In each case, the HG test of Hering and Genton (2011) is the most accurate.

In the last year, several traditional hypothesis testing procedures (including IID and block bootstrap methods) for testing which of two competing forecasts, say F1 and F2, is better in the sense of average loss differential were put to the test for accuracy in size, and those found to be accurate were also tested for power.  The impact of both temporal dependence and contemporaneous correlation (i.e., when F1 and F2 are correlated with each other) were analyzed.  It was found that the most commonly used test in this setting (the z-test and z-test with a variance inflation factor, VIF, applied to account for temporal dependence) were both strongly affected by contemporaneous correlation, and were generally not as accurate as the Hering and Genton (HG, 2011) test. 

Another commonly used routine is the bootstrap, which is a resampling method that does not assume a particular distribution for the statistic of study, but nevertheless still makes some assumptions about the statistic.  Two common bootstrap methods include the independent and identically distributed (IID) bootstrap and the block bootstrap.  The IID bootstrap assumes that the underlying data are temporally independent and the block bootstrap assumes that the length of temporal dependence is considerably less than the length of the data.  The block bootstrap fared generally well in the tests as it was robust to contemporaneous correlation where other tests were generally strongly affected by it.  Generally, however, the approach was oversized; that is, it rejected the null hypothesis when the null hypothesis was true too often.  It also requires much larger sample sizes than a parametric test. 

Results from the study were published in Gilleland et al. (2018), and Figure 1, which is the same as Figure 3 in Gilleland et al. (2018), shows a graphic displaying the results of the testing for the case of moderately strong temporal dependence and three levels of contemporaneous correlation.  An accurate test should have an empirical type I error rate at around the significance level of the test, which for this figure is 10%. 

Figure 2: Diagram explaining the tiered system for test cases in MesoVICT.  The idea is to make sure every method is tested on at least the one core case (middle).  The experiments become increasingly complex as they radiate out from the core with the idea that not all participants will have the means to continue all the way out to the last tier as many participants do not have funding to do all the tiers.
Figure 2: Diagram explaining the tiered system for test cases in MesoVICT.  The idea is to make sure every method is tested on at least the one core case (middle).  The experiments become increasingly complex as they radiate out from the core with the idea that not all participants will have the means to continue all the way out to the last tier as many participants do not have funding to do all the tiers.

Work also continues on the Mesoscale Verification Intercomparison in Complex Terrain (MesoVICT) project.  The project is the second phase of the spatial forecast verification inter-comparison project (ICP), and is intended to further inform potential users about the newer spatial methods, and in particular, how they can be used in more realistic meteorological situations, including: complex terrain, ensembles of forecasts, ensembles of observations, and additional variables besides precipitation, such as wind vectors.  In order to facilitate a comprehensive comparison that enables all methods to be minimally tested with the same set of cases, a tiered approach to the cases has been established (cf. Figure 2).  The project is ramping up as results are beginning to be published.  A new special collection of papers for Monthly Weather Review has been started, and a project overview paper has been published in the Bulletin of the American Meteorological Society (Dorninger et al. 2018).  Several webinars among the international participants were held during the year.  A series of new geometric test cases has also been developed over the course of the year, which are designed to challenge spatial displacement-focused methods.

In addition to the MesoVICT project, Abatan et al. (2018) published a paper applying the feature-based spatial method called MODE (developed here at NCAR) to a climate application concerned with analyzing multi-year droughts.

FY2019 Goals

  • Publish a paper describing the new geometric cases with initial results for several of the displacement-based measures.

  • Continue working on the MesoVICT project including both its organization and writing papers evaluating one or more spatial verification techniques (e.g., MODE).

References

Abatan, A. A., W. J. Gutowski, Jr., C. M. Ammann, L. Kaatz, B. G. Brown, L. Buja, R. G. Bullock, T. L. Fowler, E. Gilleland and J. Halley Gotway, 2018. Statistics of Multi-year Droughts from the Method for Object-Based Diagnostic Evaluation (MODE). International Journal of Climatology, 38 (8), 3405 - 3420, DOI: 10.1002/joc.5512.

Dorninger, M., E. Gilleland, B. Casati, M. P. Mittermaier, E. E. Ebert, B. G. Brown, and L. J. Wilson, 2018. Mesoscale Verification Inter-Comparison over Complex Terrain. Bull. Amer. Meteorol. Soc., 99 (9), 1887 - 1906, DOI: 10.1175/BAMS-D-17-0164.1.

Gilleland, E., A. S. Hering, T. L. Fowler, and B. G. Brown, 2018. Testing the tests: What are the impacts of incorrect assumptions when applying confidence intervals or hypothesis tests to compare competing forecasts? Mon. Wea. Rev., 146 (6), 1685 - 1703, DOI: 10.1175/MWR-D-17-0295.1

Hering, A. S. and M. G. Genton, 2011: Comparing spatial predictions. Technometrics, 53 (4), 414 - 425, DOI 10.1198/TECH.2011.10136