Forecast verification and evaluation activities typically are based on relatively simple metrics that measure the meteorological performance of forecasts and forecasting systems. Metrics such as the Probability of Detection, Root Mean Squared Error, and Equitable Threat Score provide information that is useful for monitoring changes in performance of single aspects of forecast performance with time. However, they generally do not provide information that can be used to improve forecasts, or that can be helpful for making decisions. Moreover, it is possible for high quality forecasts– such as high-resolution forecasts – to have very poor scores when evaluated using these standard metrics even when they provide useful information, while poorer quality forecasts may score better. In response to these limitations, WSAP scientists develop improved verification approaches and tools that provide more meaningful and relevant information about forecast performance. The focus of this effort is on diagnostic, statistically valid approaches, including new spatial approaches that can provide more meaningful information (for forecast developers as well as forecast users) about forecast performance.
In the last year, several traditional hypothesis testing procedures (including IID and block bootstrap methods) for testing which of two competing forecasts, say F_{1} and F_{2}, is better in the sense of average loss differential were put to the test for accuracy in size, and those found to be accurate were also tested for power. The impact of both temporal dependence and contemporaneous correlation (i.e., when F_{1} and F_{2} are correlated with each other) were analyzed. It was found that the most commonly used test in this setting (the z-test and z-test with a variance inflation factor, VIF, applied to account for temporal dependence) were both strongly affected by contemporaneous correlation, and were generally not as accurate as the Hering and Genton (HG, 2011) test.
Another commonly used routine is the bootstrap, which is a resampling method that does not assume a particular distribution for the statistic of study, but nevertheless still makes some assumptions about the statistic. Two common bootstrap methods include the independent and identically distributed (IID) bootstrap and the block bootstrap. The IID bootstrap assumes that the underlying data are temporally independent and the block bootstrap assumes that the length of temporal dependence is considerably less than the length of the data. The block bootstrap fared generally well in the tests as it was robust to contemporaneous correlation where other tests were generally strongly affected by it. Generally, however, the approach was oversized; that is, it rejected the null hypothesis when the null hypothesis was true too often. It also requires much larger sample sizes than a parametric test.
Results from the study were published in Gilleland et al. (2018), and Figure 1, which is the same as Figure 3 in Gilleland et al. (2018), shows a graphic displaying the results of the testing for the case of moderately strong temporal dependence and three levels of contemporaneous correlation. An accurate test should have an empirical type I error rate at around the significance level of the test, which for this figure is 10%.
Work also continues on the Mesoscale Verification Intercomparison in Complex Terrain (MesoVICT) project. The project is the second phase of the spatial forecast verification inter-comparison project (ICP), and is intended to further inform potential users about the newer spatial methods, and in particular, how they can be used in more realistic meteorological situations, including: complex terrain, ensembles of forecasts, ensembles of observations, and additional variables besides precipitation, such as wind vectors. In order to facilitate a comprehensive comparison that enables all methods to be minimally tested with the same set of cases, a tiered approach to the cases has been established (cf. Figure 2). The project is ramping up as results are beginning to be published. A new special collection of papers for Monthly Weather Review has been started, and a project overview paper has been published in the Bulletin of the American Meteorological Society (Dorninger et al. 2018). Several webinars among the international participants were held during the year. A series of new geometric test cases has also been developed over the course of the year, which are designed to challenge spatial displacement-focused methods.
In addition to the MesoVICT project, Abatan et al. (2018) published a paper applying the feature-based spatial method called MODE (developed here at NCAR) to a climate application concerned with analyzing multi-year droughts.
Publish a paper describing the new geometric cases with initial results for several of the displacement-based measures.
Continue working on the MesoVICT project including both its organization and writing papers evaluating one or more spatial verification techniques (e.g., MODE).
Abatan, A. A., W. J. Gutowski, Jr., C. M. Ammann, L. Kaatz, B. G. Brown, L. Buja, R. G. Bullock, T. L. Fowler, E. Gilleland and J. Halley Gotway, 2018. Statistics of Multi-year Droughts from the Method for Object-Based Diagnostic Evaluation (MODE). International Journal of Climatology, 38 (8), 3405 - 3420, DOI: 10.1002/joc.5512.
Dorninger, M., E. Gilleland, B. Casati, M. P. Mittermaier, E. E. Ebert, B. G. Brown, and L. J. Wilson, 2018. Mesoscale Verification Inter-Comparison over Complex Terrain. Bull. Amer. Meteorol. Soc., 99 (9), 1887 - 1906, DOI: 10.1175/BAMS-D-17-0164.1.
Gilleland, E., A. S. Hering, T. L. Fowler, and B. G. Brown, 2018. Testing the tests: What are the impacts of incorrect assumptions when applying confidence intervals or hypothesis tests to compare competing forecasts? Mon. Wea. Rev., 146 (6), 1685 - 1703, DOI: 10.1175/MWR-D-17-0295.1
Hering, A. S. and M. G. Genton, 2011: Comparing spatial predictions. Technometrics, 53 (4), 414 - 425, DOI 10.1198/TECH.2011.10136