Statistical Methods in Forecasting

Background

Spatial forecast verification methods were rapidly introduced to address inconsistencies found between a forecaster’s subjective assessment of a forecast versus traditional verification’s assessment of the same forecast, which tended to favor coarser resolution models over the newer high-resolution counterparts.   Subsequently, much effort has been placed in evaluating the utility of these new methods; trying to determine their reliability in terms of both repeatability (if a researcher applies the method several times to the same cases, would they always have the same conclusions?), reproducibility (if different users apply the same verification method, would they have the same conclusions as each other?), as well as to determine if the methods yield sensible information about forecast performance (i.e., do they measure physically meaningful errors, or do they give erroneous information about performance?).  Such work continues with the development of a new set of cases along with their evaluation by several distance-based spatial verification measures; the findings of which have been submitted in the paper Gilleland et al (2019).

Figure 1: Consider the binary set A, represented by the light blue circle in the center of each panel, which could represent a forecast area where a variable of interest exceeds a certain threshold.  Now consider two other event sets, B and C, where B is a large ring centered at the same spot as A and C is a circle of identical size as A but translated slightly to the right.  Which of B and C makes a better forecast of A?  Different users may have different opinions, but they should choose a verification measure that agrees with their opinion.  For example, the centroid distance favors B (it gives a perfect score, in fact, of zero) whereas for C it will be equal to the translation, in this case, which means it is a worse score than for B.
Figure 1. Consider the binary set A, represented by the light blue circle in the center of each panel, which could represent a forecast area where a variable of interest exceeds a certain threshold.  Now consider two other event sets, B and C, where B is a large ring centered at the same spot as A and C is a circle of identical size as A but translated slightly to the right.  Which of B and C makes a better forecast of A?  Different users may have different opinions, but they should choose a verification measure that agrees with their opinion.  For example, the centroid distance favors B (it gives a perfect score, in fact, of zero) whereas for C it will be equal to the translation, in this case, which means it is a worse score than for B.
Figure 2: Circle cases overlaid on top of each other.  Each case is a single field, and different comparisons are proposed in Gilleland et al (2019).  For example, comparing 1 with 2, 2 with 3 and 2 with 4 tests each method for their reliability in terms of repeatability; that is, each pair is identical but placed in different parts of the domain and/or north-south instead of east-west orientation.
Figure 2. Circle cases overlaid on top of each other.  Each case is a single field, and different comparisons are proposed in Gilleland et al (2019).  For example, comparing 1 with 2, 2 with 3 and 2 with 4 tests each method for their reliability in terms of repeatability; that is, each pair is identical but placed in different parts of the domain and/or north-south instead of east-west orientation.

An example of one of the cases proposed is shown in Figure 1.  The centroid distance is a mathematical metric (meaning that it satisfies three generally desirable properties of a measure) that informs about the centroids of two fields (or individual features within a field).  The observation, A, is an area where a variable exceeds a certain threshold and B and C represent two different “forecasts” of this area.  The centroid distance favors B, giving it a perfect score, because it has identically the same centroid as A.  Therefore, if the centroid is the most important feature that a user is interested in, then centroid distance is a valuable measure.  On the other hand, if it is more important to get the overall area correct, even if it is displaced slightly in space, then centroid distance may not be ideal; at least in the sense of not being reliable as defined above.

FY2019 Accomplishments

Over fifty new (binary) geometric cases were proposed for testing spatial verification methods, particularly those aimed primarily at location errors, in order to test the reliability of verification measures, and to help determine what properties each measure has, as well as how they might fail.  Several distance-based measures were also applied to these cases, and the results are in the paper Gilleland et al. (2019).  In particular, a common situation in weather forecasting is that nothing is forecast (e.g., no rain anywhere in the domain).  If both fields are empty (zero-valued everywhere), then it should be a perfect forecast.  If one field has just a few non-zero values, then perhaps it is still an excellent forecast.  It turns out that many methods are either undefined for this situation, or when they are defined, they are highly sensitive to the addition of one or more non-zero-valued points; leading to spurious results.  Also, the position in space of these non-zero values can greatly affect several of the measures.

Many of the cases involve simple circles, some of which had previously been used in Gilleland (2017).  These cases each test how methods inform about errors for specific challenging situations and are summarized in Figure 2.  Other cases include ovals that mainly have one or more of three types of errors (size bias, location errors and orientation errors), as well as some cases involving random placement of event areas within different envelopes, and some additional sensitivity cases with noise added to other cases.

Additionally, the MesoVICT project had its final workshop in Vienna, Austria to conclude what had been learned about situations specific to complex terrain.

FY2020 Goals

  • Investigate bootstrap properties under realistic situations for forecast verification measures
  • Submit a paper on bootstrapping for forecast verification

References

Gilleland, E., 2017. A new characterization in the spatial verification framework for false alarms, misses, and overall patterns. Weather Forecast., 32 (1), 187 - 198, doi: 10.1175/WAF-D-16-0134.1.

Gilleland, E., G. Skok, B. G. Brown, B. Casati, M. Dorninger, M. P. Mittermaier, N. Roberts, and L. J. Wilson, 2019. A novel set of verification test fields with application to distance measures. Submitted to Monthly Weather Review on 3 August 2019.