Advanced Verification Techniques and Tools

BACKGROUND

Forecast verification and evaluation activities typically are based on relatively simple metrics that measure the meteorological performance of forecasts and forecasting systems. Metrics such as the Probability of Detection, Root Mean Squared Error, and Equitable Threat Score provide information that is useful for monitoring changes in performance of single aspects of forecast performance with time. However, they generally do not provide information that can be used to improve forecasts, or that can be helpful for making decisions. Moreover, it is possible for high quality forecasts– such as high-resolution forecasts – to have very poor scores when evaluated using these standard metrics, while poorer quality forecasts may score higher. In response to these limitations, the RAL Verification Group develops improved verification approaches and tools that provide more meaningful and relevant information about forecast performance. The focus of this effort is on diagnostic, statistically valid approaches, including feature–based evaluation of precipitation and convective forecasts, and distribution–based approaches that can provide more meaningful information (for forecast developers as well as forecast users) about forecast performance. In addition, the RAL Verification Group develops forecast evaluation tools that are available for use by members of the operational, model development, and research communities. Development and dissemination of new forecast verification approaches requires research and application in several areas, including statistical methods, exploratory data analysis, statistical inference, pattern recognition, and evaluation of user needs.

FY2017 ACCOMPLISHMENTS

Spatial verification methods and the spatial method inter–comparison project

The initial forecast verification methods intercomparison project focused on comparing the capabilities of newly developed spatial forecast verification methods. That project was completed in 2011 and resulted in a special collection of articles in the journal Weather and Forecasting. A second intercomparison project, developed in partnership with international collaborators, has been implemented and is known as the Mesoscale Verification Intercomparison in Complex Terrain (MesoVICT; http://www.ral.ucar.edu/projects/icp/). Detailed MesoVICT planning took place at the European Meteorological Society annual meetings in September 2013, October 2014 (Vienna, Austria), September 2015 (Sofia Bulgaria) and most recently in September 2016 (Bologna Italy). The meeting was well-attended by key researchers and operational forecasts from various centers/institutions in Europe, as well as Russia and China. The cases for this project included more complex terrain and wind verification. Most of the test cases are already available and are described along with the goals of the project in an NCAR Technical Note TN-505+STR (Dorninger et al., 2013).

To simplify the use of many of the spatial verification methods for the MesoVICT and other efforts, the RAL verification group has developed a spatial verification methods package in the R programming language (SpatialVx; http://www.ral.ucar.edu/projects/icp/SpatialVx/), which continues to be developed. The package currently includes considerable functionality for features-based verification, neighborhood methods, kernel smoothers, and many other statistical and image-based verification approaches.  Many improvements to SpatialVx were made based on feedback from MesoVICT participants, who have been using the software, at the MesoVICT workshop in Bologna.  Initial results for the MesoVICT cases have been made in part because of the availablitiy of SpatialVx.  One paper is now in press in the journal Weather and Forecasting (Gilleland, 2017, DOI: 10.1175/WAF-D-16-0134.1) that provides an initial analysis of MesoVICT Tier I cases.

NCAR staff continued to support several packages for the R project for statistical computing.  These include: distillery, extRemes, ismev, smoothie, SpatialVx, and verification packages.

Verification of Total Cloud Fraction

The DTC continued to work with the Air Force 557th Weather Wing on exploring verification approaches for clouds.  This evaluation included numerous NWP model forecasts utilized by the Air Force. The DTC’s MET (http://www.dtcenter.org/met/users/) software package was used for the verification analyses in this study. Initial investigation of cloud verification metrics was undertaken during  FY2016 (see http://www.dtcenter.org/verification/reports/DTC_AF_Cloud_Verification_Report_FY15.pdf and http://www.dtcenter.org/verification/reports/Report_on_New_Cloud_Verific...).  For FY2017, the DTC continued to explore the total cloud fraction (TCDC) field and expanded the demonstration of several methods to the global scale, exploring several thresholds and convolution radii to identify the best configuration for MODE, obtaining additional summer data and adding measures to MET.  New methods added to MET as part of this project (day/night mask, land/sea mask, satellite grouping masks and regions defined by latitude bands) were used during the evaluation.  As part of defining an optimal MODE configuration, multiple MODE convolution radii and threshold pairs were investigated using the new “quilt” configuration method introduced in METv5.2. Scores were computed for cloudy (TCDC > 80%) and clear (TCDC < 80%) conditions.

Figure 1.  Box plots of Probability of Detection (PODY) computed against WWMCA-R for ADVCLD (red), GFSDCF (yellow), GFSRAW (green), UMDCF (light blue), UMRAW (dark blue) and WWMCA operational analysis (purple).  Scores shown are for cloudy (left) and clear (right) conditions.
Figure 1.  Box plots of Probability of Detection (PODY) computed against WWMCA-R for ADVCLD (red), GFSDCF (yellow), GFSRAW (green), UMDCF (light blue), UMRAW (dark blue) and WWMCA operational analysis (purple).  Scores shown are for cloudy (left) and clear (right) conditions.

Box plots shown in Figure 1 indicate the probability of detecting an event (PODY) for all TCDC forecasts drops off markedly within the first six-hour lead times, and generally plateaus out to 78-hour lead times (beyond the range of these plots).  The exception is the continued reduction in skill for the Advect Cloud model (red) for clear conditions (right panel).  Additionally, the raw model fields (RAW) tend to have a very low PODY for clear conditions.  Also included in Figure 1 is a comparison between WWMCA and WWMCA-R (Air Force satellite-based global cloud analysis, where WWMCA-R is a reanalysis that includes more data).  Interestingly, the score of approximately 0.9 for WWMCA (purple) when compared to WWMCA-R indicates there is approximately a 10-15% change in the cloud field once the re-analysis is performed.  Finally, even though there are only four weeks in the sample, the height of the boxes and lack of outliers (circles) indicate the distribution of scores are fairly tight.  This narrow distribution results from the large quantity of grid points over the global domain.

Figure 2.  Performance diagrams for categorical results stratified by regions in the Northern (left) and Southern (right) hemisphere.  High latitudes are shown as circles, mid-latitudes as triangles and tropics as squares.  Diagrams are for clear conditions. Color scheme is same as that in Fig. 3.5-1.
Figure 2.  Performance diagrams for categorical results stratified by regions in the Northern (left) and Southern (right) hemisphere.  High latitudes are shown as circles, mid-latitudes as triangles and tropics as squares.  Diagrams are for clear conditions. Color scheme is same as that in Fig. 3.5-1.

Performance diagrams provide information about a set of related categorical scores to provide a complete picture of model performance in one diagram (Roebber, 2009). An example of a performance diagram is shown in Figure 2.  In this diagram, points that fall in the upper right corner indicate a perfect forecast, whereas those in the lower left corner indicate a forecast with no skill. Performance diagrams plot PODY versus success ratio (1- False Alarm Ratio).  The straight lines emanating from the origin are Frequency Bias (1 is perfect, while values greater than 1 indicate over-forecasting and values less than 1 indicate under-forecasting) and the curved lines are the Critical Success Index (CSI) values. Figures 1 and 2 indicate that the scores for clear conditions from the RAW fields tend to be lower than those for their companion Diagnostic Cloud Forecast (DCF) fields.  Scores for high latitude regions (above 50 degrees) in Figure 2 are shown as circles; for mid-latitudes (30-50 degrees) are triangles; and tropical regions (0-30 degrees) are squares.  In the Northern Hemisphere (left), the scores increase from poles to tropics.  Interestingly, in the Southern Hemisphere (right), scores are lowest in the mid-latitudes.  WWMCA is an analysis and ADVCLD only provides forecasts to 9 hours. Hence, neither are included in these performance diagrams for the 24 hour forecasts.  

Figure 3.  6 hour forecast (upper left) and analysis (upper right) fields for 5 August 2016 valid at 12 UTC. Resulting MODE objects for a cloudy (lower left) and clear (lower right) configurations. Thresholds of >=80 and <=20 were applied for cloudy and clear, respectively.  Forecast objects are shaded and analysis objects are outlines.
Figure 3.  6 hour forecast (upper left) and analysis (upper right) fields for 5 August 2016 valid at 12 UTC. Resulting MODE objects for a cloudy (lower left) and clear (lower right) configurations. Thresholds of >=80 and <=20 were applied for cloudy and clear, respectively.  Forecast objects are shaded and analysis objects are outlines.

Several MODE configurations were tested on the North American G212 domain during AOP 2015.  The resulting configurations were the starting point for the MODE testing performed during AOP 2016.  It was found that the regional configurations were not optimal for use over the global domain for several reasons (e.g., too fine a convolution radius resulted in objects that were more structured than necessary).  Also, the centroid distance and area ratio settings needed to be modified to decrease their influence on matching and merging.  Figure 3 shows the forecast (upper left) and analysis (upper right) for one case, along with the resulting objects for a cloudy (lower left) and clear (lower right) configuration.  The configuration using a convolution radii of 30 gridpoints and thresholds of >=80 and <=20, for cloudy and clear respectively, resulted in the object representation to most closely match subject impressions of what defines the cloudy or clear areas.  This configuration was applied to all 4 seasonal cases and will be summarized in the final report, which will be posted on the DTC website.

The Model Evaluation Tools (MET)

The Model Evaluation Tools (MET) (http://www.dtcenter.org/met/users/) is a freely available software package for forecast evaluation that was developed and is supported by RAL/JNT staff. METv6.0 was released to the community in April 2017.  It includes a multitude of enhancements to the already extensive capability.  The total number of registered users now exceeds 3400. The user base is predominantly university researchers, both international and US-based. The METViewer database and user interface software were updated to accommodate additional capabilities released in METv6.0 as well as to continue to provide a more streamlined and intuitive interface.  The METv6.0 release includes numerous enhancements and smaller bug fixes. METv6.0 is available for download at http://www.dtcenter.org/met/users/downloads/index.php.  A list of the new capabilities can be found in the MET v6.0 release notes.

MET+

RAL staff continued to work with the National Center for Environmental Prediction (NCEP)/Environmental Modeling Center (EMC) to unify the verification system between the two organizations using MET and METViewer.  This work has focused on addressing the requirements document released in September 2016.  With funding from NGGPS, the RAL verification team is using this information to develop a unified verification system called MET+.  Briefly, MET+ is a set of python wrappers to simplify set-up and running of MET, allow researchers to leverage their own unique algorithms, and systematically plot the fields and results.

FY2018 PLANS

There will be at least one major release of MET in FY2018.  Future releases of MET will include enhancements necessary for the unification efforts described above, as well as capabilities needed for testing and evaluation activities within the DTC.  Also, the database within METViewer will undergo refinements to optimize for “big data” and larger numbers of  users. Additional spatial verification methods will be explored through the MesoVICT project and added to SpatialVx and, if applicable, MET.