Advanced Verification Techniques and Tools


Forecast verification and evaluation activities typically are based on relatively simple metrics that measure the meteorological performance of forecasts and forecasting systems. Metrics such as the Probability of Detection, Root Mean Squared Error, and Equitable Threat Score provide information that is useful for monitoring changes in performance of single aspects of forecast performance with time. However, they generally do not provide information that can be used to improve forecasts, or that can be helpful for making decisions. Moreover, it is possible for high-quality forecasts– such as high-resolution forecasts – to have very poor scores when evaluated using these standard metrics, while poorer quality forecasts may score higher. In response to these limitations, the RAL Verification Group develops improved verification approaches and tools that provide more meaningful and relevant information about forecast performance. The focus of this effort is on diagnostic, statistically valid approaches, including feature–based evaluation of precipitation and convective forecasts, and distribution–based approaches that can provide more meaningful information (for forecast developers as well as forecast users) about forecast performance. In addition, the RAL Verification Group develops forecast evaluation tools that are available for use by members of the operational, model development, and research communities. Development and dissemination of new forecast verification approaches requires research and application in several areas, including statistical methods, exploratory data analysis, statistical inference, pattern recognition, and evaluation of user needs.


Spatial verification methods and the spatial method inter–comparison project

The initial forecast verification methods intercomparison project focused on comparing the capabilities of newly developed spatial forecast verification methods. That project was completed in 2011 and resulted in a special collection of articles in the journal Weather and Forecasting. A second intercomparison project, developed in partnership with international collaborators, has been implemented and is known as the Mesoscale Verification Intercomparison in Complex Terrain (MesoVICT; Detailed MesoVICT planning took place at the European Meteorological Society annual meetings in September 2013, October 2014 (Vienna, Austria), September 2015 (Sofia Bulgaria) and most recently in September 2016 (Bologna Italy). The meeting was well-attended by key researchers and operational forecasts from various centers/institutions in Europe, as well as Russia and China. The cases for this project included more complex terrain and wind verification. Most of the test cases are already available and are described along with the goals of the project in an NCAR Technical Note TN-505+STR (Dorninger et al., 2013).

To simplify the use of many of the spatial verification methods for the MesoVICT and other efforts, the RAL verification group has developed a spatial verification methods package in the R programming language (SpatialVx;, which continues to be developed. The package currently includes considerable functionality for features-based verification, neighborhood methods, kernel smoothers, and many other statistical and image-based verification approaches.  Many improvements to SpatialVx were made based on feedback from MesoVICT participants, who have been using the software, at the MesoVICT workshop in Bologna.  Initial results for the MesoVICT cases have been made in part because of the availablitiy of SpatialVx. NCAR staff continued to support several packages for the R project for statistical computing.  These include: distillery, extRemes, ismev, smoothie, SpatialVx, and verification packages.

Several papers related to fundamental verification research were accepted for publication or published during FY18.  Abatan et al (2018) discusses the use of the Method for Object-based Diagnostic Evaluation (MODE) for Climate prediction of multidecadal droughts; Dominger et al. (2018), discusses outcomes from MesoVICT. Coelho et al (2018) is a chapter on forecast verification in a book describes the key factors in sub-seasonal to seasonal prediction.  Ebert et al (2018), highlights work towards developing new forecast verification metrics through the World Meteorological Organization (WMO).

2018 DTC Community Unified Forecast System Test Plan and Metrics Workshop

The 2018 Developmental Testbed Center (DTC) Community Unified Forecast System (UFS) Test Plan and Metrics Workshop was held at National Oceanic and Atmospheric Administration (NOAA)’s National Center for Weather and Climate Prediction on July 30 - August 1, 2018. The major goal of this workshop was to work towards a community test plan with common validation and verification metrics for the emerging UFS. The plan will serve as a guide for the weather and earth system prediction community for testing and evaluating new developments for the UFS models and components. Through standardized hierarchical testing, comparison of both historical and real-time forecasts with observations and analyses will be conducted.

The workshop had a mix of presentations, discussion periods, and working sessions in which participants contributed to the three topic-based breakout sessions: test plans, metrics, and hierarchical testing. The last activity in the workshop was a summary of the working sessions’ discussions by their leads, which was presented to workshop participants and members of the UFS Strategic Implementation Plan (SIP) meeting. The report has been published on the DTC website, which also includes presentations linked on the agenda page at:

Figure 1. Photos of many of the workshop attendees.
Figure 1. Photos of many of the workshop attendees.

The Model Evaluation Tools (MET) enhanced verification package (METplus)

The Model Evaluation Tools (MET) ( is a freely available software package for forecast evaluation that was developed and is supported by RAL/JNT staff via the DTC. During FY18, several tools of MET were wrapped with Python and use-cases, or examples, were developed to help users set up systematic evaluation capability more easily.  The wrappers and examples constitute an extension of MET to what is now called METplus.  MET and METviewer are now considered core components METplus and all three are now supported to the community via DTC.

Two tutorials were given during FY18 to train the community on the use of METplus, including one October 23-26, 2017 at the National Center for Weather and Climate Prediction (NCWCP) and the other January 31-February 2, 2018 at NCAR.


RAL staff continued to work with the NOAA/Environmental Modeling Center (EMC) to unify the verification system between the two organizations using MET and METViewer. The goal is to provide this to the community, as well, to help with research-to-operations transitions. This work has focused on addressing the requirements document released in September 2016. Much of the work during FY18 focused on standardizing tool wrappers and configuration files to make the tools easier to use.  Use-ases demonstrating how to set METplus up to evaluate gridded forecasts using gridded analyses (grid-to-grid) and gridded forecast to point observations (grid-to-obs) were established to replicate the verification capability at EMC.  The Quantitative Precipitation Forecast (QPF), Tropical Cyclone Track and Intensity, and Feature Relative use cases established in FY17 were augmented as coding standards and repository conformity were established.  The current METplus python scripts may be found at The three releases made available during FY18 are also there.  METplus Beta was release in October 2017; version 1 in July 2018, and version 2 in September 2018.


MET was first released in January 2008.  During a decade of community support, there have been 12 community releases.  In FY18, there were 3 community releases. METv6.1 was released to the community in December 2017, METv7.0 was released in March 2018, and METv8.0 was released in September, 2018. Many bug fixes, enhancements, and optimizations to MET have been included during the past year.  Most notable include: 1) support for many more grid projections and the use of shapefiles for generating masks to focus on evaluation over regions of interest (Figure 2 shows an example of the definition of several shape files); 2) ability to perform simple conversions (e.g. degrees Kelvin to Celsius or Farenheit) as well as compute many thermodynamic fields from multivariate observations; 3) support for flexible climatology, or reference field, definition to compute global continuous statistics and ensemble related scores; 4) computation of WMO statistics for gridded continuous scores; 5) addition of several new interpolation methods; 6) improvements to configurability of MET object-based methods; and 7) the ability to call a Python script from MET tools.  The last aforementioned enhancement will allow the community to extend MET, and hence METplus, in a multitude of ways, including: reading in data formats currently not supported, deriving fields in the tools, and testing promising new verification techniques. METv8.0 available for download at  A list of the new capabilities can be found in the METv6.1, METv7.0 and METv8.0 release notes.


METviewer, is the companion database and display system for MET output and was first used by RAL in 2009 for work with the Hazardous Weather Testbed. During the nearly decade of development, it has become the quintessential data analysis tool for users of MET.  Plotting capability include: 1)time series, boxplots and histograms of summarized (mean, median or accumulated) statistics or aggregated statistics; 2) plots of ensemble definition statistics such as rank-histograms; 3)synthesis diagrams such as the Taylor and Performance Diagrams, scorecards and 2-d contour plots of statistics.  METviewer include the ability to apply boot-strapped or normal confidence intervals and assess statistical significance.  During FY18, six releases of METviewer were made public to the community, including METviewer v2.2 through 2.8.  Releases were driven by necessary enhancements, bug-fixes, and support for MET releases.  Figure 2 shows a few examples of contour plots generated from current projects, including the Frequency Bias (FBIAS) of Updraft Helicity prediction at multiple thresholds from the High Resolution Rapid Refresh (HRRR – operational) and NOAA Geophysical Fluid Dynamics Laboratory (GFDL) Cubed Finite Volume model (FV3- research) CAM models evaluated during the 2018 Hazardous Weather Testbed.  Both forecasts appear to be under-forecasting updraft helicity but GFDL FV3 generally predicts higher values.  FBIAS ranges from –infinity to infinity with an optimal score of 1.  In this figure, white indicates FBIAS greater than 0.1.

Figure 2. Example contour plots of frequency bias for two CAM models (HRRR – left and GFDL FV3 – right) using METviewer.
Figure 2. Example contour plots of frequency bias for two CAM models (HRRR – left and GFDL FV3 – right) using METviewer.

Air Force Verification and Validation

During FY18, the JNT continued a number of verification and validation exercises in partnership with the United States Air Force (AF). The AF is currently in the process of undertaking major upgrades to various components of their operational Global Air-Land Weather Exploitation Model (GALWEM) forecast system, including the land information system, data assimilation system, global deterministic and ensemble systems, high-resolution regional modeling system, and post-processing software system. To assist the AF with validating their implementations and/or verifying new implementations are producing quality forecast products that are as or more skillful than current operational products, the DTC has been tasked with helping design and carryout test plans that clearly articulate the targets needed to show necessary improvement for implementation. The design and execution of the test plans heavily leverages the JNT’s expertise in cutting-edge and advanced verification methods as well as the Model Evaluation Tools (MET). Progress on the various implementations is currently underway; it is expected work will continue into FY19.

As noted above, the AF recently performed substantial upgrades to their post processing software, including algorithm changes and a consolidation from products based on output from a variety of models [NCEP Global Forecast System (GFS), AF configurations of the Weather Research and Forecasting (WRF) model and GALWEM] to products based solely on GALWEM output. To assess the readiness of the GALWEM Post-Processor (GPP) implementation, the JNT compared post-processed products from GPP against the AF’s operational capabilities. The AF provided the JNT with post-processed files from the various systems for the validation exercise, focusing on 00 and 12 UTC initializations from 1-30 June of 2017 with lead times at six-hour intervals out to 96 hours.

GPP was used as the baseline since the AF was targeting it for operational implementation. The validation included an extensive number of variables and levels to ensure the new system was producing acceptable results. MET was used extensively in the validation for calculating bulk statistics as well as spatial differences between the different outputs. Differences between the new and old capabilities were anticipated for the following reasons not related to validity of implementation: 1) the underlying models or model version in each comparison are different, 2) the algorithms used for certain variables are different, and 3) the grid resolutions and/or grid specifications are different.

To highlight a key finding from the validation, results are presented for 2-m density altitude (m), a critical field in aviation that assesses an aircraft’s aerodynamic performance under certain meteorological conditions. In the original set of delivered data, differences between GPP and WRF in the Northern Hemisphere, Southern Hemisphere, and Tropical domains were in often in excess of +/-50 m, the designated threshold indicating remarkable results. Further investigation showed a discrepancy in the field level that was being output in the GPP and WRF data. A modification to the GPP system was made due to JNT findings to correct the level of density altitude being output to be 2-m. A new data set was delivered, and the modifications showed a mitigation in the differences between the density altitude products being output between GPP and the WRF-based system. An example of plot types used in the validation are shown in Figures 3 and 4.

Figure 3. Time series plots of 2-m density between WRF and GPP for all 00 UTC initializations (dark green) and 12 UTC initializations (light green) for the a) Northern Hemisphere, b) Southern Hemisphere, and c) Tropics.
Figure 3. Time series plots of 2-m density between WRF and GPP for all 00 UTC initializations (dark green) and 12 UTC initializations (light green) for the a) Northern Hemisphere, b) Southern Hemisphere, and c) Tropics.


Figure 4. Spatial differences of GCWRF - GPP for 2-m density altitude over all 00 UTC initializations for the 96-h forecast lead time for a) the Northern Hemisphere, b) the Southern Hemisphere, and c) the Tropics. Blues indicated GPP values are greater than GCWRF values. White shaded areas have values between +/-50 m.
Figure 4. Spatial differences of GCWRF - GPP for 2-m density altitude over all 00 UTC initializations for the 96-h forecast lead time for a) the Northern Hemisphere, b) the Southern Hemisphere, and c) the Tropics. Blues indicated GPP values are greater than GCWRF values. White shaded areas have values between +/-50 m.

Verification of Weather Hazards Prediction

RAL staff continued to work with NOAA on evaluating the prediction of severe weather hazards, including heavy rain and snow, strong winds, hail, and tornadoes.  The projects were collaborative between RAL, NOAA Earth System Research Laboratory (ESRL), NOAA National Severe Storms Laboratory (NSSL), and NOAA Centers, including EMC, Storm Prediction Center (SPC), and Weather Prediction (WPC).  Two NOAA testbeds were targeted for this work, including the Hazardous Weather Testbed and the Hydrometeorology Testbed.

The overall goals of scorecard project with the Hazardous Weather Testbed are to (a) identify accepted measures that should be integrated into METplus, (b) explore the applicability to Convection Allowing Models (CAM)s of a new feature-relative methodology being transitioned to HMT this year (c) explore the use of forecast consistency measures that are being added to METplus, (d) develop a flexible scorecard to allow the community to define a useful one for CAM systems and (e) work with HWT to evaluate CAMs retrospectively to demonstrate the usefulness of the metrics (f) work with HWT to assess difference ensemble configurations of the Community Leveraged Unified Ensemble (CLUE). Emphasis was placed on evaluating deterministic and probabilistic products derived from storm-attribute fields and assessing their skill at predicting severe events (e.g., tornados, hail, wind). Particular focus in FY18 was placed on the skill of “proxy” products (e.g., updraft helicity) and their usefulness in identifying severe weather hazards. In Figure 5, the left two panels show examples of a scorecard comparison focused on the skill of the three deterministic model at predicting the potential for updraft helicity within a pre-defined circular neighborhood of 40km (~13 grid-squares) to exceed between 50 and 125 m2/s2 across the daily domain of interest and the continuous US domain (CONUS).  The right panel shows the comparison of two CAM ensembles, the High Resolution Ensemble Forecast v2, developed at the Storm Prediction Center, and the High Resolution Rapid Refresh Ensemble, developed at GSDL.  In all three panels, green or upward pointing arrows indicate model 1 is out-performing model 2.  Red or downward pointing arrows indicate the opposite. The amount of the statistical significance is indicated by the size of the arrow.  Results from the testing of CLUE ensemble configurations may be found in the Testing and Evaluation section.

Figure 5.  Example scorecards developed for the HWT and focused on the severe weather indictor, updraft helicity.  See text for description.
Figure 5.  Example scorecards developed for the HWT and focused on the severe weather indictor, updraft helicity.  See text for description.

The overall goals of the project with the HMT is to improve extreme quantitative precipitation forecasts (QPF) that leads to flash flooding by integrating verification research with social science research conducted with National Weather Service (NWS) forecasters. The goal is to develop verification metrics and tools that improve forecasters’ interpretation and use of deterministic and ensemble-based model guidance, with a focus on convection-allowing guidance. The project leverages work performed by the USWRP Research to Operations (R2O) project on Ensemble Hazard Prediction, which is a collaboration between RAL and MMM and led by NOAA’s ESRL.

For the Ensemble Hazard project, scientists in MMM conducted 31 semi-structured interviews with NWS forecasters from 12 Weather Forecast Offices (WFOs) about their current use (and lack thereof) of model guidance—synoptic- and meso-scale, and deterministic and ensemble-based—along with their subjective approaches to verifying the guidance, their use of objective verification information, and their needs for additional objective verification. During FY18, a manuscript reporting this work was drafted by MMM and RAL scientists and is under final editing.

The work performed in the Ensemble Hazard project informed the decision by WPC, RAL and MMM to focus efforts for the ongoing project on developing two verification metrics. Figure 6 shows an example of the result of a forecaster interview.  The metrics “Event Onset” can be evaluated using the MET MODE-Time Domain tool. The first metric will evaluate run-to-run model consistency and trends for QPF and related fields at different output frequency (e.g., hourly output, 3-hourly output) using the Method for Object-based Diagnostic Evaluation extended to the Time Domain (MODE-TD). The second metric will compare valid model output against currently available observations as a way to evaluate recent or current guidance. All METplus tools will be will be explored for developing QPF-related objective verification for this purpose with initial focus on the MODE tool.


Figure 6.  Sample of forecaster interview regarding forecast products and evaluation metrics that would build confidence in National Weather Service forecasters.
Figure 6.  Sample of forecaster interview regarding forecast products and evaluation metrics that would build confidence in National Weather Service forecasters.


There will be at least two major releases of METplus and its components in FY2019.  Future releases of MET will include enhancements necessary for the unification efforts described above, as well as capabilities needed for testing and evaluation activities within the DTC. Spatial verification methods will be explored through the MesoVICT project will be added MET through a DTC visitor project. Support for calling Python algorithms through MET will be expanded to support Xarray and PANDAs more fully.  Also, the METViewer computation layer will undergo refactoring to optimize for “big data” and larger numbers of  users. The number of METplus use-cases will expand significantly to include at least examples of how to set-up MET for Aerosols, Air Quality, Hail, Marine, Sea-Ice, Space Weather, Subseasonal-to-Seasonal, and Tropical Cyclone Genesis, and Tropical Environment. Air Force and Hazards Assessment verification and validation work will continue with both HWT and HMT, including continued collaboration with NCAR/MMM on social science aspects of the problem.