Advanced Verification Techniques and Tools


Forecast verification and evaluation activities typically are based on relatively simple metrics that measure the meteorological performance of forecasts and forecasting systems. Metrics such as the Probability of Detection, Root Mean Squared Error, and Equitable Threat Score provide information that is useful for monitoring changes in performance of single aspects of forecast performance with time. However, they generally do not provide information that can be used to improve forecasts, or that can be helpful for making decisions. Moreover, it is possible for high-quality forecasts– such as high-resolution forecasts – to have very poor scores when evaluated using these standard metrics, while poorer quality forecasts may score higher. In response to these limitations, the RAL Verification Group develops improved verification approaches and tools that provide more meaningful and relevant information about forecast performance. The focus of this effort is on diagnostic, statistically valid approaches, including feature–based evaluation of precipitation and convective forecasts, and distribution–based approaches that can provide more meaningful information (for forecast developers as well as forecast users) about forecast performance. In addition, the RAL Verification Group develops forecast evaluation tools that are available for use by members of the operational, model development, and research communities. Development and dissemination of new forecast verification approaches requires research and application in several areas, including statistical methods, exploratory data analysis, statistical inference, pattern recognition, and evaluation of user needs.


Spatial verification methods and the spatial method inter–comparison project

The initial forecast verification methods intercomparison project focused on comparing the capabilities of newly developed spatial forecast verification methods. That project was completed in 2011 and resulted in a special collection of articles in the journal Weather and Forecasting. A second intercomparison project, developed in partnership with international collaborators, has been implemented and is known as the Mesoscale Verification Intercomparison in Complex Terrain (MesoVICT; Detailed MesoVICT planning took place at the European Meteorological Society annual meetings in September 2013, October 2014 (Vienna, Austria), September 2015 (Sofia Bulgaria) and most recently in September 2016 (Bologna Italy). The meeting was well-attended by key researchers and operational forecasts from various centers/institutions in Europe, as well as Russia and China. The cases for this project included more complex terrain and wind verification. Most of the test cases are already available and are described along with the goals of the project in an NCAR Technical Note TN-505+STR (Dorninger et al., 2013).

To simplify the use of many of the spatial verification methods for the MesoVICT and other efforts, the RAL verification group has developed a spatial verification methods package in the R programming language (SpatialVx;, which continues to be developed. The package currently includes considerable functionality for features-based verification, neighborhood methods, kernel smoothers, and many other statistical and image-based verification approaches.  Many improvements to SpatialVx were made based on feedback from MesoVICT participants, who have been using the software, at the MesoVICT workshop in Bologna.  Initial results for the MesoVICT cases have been made in part because of the availablitiy of SpatialVx. NCAR staff continued to support several packages for the R project for statistical computing.  These include: distillery, extRemes, ismev, smoothie, SpatialVx, and verification packages.

Several papers related to fundamental verification research were accepted for publication or published during FY18.  Abatan et al (2018) discusses the use of the Method for Object-based Diagnostic Evaluation (MODE) for Climate prediction of multidecadal droughts; Dominger et al. (2018), discusses outcomes from MesoVICT. Coelho et al (2018) is a chapter on forecast verification in a book describes the key factors in sub-seasonal to seasonal prediction.  Ebert et al (2018), highlights work towards developing new forecast verification metrics through the World Meteorological Organization (WMO).

The Model Evaluation Tools (MET) Enhanced Verification Package (Metplus)

The Model Evaluation Tools (MET) ( is a freely available software package for forecast evaluation that was developed and is supported by RAL/JNT staff via the DTC. During FY19, additional MET tools were wrapped with Python and use-cases, or examples, were developed to help users set up systematic evaluation capability more easily.  The wrappers and examples constitute an extension of MET to what is now called METplus.  MET and METviewer are now considered core components METplus and all three are now supported to the community via DTC.

A tutorial was given during FY19 to train the community on the use of METplus between February 4-6, 2019 at NCAR.  Additionally, a tutorial as given at the Naval Research Laboratory (NRL) in Monterey, CA between July 30-August 1, 2019.  The NRL tutorial was recorded to provide “internal training tools” and will be turned into scripts for online video tutorials in FY20.


RAL staff continued to work with the NOAA/Environmental Modeling Center (EMC) to unify the verification system between the two organizations using MET and METViewer. The goal is to provide this to the community, as well, to help with research-to-operations transitions. This work has focused on addressing the requirements document released in September 2016. Much of the work during FY18 focused on standardizing tool wrappers and configuration files to make the tools easier to use.  Use-cases demonstrating how to set METplus up to evaluate gridded forecasts using gridded analyses (grid-to-grid) and gridded forecast to point observations (grid-to-obs) were established to replicate the verification capability at EMC.  The Quantitative Precipitation Forecast (QPF), Tropical Cyclone Track and Intensity, and Feature Relative use cases established in FY17 were augmented as coding standards and repository conformity were established.  The current METplus python scripts may be found at The two releases made available during FY19 are also there. METplus version 2.1 in January 2019, and version 2.2 in May 2019.


MET was first released in January 2008.  During a decade of community support, there have been 12 community releases.  In FY19, there was 1 community releases. METv8.1 was released to the community in May 2019.  Many bug fixes, enhancements, and optimizations to MET have been included during the past year.  Most notable include:

1)   Compliance with Fortify, a static code analyzer tool which identifies potential security vulnerabilities.

2)   Added support for defining thresholds as percentiles of data (see the user's guide for details and examples).

3)   Added support for the Gaussian interpolation and regridding method, including the sigma option to define the shape.

4)   Enhanced the NetCDF library support.

5)   Added the "-derive" functionality to PCP-Combine tool to compute the sum, min, max, range, mean, standard deviation, or valid data count of data from a series of input files.

6)   Updated support for computing GOES-16/17 pixel locations from metadata of the input files.

7)   Standardize configuration files compute time summaries for all point-data formats.

8)   Added support for land/sea masking, including "nearest" land or sea interpolation

METv8.1 available for download at  A list of the new capabilities can be found in the METv8.1 release notes.


METviewer, is the companion database and display system for MET output and was first used by RAL in 2009 for work with the Hazardous Weather Testbed. During the nearly decade of development, it has become the quintessential data analysis tool for users of MET.  Plotting capability include: 1) time series, boxplots and histograms of summarized (mean, median or accumulated) statistics or aggregated statistics; 2) plots of ensemble definition statistics such as rank-histograms; 3)synthesis diagrams such as the Taylor and Performance Diagrams, scorecards and 2-d contour plots of statistics.  METviewer includes the ability to apply boot-strapped or normal confidence intervals and assess statistical significance.  During FY18, three releases of METviewer were made public to the community, including METviewer v2.9 through 2.11.  Releases were driven by necessary enhancements, bug-fixes, and support for MET releases. 

New features in METviewer include:

1)   Support for MariaDB and Aurora (for Amazon Web Services) databases.

2)   Support for several new statistics line-types, including those to comply with METv8.1 line types.

3)   Event Equalization logic is available for hist, roc, rely, ens_ss plots.

4)   Substantial expansion of scorecard capability to allow users to:

  • Put forecast/observed threshold in columns and forecast lead in rows, which is different than a traditional scorecard definition.
  • Combine several vx_masks and/or fcst_leads together into one column.
  • Display values for DIFF and symbols for statistical significance.
  • Use multiple time periods.
  • Use multiple databases


Air Force Verification and Validation

During FY18, the JNT continued a number of verification and validation exercises in partnership with the United States Air Force (AF). The AF is currently in the process of undertaking major upgrades to various components of their operational Global Air-Land Weather Exploitation Model (GALWEM) forecast system, including the land information system, data assimilation system, global deterministic and ensemble systems, high-resolution regional modeling system, and post-processing software system. To assist the AF with validating their implementations and/or verifying new implementations are producing quality forecast products that are as or more skillful than current operational products, the DTC has been tasked with helping design and carryout test plans that clearly articulate the targets needed to show necessary improvement for implementation. The design and execution of the test plans heavily leverages the JNT’s expertise in cutting-edge and advanced verification methods as well as the Model Evaluation Tools (MET). Progress on the various implementations is currently underway; it is expected work will continue into FY19.

Verification of Weather Hazards Prediction

RAL staff continued to work with NOAA on evaluating the prediction of severe weather hazards, including heavy rain and snow, strong winds, hail, and tornadoes.  The projects were collaborative between RAL, NOAA National Severe Storms Laboratory (NSSL), and NOAA Centers, including EMC, Storm Prediction Center (SPC), and Weather Prediction (WPC).  Two NOAA testbeds were targeted for this work, including the Hazardous Weather Testbed and the Hydrometeorology Testbed.

The overall goals of scorecard project with the Hazardous Weather Testbed are to (a) identify accepted measures that should be integrated into METplus, (b) explore the applicability to Convection Allowing Models (CAM)s of a new feature-relative methodology being transitioned to HMT this year (c) explore the use of forecast consistency measures that are being added to METplus, (d) develop a flexible scorecard to allow the community to define a useful one for CAM systems and (e) work with HWT to evaluate CAMs retrospectively to demonstrate the usefulness of the metrics (f) work with HWT to assess difference ensemble configurations of the Community Leveraged Unified Ensemble (CLUE). Emphasis was placed on evaluating deterministic and probabilistic products derived from storm-attribute fields and assessing their skill at predicting severe events (e.g., tornados, hail, wind). Particular focus in FY19 was refining the CAM scorecard and adding support in METplus to compute the Surrogate Severe field.  Figure 1 shows the final visual representation of the scorecard.  Colors are targeted to be most distinguishable by color-blind people.   Green or upward pointing arrows indicate model 1 is out-performing model 2. Purple or downward pointing arrows indicate the opposite. The amount of the statistical significance is indicated by the size of the arrow.

 Figure 1.  Example scorecards developed for the HWT and focused on the severe weather indictor, updraft helicity.  See text for description.
Figure 1.  Example scorecards developed for the HWT and focused on the severe weather indictor, updraft helicity.  See text for description.

NCAR also collaborated with AER, Inc for another project focused on the HWT experiment.  This project included the evaluation of the AER HailCast product.  Verification was performed on the prototype hail products and summarized in performance diagrams (see Figure 2). The Performance Diagram (Roebber, 2009), to provide a better understanding of how models perform on predicting RI.  A performance diagram includes four metrics on one plot: 1) Probability of Detection (POD), conditioned on observed events; 2) False Alarm Ratio (FAR), conditioned on forecasted events; 3) Critical Success Index (CSI), a function of POD and FAR; and Frequency Bias (FBIAS), which gives the ratio of forecasted events to observed events.  A forecast is considered to be performing optimally when the symbol is closest to the upper right hand corner.

Figure 2 Example of a Performance Diagram generated for the AER HailCast product.
Figure 2. Example of a Performance Diagram generated for the AER HailCast product.

NCAR RAL and MMM wrapped up a United States Weather Research Program (USWRP) project in collaboration with NOAA/Earth Research Systems Laboratory (ESRL) by submitting a journal article to Monthly Weather Review.  Work continued between RAL, MMM, and HMT is to improve extreme quantitative precipitation forecasts (QPF) that leads to flash flooding by integrating verification research with social science research conducted with National Weather Service (NWS) forecasters. This work was focused on the run-to-run model consistency of forecast. This was initially explored by looking at the trends for QPF and related fields at different output frequency (e.g., hourly output, 3-hourly output) using the Method for Object-based Diagnostic Evaluation extended to the Time Domain (MODE-TD), shown in Figure 3. NCAR also worked with HMT on developing METplus use cases to validate model output against currently available observations as soon as possible.

Figure 3.  Revision series accumulation box plots to visualize forecast consistency for object areas for probability of 1 hour precipitation accumulation > 12.7mm.
Figure 3.  Revision series accumulation box plots to visualize forecast consistency for object areas for probability of 1-hour precipitation accumulation > 12.7mm.


There will be at least two major releases of METplus and its components in FY2020.  Future releases of MET will include enhancements necessary for the unification efforts described above, as well as capabilities needed for testing and evaluation activities within the DTC. Enhancements will be focused on the following application areas: Aerosols, Air Quality, Hail, Marine, Sea-Ice, Space Weather, Subseasonal-to-Seasonal, and Tropical Cyclone Genesis, and Tropical Environment. Air Force and Hazards Assessment verification and validation work will continue as well as both HWT and HMT, and finally work towards unifying NCAR verification and validation capability through NCAR’s unified modeling framework project, called the System for Integrated Modeling of the Atmosphere (SIMA) will start gaining momentum.