Futureproofing High School Science Students through Data Analytics

One of NCAR’s Strategic Imperatives is the training and preparation of the next generation of scientists to continue NCAR’s work. We believe that this training needs to start early, and one of the key elements is preparing students still in high school for the challenges of working with the huge amounts of data required to make progress in our field. Science of all flavors is destined to be based on ever-larger datasets as our technologies and methods improve, so programming skills, the ability to proficiently manipulate these data sets, and a strong understanding of statistics (collectively known as “data analytics”) are all vital skills for tomorrow’s scientists.

This past summer, NCAR’s Computing and Information Systems Laboratory (CISL) ran the “Data Analytics Bootcamp for High School Students” – an opportunity for 10 Boulder Valley School District high school sophomores and juniors to gain a hands-on introduction to being a Data Scientist. This five-day workshop was held 22–26 June 2015 at NCAR’s Mesa Lab facility in Boulder.

Solar panels being installed
Dorit Hammerling (standing), lead organizer and developer of the workshop stated, “Data analytics might sound scary at first to young people, but by approaching the field using data sets and problems that are both interesting and practical, this effort is designed to attract new talent to the discipline by providing hands-on experience for young people interested in pursuing careers involving scientific data analysis.” —Photo by Brian Bevirt, CISL

Demand for data scientists continues to increase as the Big Data era produces data in varieties and volumes far exceeding anything scientists and engineers have ever had to manage before. Effective data analysis – using data to answer practical questions – underpins decision making in many fields and is the power behind many of the most successful web enterprises including Google, Facebook, Amazon, and Orbitz. For NCAR researchers, effective data analysis also promises to unlock more scientific information from observations and numerical simulations in the geosciences. This bootcamp introduced data analysis concepts by presenting exercises using real data applied to real-life situations. Some of the examples covered concepts in climate, but others were just fun, for example analyzing the performances of basketball players and pricing used cars.

The workshop was sponsored by CISL’s Institute for Mathematics Applied to the Geosciences (IMAGe) and was provided at no cost to the students. Organizer Dorit Hammerling (IMAGe Project Scientist II) and sponsor/co-organizer Doug Nychka (IMAGe Director) designed the curriculum to be a hands-on and engaging experience for the students. Supported by a team of instructors and programming coaches from NCAR, UCAR, CU Boulder, Colorado School of Mines, and Columbia University, the students were presented with a sequence of 15-minute lessons: 5 minutes of teaching followed by 10 minutes of hands-on exercises for students to apply their new knowledge. Students used a research-level software package called R to carry out the data analysis, while covering six fundamental concepts in data analytics:

  • Fundamentals of statistics and data types.
  • Exploratory data analysis and visualization.
  • Multivariate linear regression.
  • Categorical data analysis.
  • Data collection and survey analysis.
  • High performance computing and its role in data analysis.


Solar panels being installed
Each pair of students received guidance from one expert during the workshop exercises. The support staff shown in this photo includes, from left to right, Colette Smirniotis, Dorit Hammerling, Lee Richardson, and Nathan Lenssen. The 10-minute exercise being conducted here followed five minutes of instruction in a new concept. This format was designed to sustain student interest during the intensive training and ensure that each participant had immediate, supported practice applying their new skills. —Photo by Brian Bevirt, CISL

By connecting with self-motivated young people as early in their lives as possible, Hammerling and Nychka aim to stimulate their interest and build their skills in using data analysis to solve real problems and prepare them for future careers in science. Hammerling summarized this new workshop’s outcome: “All the students learned about data analysis and developed skills using the R statistical programming environment to solve problems. They left the workshop with R skills that they can readily apply in internships or other employment opportunities. And IMAGe hopes to hire some of these freshly trained people as student assistants to help advance our current research projects.”