November 21, 2022
Research Highlight

Optimizing Random Forest Machine Learning Models to Predict Water Quality

Data-driven experiments determine best practices for Random Forest machine learning to predict water quality

Estuary

Photograph taken at the research field study site located in the coastal terrestrial-aquatic interface of the Old Woman Creek estuary, a tributary to Lake Erie.

(Photo by Peter Regier | Pacific Northwest National Laboratory)

The Science

Machine learning (ML) is a rapidly growing field which uses the power of computers to develop new ways to analyze complex datasets. In the Earth sciences, ML is increasingly being used, but many research areas including water quality, lack clear and relevant guidance for setting up ML models correctly. In this study, we explored approximately 1300 different model configurations for Random Forests (a popular ML algorithm) using long-term data from two estuaries. We examined how different models predicted nitrate, an important nutrient in estuaries, to understand the best way to set up the Random Forests model to assess water quality changes.

The Impact

Our findings enabled us to create an ordered list of the most important factors to consider when building Random Forest models for water quality predictions. We found it is important to account for the influence of time in models. Additionally, how the model is designed impacts how well it performs, and how we interpret the results. While we focused on water quality, our findings will be useful across the broader aquatic sciences and have general relevance to any dataset containing temporal or spatial dependence.

Summary

The ML is increasingly used across the Earth sciences to understand how natural systems function, but we currently lack domain-relevant guidance on best practices. We explored how six different model factors impacted the ability of RF ML models to predict nitrate in two long-term estuarine datasets. We explored how models were able to predict nitrate over time and found that accounting specifically for time was important to avoid over-estimations of the model’s ability, particularly for future predictions of nitrate. Model parameter decisions such as how the data are split, which predictor variables are selected, and model hyperparameters all significantly impact model performance. We further explored how these decisions influenced how the model is interpreted, including which predictors were the most important, and how those predictors related to nitrate.

PNNL Contact

Vanessa Bailey, Earth Scientist, Pacific Northwest National Laboratory, vanessa.bailey@pnnl.gov.

Funding

This research is based on work supported by COMPASS-Field, Measurements, and Experiments, a multi-institutional effort funded by the by the Earth and Environmental System Science Division of the Department of Energy’s Office of Science. The Pacific Northwest National Laboratory is operated for DOE by Battelle Memorial Institute under contract DE-AC05-76RL01830.

Published: November 21, 2022

Regier, P., M. Duggan, A. Myers-Pigg, N. Ward. “Effects of Random Forest modeling decisions on biogeochemical time-series predictions”, Limnology and Oceanography Methods. [DOI: 10.1002/lom3.10523]