The quality of natural waters is of critical importance for public health, welfare, sustainable development, and ecological systems. Unfortunately, Lake Erie has been facing a persistent crisis of harmful algal blooms (HABs) since the 1960s. The most annual occurrence of HABs at Lake Erie predominately occurs at the western region between the months of May and August. HABs are exhibited by the excessive growth of cyanobacteria (i.e., blue-green algae), which often produce toxins such as microcystins. The goal of our research is to identify key factors affecting HABs in western Lake Erie from the perspective of machine learning, which is a powerful tool for predicting the output from the input features. Specifically, chlorophyll-a is usually a direct indicator of the severity of HABs and is applied as the output target in machine learning. We collected data from both on site and remote sensing. We found that several features of remote sensing were not reliable and thus applied only on-site data as input features in machine learning. After comparing 12 popular machine learning algorithms, we found that the random forests model had the best performance in predicting the value of chlorophyll-a, with a R~2 score of 0.84. Moreover, with the help from machine learning, we could identify the quantitative importance among input features and find that particulate organic carbon is the most correlated factor to chlorophyll-a. Furthermore, we showed that the machine learning model was location dependent and that different structures of a machine learning model should be applied to different locations to predict HABs.
展开▼