For my Unit 2 portfolio project in the Data Science program at Lambda School, I decided to move further out of my comfort zone. I chose to work with a Kaggle dataset compiled by Christoph Minixhofer. The aim of the Predict Droughts using Weather & Soil Data is to predict 6 levels of continental US drought levels using meteorological & soil data. The underlying data was gathered by the experts at the US Drought Monitor and the UN harmonized soil database.
To avoid data leakage, the dataset author had already split it into a Training (Jan. 1, 2000 to Dec. 31), Validation (Jan. 1, 2017 to Dec. 31, 2018) and Testing (Jan. 1, 2019 to Dec. 31, 2020) subsets.
The full training dataset contained over 19 million separate meteorological observations (data rows) which were too taxing for my puny laptop. Consequently, I chose to focus on data points related to my home state (Pennsylvania). This brought the datasets to much more manageable sizes (Train: 416,070 observations; Validation: 48,910 observations, Test: 48,977 observations).
Summary of Findings
After performing some exploratory data analysis (EDA), feature engineering, model selection and optimization, the Random Forest Classification Model performed the best. The Baseline Balanced Accuracy Score was 0.25 and the final model’s Test Balanced Accuracy Score was 0.30. Due to the skewed nature of the classes to be predicted (approximately 68% of the observations showed None as the level of drought), when ran on the Test dataset, the final model was much better at predicting the instances where the drought level was None than predicting any other levels of drought.
For those interested in the more technical statistical analysis and the visualizations demonstrating the model performance, read on below:
EDA and Feature Engineering
After reducing the size of the dataset to Pennsylvania, the EDA showed that the target — the level of drought — had a lot of NaN (empty) values. This was an intended limitation of the dataset because the meteorological data was gathered daily, while the drought data was gathered weekly.¹ To impute the logical fill-in values for the empty values, I used the Pandas Series Interpolate function. The default interpolation method was linear which I felt was consistent with how drought would be recorded had there been a daily drought observation.²
To analyze the problem as a classification problem, I then wrote a function that transformed the interpolated float values in the target into 6 categories as follows:
- ‘None’ (score = 0.0) (68% of Training Target)
- ‘Abnormal (D0)’ (0.0 < score <= 1.0) (25% of Training Target)
- ‘Moderate (D1)’ (1.0 < score <= 2.0)(5.4% of Training Target)
- ‘Severe (D2)’ (2.0 < score <= 3.0)(1% of Training Target)
- ‘Extreme (D3)’ (3.0 < score <= 4.0)(0.4% of Training Target)
- ‘Exceptional (D4)’ (4.0 < score <= 5.0)(0.0012% of Training Target)
Next, I created four new features related to precipitation as my intuition suggested that precipitation would likely have the most effect on the level of drought on any given day. The features accounted for weekly, bi-weekly, tri-weekly, and monthly precipitations. To create these features, I grouped the data by county (‘fips’ or Federal Information Processing Standard County Codes) and summed the daily precipitation total using the “lambda rolling” on a 7-, 14-, 21-, and 30-day basis shifting down with each summation as illustrated below:
df['monthly_prec'] = df.groupby('fips')['PRECTOT'].apply(lambda x: x.rolling(30, min_periods = 1).sum().shift())
Evaluation Metric and Baseline Score
From the above EDA, it is clear that the target classes are very imbalanced (with ‘none’ comprising 68%). Consequently, accuracy_score would not be a good evaluation metric because it could be misleading. Another evaluation metric — balanced_accuracy_score — appears to be a better metric to deal with multiclass classification problems that have imbalanced target. The balanced_accuracy_score is defined as the average of recall obtained on each class.
The difference between accuracy_score and balanced_accuracy_score is easily demonstrated by using Scikit-Learn’s dummy classifier (which picks the most_frequent class in the training target). The results for the Validation dataset are as follows:
Baseline Accuracy of the dummy classifier: 0.780
Baseline Balanced Accuracy of the dummy classifier: 0.25
It is clear that accuracy_score is misleading given that a guess of the most_frequent class is accurate close to 80% of the time.
Classification Supervised Learning Models
To solve this classification problem, I focused on two supervised learning models: Random Forest and XGBoost. I started with a Random Forest model with 15 trees and used Scikit-Learn’s compute_class_weight to account for the imbalanced target classes.
weights = compute_class_weight(class_weight = 'balanced',
classes = np.unique(y_train),
y = y_train)# Use zip() to convert numpy array of values for class weights into a dictionaryclass_weights = dict(zip(np.unique(y_train), weights))
Initially, I added a scaler³ to determine whether it improved the models’ performance. The conclusion I reached was that scalers only marginally improved (by 0.0001) the Balanced Accuracy for the Validation dataset when compared with the base Random Forest model. As such, I did not test adding a scaler to the XGBoost model.
During my testing, I found that hyperparameter optimization of the models on the full training dataset took a significant amount of time due to the size of the dataset. Neither GridSearchCV nor RandomizedSearchCV were able to fully run (either on Google Colab or locally). I was thus forced to perform manual hyperparameter optimization.
At the suggestion of a colleague (h/t James Caldwell, DSPT10), I created a random subset from the Training dataset keeping the target distribution the same as in the full set. I then used the smaller dataset to hyperparameter optimize the models.
Optimizing Based on balanced_accuracy for the Validation and the Test Datasets
During the hyperparameter optimization, I noticed something interesting. If I tested the parameters of the Random Forest and XGBoost models on both the Validation and the Test datasets, I realized that the best hyperparameters for the Validation dataset were not the best hyperparameters for the Test dataset, and vice versa. In other words, if I focused on the models’ performance on the Test dataset, I tended to pick hyperparameters that were not optimized for the Validation dataset. In the end, given that in the real world one should be blinded to the test dataset to prevent any data leakage, I took the more conservative approach and chose the hyperparameters that optimized the models for the Validation dataset.
Using SelectKBest to Find the Best Features to Include in Final Model
Lastly, I used Scikit-Learn’s SelectKBest to find the best features to include in the Final Model. I used a For Loop to evaluate the hyperparameter optimized Random Forest Model, calculate the balanced_accuracy for the Training and Validation datasets, and store the values in a dataframe. The highest balanced_accuracy for the Validation dataset was somewhat surprising but it included all the features in the model. As a result, I did not eliminate any of the model features.
Model Performance and Interpretation
The Final Model selected is a Random Forest with 10 Trees, 139 estimators, a minimum of 10 samples required to split an internal node, balanced class weights and no bootstraping. The features in the Final Model have the following importances:
It is very satisfying when a model supports one’s initial intuition: that cumulative precipitation has a large effect on drought level. In this case, as the above graph illustrates 3 of the features I engineered (monthly_prec, triweekly_prec, biweekly_prec) account for approximately 0.19 percent of the Final Model’s predictive power.
The confusion matrix and classification report for the Final Model are as follows:
Both the confusion matrix and the classification report demonstrate that due to the imbalanced nature of the Training target and the fact that the Test dataset did not have any true instances of Extreme or Exceptional Drought, the Final Model is better at predicting when the drought level is ‘None’ (high Precision but medium Recall). The Final Model is limited in its predictions of ‘Abnormal’, ‘Moderate’, and ‘Severe’ drought levels.
As climate change worsens, our ability to predict drought based on data as complex as meteorological and soil observations will become more and more important.
The Final Model was able to predict Pennsylvania droughts with high Precision when the drought level was ‘None’ but had significant failures when it attempted to predict the ‘Abnormal’, ‘Moderate’, and ‘Severe’ drought levels. Given that these drought levels are much more important and need to be accurately forecasted so that proper precautions can be taken, the Final Model cannot be used in the real world. Further study and improved models are necessary.
Overall, this was a rewarding (if somewhat frustrating) dataset. Working with this dataset has showcased how real world data is never perfectly clean, balanced, correlated, or easily interpreted. It also truly fit into the following meme (h/t Han Lee):
For those interested, this is the project’s Google Colab Notebook.
In order to access the data from Kaggle’s API, you need to sign up for an account and have an API authentication token. From Kaggle’s GitHub Readme: “To use the Kaggle API, sign up for a Kaggle account at https://www.kaggle.com. Then go to the ‘Account’ tab of your user profile (
https://www.kaggle.com/<username>/account) and select 'Create API Token'. This will trigger the download of
kaggle.json, a file containing your API credentials.” Once you have the
kaggle.jsonfile, running the first cell in the notebook will allow you to upload the
kaggle.json file in the correct directory for the Kaggle API to properly authenticate you and allow the download of the dataset in the next cell.
This meant that the target column 6 empty values, then a float value for the weekly drought level; this pattern was then repeated.
 In other words, I believe that if there had been daily drought observations, the values assigned to each day of observation would be based on the previous week’s drought observation and would differ slightly in a linear fashion to reach the next week’s drought observation (e.g. if drought level increased from 1.0 to 2.0 in a week, then Day 2/Week 1=1.1, Day 3/Week 1=1.3, . . . Day 6/Week 1 = 1.8 in order to reach Day 1/Week 2 = 2.0)
 I used Scikit-Learn’s MinMaxScaler, RobustScaler, and StandardScaler.