🚱 Water Shortage Prediction at Hi!ckathon 2024

🔍 Overview

This repository contains the work developed by our team for the Hi!ckathon, a competition focused on AI and sustainability organized by Hi! PARIS - the Center on Data Analytics and Artificial Intelligence for Science, Business and Society created by Institut Polytechnique de Paris and HEC Paris and joined by Centre Inria de Saclay. The goal of our project was to build an AI model capable of predicting groundwater levels for French piezometric stations, with a special emphasis on the summer months. Our model uses a variety of data sources, including piezometric data, weather patterns, hydrology, water withdrawal, and economic data, to make accurate predictions.

In addition to model development, we were tasked with considering the real-world application of our solution and projecting how it could be used in the market to address water shortages 🌍💧

🚀 Objective

The primary objective of the project is to:

Build a predictive model for forecasting groundwater levels at French piezometric stations.
Focus specifically on the summer months, as they are crucial for water resource management.
Leverage multiple data sources, including weather, hydrology, water withdrawal, and economic data, to improve prediction accuracy.
Explore and design a real-world application of the model to address water shortage issues.

👥 Our Team

🎯 Our Approach

The target variable is categorical, with 5 balanced classes representing groundwater levels: very low, low, average, high, and very high. Since the data is balanced, no specific techniques for handling imbalanced data were necessary, and the models were trained to perform classification.

The data preprocessing steps included removing columns with over 80% missing values, followed by imputing the remaining missing values with either the median or mode. Feature engineering was then performed, as detailed below. All numeric features were scaled, and the target variable was encoded as integers from 0 to 4.

Subsequently, five models were trained and evaluated using 3-fold cross-validation, with results presented in the results section. The best-performing model was a random forest, which underwent grid search for hyperparameter tuning. The final F1 score of this model on the test set was 58.36%, placing the team 15th out of 60 teams.

Feature Engineering

Feature	Description
`day`	Extracted day from the `meteo_date`. Represents the day of the month.
`month`	Extracted month from the `meteo_date`. Represents the month of the year.
`quarter`	Extracted quarter from the `meteo_date`. Represents which quarter of the year (1 to 4).
`year`	Extracted year from the `meteo_date`. Represents the year of the data point.
`day_sin`	Sin transformation of the `day` feature. Converts the day of the month into a periodic value for modeling cyclical behavior.
`day_cos`	Cos transformation of the `day` feature. This, alongside `day_sin`, captures the periodicity of the day of the month.
`month_sin`	Sin transformation of the `month` feature. Converts the month into a periodic value to model cyclical patterns (seasons, etc.).
`month_cos`	Cos transformation of the `month` feature. Works together with `month_sin` to capture the cyclical nature of months.
`quarter_sin`	Sin transformation of the `quarter` feature. Captures the cyclic behavior of the four seasons in a year.
`quarter_cos`	Cos transformation of the `quarter` feature. Works together with `quarter_sin` to capture the periodic nature of quarters.
`meteo_temperature_avg_lag_1`	Lag feature representing the average temperature from the previous year. This helps capture long-term temperature trends.
`meteo_rain_height_lag_1`	Lag feature representing the rainfall from the previous year. Similar to temperature lag, this captures long-term precipitation trends.
`meteo_temperature_avg_rolling_mean_7`	Rolling mean of the average temperature over a 7-day window. This smooths out short-term fluctuations and helps capture medium-term temperature trends.
`meteo_rain_height_rolling_sum_7`	Rolling sum of the rainfall over a 7-day window. Helps to capture cumulative rainfall over a short period.
`temperature_wind_interaction`	Interaction feature between average temperature and wind speed. Helps to capture the joint effect of temperature and wind on environmental conditions.
`humidity_rain_interaction`	Interaction feature between humidity and rainfall. Helps to understand how the two variables interact and affect the environment together.
`temperature_range`	Difference between the maximum and minimum temperature. Captures the temperature variability within a day or over time.
`evapotranspiration_to_rain_ratio`	Ratio of evapotranspiration to rainfall. Helps understand how the amount of water evaporated compares to the rainfall, influencing soil moisture.
`altitude_difference`	Difference between the piezo station altitude and the meteorological station altitude. Helps to capture geographic effects on environmental conditions.
`cumulative_rainfall_30_days`	Rolling sum of rainfall over a 30-day window. Captures long-term trends in precipitation.

📊 Results

Model	Accuracy	F1 Score	Precision	Recall	AUC-ROC
Random Forest	0.7149 ± 0.0004	0.7212 ± 0.0005	0.7231 ± 0.0010	0.7199 ± 0.0001	0.9222 ± 0.0003
XGBoost	0.6261 ± 0.0014	0.6349 ± 0.0014	0.6352 ± 0.0013	0.6349 ± 0.0016	0.8821 ± 0.0007
LightGBM	0.5851 ± 0.0014	0.5928 ± 0.0015	0.5925 ± 0.0015	0.5940 ± 0.0015	0.8592 ± 0.0010
CNN	0.5223 ± 0.0048	0.5227 ± 0.0049	0.5241 ± 0.0047	0.5223 ± 0.0048	0.8342 ± 0.0027
AdaBoost	0.3390 ± 0.0020	0.3411 ± 0.0025	0.3432 ± 0.0032	0.3408 ± 0.0018	0.6756 ± 0.0007

🖥️ Run the code

Set up

First, clone the repository and navigate to the project folder:

git clone git@github.com:zhukovanadezhda/water-scarcity.git
cd water-scarcity

To set up the environment and install the required dependencies, use the following commands:

conda env create -f environment.yml
conda activate water-scarcity

Preprocessing

Download the data to the data folder (contact us to get the data). Then run this command to get the train and test datasets:

python scripts/preprocess_data.py --path <data_file_path> [--is_train]

    --path        Path to the CSV data file (training or test).
    --is_train    Flag to indicate training data (optional).

Models training and evaluation

After the preprocessing is completed, use one of two scripts train_cnn.py or train_models.py to train and evaluate corresponding models.

python scripts/train_cnn.py --X_path data/X_train.csv --y_path data/y_train.csv

    --X_path      Path to the CSV file containing the training features.
    --y_path      Path to the CSV file containing the training labels.

🤝 Acknowledgments

Hi! PARIS for organizing the Hi!ckathon and providing the opportunity to work on impactful sustainability challenges 🎉
The participants, mentors, and organizers for their valuable feedback and support during the competition.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

🚱 Water Shortage Prediction at Hi!ckathon 2024

🔍 Overview

🚀 Objective

👥 Our Team

🎯 Our Approach

Feature Engineering

📊 Results

🖥️ Run the code

Set up

Preprocessing

Models training and evaluation

🤝 Acknowledgments

Files

README.md

Latest commit

History

README.md

File metadata and controls

🚱 Water Shortage Prediction at Hi!ckathon 2024

🔍 Overview

🚀 Objective

👥 Our Team

🎯 Our Approach

Feature Engineering

📊 Results

🖥️ Run the code

Set up

Preprocessing

Models training and evaluation

🤝 Acknowledgments