📅 Revealing Representative Day-Types Using Clustering

📚 Course Information

Course: AH2179
Module: 5
Project: Representative Day-Types through Clustering Analysis

🎯 Project Overview

This project implements and compares multiple clustering methods to identify representative day-types. The methods explored include k-means, agglomerative, DBSCAN, and GNN. The focus is on tuning model parameters and evaluating performance using internal and external metrics to reveal the most representative day types effectively.

📊 Model Performance Comparison

Model	Silhouette Score	Davies Bouldin Score	Calinski Harabbasz Score	MAE	MAPE	Evaluation Metric
kNN	0.26924	1.23140	118.0547	129.77361	8.4901	9.2775
Agglomerative	0.26377	1.35879	159.13421	59.98866	3.31836	10.7863
DBSCAN	0.28358	2.22541	9.93803	60.67715	3.3573	0.3044
GNN	0.31493	1.07859	112.19560	128.16948	8.40844	12.0161

🏗️ Model Architecture

Data Preprocessing

Scaling: StandardScaler implementation
Feature Selection: All columns relevant to clustering day-types
Parameter Tuning: Iterative testing with cluster numbers, epsilon for DBSCAN, etc.

🤖 Clustering Methods Implemented

k-means
Agglomerative Clustering
DBSCAN
GNN (Graph Neural Network)

⭐ Best Model Configuration

The GNN method emerged as the best performer based on internal and external evaluation metrics. The optimal configuration identified representative day-types with minimal errors and maximized internal scoring metrics.

📈 Evaluation Metric Formula

To assess internal performance, a custom evaluation metric was created:

evaluation_metric = silhouette_score * e(-davies_bouldin_score) * calinski_harabbasz_score 🚀

This metric enabled a balanced assessment across the different clustering models, providing insights into the most representative configurations.

💡 Key Findings & Detailed Reflections

🔍 Model Performance Analysis

Clustering Method and Cluster Selection
- GNN was the top-performing method, showing distinct patterns for warmer/cooler months and weekdays/weekends in the calendar graph.
- Agglomerative clustering performed well at 5 clusters, though its performance decreased beyond this point.
Parameter Calibration Observations
- kNN: Performance dipped between clusters 3-5, particularly with a rise in the Davies-Bouldin Score.
- Agglomerative: Best performance at around 5 clusters, as indicated by the Davies-Bouldin Score.
- DBSCAN: Epsilon value calibration was essential, with optimal performance at epsilon < 1400.
- GNN: Optimal performance observed with 3 clusters, with performance dropping as clusters increased.

🎓 Technical Lessons Learned

Challenges in Clustering
- Interpreting clustering scores and visual results was subjective, especially with calendar graphs showing diverse patterns.
- Selecting cluster numbers and tuning epsilon required trial and error, adding time to the analysis process.
Importance of Clustering in AI
- Clustering provides valuable insights into large datasets by identifying patterns that aid decision-making and strategy formulation.
Domain-Specific Applications
- This clustering approach could aid in train delay predictions and optimizing scheduling based on high-flow periods.

🌅 Visual Results

🗓️ Calendar Graph

📈 Flow vs Time of Day

🔮 Future Applications & Transportation Problems

📋 Potential Applications

Train Delay Prediction
- Cluster-based analysis to predict high-delay periods
- Optimization of train schedules to meet demand
Traffic Flow Optimization
- Identification of peak and off-peak hours for improved scheduling
- Application in real-time traffic management
Urban Infrastructure Planning
- Analyzing utilization patterns to aid in infrastructure planning and maintenance scheduling

🛠️ Recommended Techniques

Enhanced Clustering Techniques
- Experimentation with additional clustering methods (e.g., spectral clustering)
- Hybrid models combining clustering with predictive analytics
Advanced Evaluation Metrics
- Incorporating cross-validation with clustering to refine parameter selection
- Additional external validation metrics for robust performance comparison
Data Enrichment for Improved Insights
- Integrating temporal and spatial data for enhanced clustering performance
- Utilizing domain-specific knowledge to refine feature selection

🛠️ Dependencies

Scikit-learn
Pandas
NumPy
Matplotlib
Seaborn
NetworkX (for GNN implementation)

Note: This project was completed as part of the AH2179 course, Module 5. 🎓

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Day Types Clustering.ipynb		Day Types Clustering.ipynb
README.md		README.md
exercise_5_clustering.ipynb		exercise_5_clustering.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📅 Revealing Representative Day-Types Using Clustering

📚 Course Information

🎯 Project Overview

📊 Model Performance Comparison

🏗️ Model Architecture

Data Preprocessing

🤖 Clustering Methods Implemented

⭐ Best Model Configuration

📈 Evaluation Metric Formula

💡 Key Findings & Detailed Reflections

🔍 Model Performance Analysis

🎓 Technical Lessons Learned

🌅 Visual Results

🗓️ Calendar Graph

📈 Flow vs Time of Day

🔮 Future Applications & Transportation Problems

📋 Potential Applications

🛠️ Recommended Techniques

🛠️ Dependencies

About

Releases

Packages

Languages

trishachander/Clustering-Day-Types

Folders and files

Latest commit

History

Repository files navigation

📅 Revealing Representative Day-Types Using Clustering

📚 Course Information

🎯 Project Overview

📊 Model Performance Comparison

🏗️ Model Architecture

Data Preprocessing

🤖 Clustering Methods Implemented

⭐ Best Model Configuration

📈 Evaluation Metric Formula

💡 Key Findings & Detailed Reflections

🔍 Model Performance Analysis

🎓 Technical Lessons Learned

🌅 Visual Results

🗓️ Calendar Graph

📈 Flow vs Time of Day

🔮 Future Applications & Transportation Problems

📋 Potential Applications

🛠️ Recommended Techniques

🛠️ Dependencies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages