This project aims to predict the prices of apartments in Buenos Aires, Argentina, using a robust machine-learning model. The focus is on properties costing less than $400,000. Accurate price predictions are crucial for various stakeholders, including buyers, sellers, real estate agents, and policymakers. Our objective was to develop a reliable model despite the absence of temporal indicators.
The goal is to identify significant features that accurately predict apartment prices in Buenos Aires and to achieve a Mean Absolute Error (MAE) of less than 50% compared to a baseline model.
We followed a prescriptive methodology to guide our model development:
- Data Collection: Scraped 12,000+ apartment listings from real estate websites.
- Data Preprocessing: Cleaned the dataset by handling missing values, converting data types, removing duplicates, and normalizing features.
- Exploratory Data Analysis (EDA): Conducted to understand feature distributions and relationships.
- Feature Engineering: Extensively used to identify the most significant features.
- Modeling: Iteratively experimented with various models, including multiple versions of the Ordinary Least Squares (OLS) model.
- Handling Heteroskedasticity: Identified and addressed this issue to improve model accuracy.
- Final Model: Selected a Gradient Boosting Regressor based on performance, which achieved the best results.
- Optimal High Leverage Threshold: 0.0003
- Optimal High Residual Threshold: 3
- Mean Absolute Error (MAE): $23,809.9879
- R²: 0.7863
- MAE Improvement: 60% better than the baseline
The final model successfully handled heteroskedasticity and outperformed previous iterations.
- Python: Core language used for development.
- Pandas: For data manipulation and analysis.
- NumPy: For numerical operations.
- Matplotlib: For data visualization.
- Seaborn: For statistical data visualization.
- Plotly: For interactive graphs.
- Dash: For web-based application development.
- Scikit-learn: For machine learning modeling.
- Statsmodels: For statistical modeling.
Make sure to install the following dependencies before running the project:
pip install pandas numpy matplotlib seaborn plotly dash scikit-learn statsmodels