I have used a Retail store data to clean, preprocess and get insights on the business. I leverage tools like Pandas for data manipulation, MySQL for data warehousing, and Power BI for data visualization.
- Programming Language: Python
- Database: MySQL
- Data Visualization: Power BI
- Data Manipulation: Pandas
- APIs: Kaggle API
Retail Sales Data: This dataset includes fields capturing sales transactions, product categories, regions, and more. The data is sourced from Kaggle.
The data is extracted using the Kaggle API. The script connects to Kaggle, downloads the dataset, and saves it locally.
The cleaned and transformed data is loaded into a MySQL database for further analysis.
- Script:
Mysqlscript.sql
Also written some complex queries to calculate:
- Find top 10 highest revenue generating products
- Top 5 selling products in each regions
- MONTH OVER MONTH GROWTH COMPARISION FOR 2022 AND 2023 SALES eg: JAN 2022 V/S 2023
- MONTH OVER MONTH GROWTH COMPARISION FOR 2022 AND 2023 SALES eg: JAN 2022 V/S 2023
- FOR EACH CATEGORY WHICH MONTH HAD HIGHEST SALES
- Which subcategory had highest growth by profit in 2023 compared to 2022
Pandas is used to clean the data, add new features, and prepare it for loading into MySQL.
- Script:
retail-orders-notebook.py
Power BI and DAX are used for data analysis and visualization, providing insights into various metrics such as the highest selling products category-wise, region-wise, and sub-category-wise. **
Feel free to contribute to this project by forking the repository and submitting a pull request.
For more detailed information about the project and data, refer to the following resources: