Welcome to this take-home assignment and congrats on making it here. 🙌
You're presented with a data set with the following features:
- Customer ID
- Product ID
- Time stamp
Our goal is to glean insights about the behavior of the customers and the resulting dynamics that arises as our customers make transactions. To this end, we are going to apply data processing techniques along with machine learning models to better understand this dynamics.
You're free to pick any architecture and any data model that you deem suitable. However, please make sure to explain your rationale for picking the model. We're interested in models that deliver reasonably high accuracy and are ideally flexible and powerful enough for modifications such as adding more features, etc. Here are a few basic questions:
- Create an ordered (descending) plot for the total number of transactions per customer from the most to the least active customer. 📉
- Given any product ID, create a plot to show its transaction frequency per month for the year 2018. 📊
- At any time, what are the top 5 products that drove the highest sales over the last six months? 💸
- Build a model to predict the total number of expected transactions over the next month for any customer. Split your data appropriately and give enough validation for the performance of your model by simulating the predicted results and comparing with the true values or however you see fit. Keep in mind that sometimes a carefully crafted plot is worth a thousand words. 💹
- What other interesting questions can be asked about this? For example, do you see a seasonality effect? ♾️
Attention to good coding practices and style will be noted. Also, feel free to implement more than one model and compare their performances or create an ensemble.
The purpose of this test is to assess your problem solving skills, creativity and experience in tackling a data science question and your style of communicating the results. It is completely up to you which data processing framework, machine learning library, or visualization tool, etc you use. However, please give an explanation as to why you decided to go with your chosen tools to for this mini project.
We don't expect extensive/fancy documentation or diagrams. A neat README.md along with clear inline comments is enough. Moreover, you don't need to finish the assignment 100% or perfect your code before posting it. Partial solutions will be reviewed as well. That said, if you're completely happy with the foundations of your work and want to show off your engineering skills, feel free to design an API and even build a web app around it (bonus).
- If using Python, feel free to use Jupyter notebooks as a dev environment but also make sure to include production ready code as .py files along with the model serialization artifacts.
- Make sure that your code is easily reproducible. For Python, we prefer to manage dependencies with Poetry.
- Create a separate file for loading the saved model and evaluating the results locally. Include clear inline instructions.
- Code must be forked and (temporarily) pushed on your personal Github.
- (Optional) Create live model logs with wandb and share with us.
This mini-assignment is designed carefully to stretch your skills beyond everyday data science tasks. The reason is: we, in the first place, are looking for great software crafts(wo)men. Being at DataChef, you need to be ready for challenges beyond boring data plumbing tasks and finding your way in understanding (sometimes vague) requirements by best finding your way around the complexities.
We hope that you'll enjoy this challenge. Good luck! 👩💻✌️👨💻