Reddit is a social news aggregation and discussion website. Reddit's registered community members can submit content such as text posts or direct links.
Content entries are organized by areas of interest called "Subreddits". Subreddit topics include news, science, gaming, movies, music, books, fitness, food, image-sharing and many others.
Provided here is a dataset of 25,000 reddit users, their interactions with subreddits and the timestamp at which they interacted.
The goal of this assignment is to recommend users the subreddit they should subscribe to. You should submit the output in .CSV format and it should have all the usernames as rows and the columns with subreddits. You should document your approaches and how you have arrived to the final solution.
The user with username "kabanossi", some of the subreddits recommended for him are: "AnimalGIFs", "AnimalBehavior", "tinyanimalsonfingers", "whatisthisanimal", etc.
Don't use RNN (Recurrent Neural Network) language model.
You will be evaluated based on the accuracy of the output, documentation, code quality, code documentation and explanation about how you have reached to the solution.
Student response addresses the most important characteristics of the dataset and uses these characteristics to inform their analysis. Important characteristics include:
- Total number of data points
- Allocation across classes (POI/non-POI)
- Number of features used
- Are there features with many missing values? etc.
- Student response identifies outlier(s) in the data, and explains how they are removed or otherwise handled.
- At least one new feature is implemented. Justification for that feature is provided in the written response. The effect of that feature on final algorithm performance is tested or its strength is compared to other features in feature selection. The student is not required to include their new feature in their final feature set.
- Univariate or recursive feature selection is deployed, or features are selected by hand (different combinations of features are attempted, and the performance is documented for each one). Features that are selected are reported and the number of features selected is justified. For an algorithm that supports getting the feature importances (e.g. decision tree) or feature scores (e.g. SelectKBest), those are documented as well.
- If algorithm calls for scaled features, feature scaling is deployed.
- At least two different algorithms are attempted and their performance is compared, with the best performing one used in the final analysis.
- Response addresses what it means to perform parameter tuning and why it is important.
At least one important parameter tuned with at least 3 settings investigated systematically, or any of the following are true:
- GridSearchCV used for parameter tuning
- Several parameters tuned
- Parameter tuning incorporated into algorithm selection (i.e. parameters tuned for more than one algorithm, and best algorithm-tune combination selected for final analysis).
- At least two appropriate metrics are used to evaluate algorithm performance (e.g. precision and recall), and the student articulates what those metrics measure in context of the project task.
- Response addresses what validation is and why it is important
- Performance of the final algorithm selected is assessed by splitting the data into training and testing sets or through the use of cross validation, noting the specific type of validation performed.
- Harishkandan
- Nikhil Borkar
- Pankaj Meher
- Nikita Goel
- Mudassir Khan
- Biraj Parikh
- Nikhil Singh
- Ronak Talreja
- Varun Panicker
- Bhavesh Bhatt
- Gowtham Dongiri
- Pramod Bhalerao
- Saahil Sharma
- Arunabh Singh
- Ramkrishna Sahu
- Sagar Ambalam
- Bijit Deka