Hiring Hackathon : Webpage Classification
Website: https://datahack.analyticsvidhya.com/contest/innoplexus-online-hiring-hackathon-ai-challenge/
Notes to self This is my 2nd hackathon in AV. I couldn't even submit in my 1st hackathon - The McKinsey one where the insurance renewal commission got to be predicted I guess. Couldn't understand the problem statement :(
Got 133 rank (missed marking "final submission") with the prediction score being as poor as 0.3040729980. Of the total X registrations, 188 submitted. A simple hack worked. Time was up before I could submit a better solution. My poor time management is to be blamed.
The highest score in leadership board was 0.9229874124.
The least score in leadership board was 0 :)
The zipped dataset size is about 1.3x GB. It is in your Microsoft's one-drive that you can access with your gmail-id.
Classification
Classification of Web page content is vital to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, however the interconnected nature of hypertext also provides features that can assist the process.
Here the task is to classify the web pages to the respective classes it belongs to, in a single label classification setup (Each webpage can belong to only 1 class).
Basically given the complete html and url, predict the tag a web page belongs to out of 9 predefined tags as given below:
-
People profile
-
Conferences/Congress
-
Forums
-
News article
-
Clinical trials
-
Publication
-
Thesis
-
Guidelines
-
Others
train.zip contains 2 csvs
-
train.csv: Train set
Variable Definition Webpage_id Unique ID for the Web page Domain Domain Url Complete Url Tag (Target) Tag (Class) of the Web page -
html_data.csv: Contains web page data in HTML for both train and test web pages
Variable Definition Webpage_id Unique ID for the Web page Html Web page data in HTML
test.csv: Test Set
Variable | Definition |
---|---|
Webpage_id | Unique ID for the Web page |
Domain | Domain |
Url | Complete Url |
sample_submission.csv: Submission format
Variable | Definition |
---|---|
Webpage_id | Unique ID for the Web page |
Tag | (Target) Tag (Class) of the Web page |
The train and test data split is done based on
Domain-Tag combination
. For example, suppose we want to split the following sample of 16 URLs into train and test set.
-
First the overall dataset is split into subsets by Tag as shown below:
-
Now for each subset(Tag) we store all unique domains and randomly shuffle them, so in this case lets say we have:
- Next, every third domain (3rd, 6th, 9th and so on) in the all domain sequence is assigned to the test and the rest (1st, 2nd, 4th, 5th, 7th and so on) are assigned to train as shown in the following table:
- Final train and test set would be:
The evaluation metric for this competition is weighted F1 score.
Test data is further randomly divided into Public (40%) and Private (60%) data.
- Your initial responses will be checked and scored on the Public data.
- The final rankings would be based on your private score which will be published once the competition is over.
- Entries submitted after the contest is closed will not be considered.
- Since this is a hiring hack, you are expected to solve the problem on your own.
- Use of external dataset is strictly prohibited.
- Use of Webpage_id as a feature is not allowed.
- Participation is free-of-charge.
- Participant must update their profile details and upload their latest CV.
The profile of the user as updated at time of registering for the contest along with their CV and Analytics Vidhya profile will be shared with the sponsor of the hackathon for purposes of hiring.
- You are free to use solution checker as many times as you want.
- Adding comments is mandatory for the use of solution checker
- Comments will help you to refer to a particular solution at a later point in time.
- Setting final submission is mandatory. Without a final submission, your entry will not be considered.
- Code file is mandatory while sending final submission. For GUI based tools, please upload a zip file of snapshots of steps taken by you, else upload code file.
- The code file uploaded should be pertaining to your final submission.
- Throughout the hackathon, you are expected to respect fellow hackers and act with high integrity.
- Slack Live Chat admins hold the right to block any participant found to use foul / disrespectful language.
- Analytics Vidhya and Innoplexus hold the right to disqualify any participant at any stage of competition, if participant(s) are deemed to be acting fraudulently.
Note :- The datasets in this competitions is solely meant to be used for this competition only. You cannot use it for any other purpose.