Authors: F. Abel, D. Kohlsdorf, R. Pálovics
In the challenge, the task of the participants will be the following: given a XING user, the recommender should predict those job postings (items) that the user will interact with in the next week.
The training dataset is intended for experimenting and training your models. You can split the
interaction data into training and test data for the purpose of evaluating your algorithms
during development. For example: you can leave out the last complete week (of the year) from
the interaction data and then try to predict whether
a given user will positively interact with an item within that week. Relevant items
are those items on which a user clicked, bookmarked or replied (interaction_type
= 1, 2 or 3).
The easiest way to test how your algorithm is performing, is to submit your solution via the
submission system.
The training dataset is a semi-synthetic sample of XING data. The dataset is designed to retain information that is useful for you in creating effective algorithms that address the challenge, while at the same time protecting the privacy of XINC users. The data set is "semi-synthetic" in that it is enriched with artificial users whose presence contributes to the anonymization.
- the dataset contains artificial users
- the dataset contains only a fraction of XING users and job postings
- IDs are used instead of raw text for almost all attribute values (pseudonymization)
- some attributes of the users may have been removed or flipped to NULL / unknown.
- not all interactions of a user are contained in the dataset
- some of the interactions are artificial (= have actually not been performed by the user)
- timestamps have been shifted (but the order of interactions is kept)
Attempting to identify users or to reveal any private information about the users or information about the business from which the data is coming from is forbidden (cf. Rules).
Your algorithm should not attempt to identify artificial users, or reconstruct flipped values. The training set and the test methodology is designed so that such approaches would not offer you an advantage. In fact, artificial users and interactions are also part of the ground truth against which your solution will be evaluated.
Which items were shown by the existing XING job recommender to which user in which week of the year. Only a subset of the impressions that were generated by XING's job recommender are considered: a fraction of the impressions on the Web (start-page and xing.com/jobs), some for mobile, none for emails. For those impressions there is no guarantee that the item was in the viewport of the user. Fields:
user_id
ID of the user (points tousers.id
)year
week
of the yearitems
is a comma-separated list (not set) of items that were displayed to the user (point toitems.id
)
Interactions that the user performed on the job posting items. Fields:
user_id
ID of the user who performed the interaction (points tousers.id
)item_id
ID of the item on which the interaction was performed (points toitems.id
)interaction_type
the type of interaction that was performed on the item:- 1 = the user clicked on the item
- 2 = the user bookmarked the item on XING
- 3 = the user clicked on the reply button or application form button that is shown on some job postings
- 4 = the user deleted a recommendation from his/her list of recommendation (clicking on "x") which has the effect that the recommendation will no longer been shown to the user and that a new recommendation item will be loaded and displayed to the user
created_at
a unix time stamp timestamp representing the time when the interaction got created
Details about those users who appear in the above datasets. Fields:
id
anonymized ID of the user (referenced asuser_id
in the other datasets above)jobroles
comma-separated list of job role terms (numeric IDs) that were extracted from the user's current job title.0
means that there was no known jobrole detected for the user.career_level
career level ID (e.g. beginner, experienced, manager):- 0 = unknown
- 1 = Student/Intern
- 2 = Entry Level (Beginner)
- 3 = Professional/Experienced
- 4 = Manager (Manager/Supervisor)
- 5 = Executive (VP, SVP, etc.)
- 6 = Senior Executive (CEO, CFO, President)
discipline_id
anonymized IDs represent disciplines such as "Consulting", "HR", etc.industry_id
anonymized IDs represent industries such as "Internet", "Automotive", "Finance", etc.country
describes the country in which the user is currently working:- de = Germany
- at = Austria
- ch = Switzerland
- non_dach = non of the above countries
region
is specified for some users who have as countryde
. Meaning of the regions: see below.experience_n_entries_class
identifies the number of CV entries that the user has listed as work experiences:- 0 = no entries
- 1 = 1-2 entries
- 2 = 3-4 entries
- 3 = 5 or more entries
experience_years_experience
is the estimated number of years of work experience that the user has:- 0 = unknown
- 1 = less than 1 year
- 2 = 1-3 years
- 3 = 3-5 years
- 4 = 5-10 years
- 5 = 10-15 years
- 6 = 16-20
- 7 = more than 20 years
experience_years_in_current
is the estimated number of years that the user is already working in her current job. Meaning of numbers: same asexperience_years_experience
edu_degree
estimated university degree of the user:- 0 or NULL = unknown
- 1 = bachelor
- 2 = master
- 3 = phd
edu_fieldofstudies
comma-separated fields of studies that the user studied.0
means "unknown" andedu_fieldofstudies > 0
entries refer to broad field of studies such as Engineering, Economics and Legal, ...
Details about the job postings that were and should be recommended to the users.
id
anonymized ID of the item (referenced asitem_id
in the other datasets above)title
concepts that have been extracted from the job title of the job posting (numeric IDs)career_level
career level ID (e.g. beginner, experienced, manager):- 0 = unknown
- 1 = Student/Intern
- 2 = Entry Level (Beginner)
- 3 = Professional/Experienced
- 4 = Manager (Manager/Supervisor)
- 5 = Executive (VP, SVP, etc.)
- 6 = Senior Executive (CEO, CFO, President)
discipline_id
anonymized IDs represent disciplines such as "Consulting", "HR", etc.industry_id
anonymized IDs represent industries such as "Internet", "Automotive", "Finance", etc.country
code of the country in which the job is offeredregion
is specified for some users who have as countryde
. Meaning of the regions: see below.latitude
latitude information (rounded to ca. 10km)longitude
longitude information (rounded to ca. 10km)employment
the type of employment:- 0 = unknown
- 1 = full-time
- 2 = part-time
- 3 = freelancer
- 4 = intern
- 5 = voluntary
tags
concepts that have been extracted from the tags, skills or company namecreated_at
a Unix time stamp timestamp representing the time when the interaction got createdactive_during_test
is1
if the item is still active (= recommendable) during the test period and0
if the item is not active anymore in the test period (= not recommendable)
ID | Name |
---|---|
0 | not specified |
1 | Baden-Württemberg |
2 | Bavaria |
3 | Berlin |
4 | Brandenburg |
5 | Bremen |
6 | Hamburg |
7 | Hesse |
8 | Mecklenburg-Vorpommern |
9 | Lower Saxony |
10 | North Rhine-Westphalia |
11 | Rhineland-Palatinate |
12 | Saarland |
13 | Saxony |
14 | Saxony-Anhalt |
15 | Schleswig-Holstein |
16 | Thuringia |
The file target_users.csv contains those user IDs for which you finally need to submit solutions. The file lists one user ID per line (in total, 150,000 user IDs). All those target users are also contained in the training data (see User).
The file solution_file_example.tgz is an example solution file that was generated by a simple content-based baseline algorithm.