-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/two_stage #209
base: experimental/two_stage
Are you sure you want to change the base?
Feature/two_stage #209
Conversation
Two-stage draft
Co-authored-by: Daria <[email protected]>
This reverts commit aa12b6f.
This reverts commit 4064cbd.
Feature/reranker
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## experimental/two_stage #209 +/- ##
=========================================================
Coverage ? 99.57%
=========================================================
Files ? 58
Lines ? 3986
Branches ? 0
=========================================================
Hits ? 3969
Misses ? 17
Partials ? 0 ☔ View full report in Codecov by Sentry. |
{file = "catboost-1.2.7.tar.gz", hash = "sha256:3ed1658bd22c250a12f9c55cf238d654d7a87d9b45f063ec39965a8884a7e9d3"}, | ||
] | ||
|
||
[package.dependencies] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I said we can put catboost easily to dependencies, but I didn't know it brought so many dependencies.
graphviz, matplotlib and plotly are not really needed for rectools. And restriction numpy<2.0 isn't also very good, we're planning to update it
So, I'm thinking it may be a good idea to put catboost to extras. wdyt?
|
||
@tp.runtime_checkable | ||
class ClassifierBase(tp.Protocol): | ||
def fit(self, *args: tp.Any, **kwargs: tp.Any) -> None: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Models usually return themselves after fit
- Let's put x and y as explicit arguments, otherwise we can't check ourselves. Same for other methods
class CatBoostReranker(Reranker): | ||
def __init__( | ||
self, | ||
model: tp.Union[CatBoostClassifier, CatBoostRanker] = CatBoostRanker(verbose=False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird linters are silent here.
But setting a mutable object as a default value is a really bad idea
self.model.fit(**fit_kwargs) | ||
|
||
def rerank(self, candidates: pd.DataFrame) -> pd.DataFrame: | ||
reco = candidates[Columns.UserItem].copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reco[Columns.Score] = self.model.predict(x_full) | ||
else: | ||
raise ValueError("Got unexpected model_type") | ||
reco.sort_values(by=[Columns.User, Columns.Score], ascending=False, inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be good to keep the order of users. At least we keep it in other models afaik
group_ids = candidates_with_target[Columns.User].values | ||
candidates_with_target = candidates_with_target.drop(columns=Columns.UserItem) | ||
pool_kwargs = { | ||
"data": candidates_with_target.drop(columns=Columns.Target), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
res = useritem | ||
res = res.merge(user_features, on=Columns.User, how="left") | ||
res = res.merge(item_features, on=Columns.Item, how="left") | ||
res = res.merge(useritem_features, on=Columns.UserItem, how="left") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to use a chain of operations, otherwise to this moment you will have 4 datasets instead of 1, all are in your memory
def sample_negatives(self, train: pd.DataFrame) -> pd.DataFrame: | ||
# train: user_id, item_id, scores, ranks, target(1/0) | ||
|
||
negative_mask = train["target"] == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Columns.Target?
sampling_mask = train[Columns.User].isin(num_negatives[num_negatives > self.num_neg_samples].index) | ||
|
||
neg_for_sample = train[sampling_mask & negative_mask] | ||
neg = neg_for_sample.groupby([Columns.User], sort=False).apply( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it's experimental, but I don't see TODO, so I'll just remind that apply
is super super slow and we're not using it.
I'm ok with TODO comment here if you don't want fixing it now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A possible solution (maybe not the best one, but without using apply)
- After calculating
num_negatives
calculate also the sampling ratio asminimum(self.num_neg_samples / num_negatives, 1)
- Convert it to series with user_id as index and map it to the
neg
dataset - Add a column with random values in [0, 1)
- Filter out rows with random values > sampling ratio
You also don't need a sampling mask here.
And you can avoid splitting to pos and neg parts simply putting 1
as a sampling ratio for positives. This also means you don't need to do shuffling in the end. And I really don't like this shuffling, because:
- It's a heavy operation
- You have
sample
method in theSampler
class and it's quite not obvious that you shuffle the data inside
|
||
@attr.s(auto_attribs=True) | ||
class PerUserNegativeSampler(NegativeSamplerBase): | ||
num_neg_samples: int = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_neg_samples
?
for user_id in sampled_df[Columns.User].unique(): | ||
user_data = sampled_df[sampled_df[Columns.User] == user_id] | ||
num_negatives = len(user_data[user_data[Columns.Target] == 0]) | ||
assert num_negatives == num_neg_samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for user_id in sampled_df[Columns.User].unique(): | |
user_data = sampled_df[sampled_df[Columns.User] == user_id] | |
num_negatives = len(user_data[user_data[Columns.Target] == 0]) | |
assert num_negatives == num_neg_samples | |
n_negatives_per_user = data.groupby(Columns.User)["Target"].agg(lambda target: (target == 0).sum()) | |
assert (n_negatives_per_user == num_neg_samples).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not about speed optimisation, of course. I suggest it because:
- Having multiple same-kind asserts in one test is not a good idea, since it's harder to debug. For such cases we usually use
parameterize
or subtests. Here it can be easily replaced with 1 assert. - It's easier to read
assert set(sampled_df.columns) == set(sample_data.columns) | ||
|
||
# Check if the number of negatives per user is correct | ||
for user_id in sampled_df[Columns.User].unique(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, but instead of checking .all()
you can do assert n_negatives_per_user.tolist() == [2, 3, 3]
generator = CandidateGenerator(model, 2, False, False) | ||
|
||
with pytest.raises(NotFittedError): | ||
generator.generate_candidates(users, dataset, filter_viewed=True, for_train=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do it with parametrize
or subtests
|
||
generator.fit(dataset, for_train=True) | ||
|
||
generator.generate_candidates(users, dataset, filter_viewed=True, for_train=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sometimes it makes sense to split the test into multiple ones
) -> pd.DataFrame: | ||
|
||
if for_train and not self.is_fitted_for_train: | ||
raise NotFittedError(self.model.__class__.__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add that this model isn't fitted for train/recommend
model_count = {} | ||
cand_gen_dict = {} | ||
for candgen in candidate_generators: | ||
model_name = candgen.model.__class__.__name__ | ||
if model_name not in model_count: | ||
model_count[model_name] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model_count = {} | |
cand_gen_dict = {} | |
for candgen in candidate_generators: | |
model_name = candgen.model.__class__.__name__ | |
if model_name not in model_count: | |
model_count[model_name] = 0 | |
model_counts = defaultdict(int) | |
cand_gen_dict = {} | |
for candgen in candidate_generators: | |
model_name = candgen.model.__class__.__name__ |
if model_name not in model_count: | ||
model_count[model_name] = 0 | ||
model_count[model_name] += 1 | ||
identifier = f"{model_name}_{model_count[model_name]}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't yet know how exactly do you plan to use identifiers, but do you think it make sense to generate some sort of hash for models?
Or, if they are used for column names (=features), maybe it's better to ask user to provide the names? In this case it would be easier to explore feature importance e.g.
candidates.rename( | ||
columns={Columns.Rank: rank_col_name, Columns.Score: score_col_name}, | ||
inplace=True, | ||
errors="ignore", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comment here why we ignore errors
for_train=False, | ||
on_unsupported_targets=on_unsupported_targets, | ||
) | ||
except NotFittedError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we consider 2 possible valid scenarios here (models are fitted and not fitted)?
candidates = self._get_candidates_from_first_stage( | ||
users=users, | ||
dataset=dataset, | ||
filter_viewed=filter_viewed, | ||
items_to_recommend=items_to_recommend, | ||
for_train=False, | ||
on_unsupported_targets=on_unsupported_targets, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway let's avoid code duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
smth like
if not all(canidate.is_fitted in self.candidates):
self._fit...
candidates = self._get...
CandidateRankingModel
and tutorialTODO: doctrsings, tests