Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/two_stage #209

Open
wants to merge 35 commits into
base: experimental/two_stage
Choose a base branch
from

Conversation

blondered
Copy link
Collaborator

CandidateRankingModel and tutorial

TODO: doctrsings, tests

@blondered blondered self-assigned this Nov 13, 2024
rectools/model_selection/splitter.py Outdated Show resolved Hide resolved
rectools/models/candidate_ranking.py Outdated Show resolved Hide resolved
Copy link

codecov bot commented Nov 13, 2024

Codecov Report

Attention: Patch coverage is 91.70732% with 17 lines in your changes missing coverage. Please review.

Please upload report for BASE (experimental/two_stage@b403ea0). Learn more about missing BASE report.

Files with missing lines Patch % Lines
rectools/models/candidate_ranking.py 91.66% 17 Missing ⚠️
Additional details and impacted files
@@                    Coverage Diff                    @@
##             experimental/two_stage     #209   +/-   ##
=========================================================
  Coverage                          ?   99.57%           
=========================================================
  Files                             ?       58           
  Lines                             ?     3986           
  Branches                          ?        0           
=========================================================
  Hits                              ?     3969           
  Misses                            ?       17           
  Partials                          ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

{file = "catboost-1.2.7.tar.gz", hash = "sha256:3ed1658bd22c250a12f9c55cf238d654d7a87d9b45f063ec39965a8884a7e9d3"},
]

[package.dependencies]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I said we can put catboost easily to dependencies, but I didn't know it brought so many dependencies.

graphviz, matplotlib and plotly are not really needed for rectools. And restriction numpy<2.0 isn't also very good, we're planning to update it

So, I'm thinking it may be a good idea to put catboost to extras. wdyt?


@tp.runtime_checkable
class ClassifierBase(tp.Protocol):
def fit(self, *args: tp.Any, **kwargs: tp.Any) -> None: ...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Models usually return themselves after fit
  2. Let's put x and y as explicit arguments, otherwise we can't check ourselves. Same for other methods

class CatBoostReranker(Reranker):
def __init__(
self,
model: tp.Union[CatBoostClassifier, CatBoostRanker] = CatBoostRanker(verbose=False),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird linters are silent here.
But setting a mutable object as a default value is a really bad idea

self.model.fit(**fit_kwargs)

def rerank(self, candidates: pd.DataFrame) -> pd.DataFrame:
reco = candidates[Columns.UserItem].copy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

reco[Columns.Score] = self.model.predict(x_full)
else:
raise ValueError("Got unexpected model_type")
reco.sort_values(by=[Columns.User, Columns.Score], ascending=False, inplace=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to keep the order of users. At least we keep it in other models afaik

group_ids = candidates_with_target[Columns.User].values
candidates_with_target = candidates_with_target.drop(columns=Columns.UserItem)
pool_kwargs = {
"data": candidates_with_target.drop(columns=Columns.Target),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

res = useritem
res = res.merge(user_features, on=Columns.User, how="left")
res = res.merge(item_features, on=Columns.Item, how="left")
res = res.merge(useritem_features, on=Columns.UserItem, how="left")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use a chain of operations, otherwise to this moment you will have 4 datasets instead of 1, all are in your memory

def sample_negatives(self, train: pd.DataFrame) -> pd.DataFrame:
# train: user_id, item_id, scores, ranks, target(1/0)

negative_mask = train["target"] == 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Columns.Target?

sampling_mask = train[Columns.User].isin(num_negatives[num_negatives > self.num_neg_samples].index)

neg_for_sample = train[sampling_mask & negative_mask]
neg = neg_for_sample.groupby([Columns.User], sort=False).apply(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it's experimental, but I don't see TODO, so I'll just remind that apply is super super slow and we're not using it.
I'm ok with TODO comment here if you don't want fixing it now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible solution (maybe not the best one, but without using apply)

  1. After calculating num_negatives calculate also the sampling ratio as minimum(self.num_neg_samples / num_negatives, 1)
  2. Convert it to series with user_id as index and map it to the neg dataset
  3. Add a column with random values in [0, 1)
  4. Filter out rows with random values > sampling ratio

You also don't need a sampling mask here.
And you can avoid splitting to pos and neg parts simply putting 1 as a sampling ratio for positives. This also means you don't need to do shuffling in the end. And I really don't like this shuffling, because:

  1. It's a heavy operation
  2. You have sample method in the Sampler class and it's quite not obvious that you shuffle the data inside


@attr.s(auto_attribs=True)
class PerUserNegativeSampler(NegativeSamplerBase):
num_neg_samples: int = 3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_neg_samples?

Comment on lines +42 to +45
for user_id in sampled_df[Columns.User].unique():
user_data = sampled_df[sampled_df[Columns.User] == user_id]
num_negatives = len(user_data[user_data[Columns.Target] == 0])
assert num_negatives == num_neg_samples
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for user_id in sampled_df[Columns.User].unique():
user_data = sampled_df[sampled_df[Columns.User] == user_id]
num_negatives = len(user_data[user_data[Columns.Target] == 0])
assert num_negatives == num_neg_samples
n_negatives_per_user = data.groupby(Columns.User)["Target"].agg(lambda target: (target == 0).sum())
assert (n_negatives_per_user == num_neg_samples).all()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not about speed optimisation, of course. I suggest it because:

  1. Having multiple same-kind asserts in one test is not a good idea, since it's harder to debug. For such cases we usually use parameterize or subtests. Here it can be easily replaced with 1 assert.
  2. It's easier to read

assert set(sampled_df.columns) == set(sample_data.columns)

# Check if the number of negatives per user is correct
for user_id in sampled_df[Columns.User].unique():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, but instead of checking .all() you can do assert n_negatives_per_user.tolist() == [2, 3, 3]

generator = CandidateGenerator(model, 2, False, False)

with pytest.raises(NotFittedError):
generator.generate_candidates(users, dataset, filter_viewed=True, for_train=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do it with parametrize or subtests


generator.fit(dataset, for_train=True)

generator.generate_candidates(users, dataset, filter_viewed=True, for_train=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes it makes sense to split the test into multiple ones

) -> pd.DataFrame:

if for_train and not self.is_fitted_for_train:
raise NotFittedError(self.model.__class__.__name__)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add that this model isn't fitted for train/recommend

Comment on lines +382 to +387
model_count = {}
cand_gen_dict = {}
for candgen in candidate_generators:
model_name = candgen.model.__class__.__name__
if model_name not in model_count:
model_count[model_name] = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model_count = {}
cand_gen_dict = {}
for candgen in candidate_generators:
model_name = candgen.model.__class__.__name__
if model_name not in model_count:
model_count[model_name] = 0
model_counts = defaultdict(int)
cand_gen_dict = {}
for candgen in candidate_generators:
model_name = candgen.model.__class__.__name__

if model_name not in model_count:
model_count[model_name] = 0
model_count[model_name] += 1
identifier = f"{model_name}_{model_count[model_name]}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't yet know how exactly do you plan to use identifiers, but do you think it make sense to generate some sort of hash for models?
Or, if they are used for column names (=features), maybe it's better to ask user to provide the names? In this case it would be easier to explore feature importance e.g.

candidates.rename(
columns={Columns.Rank: rank_col_name, Columns.Score: score_col_name},
inplace=True,
errors="ignore",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment here why we ignore errors

for_train=False,
on_unsupported_targets=on_unsupported_targets,
)
except NotFittedError:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we consider 2 possible valid scenarios here (models are fitted and not fitted)?

Comment on lines +620 to +627
candidates = self._get_candidates_from_first_stage(
users=users,
dataset=dataset,
filter_viewed=filter_viewed,
items_to_recommend=items_to_recommend,
for_train=False,
on_unsupported_targets=on_unsupported_targets,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway let's avoid code duplication

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smth like

if not all(canidate.is_fitted in self.candidates):
    self._fit...
candidates = self._get...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants