Feature/two_stage #209

blondered · 2024-11-13T13:56:40Z

CandidateRankingModel and tutorial

TODO: doctrsings, tests

Two-stage draft

Co-authored-by: Daria <[email protected]>

This reverts commit aa12b6f.

This reverts commit 4064cbd.

Feature/reranker

rectools/model_selection/splitter.py

rectools/models/candidate_ranking.py

codecov · 2024-11-13T14:14:03Z

Codecov Report

Attention: Patch coverage is 91.70732% with 17 lines in your changes missing coverage. Please review.

Please upload report for BASE (experimental/two_stage@b403ea0). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
rectools/models/candidate_ranking.py	91.66%	17 Missing ⚠️

Additional details and impacted files

@@                    Coverage Diff                    @@
##             experimental/two_stage     #209   +/-   ##
=========================================================
  Coverage                          ?   99.57%           
=========================================================
  Files                             ?       58           
  Lines                             ?     3986           
  Branches                          ?        0           
=========================================================
  Hits                              ?     3969           
  Misses                            ?       17           
  Partials                          ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

feldlime · 2024-11-14T22:10:04Z

poetry.lock

+    {file = "catboost-1.2.7.tar.gz", hash = "sha256:3ed1658bd22c250a12f9c55cf238d654d7a87d9b45f063ec39965a8884a7e9d3"},
+]
+
+[package.dependencies]


I know I said we can put catboost easily to dependencies, but I didn't know it brought so many dependencies.

graphviz, matplotlib and plotly are not really needed for rectools. And restriction numpy<2.0 isn't also very good, we're planning to update it

So, I'm thinking it may be a good idea to put catboost to extras. wdyt?

feldlime · 2024-11-14T22:15:58Z

rectools/models/candidate_ranking.py

+
+@tp.runtime_checkable
+class ClassifierBase(tp.Protocol):
+    def fit(self, *args: tp.Any, **kwargs: tp.Any) -> None: ...


Models usually return themselves after fit

Let's put x and y as explicit arguments, otherwise we can't check ourselves. Same for other methods

feldlime · 2024-11-14T22:20:11Z

rectools/models/candidate_ranking.py

+class CatBoostReranker(Reranker):
+    def __init__(
+        self,
+        model: tp.Union[CatBoostClassifier, CatBoostRanker] = CatBoostRanker(verbose=False),


It's weird linters are silent here.
But setting a mutable object as a default value is a really bad idea

feldlime · 2024-11-14T22:23:58Z

rectools/models/candidate_ranking.py

+        self.model.fit(**fit_kwargs)
+
+    def rerank(self, candidates: pd.DataFrame) -> pd.DataFrame:
+        reco = candidates[Columns.UserItem].copy()


feldlime · 2024-11-14T22:26:45Z

rectools/models/candidate_ranking.py

+            reco[Columns.Score] = self.model.predict(x_full)
+        else:
+            raise ValueError("Got unexpected model_type")
+        reco.sort_values(by=[Columns.User, Columns.Score], ascending=False, inplace=True)


I think it'd be good to keep the order of users. At least we keep it in other models afaik

feldlime · 2024-11-14T22:36:07Z

rectools/models/candidate_ranking.py

+            group_ids = candidates_with_target[Columns.User].values
+            candidates_with_target = candidates_with_target.drop(columns=Columns.UserItem)
+            pool_kwargs = {
+                "data": candidates_with_target.drop(columns=Columns.Target),


feldlime · 2024-11-14T22:40:29Z

rectools/models/candidate_ranking.py

+        res = useritem
+        res = res.merge(user_features, on=Columns.User, how="left")
+        res = res.merge(item_features, on=Columns.Item, how="left")
+        res = res.merge(useritem_features, on=Columns.UserItem, how="left")


It's better to use a chain of operations, otherwise to this moment you will have 4 datasets instead of 1, all are in your memory

feldlime · 2024-11-14T22:41:13Z

rectools/models/candidate_ranking.py

+    def sample_negatives(self, train: pd.DataFrame) -> pd.DataFrame:
+        # train: user_id, item_id, scores, ranks, target(1/0)
+
+        negative_mask = train["target"] == 0


Columns.Target?

feldlime · 2024-11-14T22:44:15Z

rectools/models/candidate_ranking.py

+        sampling_mask = train[Columns.User].isin(num_negatives[num_negatives > self.num_neg_samples].index)
+
+        neg_for_sample = train[sampling_mask & negative_mask]
+        neg = neg_for_sample.groupby([Columns.User], sort=False).apply(


I see it's experimental, but I don't see TODO, so I'll just remind that apply is super super slow and we're not using it.
I'm ok with TODO comment here if you don't want fixing it now

A possible solution (maybe not the best one, but without using apply)

After calculating num_negatives calculate also the sampling ratio as minimum(self.num_neg_samples / num_negatives, 1)

Convert it to series with user_id as index and map it to the neg dataset

Add a column with random values in [0, 1)

Filter out rows with random values > sampling ratio

You also don't need a sampling mask here.
And you can avoid splitting to pos and neg parts simply putting 1 as a sampling ratio for positives. This also means you don't need to do shuffling in the end. And I really don't like this shuffling, because:

It's a heavy operation

You have sample method in the Sampler class and it's quite not obvious that you shuffle the data inside

feldlime · 2024-11-14T22:49:31Z

rectools/models/candidate_ranking.py

+
+@attr.s(auto_attribs=True)
+class PerUserNegativeSampler(NegativeSamplerBase):
+    num_neg_samples: int = 3


n_neg_samples?

feldlime · 2024-11-15T11:20:12Z

tests/models/test_candidate_ranking.py

+        for user_id in sampled_df[Columns.User].unique():
+            user_data = sampled_df[sampled_df[Columns.User] == user_id]
+            num_negatives = len(user_data[user_data[Columns.Target] == 0])
+            assert num_negatives == num_neg_samples


Suggested change

for user_id in sampled_df[Columns.User].unique():

user_data = sampled_df[sampled_df[Columns.User] == user_id]

num_negatives = len(user_data[user_data[Columns.Target] == 0])

assert num_negatives == num_neg_samples

n_negatives_per_user = data.groupby(Columns.User)["Target"].agg(lambda target: (target == 0).sum())

assert (n_negatives_per_user == num_neg_samples).all()

It's not about speed optimisation, of course. I suggest it because:

Having multiple same-kind asserts in one test is not a good idea, since it's harder to debug. For such cases we usually use parameterize or subtests. Here it can be easily replaced with 1 assert.

It's easier to read

feldlime · 2024-11-15T11:25:00Z

tests/models/test_candidate_ranking.py

+        assert set(sampled_df.columns) == set(sample_data.columns)
+
+        # Check if the number of negatives per user is correct
+        for user_id in sampled_df[Columns.User].unique():


same here, but instead of checking .all() you can do assert n_negatives_per_user.tolist() == [2, 3, 3]

feldlime · 2024-11-15T11:25:44Z

tests/models/test_candidate_ranking.py

+        generator = CandidateGenerator(model, 2, False, False)
+
+        with pytest.raises(NotFittedError):
+            generator.generate_candidates(users, dataset, filter_viewed=True, for_train=True)


please do it with parametrize or subtests

feldlime · 2024-11-15T11:26:32Z

tests/models/test_candidate_ranking.py

+
+        generator.fit(dataset, for_train=True)
+
+        generator.generate_candidates(users, dataset, filter_viewed=True, for_train=True)


sometimes it makes sense to split the test into multiple ones

feldlime · 2024-11-15T16:28:32Z

rectools/models/candidate_ranking.py

+    ) -> pd.DataFrame:
+
+        if for_train and not self.is_fitted_for_train:
+            raise NotFittedError(self.model.__class__.__name__)


Let's add that this model isn't fitted for train/recommend

feldlime · 2024-11-15T16:36:20Z

rectools/models/candidate_ranking.py

+        model_count = {}
+        cand_gen_dict = {}
+        for candgen in candidate_generators:
+            model_name = candgen.model.__class__.__name__
+            if model_name not in model_count:
+                model_count[model_name] = 0


Suggested change

model_count = {}

cand_gen_dict = {}

for candgen in candidate_generators:

model_name = candgen.model.__class__.__name__

if model_name not in model_count:

model_count[model_name] = 0

model_counts = defaultdict(int)

cand_gen_dict = {}

for candgen in candidate_generators:

model_name = candgen.model.__class__.__name__

feldlime · 2024-11-15T16:39:31Z

rectools/models/candidate_ranking.py

+            if model_name not in model_count:
+                model_count[model_name] = 0
+            model_count[model_name] += 1
+            identifier = f"{model_name}_{model_count[model_name]}"


I don't yet know how exactly do you plan to use identifiers, but do you think it make sense to generate some sort of hash for models?
Or, if they are used for column names (=features), maybe it's better to ask user to provide the names? In this case it would be easier to explore feature importance e.g.

feldlime · 2024-11-15T16:42:36Z

rectools/models/candidate_ranking.py

+            candidates.rename(
+                columns={Columns.Rank: rank_col_name, Columns.Score: score_col_name},
+                inplace=True,
+                errors="ignore",


Please add comment here why we ignore errors

feldlime · 2024-11-15T16:46:56Z

rectools/models/candidate_ranking.py

+                for_train=False,
+                on_unsupported_targets=on_unsupported_targets,
+            )
+        except NotFittedError:


Why do we consider 2 possible valid scenarios here (models are fitted and not fitted)?

feldlime · 2024-11-15T16:47:19Z

rectools/models/candidate_ranking.py

+            candidates = self._get_candidates_from_first_stage(
+                users=users,
+                dataset=dataset,
+                filter_viewed=filter_viewed,
+                items_to_recommend=items_to_recommend,
+                for_train=False,
+                on_unsupported_targets=on_unsupported_targets,
+            )


Anyway let's avoid code duplication

smth like

if not all(canidate.is_fitted in self.candidates): self._fit... candidates = self._get...

Daria Tikhonovich and others added 30 commits July 30, 2024 12:23

add rerank draft

c383fe7

fixed is_fitted

ce722a6

splitter split to train_dataset

77894c5

processing in both internals and external ids

e47b93c

linters and PerUserNegativeSampler test

be1a483

candgen first test

2bd8823

generator columns test

e46d582

candidate genrator happy path

e74b7d7

TwoStageModel prepare train happy path

dd43f41

comment

3e38b89

changed cross_validate in internal ids

f767149

add two_stage example

06f26d6

add feature vollector to example

f15c6ba

Two-stage draft

9258fa0

Two-stage draft

add ranker base

ea99fe7

add example

59bf882

Update rectools/models/rerank.py

225fdc3

Co-authored-by: Daria <[email protected]>

fix ranker and tutorial

aa12b6f

Revert "fix ranker and tutorial"

4064cbd

This reverts commit aa12b6f.

Revert "Revert "fix ranker and tutorial""

84e542c

This reverts commit 4064cbd.

return to previous version poetry.lock

d4bb7ac

final vers tutorial

58e8171

small fix in tutorial

69c9f7f

fix file name and some params

4dab1c6

Merge pull request #2 from blondered/feature/reranker

a08eb28

Feature/reranker

Merge branch 'dev/two_stage' into feature/twostage

49bd980

back to main repo poetry lock

d53944b

added catboost dependency

694e906

fixed extra line

a152a70

fixed errors namings and tutorial

e486911

updated tutorial

8ef83f9

blondered requested a review from feldlime November 13, 2024 13:56

blondered self-assigned this Nov 13, 2024

deleted extra example

a2bee41

blondered commented Nov 13, 2024

View reviewed changes

rectools/model_selection/splitter.py Outdated Show resolved Hide resolved

rectools/models/candidate_ranking.py Outdated Show resolved Hide resolved

Daria Tikhonovich added 2 commits November 13, 2024 17:06

removed splitter.get_train_dataset

52fa593

removed splitter get_train_dataset from splitter

4d26f4e

excluded coverage for experimental branch

b8e33ec

feldlime reviewed Nov 14, 2024

View reviewed changes

feldlime reviewed Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/two_stage #209

Feature/two_stage #209

blondered commented Nov 13, 2024

codecov bot commented Nov 13, 2024

feldlime Nov 14, 2024

feldlime Nov 14, 2024

feldlime Nov 14, 2024

feldlime Nov 14, 2024

feldlime Nov 14, 2024

feldlime Nov 14, 2024

feldlime Nov 14, 2024

feldlime Nov 14, 2024

feldlime Nov 14, 2024

feldlime Nov 14, 2024

feldlime Nov 14, 2024

feldlime Nov 15, 2024

feldlime Nov 15, 2024

feldlime Nov 15, 2024

feldlime Nov 15, 2024

feldlime Nov 15, 2024

feldlime Nov 15, 2024

feldlime Nov 15, 2024

feldlime Nov 15, 2024

feldlime Nov 15, 2024

feldlime Nov 15, 2024

feldlime Nov 15, 2024

feldlime Nov 15, 2024


		generator.fit(dataset, for_train=True)

		generator.generate_candidates(users, dataset, filter_viewed=True, for_train=True)

Feature/two_stage #209

Are you sure you want to change the base?

Feature/two_stage #209

Conversation

blondered commented Nov 13, 2024

codecov bot commented Nov 13, 2024

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment