-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement fit_partial()
for ImplicitALSWrapperModel
and LightFMWrapperModel
#179
Changes from 2 commits
0caaf56
144c6c5
ea33c05
47db226
0b04ed9
e8929c5
117c5c7
4c86904
b75dd62
cd9ec52
43deafd
dd84f10
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -21,6 +21,7 @@ | |||
from scipy import sparse | ||||
|
||||
from rectools import Columns | ||||
from rectools.types import InternalIdsArray | ||||
|
||||
from .features import AbsentIdError, DenseFeatures, Features, SparseFeatures | ||||
from .identifiers import IdMap | ||||
|
@@ -91,6 +92,14 @@ def get_hot_item_features(self) -> tp.Optional[Features]: | |||
return None | ||||
return self.item_features.take(range(self.n_hot_items)) | ||||
|
||||
def get_hot_users(self) -> InternalIdsArray: | ||||
"""Return internal ids of hot users.""" | ||||
return self.interactions.df[Columns.User].unique() | ||||
feldlime marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
|
||||
def get_hot_items(self) -> InternalIdsArray: | ||||
"""Return internal ids of hot items.""" | ||||
return self.interactions.df[Columns.Item].unique() | ||||
|
||||
@classmethod | ||||
def construct( | ||||
cls, | ||||
|
@@ -138,9 +147,7 @@ def construct( | |||
Dataset | ||||
Container with all input data, converted to `rectools` structures. | ||||
""" | ||||
for col in (Columns.User, Columns.Item): | ||||
if col not in interactions_df: | ||||
raise KeyError(f"Column '{col}' must be present in `interactions_df`") | ||||
cls._check_columns_present(interactions_df) | ||||
user_id_map = IdMap.from_values(interactions_df[Columns.User].values) | ||||
item_id_map = IdMap.from_values(interactions_df[Columns.Item].values) | ||||
interactions = Interactions.from_raw(interactions_df, user_id_map, item_id_map) | ||||
|
@@ -194,6 +201,12 @@ def _make_features( | |||
except Exception as e: # pragma: no cover | ||||
raise RuntimeError(f"An error has occurred while constructing {feature_type} features: {e!r}") | ||||
|
||||
@staticmethod | ||||
def _check_columns_present(interactions_df: pd.DataFrame) -> None: | ||||
for col in (Columns.User, Columns.Item): | ||||
if col not in interactions_df: | ||||
raise KeyError(f"Column '{col}' must be present in `interactions_df`") | ||||
|
||||
def get_user_item_matrix( | ||||
self, | ||||
include_weights: bool = True, | ||||
|
@@ -245,3 +258,72 @@ def get_raw_interactions(self, include_weight: bool = True, include_datetime: bo | |||
pd.DataFrame | ||||
""" | ||||
return self.interactions.to_external(self.user_id_map, self.item_id_map, include_weight, include_datetime) | ||||
|
||||
def construct_new_datasets( | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd call it Right now it's not working properly in the conceptions that we use. We have a rule that all hot users (users that have interactions) always go before warm users (users that have only features). And we use this assumption in some of our impertant internal logic. New method can break this rule because new users with interactions will have ids after warm users from previous dataset state. So we really can't make this a public method in the library. Options are:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for pointing it out. I wasn't aware of the warm users' assumption. And yeah, I expected this to be a controversial change. And that's why I haven't written the test yet 😅 Originally, I expected for RecTools' I'm okay to make it private, or I can rebuild old_hot_user_id_map = IdMap.from_dict({e: i for e,i in zip(ds.user_id_map.convert_to_external(ds.get_hot_users()), ds.get_hot_users())})
old_hot_item_id_map = ...
new_user_id = old_hot_user_id_map.add_ids(... I'm not confident if it's okay to drop the previous warm user yet though,. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, I attempt to drop old warm ids, but I face RecTools/rectools/dataset/features.py Line 121 in f73d054
Maybe, storing hot/warm marks like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The overall conception of ids in rectools
As for keeping warm user ids in the dataset it is very simple. If those users don't have features in user features matrix in the final dataset then they need to be dropped from user_id_map. Cause this is the only application of those warm ids - to get features for them (during There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As for the error, one possible workaround is to keep all (old) hot user features inside the dataset. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, since we are not dropping "old hot" user embeddings from the models, why should we drop their features from dataset? For Now about corner cases. I think that for incremental training we need to require exactly the same number of features, feature names and feature types as were present in the original dataset. At least in this PR. Or we gonna dive too deep. So all this checks for correct data in features are needed. And Items follow same logic as users. Wow. That's a tough story :) What do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, let me revive my dropping code and see what it happens on CI. I'll share it later. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed to drop old warm ids e8929c5 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @blondered I think I completed implementing the "Rewrite logic" option, but let me know if you find anything else. |
||||
self, | ||||
interactions_df: pd.DataFrame, | ||||
user_features_df: tp.Optional[pd.DataFrame] = None, | ||||
cat_user_features: tp.Iterable[str] = (), | ||||
make_dense_user_features: bool = False, | ||||
item_features_df: tp.Optional[pd.DataFrame] = None, | ||||
cat_item_features: tp.Iterable[str] = (), | ||||
make_dense_item_features: bool = False, | ||||
) -> "Dataset": | ||||
""" | ||||
Create new dataset by merging user_id_map and item_id_map. | ||||
This function is useful when you want to use fit_partial. | ||||
|
||||
Parameters | ||||
---------- | ||||
interactions_df : pd.DataFrame | ||||
New interactions table. | ||||
The same structure as in `construct` method. | ||||
user_features_df, item_features_df : pd.DataFrame, optional | ||||
New user (item) explicit features table. | ||||
The same structure as in `construct` method. | ||||
cat_user_features, cat_item_features : tp.Iterable[str], default ``()`` | ||||
List of categorical user (item) feature names for | ||||
`SparseFeatures.from_flatten` method. | ||||
Used only if `make_dense_user_features` (`make_dense_item_features`) | ||||
flag is ``False`` and `user_features_df` (`item_features_df`) is not ``None``. | ||||
make_dense_user_features, make_dense_item_features : bool, default ``False`` | ||||
Create user (item) features as dense or sparse. | ||||
Used only if `user_features_df` (`item_features_df`) is not ``None``. | ||||
- if ``False``, `SparseFeatures.from_flatten` method will be used; | ||||
- if ``True``, `DenseFeatures.from_dataframe` method will be used. | ||||
|
||||
Returns | ||||
------- | ||||
Dataset | ||||
New dataset with added data. | ||||
""" | ||||
self._check_columns_present(interactions_df) | ||||
|
||||
new_user_id_map = self.user_id_map.add_ids(interactions_df[Columns.User].values, raise_if_already_present=False) | ||||
new_item_id_map = self.item_id_map.add_ids(interactions_df[Columns.Item].values, raise_if_already_present=False) | ||||
new_interactions = Interactions.from_raw(interactions_df, new_user_id_map, new_item_id_map) | ||||
|
||||
new_user_features, new_user_id_map = self._make_features( | ||||
user_features_df, | ||||
cat_user_features, | ||||
make_dense_user_features, | ||||
new_user_id_map, | ||||
Columns.User, | ||||
"user", | ||||
) | ||||
new_item_features, new_item_id_map = self._make_features( | ||||
item_features_df, | ||||
cat_item_features, | ||||
make_dense_item_features, | ||||
new_item_id_map, | ||||
Columns.Item, | ||||
"item", | ||||
) | ||||
|
||||
return Dataset( | ||||
new_user_id_map, | ||||
new_item_id_map, | ||||
new_interactions, | ||||
new_user_features, | ||||
new_item_features, | ||||
) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -90,6 +90,25 @@ def _fit(self, dataset: Dataset) -> None: # type: ignore | |
self.verbose, | ||
) | ||
|
||
def _fit_partial(self, dataset: Dataset) -> None: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We've discussed it a lot. Let's not do ALS support for
|
||
# deepcopy does not copy model.item_factors and model.user_factors. | ||
feldlime marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# That causes issues with partial fit. | ||
users = dataset.get_hot_users() | ||
items = dataset.get_hot_items() | ||
|
||
ui_csr = dataset.get_user_item_matrix( | ||
include_weights=True, include_warm_users=True, include_warm_items=True | ||
).astype(np.float32) | ||
iu_csr = ui_csr[:, items].T.tocsr(copy=False) | ||
|
||
# TODO: implement partial fit for explicit features | ||
if dataset.get_hot_item_features() or dataset.get_hot_user_features(): | ||
raise NotImplementedError("fit_partial with explicit features is not implemented") | ||
|
||
for _ in range(self.model.iterations): | ||
self.model.partial_fit_users(users, ui_csr[users]) | ||
self.model.partial_fit_items(items, iu_csr) | ||
|
||
def _get_users_factors(self, dataset: Dataset) -> Factors: | ||
return Factors(get_users_vectors(self.model)) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very excited about the new feature. Thank you for the PR!
Let's discuss what we're trying to achieve with
fit_partial
.In implicit ALS has
partial_fit_users
andpartial_fit_items
methods that allow to update weights only for specific parts of users/items and also support feeding new users and items to already fitted model.LightFM model has different logic. It has
fit_partial
method. From docs: "Unlike fit, repeated calls to this method will cause training to resume from the current model state". This method does not support adding new users/items. And it is not meant to update only parts of embeddings.What exactly do we want to achieve with
fit_partial
in rectools? I would say that first step would be to follow LightFM logic. We are missing the ability to resume training, outfit
always refits models from the beginning, just like LightFMfit
.As for
partial_fit_users
andpartial_fit_items
- these are very model specific ALS methods. I wouldn't call themfit_partial
. I would move them to a separate issue and PR.@chezou @feldlime what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
My motivation for adding
fit_partial
is to enable incremental training. For example, weekly batches for full training and daily incremental training to adopt new and existing user transactions.I don't seriously care about preserving old users/items. Rather, shortening training time would be important.
Actually, LightFM has a known workaround for adding a new user/item. lyst/lightfm#347 (comment)
I included that workaround into the patch. ea33c05
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great feature for LigtFM, thank you!
So there are two major usage cases for
fit_partial
:Lightfm.fit_partial
Right now in LightFM you implemented support for both cases but in ALS only for one of them (incremental). Could you please add support for ALS epochal training (with
NotImplementedError
if features are present)? Otherwise users would be very confused for different behaviour of the same method. And could you please add an optionalepochs
argument tofit_partial
of ALS?We would also need tests for both models. With fitting model for 10 epochs with
fit
and 10 times fitting models withfit_partial
for 1 epoch. And finals embeds should be equal.How do you feel about the whole story?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Let me add tests for re-training with the same dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the tests 117c5c7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blondered I think 117c5c7 shows how it works for the use case 2.
Please let me know if you'd like to add anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Sorry for the delay. It's summer and it was a vacation time :)