Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ModelCheckpoint Callback results in an error due to unsupported overwrite=True parameter #1242

Open
hkristof03 opened this issue Jun 4, 2024 · 7 comments
Labels
bug Something isn't working status/needs-triage

Comments

@hkristof03
Copy link

Bug description

The callback ModelCheckpoint calls the model's save method, which is overwritten by the Merlin BaseModel implementation. The callback passes the overwrite=True parameter to the save method, which is not supported and raises an error.

Steps/Code to reproduce bug

  1. Create a ModelCheckpoint callback.
  2. Pass it to the model.fit() method.
@hkristof03 hkristof03 added bug Something isn't working status/needs-triage labels Jun 4, 2024
@CarloNicolini
Copy link

CarloNicolini commented Jun 7, 2024

Same problem happens to me on latest merlin tensorflow container (23.08).

I believe the solution is to create a new class MerlinCheckPoint inheriting from ModelCheckpoint and overriding the ._save_model method. I am working on that, I'll keep you informed.

@hkristof03
Copy link
Author

hkristof03 commented Jun 19, 2024

I implemented a custom callback to handle this issue. However, I just encountered the fact that the save() method of the class Model(BaseModel) class is well...not given much thought I guess. I have a model that takes most of the GPU memory. When I call the save method, it calls the parent's save method, after that get_output_schema() function is called, which actually tries to load the previously saved model (Isn't the model already in memory? Why is it assumed that the model can fit into the GPU twice? Why not pass the model to the function?) @rnyak @bschifferer

@CarloNicolini
Copy link

I implemented a custom callback to handle this issue. However, I just encountered the fact that the save() method of the class Model(BaseModel) class is well...not given much thought I guess. I have a model that takes most of the GPU memory. When I call the save method, it calls the parent's save method, after that get_output_schema() function is called, which actually tries to load the previously saved model (Isn't the model already in memory? Why is it assumed that the model can fit into the GPU twice? Why not pass the model to the function?) @rnyak @bschifferer

What is the status of this issue? Is there any update on the code?

@hkristof03
Copy link
Author

@CarloNicolini I implemented a custom callback where I call the BaseModel's .save() method. After the training, I had to exit the model's scope to be able to fully delete the model and call this utility , thanks for the magic of Tensorflow. At the same time after raising a couple of issues and writing the n-th workaround, it seems that this project is inactive, based on the lack of feedback.

@CarloNicolini
Copy link

CarloNicolini commented Jun 27, 2024

@CarloNicolini I implemented a custom callback where I call the BaseModel's .save() method. After the training, I had to exit the model's scope to be able to fully delete the model and call this utility , thanks for the magic of Tensorflow. At the same time after raising a couple of issues and writing the n-th workaround, it seems that this project is inactive, based on the lack of feedback.

Agree, there are many issues with no response since weeks, I also never managed to make the .to_topk_recommender work without an IndexError, and finally implemented my own functions for performing recommendation of topk items.
I wonder why the guys at NVIDIA kind of abandoned this project. It is really nice to work with, especially dealing with schema etc is very convenient (apart from a number of bugs).
I believe the problem setting is well made, but there is no thriving community around it.

By the way, could you please share your custom callback?
I've subclassed the keras ModelCheckpoint and overriden the _save_model method with mine commenting the overwrite=True option as well as the options parameter.
Here is the code:

from keras.callbacks import ModelCheckpoint

class MerlinCheckpoint(ModelCheckpoint):
     # custom override of ModelCheckpoint _save_model made to operate on Merlin BaseModel rather than on standard Keras model
    def _save_model(self, epoch, batch, logs):
        """Saves the model.

        Args:
            epoch: the epoch this iteration is in.
            batch: the batch this iteration is in. `None` if the `save_freq`
              is set to `epoch`.
            logs: the `logs` dict passed in to `on_batch_end` or `on_epoch_end`.
        """
        logs = logs or {}

        if (
            isinstance(self.save_freq, int)
            or self.epochs_since_last_save >= self.period
        ):
            # Block only when saving interval is reached.
            logs = tf_utils.sync_to_numpy_or_python_type(logs)
            self.epochs_since_last_save = 0
            filepath = self._get_file_path(epoch, batch, logs)

            # Create host directory if it doesn't exist.
            dirname = os.path.dirname(filepath)  # noqa: PTH120
            if dirname and not tf.io.gfile.exists(dirname):
                tf.io.gfile.makedirs(dirname)

            try:
                if self.save_best_only:
                    current = logs.get(self.monitor)
                    if current is None:
                        logger.warning(
                            "Can save best model only with %s available, " "skipping.",
                            self.monitor,
                        )
                    elif self.monitor_op(current, self.best):
                        if self.verbose > 0:
                            io_utils.print_msg(
                                f"\nEpoch {epoch + 1}: {self.monitor} "
                                "improved "
                                f"from {self.best:.5f} to {current:.5f}, "
                                f"saving model to {filepath}"
                            )
                        self.best = current
                        if self.save_weights_only:
                            self.model.save_weights(
                                filepath,
                                # overwrite=True,
                                #options=self._options,
                            )
                        else:
                            self.model.save(
                                filepath,
                                # overwrite=True,
                                # options=self._options,
                            )
                    elif self.verbose > 0:
                        io_utils.print_msg(
                            f"\nEpoch {epoch + 1}: "
                            f"{self.monitor} did not improve "
                            f"from {self.best:.5f}"
                        )
                else:
                    if self.verbose > 0:
                        io_utils.print_msg(
                            f"\nEpoch {epoch + 1}: saving model to {filepath}"
                        )
                    if self.save_weights_only:
                        self.model.save_weights(
                            filepath, 
                            # overwrite=True,
                            # options=self._options
                        )
                    else:
                        self.model.save(
                            filepath,
                            # overwrite=True,
                            # options=self._options
                            )

                self._maybe_remove_file()
            except IsADirectoryError:  # h5py 3.x
                msg = (
                    "Please specify a non-directory filepath for "
                    "ModelCheckpoint. Filepath used is an existing "
                    f"directory: {filepath}"
                )
                raise OSError(msg)  # noqa: B904
            except IOError as e:  # h5py 2.x
                # `e.errno` appears to be `None` so checking the content of
                # `e.args[0]`.
                if "is a directory" in str(e.args[0]).lower():
                    msg = (
                        "Please specify a non-directory filepath for "
                        "ModelCheckpoint. Filepath used is an existing "
                        f"directory: f{filepath}"
                    )
                    raise OSError(msg)  # noqa: B904
                # Re-throw the error for any other causes.
                raise e

@hkristof03
Copy link
Author

@CarloNicolini sorry I can't share the code as I developed it during my work and it is not my property. However, I just asked from Chat-GPT 3.5 a basic Tensorflow model saving template which I just modified a bit for my own use-case. Compared to the officialModelCheckpoint callback, I threw out 60% of that code.

@rnyak
Copy link
Contributor

rnyak commented Jun 28, 2024

@CarloNicolini @hkristof03 can u please try to use our nightly container nvcr.io/nvidia/merlin/merlin-tensorflow:nightly, last 3 months there were some fixes happened in the merlin models lib for saving the models.

Beyond that we always need a proper small minimal reproducible example code to repro the errors. you can generate a small fake data and share your entire code to repro.

thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working status/needs-triage
Projects
None yet
Development

No branches or pull requests

3 participants