[BUG] ModelCheckpoint Callback results in an error due to unsupported overwrite=True parameter #1242

hkristof03 · 2024-06-04T12:38:12Z

Bug description

The callback ModelCheckpoint calls the model's save method, which is overwritten by the Merlin BaseModel implementation. The callback passes the overwrite=True parameter to the save method, which is not supported and raises an error.

Steps/Code to reproduce bug

Create a ModelCheckpoint callback.
Pass it to the model.fit() method.

The text was updated successfully, but these errors were encountered:

CarloNicolini · 2024-06-07T14:01:46Z

Same problem happens to me on latest merlin tensorflow container (23.08).

I believe the solution is to create a new class MerlinCheckPoint inheriting from ModelCheckpoint and overriding the ._save_model method. I am working on that, I'll keep you informed.

hkristof03 · 2024-06-19T07:33:51Z

I implemented a custom callback to handle this issue. However, I just encountered the fact that the save() method of the class Model(BaseModel) class is well...not given much thought I guess. I have a model that takes most of the GPU memory. When I call the save method, it calls the parent's save method, after that get_output_schema() function is called, which actually tries to load the previously saved model (Isn't the model already in memory? Why is it assumed that the model can fit into the GPU twice? Why not pass the model to the function?) @rnyak @bschifferer

CarloNicolini · 2024-06-27T13:20:07Z

I implemented a custom callback to handle this issue. However, I just encountered the fact that the save() method of the class Model(BaseModel) class is well...not given much thought I guess. I have a model that takes most of the GPU memory. When I call the save method, it calls the parent's save method, after that get_output_schema() function is called, which actually tries to load the previously saved model (Isn't the model already in memory? Why is it assumed that the model can fit into the GPU twice? Why not pass the model to the function?) @rnyak @bschifferer

What is the status of this issue? Is there any update on the code?

hkristof03 · 2024-06-27T13:31:32Z

@CarloNicolini I implemented a custom callback where I call the BaseModel's .save() method. After the training, I had to exit the model's scope to be able to fully delete the model and call this utility , thanks for the magic of Tensorflow. At the same time after raising a couple of issues and writing the n-th workaround, it seems that this project is inactive, based on the lack of feedback.

CarloNicolini · 2024-06-27T14:26:54Z

@CarloNicolini I implemented a custom callback where I call the BaseModel's .save() method. After the training, I had to exit the model's scope to be able to fully delete the model and call this utility , thanks for the magic of Tensorflow. At the same time after raising a couple of issues and writing the n-th workaround, it seems that this project is inactive, based on the lack of feedback.

Agree, there are many issues with no response since weeks, I also never managed to make the .to_topk_recommender work without an IndexError, and finally implemented my own functions for performing recommendation of topk items.
I wonder why the guys at NVIDIA kind of abandoned this project. It is really nice to work with, especially dealing with schema etc is very convenient (apart from a number of bugs).
I believe the problem setting is well made, but there is no thriving community around it.

By the way, could you please share your custom callback?
I've subclassed the keras ModelCheckpoint and overriden the _save_model method with mine commenting the overwrite=True option as well as the options parameter.
Here is the code:

from keras.callbacks import ModelCheckpoint

class MerlinCheckpoint(ModelCheckpoint):
     # custom override of ModelCheckpoint _save_model made to operate on Merlin BaseModel rather than on standard Keras model
    def _save_model(self, epoch, batch, logs):
        """Saves the model.

        Args:
            epoch: the epoch this iteration is in.
            batch: the batch this iteration is in. `None` if the `save_freq`
              is set to `epoch`.
            logs: the `logs` dict passed in to `on_batch_end` or `on_epoch_end`.
        """
        logs = logs or {}

        if (
            isinstance(self.save_freq, int)
            or self.epochs_since_last_save >= self.period
        ):
            # Block only when saving interval is reached.
            logs = tf_utils.sync_to_numpy_or_python_type(logs)
            self.epochs_since_last_save = 0
            filepath = self._get_file_path(epoch, batch, logs)

            # Create host directory if it doesn't exist.
            dirname = os.path.dirname(filepath)  # noqa: PTH120
            if dirname and not tf.io.gfile.exists(dirname):
                tf.io.gfile.makedirs(dirname)

            try:
                if self.save_best_only:
                    current = logs.get(self.monitor)
                    if current is None:
                        logger.warning(
                            "Can save best model only with %s available, " "skipping.",
                            self.monitor,
                        )
                    elif self.monitor_op(current, self.best):
                        if self.verbose > 0:
                            io_utils.print_msg(
                                f"\nEpoch {epoch + 1}: {self.monitor} "
                                "improved "
                                f"from {self.best:.5f} to {current:.5f}, "
                                f"saving model to {filepath}"
                            )
                        self.best = current
                        if self.save_weights_only:
                            self.model.save_weights(
                                filepath,
                                # overwrite=True,
                                #options=self._options,
                            )
                        else:
                            self.model.save(
                                filepath,
                                # overwrite=True,
                                # options=self._options,
                            )
                    elif self.verbose > 0:
                        io_utils.print_msg(
                            f"\nEpoch {epoch + 1}: "
                            f"{self.monitor} did not improve "
                            f"from {self.best:.5f}"
                        )
                else:
                    if self.verbose > 0:
                        io_utils.print_msg(
                            f"\nEpoch {epoch + 1}: saving model to {filepath}"
                        )
                    if self.save_weights_only:
                        self.model.save_weights(
                            filepath, 
                            # overwrite=True,
                            # options=self._options
                        )
                    else:
                        self.model.save(
                            filepath,
                            # overwrite=True,
                            # options=self._options
                            )

                self._maybe_remove_file()
            except IsADirectoryError:  # h5py 3.x
                msg = (
                    "Please specify a non-directory filepath for "
                    "ModelCheckpoint. Filepath used is an existing "
                    f"directory: {filepath}"
                )
                raise OSError(msg)  # noqa: B904
            except IOError as e:  # h5py 2.x
                # `e.errno` appears to be `None` so checking the content of
                # `e.args[0]`.
                if "is a directory" in str(e.args[0]).lower():
                    msg = (
                        "Please specify a non-directory filepath for "
                        "ModelCheckpoint. Filepath used is an existing "
                        f"directory: f{filepath}"
                    )
                    raise OSError(msg)  # noqa: B904
                # Re-throw the error for any other causes.
                raise e

hkristof03 · 2024-06-27T19:20:46Z

@CarloNicolini sorry I can't share the code as I developed it during my work and it is not my property. However, I just asked from Chat-GPT 3.5 a basic Tensorflow model saving template which I just modified a bit for my own use-case. Compared to the officialModelCheckpoint callback, I threw out 60% of that code.

rnyak · 2024-06-28T13:22:16Z

@CarloNicolini @hkristof03 can u please try to use our nightly container nvcr.io/nvidia/merlin/merlin-tensorflow:nightly, last 3 months there were some fixes happened in the merlin models lib for saving the models.

Beyond that we always need a proper small minimal reproducible example code to repro the errors. you can generate a small fake data and share your entire code to repro.

thanks.

hkristof03 added bug Something isn't working status/needs-triage labels Jun 4, 2024

hkristof03 closed this as completed Jun 19, 2024

hkristof03 reopened this Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ModelCheckpoint Callback results in an error due to unsupported overwrite=True parameter #1242

[BUG] ModelCheckpoint Callback results in an error due to unsupported overwrite=True parameter #1242

hkristof03 commented Jun 4, 2024

CarloNicolini commented Jun 7, 2024 •

edited

Loading

hkristof03 commented Jun 19, 2024 •

edited

Loading

CarloNicolini commented Jun 27, 2024

hkristof03 commented Jun 27, 2024

CarloNicolini commented Jun 27, 2024 •

edited

Loading

hkristof03 commented Jun 27, 2024

rnyak commented Jun 28, 2024 •

edited

Loading

[BUG] ModelCheckpoint Callback results in an error due to unsupported overwrite=True parameter #1242

[BUG] ModelCheckpoint Callback results in an error due to unsupported overwrite=True parameter #1242

Comments

hkristof03 commented Jun 4, 2024

Bug description

Steps/Code to reproduce bug

CarloNicolini commented Jun 7, 2024 • edited Loading

hkristof03 commented Jun 19, 2024 • edited Loading

CarloNicolini commented Jun 27, 2024

hkristof03 commented Jun 27, 2024

CarloNicolini commented Jun 27, 2024 • edited Loading

hkristof03 commented Jun 27, 2024

rnyak commented Jun 28, 2024 • edited Loading

CarloNicolini commented Jun 7, 2024 •

edited

Loading

hkristof03 commented Jun 19, 2024 •

edited

Loading

CarloNicolini commented Jun 27, 2024 •

edited

Loading

rnyak commented Jun 28, 2024 •

edited

Loading