Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

an example of a word-level language model using FluxML #357

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jaffourt
Copy link

No description provided.

Comment on lines +80 to +82
grad_x = clamp!(gradient[x], -args.clip, args.clip)
# backprop
x .-= lr .* grad_x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spirit of promoting best practices, we have https://fluxml.ai/Flux.jl/stable/training/optimisers/#Flux.Optimise.ClipValue and the rest of https://fluxml.ai/Flux.jl/stable/training/optimisers/ for this. I imagine you'd want to use something more sophisticated than plain SGD anyhow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ToucheSir I opted out of the library methods as this explicit calculation provided better performance. However, for the promotion of best practices I agree and I can switch to using the library methods.


# logit cross entropy loss function
function loss(x, y)
Flux.reset!(model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would highly recommend moving this outside of the loss function (i.e. into the training loop).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ToucheSir Can you elaborate on this? My understanding is that the loss function is called for each batch in a single training loop, thus for a sequential language model such as this one we want to reset the hidden state after each batch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume the suggestion is to keep it inside the for batch in data_loader loop, but outside the gradient call.

hold_out = zip(x_train[end-5:end], y_train[end-5:end])

# used for updating hyperparameters
best_val_loss = nothing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
best_val_loss = nothing
local best_val_loss

Co-authored-by: Brian Chen <[email protected]>
@mcabbott mcabbott added the new model new arrival in the zoo label Dec 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new model new arrival in the zoo
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants