Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

对于train.crl和train.mtcl我不太懂是什么样的结果才算是运行完成? #15

Closed
YYrgb opened this issue Dec 26, 2024 · 9 comments

Comments

@YYrgb
Copy link

YYrgb commented Dec 26, 2024

我在运行train.crl时,max-epoch为500,我看了对应论文,但是我还是不太明白到什么时候适合停止运行。
同样的我在运行train.mtcl时,运行时间比较久,最上面显示fold 1 epoch 0,下面显示进度条准确率,一直再缓慢提高,我不知道我做的这样是对吗?

@SeongjuLee
Copy link
Contributor

Thank you for your interest in our work!
In the CRL stage, early stopping with a patience of 20 is applied. This means the training is terminated if there is no decrease in validation loss within 20 epochs.

Regarding your second question, I anticipate the following possible situations:

You might have conducted MTCL before finishing the CRL stage. The MTCL stage should only be conducted after the CRL stage is completely finished (i.e., after training all folds).
It seems that the training is being conducted on a CPU, I guess.
Please share some screenshots; they will be helpful for identifying and resolving the issues.

@YYrgb
Copy link
Author

YYrgb commented Dec 27, 2024

首先感谢您的回复,您说的是对的,是在cpu上运行,对于crl我仅仅运行到保存了进行第1折训练的结果,然后我就运行mtcl,也仅仅是运行到保存第1折的网络训练参数。可能因为我用的是cpu,而且我把batch_size缩小了一倍,我完成第1折的运行已经耗费至少24小时。下面的图片是我运行mtcl的结果,其中utils里的stty,我在windows系统下无法使用,我将里面的代码替换为使用tqdm库显示进度。第一张是刚开始运行mtcl时打印的一些信息,第二张图片是第1折训练完后,开始验证。第三张是验证环节结束,保存的model。下面又重新开始训练,我不知道这是开始第二个epoch了嘛?
7bf906de726947d80378b8f0995dd7b
再次感谢您一开始能回复我,谢谢您!

@YYrgb
Copy link
Author

YYrgb commented Dec 27, 2024

Uploading b76d3b12c565fc95eb36dd30b918700.png…

@YYrgb
Copy link
Author

YYrgb commented Dec 27, 2024

Uploading d0ee14c7359a4e82ebfa53ac4bb813f.png…

@SeongjuLee
Copy link
Contributor

image
The second and third images are broken. Can you re-upload the second and third screenshot?

@YYrgb
Copy link
Author

YYrgb commented Dec 30, 2024

再次感谢您的回复,我将重新上传第二张,第三张图片
b76d3b12c565fc95eb36dd30b918700
d0ee14c7359a4e82ebfa53ac4bb813f

@SeongjuLee
Copy link
Contributor

SeongjuLee commented Dec 31, 2024

In our work, we used iteration-based train loop, which means that the validation is conducted at every N training iteration (not epoch). Please refer to our paper and "val_period" variable in config.
In your case, the "val_preiod" is set to 500 and the batch size is 16. Therefore, the validation step is processed at 8000th training iteration (500*16).
Anyway, it seems that there's no issue in training and validation.

@YYrgb
Copy link
Author

YYrgb commented Dec 31, 2024

好的,非常感谢您的回复!祝您万事如意!

@SeongjuLee
Copy link
Contributor

Thanks. Happy new year!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants