-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Sparse Bug] Test and sparse_remote_update can not co-exsit, crash trainer if necessary #891
Conversation
这个补丁能早期检测出sparse模型配置的异常,避免 #660 问题的出现。 原理:
本patch,检测模型是否sparse,并提前报错。 同时, 部分用户使用test逻辑进行predict的case也有潜在风险,本patch都能避免此类问题发生。另外用户使用py_paddle进行预测的case,不能被此patch捕获。 |
另外,从这个patch的定位过程总结来看,算法上test、predict、train三个阶段,如果能做到顶层代码的分离,尽量分离,底层模块尽量功能单一、简洁,顶层通过组装形式share底层逻辑最好,否则对理解问题比较困难。 当前整个sparse的设计,也尤其重构后的sparse逻辑,为了优化分开逻辑,将ids耦合到底层通信protobuf逻辑,也带来了一些问题,算法和通信过于耦合,有些顶层配置的异常直接暴露到最底层逻辑,对定位和理解问题避免的相对困难。 |
直接log(FATAL)不太好吧。我们应该可以改一下trainer来支持这个东西? |
即便要改,我觉得也应该是另一个pr来解决。 |
LOG(FATAL) << "It's prohibited to set sparse_remote_update " | ||
<< "in some layers if testing will be under going " | ||
<< "in the middle of training. You can do testing " | ||
<< "within separate process."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's prohibited to set sparse_remote_update when doing train and test jobs in the same process. You could run paddle --job=test in a separate process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以用grammarly检查一下语法。 不过,在这里面报错,会不会在不同进程里也会报错呢?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test过程是要求不能用sparse配置的,所以test出错就是错误配置了。所以这种改法应该没有问题。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's prohibited to set sparse_remote_update when doing train and test jobs in the same process. You could run paddle --job=test in a separate process.
follow you comments
* fix unified transformer dtype problem * fix win dtype bug * Fix plato-2 and plato-mini dtype bug * Fix plato-2 tokenization * Refine some doc * Add general k support for topk sampling * fix seed * minor fix * Fix unitransformer readme * topk kernel optimization * add unimo model and fix generate api * add 3 datasets for unimo-text Co-authored-by: Jiaqi Liu <[email protected]> Co-authored-by: liu zhengxi <[email protected]>
fix #660
新接口已经不允许disable test datprovider,所以本patch 还要一些改动,
关联问题:
predict和sparse_remote_updater也不能共存,本patch暂时不解决这个预测问题。