Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Sparse Bug] Test and sparse_remote_update can not co-exsit, crash trainer if necessary #891

Merged
merged 4 commits into from
Dec 16, 2016

Conversation

backyes
Copy link
Contributor

@backyes backyes commented Dec 14, 2016

fix #660

新接口已经不允许disable test datprovider,所以本patch 还要一些改动,

关联问题:

predict和sparse_remote_updater也不能共存,本patch暂时不解决这个预测问题。

@backyes
Copy link
Contributor Author

backyes commented Dec 15, 2016

这个补丁能早期检测出sparse模型配置的异常,避免 #660 问题的出现。

原理:

  • sparse 集群模型下, 会有一个prefetch逻辑,它根据当前处理的数据情况,选择从pserver拖取适当的训练参数,优化通信和整体的update效率。 但是,test过程是一个forward过程,没有backward和update过程,因此当前将test和trainer公用一套逻辑优化在gpu上对内存的消耗的设计,使得test的forward会share一部分prefetch的逻辑和数据,因此集群sparse训练异常会指导forward才被捕获。

本patch,检测模型是否sparse,并提前报错。

同时, 部分用户使用test逻辑进行predict的case也有潜在风险,本patch都能避免此类问题发生。另外用户使用py_paddle进行预测的case,不能被此patch捕获。

@backyes
Copy link
Contributor Author

backyes commented Dec 15, 2016

另外,从这个patch的定位过程总结来看,算法上test、predict、train三个阶段,如果能做到顶层代码的分离,尽量分离,底层模块尽量功能单一、简洁,顶层通过组装形式share底层逻辑最好,否则对理解问题比较困难。 当前整个sparse的设计,也尤其重构后的sparse逻辑,为了优化分开逻辑,将ids耦合到底层通信protobuf逻辑,也带来了一些问题,算法和通信过于耦合,有些顶层配置的异常直接暴露到最底层逻辑,对定位和理解问题避免的相对困难。

@reyoung
Copy link
Collaborator

reyoung commented Dec 15, 2016

直接log(FATAL)不太好吧。我们应该可以改一下trainer来支持这个东西?

@backyes
Copy link
Contributor Author

backyes commented Dec 15, 2016

@reyoung

  • 这个错误根源来自gradient里面。 因此从trainer顶层貌似不好改。

  • 如果warning忽略,
    * 如果自动重写sparse配置(比如disable),这会造成非常大的影响,性能变慢。
    * 如果自动disable test过程,那么距离用户理解的接口,差异非常大

  • 目前直接FATAL,感觉其实还好。只有使用remote sparse配置的用户才会收到影响。

即便要改,我觉得也应该是另一个pr来解决。

LOG(FATAL) << "It's prohibited to set sparse_remote_update "
<< "in some layers if testing will be under going "
<< "in the middle of training. You can do testing "
<< "within separate process.";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's prohibited to set sparse_remote_update when doing train and test jobs in the same process. You could run paddle --job=test in a separate process.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以用grammarly检查一下语法。 不过,在这里面报错,会不会在不同进程里也会报错呢?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test过程是要求不能用sparse配置的,所以test出错就是错误配置了。所以这种改法应该没有问题。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's prohibited to set sparse_remote_update when doing train and test jobs in the same process. You could run paddle --job=test in a separate process.

follow you comments

@reyoung reyoung merged commit 04fb1fc into PaddlePaddle:develop Dec 16, 2016
zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this pull request Sep 25, 2019
wangxicoding pushed a commit to wangxicoding/Paddle that referenced this pull request Dec 9, 2021
* fix unified transformer dtype problem

* fix win dtype bug

* Fix plato-2 and plato-mini dtype bug

* Fix plato-2 tokenization

* Refine some doc

* Add general k support for topk sampling

* fix seed

* minor fix

* Fix unitransformer readme

* topk kernel optimization

* add unimo model and fix generate api

* add 3 datasets for unimo-text

Co-authored-by: Jiaqi Liu <[email protected]>
Co-authored-by: liu zhengxi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sparse training cluster时在pass0后失败
3 participants