Possible improvements about the wrapper #160

wddabc · 2017-03-13T04:21:08Z

While minpy is a great tool for building dynamic nets, I found the wrapper (Solver, ModelBase...) is a little bit hard to use that I had to do a lot of hacks to make this work. Here are some thoughts about the possible improvements at the current stage:

Lots of things should be passed into Solver.train rather than the constructor. Such as the data iterator, the optimizer. etc. In fact, the Solver.train should support more, such as whether do cross-validation, callback functions. I think a good template is the .fit function in Keras.
As pointed out in Why solver._step() shouldn't be called manually? #147, user should have flexibility writing their own training loop. So I think a public step is necessary.
The definition of the final training loss is inflexible.

 def loss_func(*params): # pylint: disable=unused-argument
            """
            Loss function calculate the loss
            """

            # It seems that params are not used in forward function. But since we will pass
            # model.params as arguments, we are ok here.
            predict = self.model.forward_batch(batch, mode='train')
            return self.model.loss_batch(batch, predict)

Assuming the data is x and the response is y, it is not necessary the case that the final loss can be decomposed to loss(model.predict(x), y) . In some cases it is just a single loss model.loss(x,y) , which means no prediction is made at the training time. This is a common case in the structured prediction problems, such as conditional random fields.

The text was updated successfully, but these errors were encountered:

sneakerkg · 2017-03-13T12:43:28Z

@wddabc Thanks a lot for the suggestion.
We are going to refine the nn package to make it more convenient for users.
For the loss_func part, what do you think we pre-define some loss or criterion template for users?

wddabc · 2017-03-13T14:58:28Z

I think directly returning self.model.loss_batch(batch) should be fine. Something looks like:

def loss_func(*params):
           return self.model.loss_batch(batch)

User can define whatever they like as the implementation of model.loss_batch(batch), for example, loss(model.predict(x), y). model.predict is optional, user can define or not based on whether they want to make prediction. For example, users might just use the model to pretrain some word embeddings, in which case model.predict is not necessary.

One thing should be careful is a wrapper on the top of loss_batch is still necessary, which will control whether to switch on the dropout/ regularizer. Might be defined inside loss_func? I'm not quite sure.

The dropout/ regularizer should be declared by user as arguments of .add_param() and be applied before calling self.params[name] . In fact, I'd suggest change self.params[name] to self.params(name) , where the self.params() decides whether to apply the dropout based on whether it is in the training stage.

Here are some other issues should be considered for the new design (I think I'm basically proposing some Keras-thing):

ModelBase looks like a standalone module that doesn't support calling other models. That is, you should define a huge graph within one ModelBase.
Also, joint training different models with shared parameters is not supported in the current design. For this case, I think the right design is the parameters for each sub-module are globally transparent to each other.

sneakerkg · 2017-03-16T23:49:10Z

@wddabc Thanks a lot for the suggestions. We totally agree on the thought that flags like "isTrain" is needed when calling loss_batch;
And:

For modelBase problem, now we have the concept Module in https://github.com/dmlc/minpy/blob/master/minpy/nn/model_builder.py . We can have multiple sub-module in a whole model;
Thanks for pointing out shared parameter issue. Could I bother you give us some suggestion on the interface for shared parameter functionality?
I'm not quite understand why we need to apply dropout on the params. Is it a common trick in NLP tasks? For vision and audio task, dropout is usually applied on the activations.

wddabc · 2017-03-17T05:46:11Z

On Thu, Mar 16, 2017 at 7:49 PM, Tianjun Xiao ***@***.***> wrote: @wddabc <https://github.com/wddabc> Thanks a lot for the suggestions. We totally agree on the thought that flags like "isTrain" is needed when calling loss_batch; And: 1. For modelBase problem, now we have the concept Module in https://github.com/dmlc/minpy/blob/master/minpy/nn/model_builder.py <https://github.com/dmlc/minpy/blob/master/minpy/nn/model_builder.py> . We can have multiple sub-module in a whole model; It looks nice. I found there is a a module called Parallel which hasn't

been implemented yet. Is this a version where you support appending multiple inputs? I'm actually thinking about making this more general by building a *graph* rather than a *list* (I mean Sequential).This is more flexible I think. Also, although it is a minor issue, I might consider renaming forward by __call__, as layer is basically a function of the input. This design is very close to the Functional API in Keras <https://keras.io/getting-started/functional-api-guide/>, which I like a lot. One major difference between minpy and Keras is the __call__ is: In Keras, __call__ is building a static graph and initialize the weight after shape inference. Minpy is different, as we are doing real computation! I think one idea is to initialize the weights lazily at the first __call__, when you will know the input shape. Then set a flag (say built) to true and skip this initialization in the following pass. This sounds a little bit hacky as the initialization shouldn't depends on the first input, for example, what if I just want to build the model and save it before passing data? A less hacky way, though I think still hacky, is adding a build function and ask user to call before running. The build function does nothing but a dry-run though the graph. The idea is, using a global private flag to indicate wether it is a dry-run, if true, use the size of the input to initialize the weights, ignore the computation, just output a random value with the same size as the declared output. The reason we want to ignore the computation is 1) It might introduce the startup cost. 2) More serious, the user might define the graph on specific types of data, our random value might crash the program (For example, user might define division, and you pass a zero-value dummy matrix)

1. Thanks for pointing out shared parameter issue. Could I bother you give us some suggestion on the interface for shared parameter functionality? I think all the parameter is declared in the submodules but actually

stored in a global dictionary. There are two ways to access these parameters, first is using local names by calling the local method. Another is directly look up the global dictionary with using global names. The global name is something like a concatenation of the module name and the local name. If the user would like to involve shared parameters, they should explicitly look up them by global name. I think this is flexible enough.

1. I'm not quite understand why we need to apply dropout on the params. Is it a common trick in NLP tasks? For vision and audio task, dropout is usually applied on the activations. This is called DropConnect

<http://www.matthewzeiler.com/pubs/icml2013/icml2013.pdf>? In stead of dropout the hidden output, dropout some connections (weights) . Well, this is not quite frequent in NLP as well. But I remember somehow I did dropout on the entire word embedding to solve some out-of-vocabulary issue.

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#160 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFdRO1CWhXeoO9OzP7dVTV3rOgav9Mr5ks5rmcp4gaJpZM4Ma1VI> .

-- Best wishes --- Dingquan Wang http://www.cs.jhu.edu/~wdd/ The Center for Language and Speech Processing Department of Computer Science Johns Hopkins University

wddabc · 2017-03-17T14:57:08Z

I think there is still flaw of the dry-run idea. I forgot the user might want to build graph themselves and shouldn't implement this. Well, this looks like a hard problem, as it needs static analysis on the dynamic graph. But the lazy init approach still seems to be viable. Or just let it go and let the user to take care of the shapes.

GaiYu0 · 2017-03-17T23:08:39Z

@wddabc Thank you very much for your suggestion! (I should have participated in discussion earlier)

The module Parallel has not been implemented because in the framework of model_builder it is difficult to provide a flexible interface for user to specify an operation on multiple outputs of Parallel. The difficulty is because a graph created by model_builder is essentially a "list", as you said. I agree that model_builder should support more general graph composition and if general graph composition is implemented it is going be simple for user to manipulate outputs of Parallel.

GaiYu0 · 2017-03-17T23:23:05Z

Resolving parameter sharing might be easier. I am implementing a initializer that enables user to index initialization configuration by parameter. For example, initializer[convolution0.weight].std = 0.1 specifies that the weight of convolution layer 0 should be sampled from a distribution using a standard deviation of 0.1. On this basis it is possible to implement an interface of the form initializer.share(parameters), which binds those parameters in initialization. No other modification is required for forward functions. What do you think?

GaiYu0 · 2017-03-17T23:52:27Z

Also, model_builder only supports static graph composition for the time being, i.e. statically declare a graph and use a function to convert the graph to a model. Dry-running is going to be an issue when model_builder supports dynamic graph composition in future, as PyTorch does. PyTorch forces users to declare all layers that is going to be used in dynamic graph composition in advance so that PyTorch can register related info, e.g. parameters. I am not sure whether PyTorch figures out parameter shape dynamically. It appears that the information provided in PyTorch layer declaration is sufficient for PyTorch to figure out shapes in advance. For example, the fully-connected layer in PyTorch is:

The shape of weight is (in_features, out_features), which can be inferred from layer declaration alone. model_builder only requires out_features, so the shape of weight must be inferred from data. Perhaps PyTorch-style dynamic graph composition is more feasible in term of difficulty of implementation?

wddabc · 2017-03-18T01:32:28Z

I agree, as I said in the last email. Maybe it is ok to just let users to take care of the shapes, just as PyTorch. The dynamic graph by itself is easy enough.

Regarding the shared parameter, I also agree that it is easy to fix. For the initialization, did you mean you want a separated initializer to take care of the initialization? I think specifying the initialization method as an argument of the parameter declaration is good enough, just as the current implementation does. As a user, I'm think about something like this.

class RNNNet(ModelBase):
    def __init__(self,
                 batch_size=100,
                 input_size=2,
                 hidden_size=64,
                 num_classes=1):
        super(RNNNet, self).__init__()
        self.add_param(name='Wx', shape=(input_size, hidden_size), init='uniform')\ #when using some random init, shape should be given
            .add_param(name='Wh', init=np.zeros(hidden_size, hidden_size))\ #when actual np ndarray should is given, shape can be ignored
            .add_param(name='b', share=global.parameter('convolution0:W') #shared parameter, shape can also be ignored. 'b' will be a local alias of 'convolution0:W'. 'convolution0:W' is a global name.

This is just from a user's prospective.

zzhang-cn · 2017-03-18T04:55:04Z

Hi all, So far, we have focused on building a sequential network, but adding per-layer init and optimization rules, each of which can be dynamically changed. This adds flexibility and simplies logics. I suggest us build upon these principles and study Keras ( https://keras.io/getting-started/functional-api-guide/) to complete functionality incrementally. Things like: - Multiple inputs - Multiple outputs - Shared layers (or modules) - Recurrent - Dynamic layers (new layers may be inserted dynamically, and their weights might be computed online). There's no rush to implement them right away, we can put all prototypes/signatures in a google doc and hash them out. What do you all think?

…

-zz

On Sat, Mar 18, 2017 at 9:32 AM, Dingquan ***@***.***> wrote: I agree, as I said in the last email. Maybe it is ok to just let users to take care of the shapes, just as PyTorch. The dynamic graph by itself is easy enough. Regarding the shared parameter, I also agree that it is easy to fix. For the initialization, did you mean you want a separated initializer to take care of the initialization? I think the specify the initialization method as an argument of the parameter declaration is good enough, just as the current implementation does. As a user, I'm think about something like this. class RNNNet(ModelBase): def __init__(self, batch_size=100, input_size=2, hidden_size=64, num_classes=1): super(RNNNet, self).__init__() self.add_param(name='Wx', shape=(input_size, hidden_size), init='uniform')\ #when using some random init, shape should be given .add_param(name='Wh', init=np.zeros(hidden_size, hidden_size))\ #when actual np ndarray should is given, shape can be ignored .add_param(name='b', share=global.parameter('convolution0:W') #shared parameter, shape can also be ignored. 'b' will be a local alias of 'convolution0:W'. 'convolution0:W' is a global name. This is just from a user's prospective. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#160 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIKVUWplRxfHMIOHHNvWXogyC-IsH-8tks5rmzQtgaJpZM4Ma1VI> .

-- *Zheng Zhang <https://shanghai.nyu.edu/academics/faculty/zheng-zhang> 张峥* Professor of Computer Science, NYU Shanghai 上海纽约大学计算机系教授 Room 1118, 1555 Century Avenue Pudong New District, Shanghai 200122, China 中国上海市浦东新区世纪大道1555号1118室, 200122 T Shanghai office 上海办公室 +86 021 2059 5687 T China mobile 中国手机 +86 156 1808 6522 E 电子邮件 [email protected] <[email protected]>

wddabc · 2017-03-19T03:14:09Z

Agree. I think one important thing is figuring out what functionality a Module should have (for example, load(), save(), fit(), predict() ...), and how these could be implemented under the context of dynamic graph. Starting from a sequential network is a good way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible improvements about the wrapper #160

Possible improvements about the wrapper #160

wddabc commented Mar 13, 2017 •

edited

Loading

sneakerkg commented Mar 13, 2017

wddabc commented Mar 13, 2017 •

edited

Loading

sneakerkg commented Mar 16, 2017

wddabc commented Mar 17, 2017 via email

wddabc commented Mar 17, 2017

GaiYu0 commented Mar 17, 2017

GaiYu0 commented Mar 17, 2017

GaiYu0 commented Mar 17, 2017

wddabc commented Mar 18, 2017 •

edited

Loading

zzhang-cn commented Mar 18, 2017 via email

wddabc commented Mar 19, 2017 •

edited

Loading

Possible improvements about the wrapper #160

Possible improvements about the wrapper #160

Comments

wddabc commented Mar 13, 2017 • edited Loading

sneakerkg commented Mar 13, 2017

wddabc commented Mar 13, 2017 • edited Loading

sneakerkg commented Mar 16, 2017

wddabc commented Mar 17, 2017 via email

wddabc commented Mar 17, 2017

GaiYu0 commented Mar 17, 2017

GaiYu0 commented Mar 17, 2017

GaiYu0 commented Mar 17, 2017

wddabc commented Mar 18, 2017 • edited Loading

zzhang-cn commented Mar 18, 2017 via email

wddabc commented Mar 19, 2017 • edited Loading

wddabc commented Mar 13, 2017 •

edited

Loading

wddabc commented Mar 13, 2017 •

edited

Loading

wddabc commented Mar 18, 2017 •

edited

Loading

wddabc commented Mar 19, 2017 •

edited

Loading