-
Notifications
You must be signed in to change notification settings - Fork 632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Has anyone told me what the model structure of the midas looks like? #272
Comments
The structure of the MiDaS model is described in their preprint paper, including a diagram (figure 1) on page 3. |
Yes, I tried to read his article, is it based on the ResNet encoder, plus a series of loss functions to predict? What I want to know more is its model structure, what is his convolutional layer? |
I've also looked at depthanything before, but that one is too difficult for me to understand, so I tried to understand the previous version of midas, I just noticed that you uploaded the muggled_dpt, which also piqued my interest, maybe you can point me out my confusion, if you have time. |
All of the newer MiDaS models (version 3 and 3.1) switched to using a 'vision transformer' instead of using ResNet, to encode input images, though the rest of the DPT structure still uses convolutions. The DPT model structure consists of 4 parts: the first part is the patch embedding and vision transformer, which generates a lists of vectors (also called tokens) based on the input image. The second part (called reassembly) takes the lists of vectors and reshapes them back into image-like data (like a grid of pixels). The third part (called fusion) combines the reassembled image data and also performs convolution on the results. The fourth part (called the head) just does more convolution to generate the final depth output. Each of the parts also include scaling/resizing steps, but this is hard-coded into the model (it's not something that needs to be learned by the model). That original MiDaS preprint actually has a figure in the appendix (on page 12) which shows the structure of the convolutional steps performed inside the 'fusion' part of the model (which is figure (a) on page 12, they call it 'Residual Convolutional Unit') as well as the convolutions performed inside the 'head' part of the model (which is figure (b) on page 12). There are a lot of pieces to the DPT model, but I think if you try to understand each one individually (i.e. understanding just the reassembly model on it's own, or the fusion model on it's own), it's much easier to make sense of the whole thing.
I actually think the depthanything implementation is simpler than any of the MiDaS models (though they all share the same DPT structure). On the muggleDPT repo, there is separate code for each of the model components: patch embedding, vision transformer, reassembly model, fusion model and head model, if you're comfortable looking through code, the |
How do I train my own model, based on that |
In theory, any 'typical' training loop should work on these DPT models. However, doing a good job with training is generally a difficult thing to get right, and there are entire research papers devoted to this. It's basically a PhD thesis topic at the moment. For example, the depth-anything paper is like this, it's almost entirely focused on how to do a better job training these models rather than being about the model structure. So it can be very difficult to understand! There's surprisingly little example code available for training these types of models (at least, I haven't found much). The only one I know of is for the original ZoeDepth models and the related depth-anything v1 metric depth and v2 metric depth models. So I'd recommend starting with that code to get an idea of how to handle the training of the models, as well as reading the original MiDaS paper that describes the training procedure (starts on page 5), and the first depth-anything paper which describes a similar procedure (starts on page 3). Alternatively, the Marigold (very accurate, but slower than DPT models) repo released training code, which you might want to check out as well (if you don't specifically need a DPT model). |
|
I just noticed if it's possible to use metric-depth in depthanything to train a model of your own with his stuff |
No description provided.
The text was updated successfully, but these errors were encountered: