-
Notifications
You must be signed in to change notification settings - Fork 957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DinoV2 & Depth Anything V2: Bigger Models #2288
base: main
Are you sure you want to change the base?
DinoV2 & Depth Anything V2: Bigger Models #2288
Conversation
Disclaimer: I had some extra time, so I just proceeded in this direction. Happy to receive any feedback and adapt, of course! I've added base, large and giant. The latter only for Dino as the DepthAnything giant model is still pending review. Also removed the awkard config object handling and added the model to the README |
There still seems to be a bug in there, comparing this large implementation with the web based model output. This output has jitter and less pronounced depth. Will investigate. // edit I think I found it. The way they compute the patch dimensions and then use the forward method to pass those along still gives me some headache |
Just noticed that load_image returns (channels, height, width) but load_image_and_resize returns (channels, width, height) I'll open an issue for that one. //edit |
Small update: Originally I thought the difference in output quality was due to model size, but there were still some bugs in there. Most seem to be fixed. It's still quite hard to get the input conditions the same. For example, they convert to RGB from BGR, but then they resize with OpenCV and they're back to BGR (I verified this in the Python code by setting channels to zero). |
@LaurentMazare I uploaded a new depth image of the bike picture. In your experience, do you think this could be caused by the difference between UpsampleNearest2D and F.interpolate(img,height,width, mode="bilinear", align_corners=True)? //edit GPT-4o suggests it may be a source of discrepancies:
I'll experiment a bit with an implementation that does do bilinear filtering and aligns the corners, like in PyTorch |
It's hard to say but I wouldn't expect it to make that much of a difference - I'm assuming that the image that you uploaded is the python output which looks great vs the rust output which is not that good. Usually for the bits that are not supported in candle, I just modify the python code to have the candle behavior e.g. I would try without the bilinear on the python side here to check that the python output is still meaningful. |
What I uploaded is the Rust output of the large model . The Python output is smoother. That's why I'm suspicious of the interpolation (used also deep inside the feature fusion blocks). Let me investigate and apologies for all the noise :) |
@LaurentMazare I need a bit of guidance here: The pushed code follows the flow of the Python implementation fairly correct: The logic is the same. However, the results are still not as good as they should. Python small: Rust small (this branch): This is still not great. So I went further and opened another branch (not pushed) where I added:
The OpenCV loading and prep brings the inputs fairly close together (still -5.86% diff in means though), but then already in the DinoV2 model, things start to diverge rather badly: Now, I'm not expecting you to look at or follow the analysis, but there's differences over 1400% between the means there. My main question would be: How well do other ports of PyTorch models to Candle follow the original numerically? How much tolerance is there for numerical deviation? I would argue that in principle there shouldn't be any deviations at all and these statistics should match rather closely? And, how chaotic can a model behave, i.e. could a 5.86% difference as input snowball into these kind of deviations? |
Submitting this PR for now, as it is definitely an improvement over #2279 There's still unexplained numerical deviations in there, that already start in the applications of the DinoV2 blocks. Whether, this is something fundamental or obvious I was not able to discover, but I have stared at this problem long enough :) |
Thanks for looking into this, the blocky results with your latest version certainly feel like some form of interpolation is off (or some upsampling of the depth results?). Anyway the PR looks good but would it be possible not to make changes on the dinov2 side? This model code might already be used so better not to break compatibility and try to reduce the diff as much as possible. |
I wouldn't mind doing that, but that was a point of the first post: The weights Facebook put on HF have the query, key and value tensors separate. However, I believe the Depth Anything weights (which contain a copy of the Dino weights) are packed in the old/different format, so let me try with that. Regarding interpolation: I tried bilinear on the candle side (like mentioned above Cpu only), and I tried nearest neighbor interpolation inside the Pytorch implementation's interpolate_pos_encosing() like your comment here According to my rudimentary analysis, however, the mean divergence already starts while getting the intermediate blocks (so the forward method of the Dino blocks produces different results) Anyway, happy to reduce the diff. Will work on that tomorrow :) |
@LaurentMazare Just to be clear: These new prefixes follow the Facebook naming and they also have the bigger models (base, large, giant). If we keep the old prefixes, then we need to compact the Q, K, V layers into a single layer, add the head weights, and export the 3 models to safetensors. Is that really what you want? (I mean Candle uses versioning, although I do understand these are breaking changes) |
Back to the old prefixes. Oddly, using their weights instead of the Facebook weights, yields a worse result: Bilinear upsampling does improve it a bit, but I removed it because:
For posterity, this is with Facebook weights and CPU bilinear upscaling: |
Sorry to open a draft again, but while adding more models I found out why the DinoV2 example loads its safetensors from a different location than Facebook's own HF space: Facebook's safetensors have the query, key and value layers separate, whereas the original Python code, and thus the Candle implementation, have a single QKV layer. Furthermore, the original model file relies on separate head weights (guessing the head is not even necessary for DepthAnything, so might remove from example)So what I've done here:
Question: Do you agree with this approach?
Happy to adapt to other ideas, but if you agree, I will continue adding the base, large and giant models.
PS: I saw you also did some formatting on my previous PR and fixed some Clippy warnings. Maybe it's an idea to have a Contributor page that details the formatting requirements and other conventions?