DinoV2 & Depth Anything V2: Bigger Models #2288

jeroenvlek · 2024-06-25T09:28:22Z

~~Sorry to open a draft again~~, but while adding more models I found out why the DinoV2 example loads its safetensors from a different location than Facebook's own HF space: Facebook's safetensors have the query, key and value layers separate, whereas the original Python code, and thus the Candle implementation, have a single QKV layer. Furthermore, the original model file relies on separate head weights (guessing the head is not even necessary for DepthAnything, so might remove from example)

So what I've done here:

Load the weights from Facebook's HF space and pack them together into a single QKV layer
Make the head optional with an optional VarBuilder
Downloaded all head files from Facebook's GitHub and converted them to safetensors in my own HF space

Question: Do you agree with this approach?

Happy to adapt to other ideas, but if you agree, I will continue adding the base, large and giant models.

PS: I saw you also did some formatting on my previous PR and fixed some Clippy warnings. Maybe it's an idea to have a Contributor page that details the formatting requirements and other conventions?

jeroenvlek · 2024-06-26T08:55:23Z

Disclaimer: I had some extra time, so I just proceeded in this direction. Happy to receive any feedback and adapt, of course!

I've added base, large and giant. The latter only for Dino as the DepthAnything giant model is still pending review.

Also removed the awkard config object handling and added the model to the README

jeroenvlek · 2024-06-26T09:44:10Z

There still seems to be a bug in there, comparing this large implementation with the web based model output. This output has jitter and less pronounced depth. Will investigate.

// edit

I think I found it. The way they compute the patch dimensions and then use the forward method to pass those along still gives me some headache

jeroenvlek · 2024-06-27T08:41:40Z

Just noticed that load_image returns (channels, height, width) but load_image_and_resize returns (channels, width, height)

I'll open an issue for that one.

//edit

#2291

jeroenvlek · 2024-06-29T08:54:30Z

Small update: Originally I thought the difference in output quality was due to model size, but there were still some bugs in there. Most seem to be fixed. It's still quite hard to get the input conditions the same. For example, they convert to RGB from BGR, but then they resize with OpenCV and they're back to BGR (I verified this in the Python code by setting channels to zero).

jeroenvlek · 2024-06-29T09:12:58Z

@LaurentMazare I uploaded a new depth image of the bike picture. In your experience, do you think this could be caused by the difference between UpsampleNearest2D and F.interpolate(img,height,width, mode="bilinear", align_corners=True)?

//edit

GPT-4o suggests it may be a source of discrepancies:

To fully align the Rust implementation with the PyTorch interpolate function, especially for mode='bilinear' with align_corners=True, you would need a more detailed and comprehensive implementation of bilinear interpolation as shown above. The nearest-neighbor interpolation provided in the Rust code serves as a good starting point but does not handle the complexity required for bilinear interpolation with the align_corners parameter.

I'll experiment a bit with an implementation that does do bilinear filtering and aligns the corners, like in PyTorch

LaurentMazare · 2024-06-29T10:31:46Z

It's hard to say but I wouldn't expect it to make that much of a difference - I'm assuming that the image that you uploaded is the python output which looks great vs the rust output which is not that good. Usually for the bits that are not supported in candle, I just modify the python code to have the candle behavior e.g. I would try without the bilinear on the python side here to check that the python output is still meaningful.

jeroenvlek · 2024-06-29T10:41:44Z

What I uploaded is the Rust output of the large model . The Python output is smoother. That's why I'm suspicious of the interpolation (used also deep inside the feature fusion blocks).

Let me investigate and apologies for all the noise :)

jeroenvlek · 2024-07-03T08:13:01Z

@LaurentMazare I need a bit of guidance here: The pushed code follows the flow of the Python implementation fairly correct: The logic is the same. However, the results are still not as good as they should.

Python small:

Rust small (this branch):

This is still not great. So I went further and opened another branch (not pushed) where I added:

image loading and resizing using opencv (I understand you might not want that as dep, this is just for testing)
Bilinear interpolation (Only in CPU storage, there doesn't seem to be a default Cuda kernel and it seems Candle doesn't have custom kernels)
Lots of debug statements and in between statistics (min, max and mean)
Load the Dino weights from the DepthAnythingV2 safetensors file and not from Facebook

The OpenCV loading and prep brings the inputs fairly close together (still -5.86% diff in means though), but then already in the DinoV2 model, things start to diverge rather badly:

model_layer_analysis.ods

Now, I'm not expecting you to look at or follow the analysis, but there's differences over 1400% between the means there.

My main question would be: How well do other ports of PyTorch models to Candle follow the original numerically? How much tolerance is there for numerical deviation? I would argue that in principle there shouldn't be any deviations at all and these statistics should match rather closely? And, how chaotic can a model behave, i.e. could a 5.86% difference as input snowball into these kind of deviations?

jeroenvlek · 2024-07-06T08:20:17Z

Submitting this PR for now, as it is definitely an improvement over #2279

There's still unexplained numerical deviations in there, that already start in the applications of the DinoV2 blocks. Whether, this is something fundamental or obvious I was not able to discover, but I have stared at this problem long enough :)

LaurentMazare · 2024-07-06T09:27:18Z

Thanks for looking into this, the blocky results with your latest version certainly feel like some form of interpolation is off (or some upsampling of the depth results?). Anyway the PR looks good but would it be possible not to make changes on the dinov2 side? This model code might already be used so better not to break compatibility and try to reduce the diff as much as possible.

jeroenvlek · 2024-07-06T11:24:11Z

I wouldn't mind doing that, but that was a point of the first post: The weights Facebook put on HF have the query, key and value tensors separate.

However, I believe the Depth Anything weights (which contain a copy of the Dino weights) are packed in the old/different format, so let me try with that.

Regarding interpolation: I tried bilinear on the candle side (like mentioned above Cpu only), and I tried nearest neighbor interpolation inside the Pytorch implementation's interpolate_pos_encosing() like your comment here

According to my rudimentary analysis, however, the mean divergence already starts while getting the intermediate blocks (so the forward method of the Dino blocks produces different results)

Anyway, happy to reduce the diff. Will work on that tomorrow :)

jeroenvlek · 2024-07-08T07:56:30Z

Thanks for looking into this, the blocky results with your latest version certainly feel like some form of interpolation is off (or some upsampling of the depth results?). Anyway the PR looks good but would it be possible not to make changes on the dinov2 side? This model code might already be used so better not to break compatibility and try to reduce the diff as much as possible.

@LaurentMazare Just to be clear: These new prefixes follow the Facebook naming and they also have the bigger models (base, large, giant). If we keep the old prefixes, then we need to compact the Q, K, V layers into a single layer, add the head weights, and export the 3 models to safetensors.

Is that really what you want? (I mean Candle uses versioning, although I do understand these are breaking changes)

jeroenvlek · 2024-07-10T08:39:56Z

Back to the old prefixes. Oddly, using their weights instead of the Facebook weights, yields a worse result:

Bilinear upsampling does improve it a bit, but I removed it because:

I didn't completely trust GPT-4o generated code, especially after challenging GPT-4o and it changed the way strides were handled
to keep the PR small
to my knowledge there's no default kernel in Cuda that does bilinear upsampling
It only improves it, but it doesn't fix the fundamental result being blobby

For posterity, this is with Facebook weights and CPU bilinear upscaling:

jeroenvlek added 2 commits June 25, 2024 11:09

make dino compatible with Facebook safetensor files

c20b7b3

make dino head optional

c9ed473

jeroenvlek marked this pull request as draft June 25, 2024 09:35

jeroenvlek added 3 commits June 26, 2024 10:10

add base model and get rid of awkward config references

d2e2d6d

add large and giant models

3037e72

update READMEs

4398baa

jeroenvlek marked this pull request as ready for review June 26, 2024 08:55

add example output on bike.jpg

866cf70

jeroenvlek changed the title ~~Depth Anything V2: Bigger Models~~ DinoV2 & Depth Anything V2: Bigger Models Jun 26, 2024

jeroenvlek added 8 commits June 26, 2024 16:31

return class token for intermediate layers

b979e26

use stack instead of cat

8e5430d

remove constant in favor of layer id array length

8f3861b

use class token from config

09c5480

first range normalize, then resize

058f80a

make better use of Default for configs

2f596a6

extract args module

4e90cc2

extract image_ops module

ca08ef0

jeroenvlek added 4 commits June 27, 2024 11:07

fix permutation

b6a6af6

set image size pre-inference

fbb8733

propogate image and patch sizes to revelant layers

a40b4e9

set output and skip adds on the featurefusion blocks

3223907

jeroenvlek marked this pull request as draft June 27, 2024 12:55

jeroenvlek added 5 commits June 28, 2024 10:20

fix some more discrepancies

7d7c572

fix resizing

129c888

use resize, rgbf32 and CatMullRom directly

2fb41a2

use get_min_max for debugging

9e60c21

use BGR input

8c4539d

update depth bike output

656d6b9

jeroenvlek added 10 commits June 29, 2024 20:39

add bilinear upsample for CPU

c6c821e

print stuff

594a490

debug statements

621ffb1

load images with opencv

30de3db

lots of debug statements

d14a81d

Merge branch 'bilinear_upsample' into tmp_debug

af9b222

more debug

e492c0a

fix debug pritn

3e07dd4

don't use class tokens

39e7970

use weights from dpt model file

c6c37ca

jeroenvlek marked this pull request as ready for review July 6, 2024 08:18

jeroenvlek added 4 commits July 9, 2024 16:06

remove all prints

8b66d21

unifying all insights

b9b13a1

ignore align corners

9605251

remove bilinear upsampling

bc5a5cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DinoV2 & Depth Anything V2: Bigger Models #2288

DinoV2 & Depth Anything V2: Bigger Models #2288

jeroenvlek commented Jun 25, 2024 •

edited

Loading

jeroenvlek commented Jun 26, 2024 •

edited

Loading

jeroenvlek commented Jun 26, 2024 •

edited

Loading

jeroenvlek commented Jun 27, 2024 •

edited

Loading

jeroenvlek commented Jun 29, 2024 •

edited

Loading

jeroenvlek commented Jun 29, 2024 •

edited

Loading

LaurentMazare commented Jun 29, 2024

jeroenvlek commented Jun 29, 2024 •

edited

Loading

jeroenvlek commented Jul 3, 2024 •

edited

Loading

jeroenvlek commented Jul 6, 2024

LaurentMazare commented Jul 6, 2024

jeroenvlek commented Jul 6, 2024 •

edited

Loading

jeroenvlek commented Jul 8, 2024 •

edited

Loading

jeroenvlek commented Jul 10, 2024

DinoV2 & Depth Anything V2: Bigger Models #2288

Are you sure you want to change the base?

DinoV2 & Depth Anything V2: Bigger Models #2288

Conversation

jeroenvlek commented Jun 25, 2024 • edited Loading

jeroenvlek commented Jun 26, 2024 • edited Loading

jeroenvlek commented Jun 26, 2024 • edited Loading

jeroenvlek commented Jun 27, 2024 • edited Loading

jeroenvlek commented Jun 29, 2024 • edited Loading

jeroenvlek commented Jun 29, 2024 • edited Loading

LaurentMazare commented Jun 29, 2024

jeroenvlek commented Jun 29, 2024 • edited Loading

jeroenvlek commented Jul 3, 2024 • edited Loading

jeroenvlek commented Jul 6, 2024

LaurentMazare commented Jul 6, 2024

jeroenvlek commented Jul 6, 2024 • edited Loading

jeroenvlek commented Jul 8, 2024 • edited Loading

jeroenvlek commented Jul 10, 2024

jeroenvlek commented Jun 25, 2024 •

edited

Loading

jeroenvlek commented Jun 26, 2024 •

edited

Loading

jeroenvlek commented Jun 26, 2024 •

edited

Loading

jeroenvlek commented Jun 27, 2024 •

edited

Loading

jeroenvlek commented Jun 29, 2024 •

edited

Loading

jeroenvlek commented Jun 29, 2024 •

edited

Loading

jeroenvlek commented Jun 29, 2024 •

edited

Loading

jeroenvlek commented Jul 3, 2024 •

edited

Loading

jeroenvlek commented Jul 6, 2024 •

edited

Loading

jeroenvlek commented Jul 8, 2024 •

edited

Loading