Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do camera viewpoints work at training and inference? #24

Open
QiuhongAnnaWei opened this issue Oct 26, 2023 · 3 comments
Open

How do camera viewpoints work at training and inference? #24

QiuhongAnnaWei opened this issue Oct 26, 2023 · 3 comments

Comments

@QiuhongAnnaWei
Copy link

Thanks for releasing the code!

I am trying to understand how the camera viewpoints are sampled and used, and I have a few questions:

  1. How exactly does the model take in the camera viewpoint? Is it the same as zero123 conditional latent diffusion architecture where the input view (x) and a relative viewpoint transformation (R,T) are used as conditional information? If so, are you using the same conditional info encoder as zero123?

  2. The report says Zero123++ uses a fixed set of 6 poses (relative azimuth and absolute elevation angles) as the prediction target.
    a. zero123 uses a dataset of paired images and their relative camera extrinsics {(x, x_(R,T) , R, T)} for training, is the equivalent notation for zero123++ {(x, x_(tiled 6 images) , R_{1,...6}, T_{1..6})}
    b. Tying back to Q1, does this mean instead of taking in (x) and (R,T) as conditional input, zero123++ takes in (x_{1...6}) and (R_{1...6}, T_{1..6}) as conditional input?

  3. I hope to explicitly pass in a randomly sampled camera viewpoint at inference time, is that possible? I couldn't seem to find the exact part in the code that will allow this.

@avaer
Copy link

avaer commented Oct 26, 2023

If the answer to 3) is it's not possible -- is the plan to release training code + dataset for folks to make their own view set (e.g. a 360 orbit)?

@eliphatfs
Copy link
Collaborator

eliphatfs commented Oct 26, 2023

We do not explicitly use any camera pose input during training or inference. It is just that the designed output views do not have any ambiguity given the input image so we do not need any. By sampling twice with different camera parameters as input we will not get consistent results, so we previously did not think it is helpful.

See #10 for comments on training code and camera pose conditioning.

@moonryul
Copy link

eliphatfs said:

We do not explicitly use any camera pose input during training or inference. It is just that the designed output views do not have any ambiguity given the input image so we do not need any.

=> By "not explicitly use any camera pose input during training",

do you mean that, as a training pair, you use

(cond_image_i, target_grid), i=1,12, for each mesh.

Here target_grid consists of 6 images obtained by rendering a given mesh using 6 camera positions with fixed absolute elevation angles and relative aznimuth angles. cond_image_i refers to the object obtained by rendering the mesh with the ith randomly chosen camera position. The number of cond_images, 12, is arbitrary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants