Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3D Generation: InstantMesh Training, Inference, and Eval Scripts #681

Open
wants to merge 26 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
d10d44d
instantmesh inference & stage 1 training, also the eval script is pro…
HaFred Oct 1, 2024
57b66d7
update readme
HaFred Oct 1, 2024
b4054ed
putting on renderer utils
HaFred Oct 2, 2024
03ece27
fixes about fmt and f-string issue in precommit check and mindone im…
HaFred Oct 8, 2024
152f310
Merge branch 'mindspore-lab:master' into itmh_oct1
HaFred Oct 8, 2024
d5a58e2
Merge branch 'itmh_oct1' of https://github.com/HaFred/mindone into it…
HaFred Oct 8, 2024
35d27c3
supporting cosine_annealing_warm_restarts_lr and top_k saving ckptcal…
HaFred Oct 8, 2024
1fc4214
housekeeping
HaFred Oct 15, 2024
37d961f
Merge branch 'mindspore-lab:master' into itmh_oct1
HaFred Oct 15, 2024
70663da
revert to f-string while meeting flake8 constraints
HaFred Oct 15, 2024
7073b97
Update README.md
HaFred Oct 18, 2024
f939579
lpips loss alignment
HaFred Oct 25, 2024
f5ea787
Merge branch 'mindspore-lab:master' into itmh_oct1
HaFred Oct 25, 2024
0eb7641
put on mindcv version
HaFred Oct 25, 2024
4206d0d
Merge branch 'itmh_oct1' of https://github.com/HaFred/mindone into it…
HaFred Oct 25, 2024
efdd423
update the
HaFred Oct 25, 2024
9bf2358
Merge branch 'mindspore-lab:master' into itmh_oct1
HaFred Oct 29, 2024
f48983c
eval output to the same path as the loaded ckpt, also some housekeepi…
HaFred Oct 29, 2024
0dfbf99
fix the ckpt saving path cfg
HaFred Oct 29, 2024
53e2216
update cfg
HaFred Oct 29, 2024
e2c29e1
update arch to support loading vanilla stage 1 ckpt and ms-trained st…
HaFred Oct 30, 2024
6ac74a9
swtich ops to mint AMAP
HaFred Nov 1, 2024
71b70f3
Merge branch 'mindspore-lab:master' into itmh_oct1
HaFred Nov 2, 2024
59697a8
upload the safetensor conversion snippet mentioned in the readme
HaFred Nov 13, 2024
88a3e9d
Merge branch 'mindspore-lab:master' into itmh_oct1
HaFred Nov 13, 2024
968a03b
update link
HaFred Nov 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 119 additions & 0 deletions examples/instantmesh/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# InstantMesh: 3D Mesh Generation from Multiview Images

We support [instantmesh](https://github.com/TencentARC/InstantMesh) for the 3D mesh generation using the multiview images extracted from [the sv3d pipeline](https://github.com/mindspore-lab/mindone/pull/574).
<p align="center" width="100%">
<img width="746" alt="Capture" src="https://github.com/user-attachments/assets/be5cf033-8f89-4cad-97dc-2bf76c1b7a4d">
</p>

The model consists of a Dino-ViT feature extractor, a triplane feature extraction transformer, and a triplane-to-NeRF synthesizer which also conducts rendering.

A walk-through of the file structure is provided here as below.

<details>
<summary>Files Tree
</summary>

```bash
├── models
│ ├── decoder # triplane feature transformer decoder
│ │ └── transformer.py
│ ├── encoder # dino vit decoder to extract img feat
│ │ ├── dino_wrapper.py
│ │ └── dino.py
│ ├── renderer # a wrapper that synthesizes sdf/texture from triplane feat
│ │ ├── synthesizer_mesh.py # triplane synthesizer, the triplane feat is decoded thru nerf to predict texture rgb & 3D sdf
│ │ ├── synthesizer.py # triplane synthesizer, the triplane feat is decoded thru nerf to predict novel view rgba
│ │ └── utils
│ │ └── renderer.py
│ ├── geometry # use Flexicubes to extract isosurface
│ │ ├── rep_3d
│ │ │ ├── flexicubes_geometry.py
│ │ │ ├── tables.py
│ │ │ └── flexicubes.py
│ │ └── camera
│ │ └── perspective_camera.py
│ ├── lrm_mesh.py # model arch for the instantmesh inference
│ └── lrm.py # model arch for the instantmesh stage 1 training
├── utils
│ ├── camera_util.py
│ ├── train_util.py
│ ├── eval_util.py
│ ├── loss_util.py
│ ├── ms_callback_util.py
│ └── mesh_util.py
├── data
│ └── objaverse.py # training dataset definition and batchify
├── configs
│ └── instant-mesh-large.yaml
├── inference.py # instantmesh inference
├── train.py # instantmesh stage 1 training
├── eval.py # instantmesh stage 1 evaluation, mview imgs to novel view synthesis
└── model_stage1.py # model arch for the stage 1 training
```

</details>

## Introduction

InstantMesh [[1]](#acknowledgements) synergizes the strengths of a multiview diffusion model and a sparse-view reconstruction model based on the LRM [[2]](#acknowledgements) architecture. It also adopts FlexiCubes [[3]](#acknowledgements) isosurface extraction for a smoother and more elegant mesh extraction.

Using the multiview images input from 3D mesh extracted from [the sv3d pipeline](../sv3d/simple_video_sample.py), we extracted 3D meshes as below. Please kindly find the input illustrated by following the link to the sv3d pipeline above.

| <p align="center"> akun </p> | <p align="center"> anya </p> |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <div class="sketchfab-embed-wrapper"><iframe title="akun_ms" frameborder="0" allowfullscreen mozallowfullscreen="true" webkitallowfullscreen="true" allow="autoplay; fullscreen; xr-spatial-tracking" xr-spatial-tracking execution-while-out-of-viewport execution-while-not-rendered web-share src="https://sketchfab.com/models/c8b5b475529d48589b85746aab638d2b/embed"></iframe></div> | <div class="sketchfab-embed-wrapper"><iframe title="anya_ms" frameborder="0" allowfullscreen mozallowfullscreen="true" webkitallowfullscreen="true" allow="autoplay; fullscreen; xr-spatial-tracking" xr-spatial-tracking execution-while-out-of-viewport execution-while-not-rendered web-share src="https://sketchfab.com/models/180fd247ba2f4437ac665114a4cd4dca/embed"></iframe></div> |

The illustrations here are better viewed in viewers than with HTML support (e.g., the vscode built-in viewer).

## Environment Requirements

1. To kickstart:

```bash
pip install -r requirements.txt
```

2. Inference is tested on the machine with the following specs using 1x NPU:

| mindspore | ascend driver | firmware | cann toolkit/kernel |
| :--- | :--- | :--- | :--- |
| 2.3.1 | 24.1.RC2 |7.3.0.1.231 | 8.0.RC2.beta1 |

## Pretrained Models
### ViT Pretrained Checkpoint
To better accommodate the mindone transformer codebase, we provide an out-of-the-box [checkpoints conversion script](./tools/convert_dinovit_bin2st.py) that works seamlessly with the mindspore version of transformers.

The image features are extracted with dino-vit, which depends on HuggingFace's transformer package. We reuse [the MindSpore's implementation](https://github.com/mindspore-lab/mindone/blob/master/mindone/transformers/modeling_utils.py#L499) and the only challenge remains to be that `.bin` checkpoint of [dino-vit](https://huggingface.co/facebook/dino-vitb16/tree/main) is not supported by MindSpore off-the-shelf. The checkpoint script above serves easy conversion purposes and ensures that dino-vit is still based on `MSPreTrainedModel` safe and sound.

### InstantMesh Checkpoint
To convert checkpoints, we prepare the following snippet.
```bash
python tools/convert_pt2ms.py --trgt PATH_TO_CKPT
```

## Inference

```shell
python inference.py --ckpt PATH_TO_CKPT \
--input_vid PATH_TO_INPUT_MULTIVIEW_VID
```

## Training
```shell
python train.py --base configs/YOUR_CFG
```
One needs to patch `mindcv.models.vgg` in L62 to enable conv kernel bias to align with the torchmetric implementation of lpips loss.
```diff
- conv2d = nn.Conv2d(in_channels, v, kernel_size=3, pad_mode="pad", padding=1)
+ conv2d = nn.Conv2d(in_channels, v, kernel_size=3, pad_mode="pad", padding=1, has_bias=True)
```

### Data Curation
We used Blender to render multiview frames for a 3D object in `.obj` for training.

## Acknowledgements

1. Xu, Jiale, et al. "Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models." arXiv preprint arXiv:2404.07191 (2024).
2. Hong, Yicong, et al. "Lrm: Large reconstruction model for single image to 3d." arXiv preprint arXiv:2311.04400 (2023).
3. Shen, Tianchang, et al. "Flexible Isosurface Extraction for Gradient-Based Mesh Optimization." ACM Trans. Graph. 42.4 (2023): 37-1.
4. Lorensen, William E., and Harvey E. Cline. "Marching cubes: A high resolution 3D surface construction algorithm." Seminal graphics: pioneering efforts that shaped the field. 1998. 347-353.
17 changes: 17 additions & 0 deletions examples/instantmesh/configs/instant-mesh-large.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
model:
encoder_model_name: 'facebook/dino-vitb16'
target: models.instantmesh.models3d.lrm_mesh.InstantMesh
params:
encoder_model_name: 'YOUR_PATH_HF/models--facebook--dino-vitb16/snapshots/f205d5d8e640a89a2b8ef0369670dfc37cc07fc2' # coz needs to enforce the is_local flag (with pretrained_model_name_or_path as dir), thus here put in the abs path as a workaround
triplane_low_res: 32
triplane_high_res: 64
triplane_dim: 80
rendering_samples_per_ray: 128
grid_res: 128
grid_scale: 2.1


infer:
# model_path: ckpts/instant_mesh_large.ckpt # by default as in torch, the model loaded from hf, let's do conversion insitu
texture_resolution: 1024
render_resolution: 512
51 changes: 51 additions & 0 deletions examples/instantmesh/configs/instant-nerf-large-train.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
model:
base_learning_rate: 4.0e-04
scheduler: cosine_annealing_warm_restarts_lr
optimizer: adamw
weight_decay: 0.01
target: model_stage1.InstantMeshStage1WithLoss
params:
input_size: 320
render_size: 192
lrm_generator_config:
openlrm_ckpt: 'YOUR_PATH/openlrm.ckpt'
target: models.lrm.InstantNeRF
params:
encoder_feat_dim: 768
encoder_freeze: false
encoder_model_name: 'YOUR_PATH_HF/models--facebook--dino-vitb16/snapshots/f205d5d8e640a89a2b8ef0369670dfc37cc07fc2' # coz needs to enforce the is_local flag (with pretrained_model_name_or_path as dir), thus here put in the abs path as a workaround
transformer_dim: 1024
transformer_layers: 16
transformer_heads: 16
triplane_low_res: 32
triplane_high_res: 64
triplane_dim: 80
rendering_samples_per_ray: 64 # for the vanilla ckpt is 128, if loaded pretrained make sure it's 128
use_recompute: true

eval_render_size: 96 # larger may lead to oom in eval.py

data:
batch_size: 1
num_workers: 4
train:
target: data.objaverse.ObjaverseDataset
params:
root_dir: YOUR_PATH_DATA # for overfitting exp
meta_fname: uid_set.pkl
input_image_dir: input
target_image_dir: input
input_view_num: 3
target_view_num: 2
input_size: 320
render_size: 192
total_view_n: 16
fov: 50
camera_rotation: true
val:
target: data.objaverse.ValidationDataset
params:
root_dir: YOUR_PATH_DATA/target
input_view_num: 6
input_image_size: 320
fov: 30
Loading