Harmonizing Visual Text Comprehension and Generation

Environment

step 1: set up the environment

git clone https://github.com/bytedance/TextHarmony
cd TextHarmony
pip install -r requirements.txt
# install `MultiScaleDeformableAttention` module
cd TextHarmony/models/utils/ops
python setup.py install

some of the packages like mmcv and flash_attn in requirements.txt may need to be installed manually.

step 2: download pretraining weights

cd TextHarmony
python TextHarmony/scripts/download_hf_models.py

step 3: download the model weight of TextHarmony

# concatenate the model files
cat pytorch_model.binaa pytorch_model.binab pytorch_model.binac > pytorch_model.bin

Inference

step1: modify 'load_from', 'llm_model_path', 'encoder_model_path' and 'pretrained_model_name_or_path' in example_inference.yaml

step 2: run the following command:

torchrun --nproc_per_node 1 --nnodes 1 --master_port 2333 inference.py  --config_file=TextHarmony/TextHarmony/configs/release/example_inference.yaml

Evaluation

image comprehension

step1: modify 'data_root' and 'data_path' in 896-moe-eval.yaml. The structure of 'data_path' should be as follows:

[
    {
		"image": image_path,
		"question": question,
		"answer": answer
    },
]

step 2: run the following command

torchrun --nproc_per_node 1 --nnodes 1 --master_port 2333 evaluate.py --config_file=TextHarmony/TextHarmony/configs/release/896-moe-eval.yaml

image generation

step 1: download AnyText-Benchmark

step 2: generate the target images

torchrun --nproc_per_node 1 --nnodes 1 --master_port 2333 inference.py  --config_file=TextHarmony/TextHarmony/configs/release/896-moe-inference.yaml

step 3: calculate the results

python TextHarmony/image_eval/eval_dgocr.py

Training

TODO

Acknowledgment

We thank the great work of MM-Interleaved, TextDiffuser, AnyText and LoRAMoE

Citation

@article{zhao2024harmonizing,
  title={Harmonizing Visual Text Comprehension and Generation},
  author={Zhao, Zhen and Tang, Jingqun and Wu, Binghong and Lin, Chunhui and Wei, Shu and Liu, Hao and Tan, Xin and Zhang, Zhizhong and Huang, Can and Xie, Yuan},
  journal={arXiv preprint arXiv:2407.16364},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
TextHarmony		TextHarmony
docs/examples		docs/examples
image_eval		image_eval
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
evaluate_utils.py		evaluate_utils.py
inference.py		inference.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Harmonizing Visual Text Comprehension and Generation

Environment

Inference

Evaluation

image comprehension

image generation

Training

Acknowledgment

Citation

About

Releases

Packages

Languages

License

bytedance/TextHarmony

Folders and files

Latest commit

History

Repository files navigation

Harmonizing Visual Text Comprehension and Generation

Environment

Inference

Evaluation

image comprehension

image generation

Training

Acknowledgment

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages