We just want to read the scrolls!!
Location of segments within the greater scroll. The ids correspond to scroll segments with their locations here: http://dl.ash2txt.org/full-scrolls/Scroll1.volpkg/paths/
- 1 TB drive
- 1 Nvidia GPU with high VRAM (I personally tried with 40GB)
Set up a Linux-based system with CUDA 12.1.
- Change directory into the folder
- Run
$ conda env create -f environment.yml
. - Run
$ conda activate Vesuvius-Challenge
. - Run
$ python data_downloader.py
. - Run
$ python data_setup.py
. - Place downloaded
[model.ckpt](https://drive.google.com/file/d/1rh0xGOPhznqPT6QqcK6tbnq86eAM9XiI/view?usp=drive_link)
into./models
. - Run
$ accelerate launch inference_unetr_pp.py
.
Results will be saved in the results/
folder.
After following the above steps to set up and activate the Conda environment:
- Run
$ python data_downloader.py
. - Run
$ python data_setup.py
. - Run
$ python training_unetr_pp.py
.
Trained models will be saved in the training/
folder.
Hallucinations were mitigated in 4 ways:
- Labeled data was created using only 64x64 pixel windowed models. The 256x256 pixel windowed model was used to generate cleaner/more legible results only in the last two iterations.
- Including more negative ink labels. The patch extraction technique described below in technical details extracts more negative labels than positive ones, reducing bias towards positive labels. Hence, the model is less likely to hallucinate positive ink labels. The risk of mis-interpreting via hallucination of negative ink labels is far lower.
- Strong data augmentation. Distorting augmentations such as optical, grid, and elastic deformations were used during training, greatly reducing the possibility that the models memorize the shape of greek letters instead of learning true ink signals.
- Results were generated over a 32 pixel stride instead of the 64 pixel stride used during training. Therefore, the results will be generated on unseen parts of characters even if that part of the scroll was used during training.
I use a custom adaptation of the state of the art UNETR++ model, a transformer based UNET derivative used in medical imaging as a 3d feature extractor, max pooling over the depth layers, then a final feature extractor based on Segformer B-5.
We exclusively ran detections on PHerc Paris 3 (scroll 1), with an ink detection of 256x256 pixels, which corresponds to a ~2.02496mm ink detection window, with a stride of 32 pixels to ensure sufficient training data. Since this is larger than the recommended 64x64 pixel detection window, the ways in which I mitigated hallucinations is discussed below.
I propose a patch extraction technique that works well for larger window sizes <=512 pixels as well as allowing the model to have sufficient examples for both positive and negative ink labels. This technique is especially important to learn characters where negative ink labels (negative space) are crucial, for example in distinguishing characters ο, ϲ, and θ, which have very similar ink structures especially when the data is noisy.
The patch extractor works by first identifying all the areas in the manually annotated ink label ground truth data that contain ink, and then only passing the ink area and the surrounding non-ink area that is critical to understanding what character it is. This includes the non-ink labels inside the character itself, which is crucial in distinguishing the aforementioned ο, ϲ, and θ.
Example patch extraction from manually annotated ink labels on PHerc Paris 3 segment 20231012184423. The green boxes denote the area of the scroll that the model will be trained. Note that the stride is 64, hence there are many overlapping boxes.
You can run and visualize the patch extract algorithm yourself using window_visualizer.py
.