Dense Index of MARCO V2 and KILT #974
-
I have been trying to make a dense from both the MARCO V2 and KILT collections (about 17M documents) on an AWS EC2 instance running Ubuntu with 8 vCPUs and 32GB memory. Each time I run:
The process gets killed after about 3 minutes, probably because it uses too much memory. When I try sharding, using:
I still experience the same outcome. However, I successfully created a dense index using a subset of the collection (~200K documents) using the command above. I wonder if there's anything I'm missing in terms of optimizing the process? or do I need to try it again on a "beefier" machine? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Hi @paulowoicho,
|
Beta Was this translation helpful? Give feedback.
Hi @paulowoicho,
The reason for OOM here is probably due to loading the entire corpus into the memory at the beginning (before sharding).
I would suggest splitting the raw collection in advance.
i.e. use Linux cli tool
split
to split the converted_collection into multiple small files.then run the encoding with: