This is my favorite and most used tool #735
clearsitedesigns
started this conversation in
Show and tell
Replies: 1 comment
-
@clearsitedesigns this looks really useful. We can integrate it to the main code. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all I have to fork off the main repo. I use this tool a lot in my research, and building some complexity. Since I am ingesting a ton of stuff I wrote a custom_ingest.py that gives me a bit more insight into what is going on and controls the token ingesting a bit more. Including, telling me what files are going, and giving a count of how much time is left. If there is interest I could merge this. It's not perfect. I usually throw a hundred files or so at a time, but if they are large PDF's that can be problematic.
For example,
I used to sit here and wonder how many hours something was going to take now I can see how long it will take to ingest these 195 documents, at least a better idea of how far done we are. I did have to update matlib + charset and a few other libs to get this to work.
1000
2024-02-05 16:14:26,396 - WARNING - text_splitter.py:176 - Created a chunk of size 1713, which is longer than the specified 1000
2024-02-05 16:14:26,970 - INFO - SentenceTransformer.py:66 - Load pre trained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length 512
2024-02-05 16:14:34,784 - INFO - custom_ingest.py:151 - Ingested batch 1/195, 0.51% complete
2024-02-05 16:14:40,216 - INFO - custom_ingest.py:151 - Ingested batch 2/195, 1.03% complete
2024-02-05 16:14:45,999 - INFO - custom_ingest.py:151 - Ingested batch 3/195, 1.54% complete
2024-02-05 16:14:52,932 - INFO - custom_ingest.py:151 - Ingested batch 4/195, 2.05% complete
2024-02-05 16:14:58,689 - INFO - custom_ingest.py:151 - Ingested batch 5/195, 2.56% complete
2024-02-05 16:15:04,510 - INFO - custom_ingest.py:151 - Ingested batch 6/195, 3.08% complete
2024-02-05 16:15:10,455 - INFO - custom_ingest.py:151 - Ingested batch 7/195, 3.59% complete
2024-02-05 16:15:16,187 - INFO - custom_ingest.py:151 - Ingested batch 8/195, 4.10% complete
2024-02-05 16:15:21,968 - INFO - custom_ingest.py:151 - Ingested batch 9/195, 4.62% complete
2024-02-05 16:15:28,333 - INFO - custom_ingest.py:151 - Ingested batch 10/195, 5.13% complete
2024-02-05 16:15:33,562 - INFO - custom_ingest.py:151 - Ingested batch 11/195, 5.64% complete
2024-02-05 16:17:34,903 - INFO - custom_ingest.py:151 - Ingested batch 31/195, 15.90% complete
Beta Was this translation helpful? Give feedback.
All reactions