Skip to content

Commit

Permalink
fix: handle too large / long pdfs with hf embeddings
Browse files Browse the repository at this point in the history
  • Loading branch information
percevalw committed Feb 9, 2024
1 parent 0de3307 commit a652a00
Showing 1 changed file with 14 additions and 4 deletions.
18 changes: 14 additions & 4 deletions edspdf/pipes/embeddings/huggingface_embedding.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,14 +165,24 @@ def preprocess(self, doc: PDFDoc):

for page in doc.pages:
# Preprocess it using LayoutLMv3
width = page.width
height = page.height

if width > 1000:
width = 1000
height /= width * 1000
if height >= 1000:
width /= height * 1000
height = 1000

prep = self.tokenizer(
text=[line.text for line in page.text_boxes],
boxes=[
(
int(line.x0 * line.page.width),
int(line.y0 * line.page.height),
int(line.x1 * line.page.width),
int(line.y1 * line.page.height),
int(line.x0 * width),
int(line.y0 * height),
int(line.x1 * width),
int(line.y1 * height),
)
for line in page.text_boxes
],
Expand Down

0 comments on commit a652a00

Please sign in to comment.