PDF Text extraction trial #19

WolfgangFahl · 2020-06-10T13:36:54Z

#!/bin/bash
# WF 2020-06-10
# get text from pdf
which pdftotext > /dev/null
if [ $? -ne 0 ]
then
  echo "you might want to install pdf2text e.g. with sudo apt-get install poppler-utils" 1>&2
  echo "see https://en.wikipedia.org/wiki/Pdftotext" 1>&2
  exit 1
else
  log=/tmp/pdf2text$$.log
  limit=10000
  for f in $(find . -name '*.pdf' | head -$limit)
  do
    b=$(basename $f .pdf)
    d=$(dirname $f)
    txt="$d/$b-content.txt"
    echo "extracting text from $f to $txt ..."
    echo "extracting text from $f to $txt ..." >> $log
    pdftotext -layout $f $txt 2>>$log
  done
  echo "done. "
  echo "See log results below ..."
  grep -v "Bad annotation" $log | grep -v "extracting"
  echo "... end of log"
fi

The text was updated successfully, but these errors were encountered:

WolfgangFahl · 2020-06-10T13:37:34Z

This could be the basis for a full-text search feature and also for the content-negotation in "txt" format.

WolfgangFahl · 2020-06-10T13:46:07Z

On my server some 1000 pdfs are converted to text per minute so the whole conversion may take some 45 minutes.

WolfgangFahl added the enhancement label Jun 10, 2020

WolfgangFahl self-assigned this Jun 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF Text extraction trial #19

PDF Text extraction trial #19

WolfgangFahl commented Jun 10, 2020

WolfgangFahl commented Jun 10, 2020

WolfgangFahl commented Jun 10, 2020 •

edited

Loading

PDF Text extraction trial #19

PDF Text extraction trial #19

Comments

WolfgangFahl commented Jun 10, 2020

WolfgangFahl commented Jun 10, 2020

WolfgangFahl commented Jun 10, 2020 • edited Loading

WolfgangFahl commented Jun 10, 2020 •

edited

Loading