Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF Text extraction trial #19

Open
WolfgangFahl opened this issue Jun 10, 2020 · 2 comments
Open

PDF Text extraction trial #19

WolfgangFahl opened this issue Jun 10, 2020 · 2 comments
Assignees

Comments

@WolfgangFahl
Copy link
Contributor

#!/bin/bash
# WF 2020-06-10
# get text from pdf
which pdftotext > /dev/null
if [ $? -ne 0 ]
then
  echo "you might want to install pdf2text e.g. with sudo apt-get install poppler-utils" 1>&2
  echo "see https://en.wikipedia.org/wiki/Pdftotext" 1>&2
  exit 1
else
  log=/tmp/pdf2text$$.log
  limit=10000
  for f in $(find . -name '*.pdf' | head -$limit)
  do
    b=$(basename $f .pdf)
    d=$(dirname $f)
    txt="$d/$b-content.txt"
    echo "extracting text from $f to $txt ..."
    echo "extracting text from $f to $txt ..." >> $log
    pdftotext -layout $f $txt 2>>$log
  done
  echo "done. "
  echo "See log results below ..."
  grep -v "Bad annotation" $log | grep -v "extracting"
  echo "... end of log"
fi
@WolfgangFahl
Copy link
Contributor Author

This could be the basis for a full-text search feature and also for the content-negotation in "txt" format.

@WolfgangFahl
Copy link
Contributor Author

WolfgangFahl commented Jun 10, 2020

On my server some 1000 pdfs are converted to text per minute so the whole conversion may take some 45 minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant