Skip to content

bakame-php/pdftotext

Repository files navigation

Extract text from a pdf

Author Build Status Total Downloads Latest Stable Version Software License

This package provides a class to extract text from a pdf.

This is a fork of Spatie/pdftotext

<?php

use Bakame\Pdftotext\Pdftotext;

$pdftotext = Pdftotext::fromUnix();
$text = $pdftotext->extract('/path/to/file.pdf');

Requirements

You need PHP >= 7.2+ but the latest stable version of PHP is recommended.

Behind the scenes this package leverages pdftotext. You can verify if the binary installed on your system by issueing this command:

which pdftotext

If it is installed it will return the path to the binary.

To install the binary you can use

  • On apt based system:
apt-get install poppler-utils

On yum based system:

yum install poppler-utils

On MacOS

brew install poppler

Installation

You can install the package via composer:

composer require bakame/pdftotext

Usage

Extracting text from a pdf is easy, just need to specify:

  • the path to the pdftotext binary.
  • the path to the pdf file to extract.
<?php

use Bakame\Pdftotext\Pdftotext;

$text = (new Pdftotext('/path/to/pdftotext'))
    ->extract('/path/to/file.pdf')
;

If you are on a Linux based system you can use the fromUnix named constructor which will try to locate and return an instance using the correct executable path.

<?php

use Bakame\Pdftotext\Pdftotext;

$text = Pdftotext::fromUnix()->extract('/path/to/file.pdf');

Sometimes you may want to use pdftotext options. You can add them as options to the extract method calls like shown below:

<?php

use Bakame\Pdftotext\Pdftotext;
$text = Pdftotext::fromUnix()->extract('table.pdf', ['layout', 'r 96']);

If you need to add defaults options, you can use the setDefaultOptions method to add basic options on each extraction call, or use the class constructor :

<?php

use Bakame\Pdftotext\Pdftotext;
$text = (new Pdftotext('/path/to/pdftotext', ['layout', 'r 96']))
   ->extract('table.pdf', ['f 1'])
;
// will return the same data as

$text = Pdftotext::fromUnix(['layout', 'r 96'])->extract('table.pdf', ['f 1']);

// will return the same data as

$pdftotext = new Pdftotext('/path/to/pdftotext');
$pdftotext->setDefaultOptions(['layout', 'r 96']);
$text = $pdftotext->extract('table.pdf', ['f 1']);

Default options will be merge with the individuals options added when calling the extract method.

You can even directly save your text extraction to a file using the save method. This method takes the same arguments as the extract method but requires a destination file as its second argument.

<?php

use Bakame\Pdftotext\Pdftotext;

$bytes = Pdftotext::fromUnix(['layout', 'r 96'])->save('table.pdf', 'table.txt', ['f 1']);

The returned $bytes is the number of bytes written to the file.

Advanced usage

You can set a timeout if you are dealing with larges PDF files using the setTimeout method. By default, the timeout is set to 60 seconds.

<?php

use Bakame\Pdftotext\Pdftotext;

$pdftotext = new Pdftotext('/path/to/pdftotext', ['layout', 'r 96']);
$pdftotext->setTimeout(120); //the extraction will timeout after 2 minutes.
$bytes = $pdftotext->save('table.pdf', 'table.txt', ['f 1']);

Testing

The package has:

  • a coding style compliance test suite using PHP CS Fixer.
  • a code analysis compliance test suite using PHPStan.
  • a PHPUnit test suite

To run the tests, run the following command from the project folder.

$ composer test

Contributing

Contributions are welcome and will be fully credited. Please see CONTRIBUTING and CONDUCT for details.

Security

If you discover any security related issues, please email [email protected] instead of using the issue tracker.

Changelog

Please see CHANGELOG for more information on what has changed recently.

License

The MIT License (MIT). Please see License File for more information.

Credits

Releases

No releases published

Packages

No packages published

Languages