Skip to content

TheoGicquel/L3-IrisaParser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IrisaParser

Python script to parse pdf files into text files.

Setup

You can get the latest stable version by cloning the master branch of our repository with :

git clone https://gitlab.com/inf1603/irisaparser.git

For other versions please see the release section on GitLab, and download the source archive corresponding to the desired version.

Make sure that Python and Pip are installed on your computer.

Software Requirements:

  • Python >3.10
  • Pip 22.0
  • Java >7

Pip packages (optional):

This step is necessary only if you do not want to build the package yourself, if you do not wisth to install manually these dependencies, jump to the Build section.

Once all software requirements are satisfied, please install the following pip packages : To install PiPy packages make sure that pip is installed and use :

pip install <packagename>

The following dependencies are required :

You can also install every dependency in one go with :

pip install pdfplumber tika spacy colorama

Usage

/!\ At first usage, the module will download tika java package, this may take a lot of time. /!\

Note that you must have the dependencies of the Pip package section to use irisaparser this way, if it's not the case please see the Build section or consult the Pip package section.

Make sure you are placed above the irisaparser directory, or for more convenience build the package.

To parse a pdf file use the command:

python3 -m irisaparser <args>

For example to parse a file named f1.pdf located in /home/user/pdfs/ and to output results in /tmp, use:

python3 -m irisaparser /home/user/pdfs/f1.pdf -o /tmp

To parse all the files located in /home/user/pdfs, use:

python3 -m irisaparser -d /home/user/pdfs

Note that if no output directory is provided using -o, text files will be placed in current directory.

For more information about how to use the module, type the following command:

python3 -m irisaparser --help

Build

Make sure that you have setuptools and wheel on your system, you can install these dependencies with the command:

pip install setuptools wheel

To build the package make sure that you are above the irisaparser directory inside the file hierarchy, and use the following command:

python3 setup.py bdist_wheel

for more information about this command consult the documentation

To install your package, use the following command with the .whl file you just built:

pip install <some .whl file>

Package usage

/!\ At first usage, the module will download tika java package, this may take a lot of time. /!\

Once you have installed the package, you can use it with the following command everywhere:

python3 -m irisaparser <args>

You can also import the package in a python script like this :

import irisaparser

# parse a file and put the output in the specified directory:
irisaparser.parseFile("filename","./out")

# parse a file and put the xml output in the current directory:
irisaparser.parseFile("filename",xml=True)

# parse multiple files provided in the list parameter files:
filesnames = ["file1","file2"]
irisaparser.parseFiles(filenames)

# do the equivalent of the command line usage, just pass the arguments as a list of strings:
args = ["-d","./Corpus_2021","-o","./out"]
irisaparser.parseArgs(args)

Documentation

For more information about the different options and behaviors of the program consult our documentation page (or the DOC.md file if you downloaded the source code).