Utilites that uses to process data from PDF files:
- Tabula - Tabula is a tool for liberating data tables locked inside PDF files.
- Visual Studio 2015 - IDE to develop .NET applicaion.
- Python Tools for Visual Studio - Completely free Python support within Visual Studio. Add functionality to VS to work with Python and create project with this language.
- Python - python language runtime and main libraries. Better to install Anaconda.
- Anaconda - data since toolbox uses Python to work with data. Contains a lot of libraries and packages to quick start. Include Puthon and many libraries.
Solution created in Visual Studio and has two types of the projects: C# application to process data and Python scripts to work with CSV files with data extracted from PDF with with Tabula.
The application contains tho parts:
- Python scripts to process CSV files generated with Tabula after extaction data from PDF documents. This scripts create JSON files with data for some table (6 tables).
- C# console application that can read generated JSON files and do additional processing (read and save as text files or export to Excel documents).
The simple description of the way how gile processed:
PDF Files (source)[input] -> 1) Tabula (manual processing) -> CSV files[output->input] -> 2) Python scripts (auto) -> JSON files[output->input] -> 3) C# App (auto) -> ...
All data files already added to repository and processed and in this case no needed do processing another one. All that needed is start using JSON files.
To start work with data it need to copy all JSON files to working folder on local machine, change variable value (in class Constants (\DataProcessingApp.Core\Constants.cs)) - public const string BaseDatadir = "i:\\Working Data";
and set local machine folder with files. After this application can build and run. As result in working folder will be created a set of new files (text files, json with combined data for U(2) and R(2) tables, Excel files).
Application (C#) can be extended if needed to add additional processing or file formats to export.
Note that there are two very big tables - U(2) and R(2) with 615 pages for each. To simplify processing sourcePDF files was splited on 5 parts. But this did because all information for each interest rate can't be placed in one big table. In other words, it was necessary in order to accommodate all the data.
List of tables that can be processed:
- Table C - Factors for Reducing Assurances – Based on Life Table 90CM.
- Table R(2) - Based on Life Table 90CM.
- Table U(1) - Based on Life Table 90CM.
- Table U(2) - Based on Life Table 90CM.
- Table H - Commutation Factors Based on Life Table 90CM.
- Table S - Single Life Factors Based on Life Table 90CM.
This tables contains data for Actuarial Factors for 1990 base year from official IRS documents.
Each table has some fields with data. This fields provided bellow wih data types (C#).
- Table C - Factors for Reducing Assurances – Based on Life Table 90CM.
- int MortalityTable
- double Rate
- int Age
- double RemainderFactor
- double RFactor
- double DFactor
- Table R(2) - Based on Life Table 90CM.
- int MortalityTable
- int Age1
- int Age2
- double AdjustedPayoutRate
- double RemainderFactor
- Table U(1) - Based on Life Table 90CM.
- int MortalityTable
- int Age
- double AdjustedPayoutRate
- double RemainderFactor
- Table U(2) - Based on Life Table 90CM.
- int MortalityTable
- int Age1
- int Age2
- double AdjustedPayoutRate
- double RemainderFactor
- Table H - Commutation Factors Based on Life Table 90CM.
- int MortalityTable
- double InterestRate
- int Age
- double DFactor
- double NFactor
- double MFactor
- Table S - Single Life Factors Based on Life Table 90CM.
- int MortalityTable
- double InterestRate
- int Age
- double PvAnnuity
- double PvLifeEstate
- double PvReminderInterest
There are three category of the files:
- Source files (PDF).
- Files with extracted data (CSV).
- Files with processed data (JSON).
And last - all files that can be created in C# application (Excel, text files and etc.) - also processed data.
Source files (PDF) included in repository and placed in folder - DataFiles
Data files with source tables:
- Table C - Factors for Reducing Assurances – Based on Life Table 90CM.
- TableC-1990.pdf - 100 pages.
- Table R(2) - Based on Life Table 90CM.
- TableR(2)-1990.pdf - full table - 615 pages.
- TableR(2)-p1-1990.pdf - part 1 - 123 pages.
- TableR(2)-p2-1990.pdf - part 2 - 123 pages.
- TableR(2)-p3-1990.pdf - part 3 - 123 pages.
- TableR(2)-p4-1990.pdf - part 4 - 123 pages.
- TableR(2)-p5-1990.pdf - part 4 - 123 pages.
- Table U(1) - Based on Life Table 90CM.
- TableU(1)-1990.pdf - 18 pages.
- Table U(2) - Based on Life Table 90CM.
- TableU(2)-1990.pdf - full table - 615 pages.
- TableU(2)-p1-1990.pdf - part 1 - 123 pages.
- TableU(2)-p2-1990.pdf - part 2 - 123 pages.
- TableU(2)-p3-1990.pdf - part 3 - 123 pages.
- TableU(2)-p4-1990.pdf - part 4 - 123 pages.
- TableU(2)-p5-1990.pdf - part 5 - 123 pages.
- Table H - Commutation Factors Based on Life Table 90CM.
- TableH-1990.pdf - 100 pages.
- Table S - Single Life Factors Based on Life Table 90CM.
- TableS-1990.pdf - 100 pages.
All files after extraction (CSV) also included in repository and placed in folder - PythonDataApp\DataFiles\
. This folder also included in solution for Python project.
- Table C - Factors for Reducing Assurances – Based on Life Table 90CM.
- tabula-TableC-1990-2.csv
- Table R(2) - Based on Life Table 90CM.
- tabula-TableR(2)-p1-1990.csv
- tabula-TableR(2)-p2-1990.csv
- tabula-TableR(2)-p3-1990.csv
- tabula-TableR(2)-p4-1990.csv
- tabula-TableR(2)-p5-1990.csv
- Table U(1) - Based on Life Table 90CM.
- tabula-TableU(1)-1990-2.csv
- Table U(2) - Based on Life Table 90CM.
- tabula-TableU(2)-p1-1990.csv
- tabula-TableU(2)-p2-1990.csv
- tabula-TableU(2)-p3-1990.csv
- tabula-TableU(2)-p4-1990.csv
- tabula-TableU(2)-p5-1990.csv
- Table H - Commutation Factors Based on Life Table 90CM.
- tabula-TableH-1990.csv
- Table S - Single Life Factors Based on Life Table 90CM.
- tabula-TableS-1990.csv
All processed file (JSON) with information extracted from CSV files. This file contains correct information from tables in PDF files and can be used in different applications.
This files placed in folder - JSONFiles
.
- Table C - Factors for Reducing Assurances – Based on Life Table 90CM.
- tabula-TableC-1990-processed.json
- Table R(2) - Based on Life Table 90CM.
- tabula-TableR(2)-p1-1990-processed.json
- tabula-TableR(2)-p2-1990-processed.json
- tabula-TableR(2)-p3-1990-processed.json
- tabula-TableR(2)-p4-1990-processed.json
- tabula-TableR(2)-p5-1990-processed.json
- Table U(1) - Based on Life Table 90CM.
- tabula-TableU1-1990-processed.json
- Table U(2) - Based on Life Table 90CM.
- tabula-TableU(2)-p1-1990-processed.json
- tabula-TableU(2)-p2-1990-processed.json
- tabula-TableU(2)-p3-1990-processed.json
- tabula-TableU(2)-p4-1990-processed.json
- tabula-TableU(2)-p5-1990-processed.json
- Table H - Commutation Factors Based on Life Table 90CM.
- tabula-TableH-1990-processed.json
- Table S - Single Life Factors Based on Life Table 90CM.
- tabula-TableS-1990-processed.json
- To use all logic and scrtpts need to install all the apps and libraries.
- Section with data files contain table name and list of files with data for each table.
Author: Andrey Kukharenko. Created on: December 2016.