1950 Census Textract Script

Custom Amazon Textract script to extract name data from the 1950 Census records. The script uses the AWS Textract service to extract the text and utilize Serverless solution to spin up hundreds of processes to process the images in parallel.

Requirements

The nodejs and python3 need to be installed on the development server. All the following npm and python packages need to be installed as well.

npm install serverless

pip3 install sagemaker ipython scikit-build opencv-python matplotlib editdistance fuzzywuzzy python-Levenshtein

The AWS account access credential environment variables need to be setup to access the required AWS resources

export AWS_ACCESS_KEY_ID=AWS_ACCESS_ID

export AWS_SECRET_ACCESS_KEY=AWS_ACCESS_SECRET_KEY

export AWS_DEFAULT_REGION=us-gov-east-1

Test Runs

Set up S3 bucket environment variables

export BUCKET_SRC=source_s3_bucket
export BUCKET_DST=output_s3_bucket
export REGION=us-gov-east-1

run "./scripts/test.sh --debug --s3 1950census/43290879-Kansas/43290879-Kansas-045836/43290879-Kansas-045836-0009.jpg" to extract text from the 1950census/43290879-Kansas/43290879-Kansas-045836/43290879-Kansas-045836-0009.jpg in the source S3 bucket.

Cloud Deployment

Update the source and destination S3 bucket names in the serverless.yml file to point to the correct S3 buckets. The securityGroupIds and subnetIds under VPC section need to be updated as well.

Run ./scripts/build-layer.sh to create the Lambda layer dependencies.zip file

Run "serverless deploy" to deploy the Lambda functions to AWS

Run "python3 ./scripts/addtosqs.py path/to/full/path/image/list/file.txt" to add image list to the Lambda Function process queue

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
handler.py		handler.py
serverless.yml		serverless.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1950 Census Textract Script

Requirements

Test Runs

Cloud Deployment

About

Releases

Packages

Contributors 2

Languages

usnationalarchives/1950-Census-Textract-Code

Folders and files

Latest commit

History

Repository files navigation

1950 Census Textract Script

Requirements

Test Runs

Cloud Deployment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages