Skip to content

chrisammon3000/aws-open-data-registry-neural-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

aws-open-data-registry-neural-search

Semantic search of AWS Open Data Registry datasets using Weaviate.

Table of Contents

Project Structure

.
├── Makefile                # Makefile for deployment and teardown
├── README.md
├── (aws-open-data-registry-neural-search-key-pair.pem)
├── bin                     # CDK app
├── cdk.context.json
├── cdk.json
├── config.json             # Environment variables for CDK app
├── frontend                # Streamlit app
├── lib                     # CDK stacks
├── notebooks
├── package.json
├── requirements.txt
├── scripts                 # Bash scripts for Weaviate
├── src                     # EC2 instance user data
├── tasks                   # Fargate task
└── tsconfig.json

Description

Deploy and load a Weaviate instance with AWS Open Data Registry datasets. Find datasets, tutorials, publications and tools & applications using semantic search queries using the Streamlit app.

Architecture

The application deploys the following resources.

  • EC2 instance
  • VPC (optional)
  • Fargate task (run on AWS Batch)

Deployment using Makefile

Once the environment is configured the application can be deployed using one command, make app. This will create the infrastructure including setting up and exposing the Weaviate instance, load the data, and start the Streamlit server once loading is finished.

SSH Key

The SSH key is created and deleted outside of the CDK app using Makefile targets, which means that you can run cdk deploy and cdk destroy without having to create or delete the SSH key. If the key's .pem file is detected in the project it will be assigned to the database instance. If no key is found, it must be created before the AWS resources are deployed. The best way to do this is by running make app or make deploy

Streamlit Frontend

The Streamlit app is where you can interact with Weaviate by browsing, searching, and executing GraphQL queries.

search-ui

Each dataset has several expandable sections containing information about it.

search-result

Custom queries can be run against Weaviate using the GraphQL API.

graphql

Quickstart

  1. Configure your AWS credentials.
  2. Add environment variables to .env.
  3. Update the tags field in config.json if desired.
  4. Run npm install to install.
  5. Run make app to deploy, load data and start the Streamlit server.

Installation

Follow the steps to configure the deployment environment.

Prerequisites

  • Nodejs >= 18.0.0
  • TypeScript >= 4.4.3
  • AWS CDK >= 2.53.0
  • AWSCLI
  • Docker
  • Python 3.10
  • jq

Environment Variables

Sensitive environment variables containing secrets like passwords and API keys must be exported to the environment first.

Create a .env file in the project root.

CDK_DEFAULT_ACCOUNT=<account_id>
CDK_DEFAULT_REGION=<region>

Important: Always use a .env file or AWS SSM Parameter Store or Secrets Manager for sensitive variables like credentials and API keys. Never hard-code them, including when developing. AWS will quarantine an account if any credentials get accidentally exposed and this will cause problems.

Make sure that .env is listed in .gitignore

CDK Application Configuration

The CDK application configuration is stored in config.json. This file contains values for the database layer, the data ingestion layer, and tags. You can update the tags and SSH IP to your own values before deploying.

{
    "layers": {
        "data_ingestion": {
            "env": {
                "repo_url": "https://github.com/awslabs/open-data-registry",
                "target_data_dir": "datasets"
            }
        },
        "vector_database": {
            "env": {
                "ssh_cidr": "0.0.0.0/0", // Update to your IP
                "ssh_key_name": "aws-open-data-registry-neural-search-key-pair"
            }
        }
    },
    "tags": {
        "org": "my-organization", // Update to your organization
        "app": "aws-open-data-registry-neural-search"
    }
}

AWS Credentials

Valid AWS credentials must be available to AWS CLI. The easiest way to do this is running aws configure, or by adding them to ~/.aws/credentials and exporting the AWS_PROFILE variable to the environment.

For more information visit the documentation page: Configuration and credential file settings

Python Development

Create a virtual environment for Python development.

# Create a virtual environment
python3.10 -m venv .venv

# Activate the virtual environment
source .venv/bin/activate

# Upgrade pip
pip install --upgrade pip

# Install dependencies
pip install -r requirements.txt

Weaviate Configuration

Configuring Weaviate requires 3 steps:

  1. Download the Docker Compose file.
  2. Update the Docker Compose file to configure Weaviate to persist data and automatically restart on reboot.
  3. Run the Docker Compose file.

Download the Docker Compose File

Run the command to download a Docker Compose file for Weaviate (source).

curl -o docker-compose.yaml "https://configuration.weaviate.io/v2/docker-compose/docker-compose.yml?generative_cohere=false&generative_openai=false&generative_palm=false&gpu_support=false&media_type=text&modules=modules&ner_module=false&qna_module=false&ref2vec_centroid=false&runtime=docker-compose&spellcheck_module=false&sum_module=false&text_module=text2vec-transformers&transformers_model=sentence-transformers-multi-qa-MiniLM-L6-cos-v1&weaviate_version=v1.19.8"

Update the Docker Compose File

Next, run the command to configure Weaviate to persist data and automatically restart on reboot.

awk '
  /^  weaviate:$/ {
    print
    print "    restart: always"
    print "    volumes:"
    print "      - /data/weaviate:/var/lib/weaviate"
    while(getline && $0 !~ /^  /);
    if ($0 ~ /^  /) {
      print
    }
    next
  }
  /^  t2v-transformers:$/ {
    print
    print "    restart: always"
    while(getline && $0 !~ /^  /);
    if ($0 ~ /^  /) {
      print
    }
    next
  }
  /CLUSTER_HOSTNAME: '\''node1'\''/ {
    print
    print "      AUTOSCHEMA_ENABLED: '\''false'\''"
    next
  }
  /restart: on-failure:0/ {
    next
  }
  1' docker-compose.yaml > docker-compose-temp.yaml && mv docker-compose-temp.yaml docker-compose.yaml

Run the Docker Compose File

Finally, run the command to start Weaviate.

docker-compose up -d

Usage

Makefile

# Deploy and run the app
make app

# Deploy AWS resources
make deploy

# Destroy the application
make destroy

# Run the Batch job to load the database. Saves output to job.json
make job.run

# Reads job.json and checks the job status
make job.status

# Get the status of Weaviate
make weaviate.status

# Stop Weaviate
make weaviate.stop

# Start Weaviate
make weaviate.start

# Restart Weaviate
make weaviate.restart

# Get the endpoint for Weaviate
make weaviate.get.endpoint

# Create the Weaviate schema
make weaviate.schema.create

# Delete the Weaviate schema
make weaviate.schema.delete

# Run the Streamlit frontend
make streamlit.run

Docker

Build the application.

cd tasks/load_odr
docker build -t load_odr:latest .

Run the application.

docker run -d --env-file ../.env load_odr:latest

Weaviate

Create the Schema

make weaviate.schema.create

Delete the Schema

make weaviate.schema.delete

CDK Commands

  • npm run build compile typescript to js
  • npm run watch watch for changes and compile
  • npm run test perform the jest unit tests
  • cdk deploy deploy this stack to your default AWS account/region
  • cdk diff compare deployed stack with current state
  • cdk synth emits the synthesized CloudFormation template

Troubleshooting

  • Check your AWS credentials in ~/.aws/credentials
  • Check that the environment variables are available to the services that need them
  • Check that the correct environment or interpreter is being used for Python

References & Links

Authors

Primary Contact: @chrisammon3000