aws-open-data-registry-neural-search

Semantic search of AWS Open Data Registry datasets using Weaviate.

Project Structure

.
├── Makefile                # Makefile for deployment and teardown
├── README.md
├── (aws-open-data-registry-neural-search-key-pair.pem)
├── bin                     # CDK app
├── cdk.context.json
├── cdk.json
├── config.json             # Environment variables for CDK app
├── frontend                # Streamlit app
├── lib                     # CDK stacks
├── notebooks
├── package.json
├── requirements.txt
├── scripts                 # Bash scripts for Weaviate
├── src                     # EC2 instance user data
├── tasks                   # Fargate task
└── tsconfig.json

Description

Deploy and load a Weaviate instance with AWS Open Data Registry datasets. Find datasets, tutorials, publications and tools & applications using semantic search queries using the Streamlit app.

Architecture

The application deploys the following resources.

EC2 instance
VPC (optional)
Fargate task (run on AWS Batch)

Deployment using Makefile

Once the environment is configured the application can be deployed using one command, make app. This will create the infrastructure including setting up and exposing the Weaviate instance, load the data, and start the Streamlit server once loading is finished.

SSH Key

The SSH key is created and deleted outside of the CDK app using Makefile targets, which means that you can run cdk deploy and cdk destroy without having to create or delete the SSH key. If the key's .pem file is detected in the project it will be assigned to the database instance. If no key is found, it must be created before the AWS resources are deployed. The best way to do this is by running make app or make deploy

Streamlit Frontend

The Streamlit app is where you can interact with Weaviate by browsing, searching, and executing GraphQL queries.

Each dataset has several expandable sections containing information about it.

Custom queries can be run against Weaviate using the GraphQL API.

Quickstart

Configure your AWS credentials.
Add environment variables to .env.
Update the tags field in config.json if desired.
Run npm install to install.
Run make app to deploy, load data and start the Streamlit server.

Installation

Follow the steps to configure the deployment environment.

Prerequisites

Nodejs >= 18.0.0
TypeScript >= 4.4.3
AWS CDK >= 2.53.0
AWSCLI
Docker
Python 3.10
jq

Environment Variables

Sensitive environment variables containing secrets like passwords and API keys must be exported to the environment first.

Create a .env file in the project root.

CDK_DEFAULT_ACCOUNT=<account_id>
CDK_DEFAULT_REGION=<region>

Important: Always use a .env file or AWS SSM Parameter Store or Secrets Manager for sensitive variables like credentials and API keys. Never hard-code them, including when developing. AWS will quarantine an account if any credentials get accidentally exposed and this will cause problems.

Make sure that .env is listed in .gitignore

CDK Application Configuration

The CDK application configuration is stored in config.json. This file contains values for the database layer, the data ingestion layer, and tags. You can update the tags and SSH IP to your own values before deploying.

{
    "layers": {
        "data_ingestion": {
            "env": {
                "repo_url": "https://github.com/awslabs/open-data-registry",
                "target_data_dir": "datasets"
            }
        },
        "vector_database": {
            "env": {
                "ssh_cidr": "0.0.0.0/0", // Update to your IP
                "ssh_key_name": "aws-open-data-registry-neural-search-key-pair"
            }
        }
    },
    "tags": {
        "org": "my-organization", // Update to your organization
        "app": "aws-open-data-registry-neural-search"
    }
}

AWS Credentials

Valid AWS credentials must be available to AWS CLI. The easiest way to do this is running aws configure, or by adding them to ~/.aws/credentials and exporting the AWS_PROFILE variable to the environment.

For more information visit the documentation page: Configuration and credential file settings

Python Development

Create a virtual environment for Python development.

# Create a virtual environment
python3.10 -m venv .venv

# Activate the virtual environment
source .venv/bin/activate

# Upgrade pip
pip install --upgrade pip

# Install dependencies
pip install -r requirements.txt

Weaviate Configuration

Configuring Weaviate requires 3 steps:

Download the Docker Compose file.
Update the Docker Compose file to configure Weaviate to persist data and automatically restart on reboot.
Run the Docker Compose file.

Download the Docker Compose File

Run the command to download a Docker Compose file for Weaviate (source).

curl -o docker-compose.yaml "https://configuration.weaviate.io/v2/docker-compose/docker-compose.yml?generative_cohere=false&generative_openai=false&generative_palm=false&gpu_support=false&media_type=text&modules=modules&ner_module=false&qna_module=false&ref2vec_centroid=false&runtime=docker-compose&spellcheck_module=false&sum_module=false&text_module=text2vec-transformers&transformers_model=sentence-transformers-multi-qa-MiniLM-L6-cos-v1&weaviate_version=v1.19.8"

Update the Docker Compose File

Next, run the command to configure Weaviate to persist data and automatically restart on reboot.

awk '
  /^  weaviate:$/ {
    print
    print "    restart: always"
    print "    volumes:"
    print "      - /data/weaviate:/var/lib/weaviate"
    while(getline && $0 !~ /^  /);
    if ($0 ~ /^  /) {
      print
    }
    next
  }
  /^  t2v-transformers:$/ {
    print
    print "    restart: always"
    while(getline && $0 !~ /^  /);
    if ($0 ~ /^  /) {
      print
    }
    next
  }
  /CLUSTER_HOSTNAME: '\''node1'\''/ {
    print
    print "      AUTOSCHEMA_ENABLED: '\''false'\''"
    next
  }
  /restart: on-failure:0/ {
    next
  }
  1' docker-compose.yaml > docker-compose-temp.yaml && mv docker-compose-temp.yaml docker-compose.yaml

Run the Docker Compose File

Finally, run the command to start Weaviate.

docker-compose up -d

Usage

Makefile

# Deploy and run the app
make app

# Deploy AWS resources
make deploy

# Destroy the application
make destroy

# Run the Batch job to load the database. Saves output to job.json
make job.run

# Reads job.json and checks the job status
make job.status

# Get the status of Weaviate
make weaviate.status

# Stop Weaviate
make weaviate.stop

# Start Weaviate
make weaviate.start

# Restart Weaviate
make weaviate.restart

# Get the endpoint for Weaviate
make weaviate.get.endpoint

# Create the Weaviate schema
make weaviate.schema.create

# Delete the Weaviate schema
make weaviate.schema.delete

# Run the Streamlit frontend
make streamlit.run

Docker

Build the application.

cd tasks/load_odr
docker build -t load_odr:latest .

Run the application.

docker run -d --env-file ../.env load_odr:latest

Weaviate

Create the Schema

make weaviate.schema.create

Delete the Schema

make weaviate.schema.delete

CDK Commands

npm run build compile typescript to js
npm run watch watch for changes and compile
npm run test perform the jest unit tests
cdk deploy deploy this stack to your default AWS account/region
cdk diff compare deployed stack with current state
cdk synth emits the synthesized CloudFormation template

Troubleshooting

Check your AWS credentials in ~/.aws/credentials
Check that the environment variables are available to the services that need them
Check that the correct environment or interpreter is being used for Python

References & Links

Authors

Primary Contact: @chrisammon3000

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

aws-open-data-registry-neural-search

Table of Contents

Project Structure

Description

Architecture

Deployment using Makefile

SSH Key

Streamlit Frontend

Quickstart

Installation

Prerequisites

Environment Variables

CDK Application Configuration

AWS Credentials

Python Development

Weaviate Configuration

Download the Docker Compose File

Update the Docker Compose File

Run the Docker Compose File

Usage

Makefile

Docker

Weaviate

Create the Schema

Delete the Schema

CDK Commands

Troubleshooting

References & Links

Authors

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
bin		bin
frontend		frontend
img		img
lib		lib
scripts		scripts
src		src
tasks/load_odr		tasks/load_odr
.env.example		.env.example
.gitignore		.gitignore
.npmignore		.npmignore
Makefile		Makefile
README.md		README.md
cdk.json		cdk.json
config.json		config.json
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt
schema.json		schema.json
tsconfig.json		tsconfig.json

chrisammon3000/aws-open-data-registry-neural-search

Folders and files

Latest commit

History

Repository files navigation

aws-open-data-registry-neural-search

Table of Contents

Project Structure

Description

Architecture

Deployment using Makefile

SSH Key

Streamlit Frontend

Quickstart

Installation

Prerequisites

Environment Variables

CDK Application Configuration

AWS Credentials

Python Development

Weaviate Configuration

Download the Docker Compose File

Update the Docker Compose File

Run the Docker Compose File

Usage

Makefile

Docker

Weaviate

Create the Schema

Delete the Schema

CDK Commands

Troubleshooting

References & Links

Authors

About

Topics

Resources

Stars

Watchers

Forks

Languages