Semantic search of AWS Open Data Registry datasets using Weaviate.
- Table of Contents
- Project Structure
- Description
- Quickstart
- Installation
- Usage
- Troubleshooting
- References & Links
- Authors
.
├── Makefile # Makefile for deployment and teardown
├── README.md
├── (aws-open-data-registry-neural-search-key-pair.pem)
├── bin # CDK app
├── cdk.context.json
├── cdk.json
├── config.json # Environment variables for CDK app
├── frontend # Streamlit app
├── lib # CDK stacks
├── notebooks
├── package.json
├── requirements.txt
├── scripts # Bash scripts for Weaviate
├── src # EC2 instance user data
├── tasks # Fargate task
└── tsconfig.json
Deploy and load a Weaviate instance with AWS Open Data Registry datasets. Find datasets, tutorials, publications and tools & applications using semantic search queries using the Streamlit app.
The application deploys the following resources.
- EC2 instance
- VPC (optional)
- Fargate task (run on AWS Batch)
Once the environment is configured the application can be deployed using one command, make app
. This will create the infrastructure including setting up and exposing the Weaviate instance, load the data, and start the Streamlit server once loading is finished.
The SSH key is created and deleted outside of the CDK app using Makefile targets, which means that you can run cdk deploy
and cdk destroy
without having to create or delete the SSH key. If the key's .pem
file is detected in the project it will be assigned to the database instance. If no key is found, it must be created before the AWS resources are deployed. The best way to do this is by running make app
or make deploy
The Streamlit app is where you can interact with Weaviate by browsing, searching, and executing GraphQL queries.
Each dataset has several expandable sections containing information about it.
Custom queries can be run against Weaviate using the GraphQL API.
- Configure your AWS credentials.
- Add environment variables to
.env
. - Update the
tags
field inconfig.json
if desired. - Run
npm install
to install. - Run
make app
to deploy, load data and start the Streamlit server.
Follow the steps to configure the deployment environment.
- Nodejs >= 18.0.0
- TypeScript >= 4.4.3
- AWS CDK >= 2.53.0
- AWSCLI
- Docker
- Python 3.10
- jq
Sensitive environment variables containing secrets like passwords and API keys must be exported to the environment first.
Create a .env
file in the project root.
CDK_DEFAULT_ACCOUNT=<account_id>
CDK_DEFAULT_REGION=<region>
Important: Always use a .env
file or AWS SSM Parameter Store or Secrets Manager for sensitive variables like credentials and API keys. Never hard-code them, including when developing. AWS will quarantine an account if any credentials get accidentally exposed and this will cause problems.
Make sure that .env
is listed in .gitignore
The CDK application configuration is stored in config.json
. This file contains values for the database layer, the data ingestion layer, and tags. You can update the tags and SSH IP to your own values before deploying.
{
"layers": {
"data_ingestion": {
"env": {
"repo_url": "https://github.com/awslabs/open-data-registry",
"target_data_dir": "datasets"
}
},
"vector_database": {
"env": {
"ssh_cidr": "0.0.0.0/0", // Update to your IP
"ssh_key_name": "aws-open-data-registry-neural-search-key-pair"
}
}
},
"tags": {
"org": "my-organization", // Update to your organization
"app": "aws-open-data-registry-neural-search"
}
}
Valid AWS credentials must be available to AWS CLI. The easiest way to do this is running aws configure
, or by adding them to ~/.aws/credentials
and exporting the AWS_PROFILE
variable to the environment.
For more information visit the documentation page: Configuration and credential file settings
Create a virtual environment for Python development.
# Create a virtual environment
python3.10 -m venv .venv
# Activate the virtual environment
source .venv/bin/activate
# Upgrade pip
pip install --upgrade pip
# Install dependencies
pip install -r requirements.txt
Configuring Weaviate requires 3 steps:
- Download the Docker Compose file.
- Update the Docker Compose file to configure Weaviate to persist data and automatically restart on reboot.
- Run the Docker Compose file.
Run the command to download a Docker Compose file for Weaviate (source).
curl -o docker-compose.yaml "https://configuration.weaviate.io/v2/docker-compose/docker-compose.yml?generative_cohere=false&generative_openai=false&generative_palm=false&gpu_support=false&media_type=text&modules=modules&ner_module=false&qna_module=false&ref2vec_centroid=false&runtime=docker-compose&spellcheck_module=false&sum_module=false&text_module=text2vec-transformers&transformers_model=sentence-transformers-multi-qa-MiniLM-L6-cos-v1&weaviate_version=v1.19.8"
Next, run the command to configure Weaviate to persist data and automatically restart on reboot.
awk '
/^ weaviate:$/ {
print
print " restart: always"
print " volumes:"
print " - /data/weaviate:/var/lib/weaviate"
while(getline && $0 !~ /^ /);
if ($0 ~ /^ /) {
print
}
next
}
/^ t2v-transformers:$/ {
print
print " restart: always"
while(getline && $0 !~ /^ /);
if ($0 ~ /^ /) {
print
}
next
}
/CLUSTER_HOSTNAME: '\''node1'\''/ {
print
print " AUTOSCHEMA_ENABLED: '\''false'\''"
next
}
/restart: on-failure:0/ {
next
}
1' docker-compose.yaml > docker-compose-temp.yaml && mv docker-compose-temp.yaml docker-compose.yaml
Finally, run the command to start Weaviate.
docker-compose up -d
# Deploy and run the app
make app
# Deploy AWS resources
make deploy
# Destroy the application
make destroy
# Run the Batch job to load the database. Saves output to job.json
make job.run
# Reads job.json and checks the job status
make job.status
# Get the status of Weaviate
make weaviate.status
# Stop Weaviate
make weaviate.stop
# Start Weaviate
make weaviate.start
# Restart Weaviate
make weaviate.restart
# Get the endpoint for Weaviate
make weaviate.get.endpoint
# Create the Weaviate schema
make weaviate.schema.create
# Delete the Weaviate schema
make weaviate.schema.delete
# Run the Streamlit frontend
make streamlit.run
Build the application.
cd tasks/load_odr
docker build -t load_odr:latest .
Run the application.
docker run -d --env-file ../.env load_odr:latest
make weaviate.schema.create
make weaviate.schema.delete
npm run build
compile typescript to jsnpm run watch
watch for changes and compilenpm run test
perform the jest unit testscdk deploy
deploy this stack to your default AWS account/regioncdk diff
compare deployed stack with current statecdk synth
emits the synthesized CloudFormation template
- Check your AWS credentials in
~/.aws/credentials
- Check that the environment variables are available to the services that need them
- Check that the correct environment or interpreter is being used for Python
Primary Contact: @chrisammon3000