This repo contains a generative AI demo that allows you to query and explore your data lake using natural language. It leverages the power of Amazon Bedrock. It comes with some example data in SQLite databases, but you can connect it to your data lake through Amazon Athena, JDBC or other.
Data Genie comes in two flavors:
- a simplified version that helps beginners understand the workflow
- a more advanced version which more features
The simplified version contains only the original workflow:
- Break user question into sub-questions
- Run vector search to find 3 tables that best match each of these sub-quesitons, plus 3 more for the original user question verbatim
- Determine the database to use based on the 1 table that best matches the original user question
- Sample 5 random rows from each of these matching tables
- Send the metadata descriptions of the matching tables and their sample data to the LLM to form SQL the query, and execute it
To run it:
- Checkout the "SIMPLE_DEMO_V1" tag with the
git checkout SIMPLE_DEMO_V1
command
For the more advanced version, checkout the HEAD of the main branch with the git checkout HEAD
command.
This code has been tested on Cloud9 on an m5.large
instance with Amazon Linux 2023.
The disk size of the instance should be increased to 20GB.
You need to enable third-party models in your AWS account in the Amazon Bedrock service,
in the us-west-2 region. If needed, you can change the Bedrock region in the
src/utils/llm.py
file.
The models used in this demo are Titan Embeddings G1 and Anthropic Claude 3 Sonnet.
You can also enable other models like Anthropic Claude 3.5 Sonnet and then select them via the LLM settings in the config.py class.
This code has been tested with Python 3.8 and Python 3.9.
Go to the src
folder:
cd src
Create a virtual environment and install dependencies:
python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt
-
Follow https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html to set up your EC2 instance
-
Configure your new EC2's security group to allow access via port 8080 from the laptop/desktop you'll use to test the web UI: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/changing-security-group.html
3: Follow the Cloud9 instrructions below, using a web browser to access the Streamlit app via https://:8080
- From the
src
folder, start the Streamlit server:
streamlit run streamlit_app.py --server.port 8080
-
Click on the Preview/Preview running application button in Cloud9, and click on the button to Pop out the browser in a new window. If the new window does not display the app, you may need to configure your browser to accept cross-site tracking cookies.
-
You can now interact with the Data Genie application through the Streamlit web interface. Start by pressing the Index Business Data Catalog button. This needs to be done only once.
-
Ask a question to Data Genie, for example "List 5 products and the name of their supplier".
Example questions that you can ask:
- For Northwind database:
- List 5 products and the name of their supplier
- List 5 customers who bought products from suppliers in France. Give their names, which product and qty ordered
- What are the top three categories in terms of number of orders?
- For Chinook database:
- What albums contain songs which title contains the word free? Give album name, song name and who is the singer.
- Which are the top 3 songs, present in the greatest number of playlists?
- What is the shortest song? What is its duration? Who is the artist?
- Index the business data catalog (this needs to be done only once):
python3 run_me_to_index_catalog.py
- Run a query:
python3 run.py
This repo contains an AWS CDK template that automates the deployment of Data Genie, for making it easily accessible from the Internet.
Here is the architecture that the cdk template deploys:
Instructions for deployment:
-
Edit
src/config_stack.py
, choose aSTACK_NAME
and aCUSTOM_HEADER_VALUE
. -
If you want to enable authentication to your application (highly recommended as it will be exposed to the Internet), edit
src/logic/config.py
and setENABLE_AUTH
toTrue
. -
Install AWS CDK. From the root of the repo directory:
python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt
- Deploy the CDK template:
cdk bootstrap
cdk deploy
The deployment takes 5 to 10 minutes.
-
Note the output, which will include the CloudFront distribution URL and the Cognito user pool ID.
-
From your AWS Console, create a user in the Cognito User Pool.
-
With your browser, connect to the CloudFront distribution URL.
-
Log in to the Streamlit app with the user you created, index the business data catalog and start asking questions!
Data Genie is a work in progress, and its algorithms can be improved to fit the specificities of each data lake. By testing and evaluating Data Genie, you can identify areas for improvement and compare the performance of different algorithms or approaches.
To test and evaluate Data Genie, follow these steps:
- Update the corresponding JSON file in the tests/test_sets directory with your test questions and expected answers. (The tester will auto-detect files in this directory and test each one.)
- Navigate to the src directory.
- Export your temporary AWS credentials or ensure the AWS CLI is permissioned to access the Bedrock service in your AWS account.
- Ensure the data is indexed by running run_me_to_index_catalog.py (see the instructions above).
- Run the run_tests.py file. (Optionally, you can add a database name to limit testing to that database, or both a database name and question number to limit testing to that one question, i.e. run_tests.py Chinook 5)
The test suite will execute the test questions and compare the results to the expected answers. This will help you evaluate the performance of Data Genie and identify areas for improvement.
This demo comes with these sample databases:
- Chinook
- Northwind (We added a few more rows to this database to be able to answer additional questions.)
- [MySports] (a database containing 87 tables of sports statistics we generated in order to test scalability of our SQL queries to huge schemas.)
For more information about the schema of these databases, we have included the Data Definition Language (DDL) files in the example_data/datalake/data/DDL folder. The populated databases themselves can be found in the example_data/datalake/data folder, and can be queried by running "sqlite3 " on the command line.
Some sample questions that you can ask:
- List 5 products with their categories
- List all song titles which contain the word "free". Please also tell me who is the singer.
- How many orders contain products from suppliers from France?
If you want to use your own data instead of the sample databases:
- Remove the sample files from
src/example_data/metadata
andsrc/example_data/data
. - Put your table descriptions files (business data catalog) in the
src/example_data/metadata
folder. - Put your databases in the
src/example_data/data
folder in SQLite format. Optional: - Put graphs of table join relationships for your database in the
src/example_data/metadata_advanced/graphs
folder. - Put historgram descriptions of data in each column for your database in the
src/example_data/metadata_advanced/histrograms
folder. - Put the complete information about each table that you want the SQL generator to have access to, including the DDL, comments and historgrams into the
src/example_data/metadata_advanced/DDL-annotated
folder. Alternatively, you can edit thesrc/utils/database_connections.py
andsrc/utils/database_connections.py
files to connect to Amazon Athena, Postgresql databases, and more.
Let's explore how Data Genie works and how you can optimize its performance for your use case. The application offers two modes of operation: Standard and Advanced.
In standard mode, Data Genie uses vector search to break the question into several variants, then to identify the 3 most semantically relevant tables from your data lake for each of these sub-quesitons. It then samples their data to generate SQL queries. While effective for straightforward queries, you might notice limitations with more complex questions requiring multiple joins or semantic understanding, or joining distant tables.
Let's start with the simpler workflow:
- Configure Data Genie for standard mode by setting in
config.py
:
ENABLE_ADVANCED_MODE = False
- Try these example questions:
- "List all products in the Beverages category" (works well with direct table relationships)
- "Show me sales performance by region and category" (may be less accurate due to complex joins)
The advanced mode introduces several enhancements:
- Graph search helps find valid and optimal join paths between tables
- Enhanced vector search with LLM reranking improves table selection by avoiding both artificially over-filtering the table list and leaving it unnecessarily large
- Entity recognition helps break down complex queries
Let's see how advanced mode improves query accuracy:
- Enable advanced mode and its features in
config.py
:
ENABLE_ADVANCED_MODE = True
USE_GRAPH_SEARCH_TO_FIND_JOIN_PATHS = True
USE_ADVANCED_VECTOR_SEARCH_INSTEAD_OF_DEFAULT_TOP_3 = True
-
Try the same questions again - notice how the results improve, especially for complex queries requiring multiple joins.
-
Test these advanced scenarios:
- "What's the revenue trend for each sales region over time?" (demonstrates graph search capabilities to connect tables via other tables the quesiton doesn't mention)
- "Find customers who purchased organic products" (shows semantic understanding)
For detailed configuration options, see the Configuration Options section below.
"Standard Mode" demonstrates the use of vector search to enable narrowing a large universe of tables down to a much smaller number of tables whose descriptions are synonymous with the user's question, and uses those tables along with a sample of data from each table to form the final SQL query. To activate standard mode, update the config.py file to set:
ENABLE_ADVANCED_MODE = False
"Advanced Mode" demonstrates several additional tricks we can use to improve query accuracy, controlled by the options listed below, which can be toggled in the config.py file:
ENABLE_ADVANCED_MODE = True # to activate Advanced mode
USE_LLM_INSTEAD_OF_VECTOR_SEARCH_TO_IDENTIFY_DATABASE = True # to let the LLM decide which database matches the question the best, or set to False to use vector search to use the database of the best matching table instead
SEARCH_FOR_ENTITIES_INSTEAD_OF_SUBQUESTIONS = True # to split the question into multiple entities (like "the Chicago Bulls basketball team"), or set to False to split it into multiple related sub-questions (like "Which basketball team is called The Chicago Bulls?")
USE_GRAPH_SEARCH_TO_FIND_JOIN_PATHS = True # to enable graph search to provide a list of valid join paths to the final SQL composition prompt
USE_ADVANCED_VECTOR_SEARCH_INSTEAD_OF_DEFAULT_TOP_3 = True # to do a wider vector search and then re-rank the results using an LLM instead of a hard-coded cut-off of 3 vector search results
USE_LLM_TABLE_FILTER = True # to use an extra LLM prompt to remove tables that came back from the vector search but don't seem to relate to the user question
- Input Validation and Sanitization: Data Genie accepts natural language queries from users, which are then translated into SQL queries to retrieve data from the underlying data sources. It is essential to implement proper input validation and sanitization mechanisms to prevent SQL injection attacks and other code injection vulnerabilities. These mechanisms should apply to both user input and the generated SQL queries. Their implementation depends on the use case, they are not implemented in the provided demo.
- Access Controls and Authentication: We recommend that you implement proper access controls and authentication mechanisms to restrict access to the underlying data sources (e.g., databases, Amazon Athena) only to authorized users or services. For instance, without being exhaustive, the Data Genie user should not have write permissions on the databases. You could also leverage services like AWS Lake Formation and Amazon DataZone to govern your Data Lake and manage access control.
- Logging and Monitoring: We recommend that you implement comprehensive logging and monitoring mechanisms to detect and respond to potential security incidents or suspicious activities. Consider using AWS services like Amazon CloudWatch and AWS CloudTrail for centralized logging and monitoring.
- Compliance and Regulatory Requirements: If your application handles sensitive data or operates in regulated industries, ensure that you comply with relevant security standards, regulations, and industry best practices. Consult with security experts or compliance professionals to understand and implement the necessary security controls and measures.
- The connection between CloudFront and the ALB is in HTTP, not SSL encrypted. This means traffic between CloudFront and the ALB is unencrypted. It is strongly recommended to configure HTTPS by bringing your own domain name and SSL/TLS certificate to the ALB.
- The provided code is intended as a demo and starting point, not production ready. The Python app relies on third party libraries like Streamlit and streamlit-cognito-auth. As the developer, it is your responsibility to properly vet, maintain, and test all third party dependencies. The authentication and authorization mechanisms in particular should be thoroughly evaluated. More generally, you should perform security reviews and testing before incorporating this demo code in a production application or with sensitive data.
- In this demo, Amazon Cognito is in a simple configuration. Note that Amazon Cognito user pools can be configured to enforce strong password policies, enable multi-factor authentication, and set the AdvancedSecurityMode to ENFORCED to enable the system to detect and act upon malicious sign-in attempts.
- AWS provides various services, not implemented in this demo, that can improve the security of this application. Network security services like network ACLs and AWS WAF can control access to resources. You could also use AWS Shield for DDoS protection and Amazon GuardDuty for threats detection. Amazon Inspector performs security assessments. There are many more AWS services and best practices that can enhance security - refer to the AWS Shared Responsibility Model and security best practices guidance for additional recommendations. The developer is responsible for properly implementing and configuring these services to meet their specific security requirements.
- Regular rotation of secrets is recommended, not implemented in this demo.
The architecture for AWS deployment is based on the deploy-streamlit-app repository:
The sample databases come from the repositories below :
See CONTRIBUTING for more information.
This application is licensed under the MIT-0 License. See the LICENSE file.