This repository contains the DataSharingClient
class, which allows you to interact with data stored in S3 and perform queries using DuckDB. This guide will help you set up your environment, configure your credentials, and use the various functionalities provided by the DataSharingClient
.
LLMs can be a helpful partner when working with this repository. You can copy the contents of LLMPartner.txt
and add it into a chat assistant such as ChatGPT, Claude, Gemini, or any other provider you prefer. Your LLM partner can help out with syntax for SQL queries, provide guidance on using DuckDB within the DataSharingClient, and answer general questions about the code and your analysis.
- Python 3.7 or higher
- pip (Python package installer)
- Virtual environment (
venv
) module
-
Open Command Prompt and navigate to your project directory:
cd path\to\your\project
-
Create a virtual environment with a custom name (e.g.,
myenv
):python3 -m venv newvenv
-
Activate the virtual environment:
source newvenv/bin/activate
-
Install the required dependencies:
pip install -r requirements.txt
-
Open Terminal and navigate to your project directory:
cd path/to/your/project
-
Create a virtual environment with a custom name (e.g.,
myenv
):python3 -m venv myenv
-
Activate the virtual environment:
source myenv/bin/activate
-
Install the required dependencies:
pip install -r requirements.txt
-
Copy the example environment file and create a new
.env
file:cp .env.example .env
-
Open the
.env
file and input your credentials:OCEAN_USERNAME=your_username OCEAN_PASSWORD=your_password
-
Run the first code block to set all imports and initialize the client:
# Initialize the client using credentials from .env file client = DataSharingClient()
-
For VSCode users: You can work directly in the
.ipynb
file without running the command line by selecting your virtual environment after clicking Select Kernel in the top right corner.
-
Default Initialization:
client = DataSharingClient()
-
Custom Initialization with DuckDB Parameters:
duckdb_path = "path/to/file/nameofyourduckdbfile.duckdb" client = DataSharingClient(duckdb_region="us-east-1", duckdb_path=duckdb_path)
-
Creating a View from S3 URI:
# Example: Creating a view from a Parquet file in S3 s3_uri = "s3://your-bucket-name/path/to/yourfile.parquet" view_name = "your_view_name" client.create_view(s3_uri, view_name)
-
Creating a View from Local Path:
# Example: Creating a view from a Parquet file in local storage local_path = "path/to/local/file/yourfile.parquet" view_name = "your_view_name" client.create_view(local_path, view_name)
-
Querying the View to Count the Records:
# Example: Querying the view to count the records query = "SELECT COUNT(*) FROM your_view_name;" result_df = client.query(query) print(result_df)
-
Creating a New Table from a Query:
# Example: Creating a new table from a query query = "SELECT * FROM your_view_name WHERE your_column > some_value;" new_table_name = "new_table_name" client.query(query, new_table_name)
# Example: Listing all tables and views
tables = client.list_tables()
print(tables)