This project provides a user-friendly interface using Gradio to manage datasets in the CyVerse Discovery Environment (DE). The tool facilitates:
- Migration of datasets from CyVerse to CKAN (a data management system).
- Conversion of metadata into DCAT and Croissant JSON-LD formats.
- CSV to Parquet conversion for efficient storage.
- Uploading metadata files directly to CKAN.
Using this application, users can seamlessly move datasets between platforms, validate metadata, and ensure compliance with DCAT and Croissant standards.
- Migrate datasets from CyVerse DE to CKAN with metadata.
- Generate DCAT or Croissant metadata files for datasets.
- Upload and manage datasets with ease using CKAN's API.
- Support for CSV-to-Parquet conversion for efficient data handling.
You have two ways to launch the app:
- Option 1: Using the Docker image available at
tdewangan63/my-gradio-app
. - Option 2: Clone the repository and build the Docker image locally.
-
Pull the Docker image from Docker Hub:
docker pull tdewangan63/my-gradio-app
-
Run the Docker container:
docker run -p 7860:7860 tdewangan63/my-gradio-app
-
Access the App:
- Open a browser and go to:
http://localhost:7860
.
- Open a browser and go to:
-
Docker installed on your system.
Install Docker if you don't have it. -
Git installed to clone the repository.
-
Clone the repository:
git clone https://github.com/cyverse/data-commons cd data-commons
-
Build the Docker image:
docker build -t cyverse-gradio-app .
-
Run the Docker container:
docker run -p 7860:7860 cyverse-gradio-app
-
Access the App:
- Open a browser and go to:
http://localhost:7860
.
- Open a browser and go to:
- Purpose: Defines the Gradio-based user interface and its various tabs.
- Features:
- Provides tabs for dataset migration, metadata generation (Croissant and DCAT), and file uploads.
- Calls helper functions from other modules to handle migration and metadata operations.
- Purpose: Handles interactions with the CKAN API.
- Functions:
- Create datasets, upload files, and update metadata in CKAN.
- Manage datasets and resources (e.g., adding or deleting datasets).
- Purpose: Manages communication with the CyVerse Discovery Environment (DE).
- Functions:
- Retrieve metadata and datasets from DE.
- Authenticate users and fetch files or directories using the DE API.
- Purpose: Orchestrates the migration process from DE to CKAN.
- Functions:
- Prepares datasets by cleaning and validating metadata.
- Ensures that datasets and files are correctly transferred.
- Purpose: Generates Croissant JSON-LD metadata for datasets.
- Functions:
- Converts metadata to Croissant format with fields like title, description, and author.
- Adds files or resources as distributions in the metadata.
- Purpose: Creates DCAT-compliant JSON-LD files for datasets.
- Functions:
- Converts metadata into the DCAT format for interoperability.
- Adds distributions (e.g., CSV, Parquet files) with unique hashes.
- Purpose: Provides utility functions for file and metadata handling.
- Functions:
- Extract metadata from JSON files.
- Generate Croissant and DCAT metadata.
- Convert CSV files to Parquet for optimized storage.
- Purpose: Captures and stores logs from the application.
- Functions:
- Uses a StringIO logging handler to keep logs in memory.
- Parses logs to separate errors and warnings during validation.
- Purpose: Validates DCAT JSON against a schema.
- Functions:
- Ensures that the DCAT metadata complies with the required structure before uploading to CKAN.
- Purpose: Provides helper functions to clean and structure metadata for migration.
- Functions:
- Handles licenses, tags, and dataset descriptions.
- Checks if datasets or files need to be re-uploaded or updated in CKAN.
- Purpose: Defines the Docker configuration for the project.
- Details:
- Uses Python 3.11 slim as the base image.
- Copies only the
.py
files to the container for a smaller image size. - Installs dependencies from
requirements.txt
and exposes port 7860 for the Gradio UI.