Skip to content

Latest commit

 

History

History
218 lines (166 loc) · 13.9 KB

README.md

File metadata and controls

218 lines (166 loc) · 13.9 KB

CORTX S3 - Label Studio Integration

CORTX integration with world's most trusted and open-source data annotation tool, Label Studio. Feel the buzz of heavy loads of data annotations on CORTX S3 data storage systems.

Watch Integration Presentation Video

Change the way the world does by connecting CORTX™— Seagate’s open-source object storage software — with the tools and platforms that underpin the data revolution.

Storing and managing data had never been easy and with flourish of AI, deep learning we have generated paramounts of data called Big Data. Because unstructured data is made up of files like audio, video, pictures and even social media data, it's easy to see why volume is a challenge. The value of the data can get lost in the shuffle when working with so much of it. There is value to be found in unstructured data, but harnessing that information can be difficult.

The ideal big data storage system would allow storage of a virtually unlimited amount of data, cope both with high rates of random write and read access, flexibly and efficiently deal with a range of different data models, support both structured and unstructured data, and for privacy reasons, only work on encrypted data. Obviously, all these needs cannot be fully satisfied.

Background / Overview of the project

The Problem

What we want to solve?

The problem statement is from my personal story, I was working on a wildlife conservation project to save animals from poaching and using biometric sensors to monitor body vitals, with tremendous hardwork I collected enough amount of data, not an easy task- I had to daily meet park rangers to get data but my negligence ruined all my hardwork after fatal HDD crash. Such huge loss in terms of data and time, the project could not be completed yet, due to it. But this time I will ensure that the data generated by the data revolution can be harnessed to the full extent of its potential using CORTX S3 data services which are scalable, reliable and efficient.

Hypothesis

Integrating Label Studio, an open source data annotation tool widely used by global AI brands. Integrating, S3 services for result storing and data retrieval for data analytics and other data related tasks. Label Studio is compatible with cloud services in it's latest version, so it is good and compatible to integrate Cortx S3 with it.

What is CORTX? CORTX is a distributed object storage system designed for great efficiency, massive capacity, and high HDD-utilization. CORTX is 100% Open Source.

What our integration does?

Our integration is simple, it allows you to import bulk data from your S3 bucket and once you are with data annotations, you can export it in most widely used annotation formats: JSON, CSV, COCO, PASCAL VOC. Moreover, you can also download the exported data in any Machine Learning environment using this integration.

Why this integration is important?

Building an AI or ML model that acts like a human requires large volumes of training data. For a model to make decisions and take action, it must be trained to understand specific information. Data annotation is the categorization and labeling of data for AI applications. However, no sole company relies upon one AI dataset or model for it's long term operation, it uses tons of those data in variey of usecases. In order to have a connected system where data annotations and data is stored requires massive scalability and efficiency, completely open source too. With our integration, new startups and AI companeis and store and fetch huge loads of big data from Cortx S3, from making self-driving cars, nano surgical bots and Mars Rovers, all require it.

Integration walkthrough

You can directly clone the project dir and use it any code editor like Pycharm or Visual Studio Code and install all requirements.txt, run streamlit app.py after starting label-studio on local host.

Step 1: Download requirements

  • We are integrating S3 storage on label Studio, open source annotation tool, download it using pip command:
 $ pip install -U label-studio
  • Start the label studio to verify installation, it would run on local host
 $ label-studio 
  • Once the tool is running, go to account setting and grab Access Token for Label Studio to connect it using python methods to automate tasks through REST API calls.

Step 2: CORTX Cloudshare VM lab setup

Follow this guide from Cortex Team: CORTX-Cloudshare-Setup-for-April-Hackathon-2021

  • Once the Cortex VM is ready, run this command, keep note of external address to that will serve as our S3 endpoint URL
 $ sudo route add default gw 192.168.2.1 ens33
 
 
  • Start Windows Server 2019 Edition, go to this S3 cortex dashboard, by default S3 user and test bucket is already made for you, a txt file on Desktop contains all default S3 credentials for use
 https://192.168.2.102:28100/#/dashboard

Step 3: Connecting S3 data endpoint class and methods for uploading and downloading of data from S3 bucket to anywhere. Our S3DataEndpoint file link: S3DataEndpoint Python Class

   class S3DataEndpoint:
    def __init__(self, end_url, accessKey, secretKey):
        self.end_url = end_url
        self.accessKey = accessKey
        self.secretKey = secretKey

        self.s3_resource = boto3.resource('s3', endpoint_url=self.end_url,
                                          aws_access_key_id=self.accessKey,
                                          aws_secret_access_key=self.secretKey,
                                          config=Config(signature_version='s3v4'),
                                          region_name='US')

        # command to access data from default session
        self.s3_client = boto3.client('s3', endpoint_url=self.end_url,
                                      aws_access_key_id=self.accessKey,
                                      aws_secret_access_key=self.secretKey,
                                      config=Config(signature_version='s3v4'),
                                      region_name='US')
                                      
        # Functions for buckets operation
        def create_bucket_op(self, bucket_name, region):
        if region is None:
            self.s3_client.create_bucket(Bucket=bucket_name)
        else:
            location = {'LocationConstraint': region}
            self.s3_client.create_bucket(Bucket=bucket_name, CreateBucketConfiguration=location)
    
   # dummy test                                 
   def main():
    END_POINT_URL = 'http://uvo100ebn7cuuq50c0t.vm.cld.sr'
    A_KEY = 'AKIAtEpiGWUcQIelPRlD1Pi6xQ'
    S_KEY = 'YNV6xS8lXnCTGSy1x2vGkmGnmdJbZSapNXaSaRhK'

    s3 = S3DataEndpoint(end_url=END_POINT_URL, accessKey=A_KEY, secretKey=S_KEY)
    s3.create_bucket_op('newbucketworks')
    
   if __name__ == "__main__":
    main()

If all credentials are correct, you will have a new bucket made which you can see it using CyberDuck on the CloudShare Widnows Server 2019 Edition VM. See below image, our new bucket is created.

Step 4: Set up programming environment and setup Label studio important API calls for S3 storage.

class LabelStudioAPI:
    def __init__(self, token, debug=True):
        self.token = token
        self.debug = debug

    def Project(self, title='Demo', labelXML="""""", description='New Task', action='create', projID='1'):
        if action == 'update':
            url = 'http://localhost:8080/api/projects/'
            headers = {'Authorization': 'Token ' + self.token}
            payload = {'title': title, 'description': description, 'label_config': labelXML}
            res = requests.patch(url, data=payload, headers=headers)
            print(res.status_code, res.content)
            
        elif action == 'create':
            url = 'http://localhost:8080/api/projects/'
            headers = {'Authorization': 'Token ' + self.token}
            payload = {'title': title, 'description': description, 'label_config': labelXML}
            res = requests.post(url, data=payload, headers=headers)
            print(res.status_code, res.content)
            if res.status_code == 201:
                return 'Successfully created NEW Project ' + title
            else:
                return 'Could not create NEW project'

        elif action == 'delete':
            url = 'http://localhost:8080/api/projects/' + projID + '/'
            headers = {'Authorization': 'Token ' + self.token}
            res = requests.delete(url, headers=headers)
            print(res.status_code, res.content)
            if res.status_code == 204:
                return 'Successfully deleted Project ' + projID
            else:
                return 'Could not delete'
        else:
            print("Not valid action")
            
    # connect S3 storage to sync all data to be annoated and results to be stored
    def connectS3Storage(self, projID="1", title="S3", bucket_name="", region_name="US", accessKey="",
                        secretKey="", s3_url=""):
        url = 'http://localhost:8080/api/storages/s3'
        headers = {'Authorization': 'Token ' + self.token}
        payload = {"project": projID, "title": title, "bucket": bucket_name, "region_name": region_name,
                   "s3_endpoint": s3_url, "aws_access_key_id": accessKey, "aws_secret_access_key": secretKey,
                   "use_blob_urls": True, "presign_ttl": "1"}
        res = requests.post(url, data=payload, headers=headers)
        print(res.content)
        print(res.status_code)
        print("Sync S3 bucket to see all your data in label studio")

    def syncS3Bucket(self, storageID='1'):
        url = 'http://localhost:8080/api/storages/s3/'+storageID+'/sync'
        headers = {'Authorization': 'Token ' + self.token}
        # payload = {}
        res = requests.post(url, headers=headers)

See full source code in project repository: https://github.com/vilaksh01/cortx/tree/main/doc/integrations/label-studioAPI We are using Streamlit as frontend for our integration user interface, below is video for how annotation project looks like through our integration, the label_creater XML file we uploaded has information of different labels to be used for data provided. (There are variety of label_creator XML templates for any kind of data annotation project, decide which one to use based on your usecase and create XML file for that to use with our integration platform: https://labelstud.io/templates/ )

Watch out how our Cortx S3 Integration works Label Studio, data annotation tool

Play video

Accomplishments:

  1. We were successful in connecting Cortx S3, making data bucket and connecting with label studio project
  2. Used API to upload bulk files to Cortx S3 and sync it to our label-studio project
  3. One button file export to Cortx S3 bucket for different formats of data annotations - JSON, CV, COCO, PASCALVOC
  4. Ability to download any file from S3 bucket anywhere, allowing fleibility in using at multiple places or use case, like downloading data annotation results in Jupyter Notebook and Tensorflow framework to make AI, data apps

Problems that you may run into

  1. Troubleshoot CORS and access problems: After syncing the imports in S3 bucket, your files don't load in Label Studio due to Cross-origin resource sharing(CORS) not being supported by Cortex S3 Currently, so you can manually upload all your data for labeling instead of syncing from S3. Here is how the problem looks like:

  1. Troubleshoot data not being exported or downloaded from S3 bucket: If you are on Cortx Cloudshare instance, make sure it's active not suspended, if suspended, reconnect and try running the below command on Cortx VM and ping it on your system not cloudshare using the device connection external address
$ sudo route add default gw 192.168.2.1 ens33
# ping it on your local system to check if S3 connection is okay or not
$ ping uvo100ebn7cuuq50c0t.vm.cld.sr
  1. Troubleshoot app.py file not being able to run: Our app is made with streamlit.io, to run the app, you need to run using the below command,
$ streamlit run app.py
  1. Troubleshoot download and upload path: When using methods to download or upload files from S3, you need to make sure you give correct file name and their location or path.

What's next for Cortx Label Studio Integration

  1. Integrating backend ML operations for auto-labeling tasks in Label Studio and storing results, metrics on S3.
  2. Multi-level user access for better project control and role management
  3. Integrating other services like data encoder, decoder and direct training over GPU with major ML frameworks
  4. Faster query system on Cortx S3 using Motr Layer