Skip to content

Latest commit

 

History

History
239 lines (167 loc) · 11.8 KB

README.md

File metadata and controls

239 lines (167 loc) · 11.8 KB

iFood Data Engineer test

Introduction

This project is a proposal for a solution to create an iFood order data processing environment. More information on: iFood Test

The solution architecture was developed focused on scalability..

image

The source of the data is formed by three files, constituting the entities.:

  • Order: containing data from the requests generated by customers
  • Consumer: customer information
  • Restaurant: Establishments registered on the iFood platform

We use Delta Lake as a layer above our Data Lake (Amazon S3), allowing us to work with accidents and better performance in frequent interactions with our Data Lake. It is also possible to obtain the batch and streaming experience in a unified way.

Lake Delta layers:

  • Bronze: Given in its native and gross format
  • Silver: Clean and transformed data
  • Gold: refined data

Data Flow

Before the ingestion in Delta Lake's bronze layer, the data is pseudonymized by generating a new table (lookup table) to store separately personal data. Then the information is saved in a raw state in the bronze layer. So, after some transformations, two new tables are generated:

  • Order Item: Contains the items of a particular request
  • Order Garnish Items: Additional items

Finally, all information related to orders is loaded to the Gold Layer to be consumed by different areas.

image

image


Privacy, Compliance and LGPD/GDPR

Pseudonimization

A LGPD treats pseudonymization as: "treatment through which data loses the possibility of a direct or indirect association to an individual, if not by the use of additional information maintained separately by the controller in a controlled and secure environment."

image

Firstly we identify that the database has personal data attributes (CPF and telephone), which, according to LGPD, are _"All information related to the natural person identified or identifiable" _. As a result, a 4-step policy was created to pseudonymize data, making them maintained and treated separately to ensure the non -attribution of an identifiable person:

  1. Serialization: We created a serialized key, the result of the scale of attributes: CPF, name, telephone, and DDD; this helps to avoid comparisons by hash lists. We use a hash (SHA) function by applying mathematical algorithms to map the fixed-size bits input strings (hash). Functions like this perform unidirectional encryptions (which cannot be reversed computationally by reverse engineering).
  2. Cryptography: Unidirectional hashes are insufficient since they can be reversed using Rainbow Tables (large data dictionaries with pre-calculated hashes). Therefore, for each generated hash, we apply a robust encryption algorithm (AES-125) to circumvent the risks of attacks by hash tables.
  3. Lookup Tables: Finally, with the result of the above operations, we generated a new pseudonymized attribute (pseudonymous_id) and stored it separately in a mapping table with their respective personal data. This allows the entire data processing process only pseudonyms to be used. Our Lookup Table has personal data that can be used to identify a particular person. This process causes only non-personal data to be used in subsequent data manipulation processes.

The GDPR encourages us to pseudonimize data as soon as possible, so we chose to do pseudonymization Immediately before data ingestion at Delta Lake.

  • Thus, we make the data that will be manipulated with the order information not contain personal data. The example of "consumer" is presented below:

image

  • In turn, the Mapping Table can be stored in isolation.

image

Data Governance

  • Once personal data is isolated, we can easily create access policies, for example, blocking users' access to directory data where personal data (Delta_lake/Lookup/Consumer_order/) will be saved. As well as the implementation of access controls for the tables (which can be linked with identity providers).

  • Delta Lake helps us with retention policies and forgetting requests (one of the rights of the data holder). It becomes challenging in traditional data lake environments to map data entities completely. In Delta Lake, after the exclusion, commands such as "Vaccum" can be used to remove all files that are no longer referenced by their respective delta table. In addition, we get performance increment grouping of *"Order-Z" * in the scanning of the exclusion process. The Combination Delta Lake + Pseudonymous_id + Glue Catalog facilitates data mapping and control.


ETL Pipeline

Requirements

Creating the Environment

Amazon AWS

Let's use CloudFormation for the provision of the environment at AWS. All template files are in the IAC/CloudFormation/.

  • Comments:

    • The commands below are ready to be executed at the root of this project. -We will use the US-WEST-2 region
    • Heads up! Environmental creation procedures do not take security practices into account. These actions were created only to automate the provision of the environment as much as possible, and is easily performed in test environments. Therefore, adjust policies if necessary.
  • Change the iac/cloudformation/stack_data.yaml replacing the YOUR_ACCOUNT_ID by your Account ID from AWS:

    image

  • Create a user that went with the following attached policies indicated below.

    • AmazonEC2FullAccess
    • IAMFullAccess
    • AmazonS3FullAccess
    • AmazonRedshiftFullAccess
    • AWSCloudFormationFullAccess
  • Set your user's credentials running aws configure.

    aws configure
  • Create the necessary roles and policies.

    aws cloudformation create-stack --stack-name stack-access --template-body file://iac/cloudformation/stack_access.yaml --output text --capabilities CAPABILITY_NAMED_IAM
  • Wait for the end of the stack provisioning (status: CREATE_COMPLETE). You can follow the status by performing the following:

    aws cloudformation describe-stacks --stack-name stack-access --query "Stacks[0].StackStatus" --output text
  • Now that the necessary policies have been created attach the Policy “Databricks” to their user. image

  • In the Databricks account management enter your user's credentials.

    image

  • Create a Bucket S3 (in the us-west-2 region) at AWS. This bucket will be used on ETL pipeline and Redshift Spectrum.

  • Run CloudFormation for the creation of the Data Stack.

    aws cloudformation create-stack --stack-name stack-data --template-body file://iac/cloudformation/stack_data.yaml --output text --capabilities CAPABILITY_NAMED_IAM
  • Wait for the end of the stack provisioning (status: CREATE_COMPLETE). You can follow the status by performing the following:

    aws cloudformation describe-stacks --stack-name stack-data --query "Stacks[0].StackStatus" --output text
  • After the finish, run the command below to get the endpoint of the Redshift Cluster.

    aws cloudformation describe-stacks --stack-name stack-data --query "Stacks[0].Outputs[?OutputKey=='ClusterEndpoint'].OutputValue" --output text

Databricks cluster

  • Create the Instance Profile:

    image

    arn:aws:iam::YOUR_ACCOUNT_ID:instance-profile/Databricks_Glue_S3

  • Create the Cluster Spark:

    • Configurations:

      image

    • Select a Instance Profile:

      image

    • Enable the Glue Catalog:

      image

      spark.databricks.hive.metastore.glueCatalog.enabled true

      Note: The use of external metastore can considerably increase the latency of the data. To avoid this problem, we can enable the Glue Catalog cache. Just added:

      spark.hadoop.aws.glue.cache.table.enable true
      spark.hadoop.aws.glue.cache.table.size 1000
      spark.hadoop.aws.glue.cache.table.ttl-mins 30
    • Install libraries: image

      pycrypto==2.6.1

  • Installing python dependencies.

    sudo apt update
    sudo apt install python3-pip
    sudo apt-get install libpq-dev
    pip3 install psycopg2==2.8.6
  • Create redshift tables:

    python iac/redshift/app.py --endpoint-redshift "YOUR_REDSHIFT_ENDPOINT" --aws-account-id "YOUR_ACCOUNT_ID" --aws-s3-bucket "YOUR_BUCKET_NAME"

    Exemplo: python iac/redshift/app.py --endpoint-redshift "stack-data-redshiftcluster-1n18jdnczvt.crmu3ndu7d.us-west-2.redshift.amazonaws.com" --aws-account-id "123456789123" --aws-s3-bucket "bucket-xyz2021"

✅ Running

After provisioning the environment, the notebook containing the Pipeline ETL is ready for execution in the Databricks:

👉 Notebook Pipeline ETL

Please, in the notebook, pay attention to informing the name of the bucket created.

image

After Pipeline ETL execution, the data is available on Redshift for SQL consultation and consumption through the API.


API

  • Docker imagem pull:

    docker pull bfamorim/api-restaurant:1.0
  • Image run. Replace the <REDSHIFT_ENDPOINT> with the redshift endpoint.

    docker run -p 8082:8082 -e REDSHIFT_ENDPOINT=<REDSHIFT_ENDPOINT> bfamorim/api-restaurant:1.0

    Exemplo: docker run -p 8082:8082 -e REDSHIFT_ENDPOINT=stack-data-redshiftcluster-1n1k153xc0czvt.crdmkp74267d.us-west-2.redshift.amazonaws.com bfamorim/api-restaurant:1.0

  • Requests:

    • Counting per day to the state of São Paulo:
    http://localhost:8082/order/count-per-day-for-city-state?orderDate=2018-12-08&state=SP
    • Counting per day to the state of São Paulo and the city of São Paulo:
    http://localhost:8082/order/count-per-day-for-city-state?orderDate=2018-12-08&state=SP&city=SAO PAULO
    • Top 10 restaurants by costumers:
    http://localhost:8082/order/top-10-consumed-restaurants-per-customer?customerId=2a68e52c-929a-4801-8128-86afb8cacb58

"# ifood-pedidos"