Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data.
All example applications and instructions here are based on release emr-6.14.0
In this tutorial, you use EMRFS
to store data in an S3 bucket. EMRFS
is an implementation of the Hadoop file system that lets you read and write regular files to Amazon S3.
Create the bucket in the same AWS Region where you plan to launch your Amazon EMR cluster.
Buckets and folders that you use with Amazon EMR have the following limitations:
-
Names can consist of lowercase letters, numbers, periods (.), and hyphens (-).
-
Names cannot end in numbers.
-
A bucket name must be unique across all AWS accounts.
-
An output folder must be empty.
- Bucket name: your-emr-bucket-name, for example: msba2024-emr-integration
- AWS Region: US East(N.Virginia) us-east-1
- ACLs disabled (recommended)
- Block all public access
- Bucket Versioning: Disable
- Encryption type: Server-side encryption with Amazon S3 managed keys (SSE-S3)
- Bucket Key: Enable
The most common way to prepare an application for Amazon EMR is to upload the application and its input data to Amazon S3. Then, when you submit work to your cluster you specify the Amazon S3 locations for your script and data.
Copy health_violations.py and save it to your local named health_violations.py
Download food_establishment_data.csv
Upload PySpark script health_violations.py and food_establishment_data.csv to the bucket your created, for example: msba2024-emr-integration
-
Search EMR in the search box
-
Under EMR on EC2 in the left navigation pane, choose Clusters, and then choose Create cluster.
-
On the Create Cluster page, note the default values for
Release, Instance type, Number of instances
, andPermissions
. These fields automatically populate with values that work for general-purpose clusters. -
Cluster name: your-emr-cluster-name, for example, annie-emr-cluster
-
Under Applications, choose the Spark option to install Spark on your cluster.
Note: Choose the applications you want on your Amazon EMR cluster before you launch the cluster. You can't add or remove applications from a cluster after launch. -
Under Cluster logs, select the Publish cluster-specific logs to Amazon S3 check box. Replace the Amazon S3 location value with the Amazon S3 bucket you created, followed by /logs. For example, s3://msba2024-emr-integration/logs. Adding
/logs
creates a new folder called 'logs' in your bucket, where Amazon EMR can copy the log files of your cluster. -
Under Security configuration and permissions, choose your EC2 key pair. In the same section, select the Service role for Amazon EMR dropdown menu and choose EMR_DefaultRole. Then, select the IAM role for instance profile dropdown menu and choose EMR_EC2_DefaultRole.
-
Click Create cluster to launch the cluster and open the cluster details page.
-
Find the cluster Status next to the cluster name. The status changes from Starting to Running to Waiting as Amazon EMR provisions the cluster. You may need to choose the refresh icon on the right or refresh your browser to see status updates.
Your cluster status changes to Waiting when the cluster is up, running, and ready to accept work.
After you launch a cluster, you can submit work to the running cluster to process and analyze data. You submit work to an Amazon EMR cluster as a step
. A step is a unit of work made up of one or more actions.
-
Under EMR on EC2 in the left navigation pane, choose Clusters, and then select the cluster where you want to submit work. The cluster state must be Waiting.
-
Choose the Steps tab, and then choose Add step.
-
Configure the step according to the following guidelines:
- For Type, choose Spark application. You should see additional fields for Deploy mode, Application location, and Spark-submit options.
- For Name, enter a new name. If you have many steps in a cluster, naming each step helps you keep track of them.
- For Deploy mode, leave the default value Cluster mode.
- For Application location, enter the location of your health_violations.py script in Amazon S3, such as s3://msba2024-emr-integration/health_violations.py.
- Leave the Spark-submit options field empty.
- In the Arguments field, enter the following arguments and values:
--data_source s3://your-EMR-bucket-name/food_establishment_data.csv --output_uri s3://your-EMR-bucket-name/emr-output for example: --data_source s3://msba2024-emr-integration/food_establishment_data.csv --output_uri s3://msba2024-emr-integration/emr-output
- For Action if step fails, accept the default option Continue. This way, if the step fails, the cluster continues to run.
-
Click Add step to submit the step. The step should appear in the console with a status of Pending.
-
Monitor the step status. It should change from Pending to Running to Completed. To refresh the status in the console, click the Refresh table to the right of Filter. The script takes about one minute to run. When the status changes to Completed, the step has completed successfully.
- If the job failed, click stderr to see the details.
After a step runs successfully, you can view its output results in your Amazon S3 output folder.
When you use Amazon EMR, you may want to connect to a running cluster to read log files, debug the cluster, or use CLI tools like the Spark shell.
Before you connect to your cluster, you need to modify your cluster security groups to authorize inbound SSH connections.
Amazon EC2 security groups act as virtual firewalls to control inbound and outbound traffic to your cluster.
Modify Inbound Rules
Regardless of your operating system, you can create an SSH connection to your cluster.
Do one of followings to connect to your cluster:
- Collect and configure AWS Academy Lab Credential
- Find out
Cluster ID
, go toEMR on EC2
in the left navigation pane, chooseClusters
, and then choose the cluster that you want to update,Cluster ID
is underSummary
section, - Use the following command to open an SSH connection to your cluster. Replace <mykeypair.key> with the full path and file name of your key pair file. For example, C:\Users<username>.ssh\mykeypair.pem.
for example:
aws emr ssh --cluster-id <j-2AL4XXXXXX5T9> --key-pair-file <~/mykeypair.key>
aws emr ssh --cluster-id j-3PSGOP0RLHKL1 --key-pair-file ./integration_key.pem
- Under Summary section, Get Primary node public DNS.
- Windows: use PuTTY to connect using user hadoop.
- Linux or MacOS, or Windows (installing OpenSSH on Windows) Run the command:
ssh -i integration_key.pem hadoop@<primaryNodePublicDNS>
Replace <primaryNodePublicDNS>
with your cluster's Primary node public DNS under Summary section
for example:
ssh -i integration_key.pem [email protected]
with sudo
sudo ssh -i integration_key.pem [email protected]
Once you connect to the cluster, you can Navigate to /mnt/var/log/spark to access the Spark logs on your cluster's master node. Then view the files in that location.
cd /mnt/var/log/spark
ls
There are two remaining options for accessing web interfaces on the primary node that provide full browser functionality. Choose one of the following:
- Option 1 - Local Port Forwarding (recommended for more technical users): Use an SSH client to connect to the primary node, configure SSH tunneling with local port forwarding, and use an Internet browser to open web interfaces hosted on the primary node. This method allows you to configure web interface access without using a SOCKS proxy.
- Option 2 - Dynamic Port Forwarding (recommended for new users): Use an SSH client to connect to the primary node, configure SSH tunneling with dynamic port forwarding, and configure your Internet browser to use an add-on such as FoxyProxy for Firefox or SwitchyOmega for Chrome to manage your SOCKS proxy settings. This method lets you automatically filter URLs based on text patterns and limit the proxy settings to domains that match the form of the primary node's DNS name.
This Learner Lab provides a sandbox environment for ad-hoc exploration of AWS services with limitations See AWS Academy Environment Restrictions
- Python3 is installed
- OpenSSH is installed
- The private key pair file in *.pem generated by AWS is located in this folder,
If you run the following scripts on Windows, you still need to generate *.pem file. If you want to use PuTTY to connect instead of scripts, you can convert *.pem to *.ppk file
- Copy ssh_ect2.py and save to your local computer as the same name
- Copy start_emr_web_access.py and save to your local computer as the same name
- Copy your private key pair file *.pem generated by AWS to the same directory
- Open a terminal
- PowerShell on Windows. Install PowerShell on Windows
- iTerm2 on MacOS. Install iTerm2 on MacOS
The script ssh_ec2.py is to access AWS EC2 running instances, open a terminal on (Windows/MacOS/Linux), run:
python ssh_ec2.py
or
py ssh_ec2.py
it prompts you enter:
- private key pair file name *.pem,
- EC2 instance public DNS name
- EC2 instance user
- type yes to confirm connection
The script start_emr_web_access.py is to establish port forwarding to allow you to access the Web Interfaces on a running EMR cluster.
Important note: Some web interfaces are accessible only after the corresponding packages are installed or enabled in the EMR Cluster
Open a terminal on (Windows/MacOS/Linux), run:
python start_emr_web_access.py
or
py start_emr_web_access.py
it prompts you enter:
-
private key pair file name *.pem,
-
EMR cluster primary node public DNS name
-
enter a number from the list of Web Interfaces you like to access,
-
after establishing the port forwarding, it displays the url for you to open it in your local browser, for example,
- choose 1 for ResourceManger, it provides info:
To access EMR ResourceManager web interface, type the link in your local browser: http://localhost:8151/
- choose 2 for JupyterHub, it provides info:
To access EMR JupyterHub web interface, type the link in your local browser: https://localhost:8152/ with userName=jovyan, password=jupyter
- choose 1 for ResourceManger, it provides info:
-
press CRTL + C to terminal session.
Importand note: Some web interfaces are accessible only after the corresponding packages are installed or enabled in the EMR Cluster
Web Interface and Local URL
ResourceManager : http://localhost:8151/
JupyterHub : https://localhost:8152/ with userName=jovyan, password=jupyter
SparkHistoryServer : http://localhost:8153/
Zeppelin : http://localhost:8154/
HBase : http://localhost:8155/
Hue : http://localhost:8156/
Tez : http://localhost:8157/
Exit : http://localhost:/