This project involves analyzing the "Independent Medical Reviews" dataset using Hadoop/MapReduce with Apache Pig for data processing, Apache Hive for data storage, and Tableau for data visualization.
- Cloudera Quickstart Docker Image
- Apache Pig and Hive (included in the Cloudera Docker image)
- Tableau Desktop with ODBC driver installed
- Access to the "Independent Medical Reviews" dataset
- Copy pig script and the "Independent Medical Reviews" dataset to the docker container.
Once the pipeline is setup you can access the tableau workbook to view the dashboard You will need to edit the connection to use your IP Address.
Username - hive
Password - cloudera
- Start the Cloudera Docker container:
docker run --hostname=quickstart.cloudera --privileged=true -t -i --publish-all=true -p 8888:8888 -p 7180:7180 -p 80:80 cloudera/quickstart /usr/bin/docker-quickstart
Access the Cloudera environment through a web browser or terminal.
- Load the dataset into HDFS:
hdfs dfs -put Independent_Medical_Reviews_Filled.csv /medical_reviews/
- Execute the Pig script for data processing:
pig diagnosis_correlation.pig
- Access Hive and create a table for the processed data:
CREATE TABLE medical_data (
treatment_category STRING,
diagnosis_category STRING,
case_count INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
- Load the data into the Hive table:
LOAD DATA INPATH '/output/treatment_diagnosis_correlation' INTO TABLE medical_data;
- Set up an ODBC connection in Tableau to the Hive server.
- Import data from the medical_data Hive table with the cloudera hadoop connection in Tableau.
- Login using the username "hive" and password "cloudera".
- Create the heatmap visualizations.