Now that we have some new information sitting in our "landing" bucket, we'll want to combine that information with our existing FDA Product Labels data. For that, we can go back to our EMR cluster to help facilitate that MapReduce task.
Using our EMR cluster, we'll load both our FDA Product Labels data and our indications extraction data. Using the shared ID field as a key, we'll then be able to join the data together and save that information back to the "curated" bucket ready for consumption.
- Connect to your EMR Cluster (as described in 02_EMR_Cluster)
- Run
pyspark
- Open
fda.indications.py
in an editor - Update the values for
BUCKET_LANDING
andBUCKET_CURATED
with the appropriate values - Copy the code and paste it into the
pyspark
shell
Given that we have just updated the FDA Product Labels data with a new column, what that means is our data's schema has now changed. We can easily update our Athena tables by rerunning our Glue Crawler.
- From the AWS Glue Crawler Dashboard
- Click on your newly created Crawler
- Click "Run crawler"
- Go to the AWS Athena Dashboard
- Run each query from below (remember to replace
<YOUR_DATABASE>
)
Product Indications:
CREATE OR REPLACE VIEW product_labels_indications AS
SELECT id, LOWER(indications) as indications, effective_date FROM "<YOUR_DATABASE>"."<YOUR_TABLE>"
CROSS JOIN UNNEST(extracted_text) as t(indications);