In this workshop you will learn how to implement big data in a serverless fashion, leveraging Amazon S3, AWS Glue, Amazon Athena and Amazon QuickSight to:
- upload a dataset to your central data lake,
- automate the creation of the data catalog,
- schedule ETL processes that aggregate data from multiple tables and convert them into a compressed columnar format that allows to speed up and reduce the cost of your queries,
- query the data using standard SQL
- create and share rich web-based visualizations
All without having to manage clusters. Even more, without having to spin up a single instance.
Welcome to the serverless age!
For this workshop we are going to use data made available by the New York City Taxi and Limousine Commission (TLC).
Raw CSV-formatted trip record data can be downloaded from the TLC website itself at www.nyc.gov/html/tlc/html/about/trip_record_data.shtml . On this page TLC provides monthly extracts on yellow cabs, boro taxis (green) and for-hire vehicles (FHV).
The yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The FHV trip records include fields capturing the dispatching base license number and the pick-up date, time, and taxi zone location identifier.
You can use data from any month avaiable. In this particular example we are going to use data from November and December 2017.