The World Development Indicators (WDI) is the World Bank's most comprehensive collection of cross-country development data. It's website basically provides access to data as well as information about data coverage, curation and methodologies and allow users to discover what type of indicators are available.
- Databricks
- Apache Spark
- Scala
- Country.csv
247 rows representing the countries.
31 columns describing various attributes of the countries.
- Indicators.csv
5656458 rows representing data instances.
6 columns describing indicators of the countries.
The size of this file is about 550MB, necessitating the use of Apache Spark implemented in Scala on Databricks. This combination provides a powerful and scalable framework for efficiently processing large-scale datasets.
Note: This link will be valid till 01-06-2024.
- Create a free Databricks Community Edition account
- Create a new cluster and wait till it is active and running
- Upload the World Development Indicators.dbc Notebook to Databricks and connect it to the above cluster.
- Upload the data (CSV files) to Databricks after downloading it from the source.
- Run the cells, view and analyse the data as desired.
%scala
val Indicators = sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/FileStore/tables/Indicators.csv")
display(Indicators)
Temporary view allows to use SQL queries on the DataFrame as if it were an SQL table.
%scala
Indicators.createOrReplaceTempView("Indicators")
%sql
select CountryName,Value,Year from Indicators where IndicatorCode in ("NY.GNP.PCAP.CD") and Year = 1962 and CountryName in ("Japan","China","France","United States") order by Value asc;