The goal of this project is to convert Wikipedia tables into the CSV format. Essentially we want a function that takes the URL of a Wikipedia page and saves all relevant tables to the CSV format. Actually, this application has two interfaces to access its convertor. It can be packaged into a jar file and added as a dependency to another java project. Also, it’s a spring-boot application. A Rest API can be run to access CSV as a list of strings.
This converter uses maven so you can test, build and run easily.
To do so, open a terminal at the root of the project.
- Testing:
cd wikimatrix mvn test
- Building:
cd wikimatrix mvn clean package
- Running:
cd wikimatrix mvn spring-boot:run
After running the Srping-boot application, you can access the Rest API at this link
This is a Rest API that all you to get Wikipedia tables in an array-like form convertible to CSV.
- Got to "/table_index?name=(Wikipedia page name)&index=(index of the table among wikitables)"
to get the specific table you want from the Wikipedia page. A list of list is returned being the rows and the columns. - Got to "/table_all?name=(Wikipedia page name)"
to get all the tables from the Wikipedia page. A 3D List is returned being the tables, the rows and the columns. - You can add "/reformat" at the right of "/table_index" or "/table_all" to get an HTML view of the transformed table.
Converting Wikipedia tables from HTML to CSV format is not an easy task, as choices and compromises as been made.
In order to access the HTML content of Wikipedia pages, we used Jsoup. This package allows an easy way to navigate in the DOM. However, HTML tables are way more versatile than the CSV format can afford. View the example bellow
HTML | Into | CSV | ||||||||||||
Code |
|
→ |
"a","b","c" "d","e","e" |
|||||||||||
Render |
|
|
As we can see, the cell named "e" is duplicated in the CSV form. In fact, the multicell can't be represented in CSV. We made the choice to just duplicate the value. HTML tables can be very tricky as you can see on the Help table web page of Wikipedia. Cells can span multiple rows and/or columns. Also, tables can be nested as far as we want. So one cell can have a table in it, and this cell can span multiple cells. This situation cannot be represented in CSV like in HTML. So we choose to duplicate rows and columns in order to have the space to fit the data and keep links between rows and columns. An example of the situation of a nested table is given below:
HTML | Into | CSV | |||||||||||||||||||||||||||||||||
Code |
|
→ |
"a","b","b","c","c" "e","f","f","g","g" "d","a1","a1","b1","b1" "d","c1","c1","c1","c1" |
||||||||||||||||||||||||||||||||
Render |
|
|
So we flatten tables in order to convert them into CSV. This flattening is performed from the bottom up so that no matter how many nested tables there are, all data will be present in the right place.
Moreover, as you can see at Help table, tables can contain different types of data. They can be images, videos, links, or any other HTML content. In this project, we focused on making a modular application that can evolve to take into account more tables and data. In this view, we already implemented the support for "text", "links" and "images". Those types need to be converted to text somehow in order to fit CSV files. For "links" we take the "href" attribute and for "images" the "src" one that we append to the text. This behavior can easily be changed.
The architecture of this app is four main parts:
-
The EXTRACTOR take the URL of a Wikipedia page and output table elements with all their respective children. Multiples options allow the user to get only one table according to the index or to return all tables on the page.
NB: Returned tables in the second case don’t contain tables that are nested in other ones. -
The CONVERTOR takes a table or a list of tables and outputs them as a table of string.
-
The SERIALIZER saves string tables into CSV files at the default or wanted location.
-
The WEB creates the web interface thanks to Spring-boot that uses the two first parts to convert HTML to CSV.
Here is the class diagram of the core project. The test and the web part are not included.
We found the three first parts of the core at the top right. But the conversion is done mostly in the WikipediaHTMLConvertorPlus
class.
The first idea is to reproduce the HTLM tree structure in java. This is done by the Balise
class that store other balise as children or parent. For each type of HTML balise exists a Balise
class that mimics the corresponding HTML behaviors. To make the correspondence between the HTML tag and the Balise
class, the Controller
uses a predefined HashMap
.
To create and then explore this tree a visitor pattern is used. To convert from HTML to CSV we first create the tree by using the CreateVisitor
. Then we calculate the grid of children balise (except table ones) with the GridVisitor
. After that, we calculate only tables grids thanks to the TableVisitor
. Finally, the first table in the tree is expanded (if there is nested tables) to make the final grids of balise. Finally, the grid of balise is transformed to a grid of strings that is returned.
We give 300+ Wikipedia URLs and the challenge is to:
- integrate the extractors' code (HTML and Wikitext)
- extract as many relevant tables as possible
- serialize the results into CSV files (within
output/html
andoutput/wikitext
)
More details can be found in BenchTest.java
. We are expecting to launch mvn test
and the results will be in output
folder