diff --git a/_posts/2023-10-10-trafilatura.markdown b/_posts/2023-10-10-trafilatura.markdown new file mode 100644 index 0000000..0db9c80 --- /dev/null +++ b/_posts/2023-10-10-trafilatura.markdown @@ -0,0 +1,190 @@ +--- +title: "Trafilatura" +date: 2023-10-10 +categories: "information_retrieval" +--- + +**What is trafilatura** + +I was trying to web scrape and to process google search results, but encountered the following problem: *how to effectively capture the main content of a website and separate it from other things present on the website such as ads, links to other content, etc.* For a human this is straightforward as the main content typically resides in the middle of the website and is visually dominant, but for an automated process that treats raw html as input, this is not easy at all. After some research I found [trafilatura][trafilatura_github], a Python package to do just that. This package can be easily pip installed, has a simple to use API, and appears to be actively maintained at the time of this post. + +**API** + +The API of trafilatura is quite simple. Say we want to extract the main content of the following website: [https://github.blog/2019-03-29-leader-spotlight-erin-spiceland](https://github.blog/2019-03-29-leader-spotlight-erin-spiceland), we can simply do: + +{% highlight python %} +import trafilatura + +downloaded = trafilatura.fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/') +main_text = trafilatura.extract(downloaded) # main_text will be the main content of the website in text format +{% endhighlight %} + +We can also use trafilatura to return other formats: + +{% highlight python %} +... +main_text_xml = trafilatura.extract(downloaded, output_format='xml') # main_text_xml will contain the main content in xml format. +{% endhighlight %} + +Apart from `extract`, several other methods are useful as well: + +* *bare_extraction* returns a dictionary including both information about the main content and the metadata. An example of returned dictionary is: + + ``` + {'title': 'Leader spotlight: Erin Spiceland', + 'author': 'Jessica Rudder', + 'url': 'https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/', + 'hostname': 'github.blog', + 'description': 'We’re spending Women’s History Month with women leaders who are making history every day in the tech community.', + 'sitename': 'The GitHub Blog', + 'date': '2019-03-29', + 'categories': [], + 'tags': [], + 'fingerprint': None, + 'id': None, + 'license': None, + 'body': None, + 'comments': '', + 'commentsbody': None, + 'raw_text': None, + 'text': '... (main content)', + 'language': None, + 'image': 'https://github.blog/wp-content/uploads/2019/03/Erin_FB.png?fit=4801%2C2521', + 'pagetype': 'article'} + ``` + +* *extract_metadata* returns a metadata object which can be converted to a dictionary. An example is: + + ``` + {'title': 'Leader spotlight: Erin Spiceland', + 'author': 'Jessica Rudder', + 'url': 'https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/', + 'hostname': 'github.blog', + 'description': 'We’re spending Women’s History Month with women leaders who are making history every day in the tech community.', + 'sitename': 'The GitHub Blog', + 'date': '2019-03-29', + 'categories': [], + 'tags': [], + 'fingerprint': None, + 'id': None, + 'license': None, + 'body': None, + 'comments': None, + 'commentsbody': None, + 'raw_text': None, + 'text': None, + 'language': None, + 'image': 'https://github.blog/wp-content/uploads/2019/03/Erin_FB.png?fit=4801%2C2521', + 'pagetype': 'article'} + ``` + +* *baseline* returns a 3 tuple of `lxml_object, text, length` where `lxml_object` is nothing but `text` wrapped inside a pair of

and

tags, and length is the length of the `text`. It is used when *extract* and trafilatura's wrapper of fallback libraries *readability-lxml* and *jusText* fails, and is purposed to return all textual tags aggregated in a single string. + + +**How does it work** + +The inner workings of trafilatura is published by its author in a [2021 paper][trafilatura_paper]. In a nutshell, the software acts on raw html through XPath expressions and perform actions in the order of the following 2 perspectives: +1. **negative perpective:** *"it excludes unwanted parts of the HTML code (e.g.