Talmud Illuminated scraper

About the project

The code here scrapes the content of the blog Talmud Illuminated
The goal of the Talmud Illuminated Project is to bring the benefits of study to everyone who wants it. These benefits are multiple and more can be added.
This scraping project was started on Tisha B'Av 2021. Tisha B'Av is the saddest day of Jewish history. It is a fast day when Torah and Talmud study are prohibited. However, this project was perfect for Tisha b'Av. In scraping, one is not learning Torah, just creating a structure for the data. So, it may be a mitzvah, but a mitzvah is not prohibited on Tisha B'Av At the same time, it is not business (which is also not encouraged on this day).

The code iterates through every Talmud volume (masechet) name, and through every page, by the number of pages in the masechet.
Then, it searches for this page using Google Blogger API, get a response in JSON, and parses through the response. The search may bring back a few pages, and the parser find the one it is looking for using the title.
For example, the code may be looking for "bava kamma 48" string. For each JSON result, it finds the answer where the title field is "bava kamma 48".
All pages are stored into this project under the content folder, and committed to GitHub.

The scraping project was basically completed in six days, from Tisha B'Av till Tu B'Av, 2021. That year, Tu B'Av fell out on Shabbat. The Talmud Illuminated project was started on the previous Tu B'Av that fell out on Shabbat, 2008. Here is the start. This part of the project thus took exactly 13 years on the Jewish calendar.

Crawl (run Full Crawl configuration)
QA (run QA configuration)
MakeSite (run MakeSite configuration)
Copy the site to TalmudIlluminatedContent repo
- cp -r site/* ../TalmudIlluminatedContent/
Deploy from there (open TalmudIlluminatedContent project and run deploy.sh)

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
content		content
data		data
doc		doc
full_text/brachot		full_text/brachot
site		site
src		src
text-for-indexing_pages		text-for-indexing_pages
text-for-indexing_paragraphs		text-for-indexing_paragraphs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
collect_all.sh		collect_all.sh
pom.xml		pom.xml