Skip to content

Latest commit

 

History

History
42 lines (27 loc) · 1.59 KB

README.md

File metadata and controls

42 lines (27 loc) · 1.59 KB

Kaggle

Crawler

Open source crawler for Persian websites. Crawled websites to now:

Asriran

asriran/run_asriran.sh

You can change some paramters in this crawler. See run_asriran.sh.

Fa-Wikipedia

Due to some problems in crawling, I splitted this job into two stages. First crawling all index pages and second use those pages for crawling.

wikipedia/run_wikipedia.sh

Tasnim News

This crawler saves tasnim news pages based on category. This is appopriate for text classification task as data is relatively balanced across all categories. I selected equal amount of page per category.

We have a parameter Called Number_of_pages in tasnim.py which controls how many pages we should crawl in each category.

tasnim/run_tasnim.sh

Datasets are all available for download at Kaggle.

CSS selectors are mostly extracted via Copy Css Selector.