Tax_agency_web_scraper

Python requests powered web scrapper of companies' financial statements through the web site of the Tax Agency of Russia - pb.nalog.ru. Code starts with "Nologi_comp_info.py" script. This script initiates the whole scraping process, opens a source data xlsx file containing company names and composes final xlsx file with all of the results found.

"Nologi_search.py" finds the information about a specific company on pb.nalog.ru and transmits it to the "Nologi_comp_info.py". "Nologi_GZ_sbis.py" searches for public procurements of a company on the independent web site.

The whole scrapper uses proxy rotation. It searches for current available free proxies and stores them into an SQL database, it is proceeded by "proxy_.py" script. The tax agency's web site is very strict to proxy connections, it makes the scraping process to take quite a few time, however the system is stable and eventually gets its job done.

This repository also contains two excel files: "sites Белгород.xlsx" with the company names of a town of Belgorod and "test_comp_info.xlsx" containing a sample of data that was scraped from the previous company list. There are a lot of "none" cells, it is an issue of companies that were already closed at the time of scraping.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tax_agency_web_scraper

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Nologi_GZ_sbis.py		Nologi_GZ_sbis.py
Nologi_comp_info.py		Nologi_comp_info.py
Nologi_search.py		Nologi_search.py
README.md		README.md
proxy_.py		proxy_.py
sites Белгород.xlsx		sites Белгород.xlsx
test_comp_info.xlsx		test_comp_info.xlsx

Obezyan0941/Tax_agency_web_scraper

Folders and files

Latest commit

History

Repository files navigation

Tax_agency_web_scraper

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages