Name		Name	Last commit message	Last commit date
parent directory ..
captcha		captcha
scratch		scratch
util		util
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.parse		Dockerfile.parse
README.md		README.md
build_mutool.sh		build_mutool.sh
captcha_helper.py		captcha_helper.py
check_duplicates.py		check_duplicates.py
collect_villages.py		collect_villages.py
common.py		common.py
convert_villages.py		convert_villages.py
download_as_mbtiles.py		download_as_mbtiles.py
full_tiling_run.sh		full_tiling_run.sh
generate_lists.sh		generate_lists.sh
index.geojson.corrections		index.geojson.corrections
index.html		index.html
known_problems.py		known_problems.py
login.py		login.py
mupdf.LICENSE		mupdf.LICENSE
mupdf.patch		mupdf.patch
otp.py		otp.py
parse.py		parse.py
partition.py		partition.py
pb_listener.py		pb_listener.py
requirements.in		requirements.in
requirements.parse.in		requirements.parse.in
requirements.parse.txt		requirements.parse.txt
requirements.txt		requirements.txt
retile.py		retile.py
run.sh		run.sh
scrape.py		scrape.py
scrape_villages.py		scrape_villages.py
special_cases.json		special_cases.json
tile.py		tile.py
tile_sources.py		tile_sources.py
webp_to_png.py		webp_to_png.py

README.md

Contains code to scrape and parse data from Survey Of India

Open Series Map

Scraping

Uses google tesseract to break the captchas - https://tesseract-ocr.github.io/tessdoc/Home.html

For mac install tesseract using homebrew brew install tesseract

tested with python version 3.9

python requirements are in the requirements.txt file. Run pip install -r requirements.txt to install the requirements.

requires usernames and passwords in data/users.json

example users.json

[
    {
        "name": "<username_1>",
        "phone_num": "<10_digit_phone_num_1>",
        "password": "<password_1>",
        "email_id": "<email_id_1>",
        "first_login": false
    },
    {
        "name": "<username_2>",
        "phone_num": "<10_digit_phone_num_2>",
        "password": "<password_2>",
        "email_id": "<email_id_2>",
        "first_login": false
    }
]

run following command to pull data: python scrape.py

pdfs for the sheets end up at data/raw/

Parsing

This involves the following steps

converting the pdfs to images
dissecting the image into the maparea area, legend area, compilation index area and notes area
further process map area
- locate corners and further locate the actual mapbox
- use the corners and the index file to georeference the mapbox and reproject it to epsg:3857 from UTM and also use the index file to further crop the image
  - use alpha channel for nodata to keep things simple in this stage
- use of alpha channel leads to big files, so now convert the alpha channel to a nodata internal mask and use further compression which now becomes accessible without the alpha channel

The final compressed, clipped, georeferenced file ends up at export/gtiffs/<sheet_no>.tif

All intermediate files are at data/inter/<sheet_no>/

Depends on mupdf for pdf to jpg conversion, gdal for georeferencing and the python dependencies are listed in requirements.parse.txt

All of this can be triggered using python parse.py

By default it runs through all the available files in data/raw/ and all files which had problems get saved to data/errors.txt

Parallelism is exploited where possible - assumes 8 available cpus - my laptop config :)

Reruns ignore already processed files

Environment variables can be used to change behavior

SHOW_IMG=1 displays the image processing steps on the terminal using imgcat
ONLY_CONVERT=1 can be used to only do the pdf to image conversion parallely for better cpu saturation
FROM_LIST=<filename> can be used to only parse files listed in <filename>, this also errors out on first failure

Tiling

python tile.py to tile

Tiles get created at export/tiles, also generates an auxilary export/prev_files_to_tile.json which contains the list of sheet files that went in and their modification times

Rerunning python tile.py picks up any new/modified sheet files in export/gtiffs and does an incremetal retitling operation

This is still slow and requires the presence of all the previous tiling data, So a newer retiling script was created to just pull the requisite tiles from the original tiles into a staging area, retile and push back only the affected tiles into the main area.

Invoke with python retile.py.. for now expects a retile.txt file containing just the sheet names to retile

Update Jobs

Update jobs are run with github actions on a weekly basis Data ends up at google cloud storage soi_data bucket

Village boundaries

Scraping

Most directions/software requirements same as for Open Series Maps, except the final commands

run following command to pull data: python scrape_villages.py

shapefile zips end up at data/raw/villages/

To recover from intermittent errors, run FORCE=1 python scrape_villages.py

To force rechecking previously missing Districts, run FORCE=1 FORCE_UNAVAILABLE=1 python scrape_villages.py

To enter captcha manually and not depend on auto captch breaking, run CAPTCHA_MANUAL=1 python scrape_bounds.py

uses imgcat for displaying the captcha, which might work on some terminals.

The python requirements for the scraping part are captured in requirements.txt

To convert the resulting shapefiles to OSM xml files which can be directly used in JOSM, run python convert_villages.py

this requires an extra python dependency of pytopojson==1.1.2 and an extra software dependency on gdal

the resulting per tehsil .osm files end up in corresponding district folders in data/raw/villages/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOI

SOI

README.md

Open Series Map

Scraping

Parsing

Tiling

Update Jobs

Village boundaries

Scraping

Files

SOI

Directory actions

More options

Directory actions

More options

Latest commit

History

SOI

Folders and files

parent directory

README.md

Open Series Map

Scraping

Parsing

Tiling

Update Jobs

Village boundaries

Scraping