Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate data from OpenStreetMap #30

Open
wants to merge 44 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
23931e3
Implement `03_osm.go`
daohoangson Apr 19, 2023
fe0f98c
Update tests
daohoangson Apr 19, 2023
6a9970b
Normalize spaces in regex
daohoangson Apr 19, 2023
4fa292a
Normalize history entity name before comparison
daohoangson Apr 19, 2023
d021c59
Prioritize active entity if all things equal
daohoangson Apr 19, 2023
c7a894d
Implement `generateVariations` method to support typos and friends
daohoangson Apr 19, 2023
0bdc0ea
`npm run format`
daohoangson Apr 19, 2023
741b272
Merge remote-tracking branch 'origin/demo/parser/20230419' into featu…
daohoangson Apr 19, 2023
394e359
Include tags in downloader output
daohoangson Apr 20, 2023
72f48ac
Include parent in downloader output
daohoangson Apr 20, 2023
dee7947
Implement first version of OSM split transformer in PHP
daohoangson Apr 20, 2023
5cbd532
Consider en-dash as space character
daohoangson Apr 20, 2023
70f4017
Merge remote-tracking branch 'origin/demo/parser/20230419' into featu…
daohoangson Apr 20, 2023
9afdb7d
Include similarity score
daohoangson Apr 20, 2023
cd950cd
Accept initial matches for first level only, or has previous match
daohoangson Apr 20, 2023
31d96ec
Merge remote-tracking branch 'origin/demo/parser/20230419' into featu…
daohoangson Apr 20, 2023
21198db
Merge remote-tracking branch 'origin/master' into feature/osm
daohoangson Apr 20, 2023
cd59e8c
Use `working.json` to keep track of in progress state
daohoangson Apr 24, 2023
9f63e01
Fix bug incorrectly mark level=1 item as having bad name
daohoangson Apr 24, 2023
762479b
Fix bad relation without `name` tag
daohoangson Sep 15, 2023
df9039e
Merge remote-tracking branch 'origin/master' into feature/osm
daohoangson Sep 16, 2023
4e3c3e6
Switch to use Bun's binary to parse full names
daohoangson Sep 16, 2023
9b67285
Print statistics at the end of osm split
daohoangson Sep 16, 2023
bf88920
Add support for local pbf file
daohoangson Sep 17, 2023
d9dce47
Optimize `parentIds` tracking:
daohoangson Sep 17, 2023
644a91e
Replace strict point equality with approximation
daohoangson Sep 17, 2023
8c50e09
Download missing ways, nodes for entities near the borders
daohoangson Sep 17, 2023
0ee159e
Improve name extraction from OSM tags
daohoangson Sep 17, 2023
8668aa2
Clean up PR
daohoangson Sep 17, 2023
ba3f344
Generate report as `osm.csv`
daohoangson Sep 17, 2023
5824e4c
Merge remote-tracking branch 'origin/master' into feature/osm
daohoangson Sep 17, 2023
3855cb5
Update osm.csv with +4 successes
daohoangson Sep 17, 2023
5cf214d
Fix incorrect `$path` variable usage
daohoangson Sep 18, 2023
b68ccb3
Clean up PR
daohoangson Sep 19, 2023
1eb3164
`Việt Nam: 63+703+3439 of 11366 = 37.00%`
daohoangson Sep 20, 2023
c824dad
Add osm workflow
daohoangson Sep 21, 2023
4310f10
Make `printDot` slower
daohoangson Sep 21, 2023
6202ac7
Drop `parent` field from downloader output JSON files
daohoangson Sep 21, 2023
9dfb62e
Post GitHub comment with statistics
daohoangson Sep 21, 2023
d0f9238
Re-use parser process to improve split performance
daohoangson Sep 21, 2023
8e89929
Merge remote-tracking branch 'origin/master' into feature/osm
daohoangson Sep 28, 2023
7c8ee93
Upload artifacts
daohoangson Oct 6, 2023
9eb811d
Merge remote-tracking branch 'origin/master' into feature/osm
daohoangson Nov 21, 2024
59e2f81
Merge remote-tracking branch 'origin/master' into feature/osm
daohoangson Nov 21, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions .github/workflows/osm.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
name: OpenStreetMap

on:
push:
branches:
- feature/osm

jobs:
e2e:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@3df4ab11eba7bda6032a0b82a6bb43b11571feac # v4.0.0
with:
submodules: recursive

- uses: actions/setup-go@93397bea11091df50f3d7e59dc26a7711a8bcfbe # v4.1.0
with:
go-version: ^1.21.0
- run: go run ./downloader/03_osm.go ./downloader/osm

- name: Setup Bun
uses: oven-sh/setup-bun@a1800f471a0bc25cddac36bb13e6f436ddf341d7 # v1
- name: Build demo/parser CLI
run: bun install && npm run bun:build
working-directory: demo/parser
- run: php transformers/osm/split.php | tee split.log

- name: Prepare GitHub comment
id: comment
run: printf 'BODY<<EOF\n%s\nEOF\n' "$(cat split.log)" >> $GITHUB_OUTPUT
- name: Post GitHub comment
uses: daohoangson/comment-on-github@35b21121fdbadf807678bec8210cdd7f22a934fe # v2.2.2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
body: |
${{ github.sha }}

```
${{ steps.comment.outputs.BODY }}
```
fingerprint: "### Statistics"
replace: please

- uses: actions/upload-artifact@a8a3f3ad30e3422c9c7b888a15615d19a852ae32 # v3.1.3
with:
path: |
data/osm.csv
downloader/osm/working.json
2 changes: 1 addition & 1 deletion data/.gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
# file too big, keep the .gz version only
/osm/
/gis.json
11,366 changes: 11,366 additions & 0 deletions data/osm.csv

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions downloader/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/osm/
315 changes: 315 additions & 0 deletions downloader/03_osm.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,315 @@
package main

import (
"context"
"encoding/json"
"fmt"
"github.com/paulmach/orb"
"github.com/paulmach/osm"
"github.com/paulmach/osm/osmapi"
"github.com/paulmach/osm/osmpbf"
"io"
"math"
"net/http"
"os"
"path"
"runtime"
"slices"
"strconv"
)

func main() {
var reader io.Reader

osmPbfPath := os.Getenv("OSM_PBF_PATH")
if len(osmPbfPath) > 0 {
reader, _ = os.Open(osmPbfPath)
} else {
url := "https://download.geofabrik.de/asia/vietnam-latest.osm.pbf"
fmt.Printf("Downloading %s...\n", url)
// this will take some time, there are about 40m points, 4m ways and a few thousand relations
httpResponse, _ := http.Get(url)
reader = httpResponse.Body
}
_ = pbf.Scan(reader)

writeDir := os.Args[len(os.Args)-1]
for _, relation := range pbf.relations {
writeError := writeRelation(writeDir, relation)
if writeError != nil {
fmt.Println(fmt.Errorf("relation.id=%d name=%s: %w", relation.ID, getTagValue(relation.Tags, "name"), writeError))
}
}
}

func buildRelationCoordinates(relation *osm.Relation, shouldUseApi bool) (*orb.MultiPolygon, error) {
var lines []orb.LineString
var wayIds []osm.WayID
for _, member := range relation.Members {
if member.Type == osm.TypeWay && member.Role == "outer" {
wayId := osm.WayID(member.Ref)
if way, ok := pbf.ways[wayId]; ok {
line := make(orb.LineString, len(way.Nodes))
for i, wayNode := range way.Nodes {
if wayNodePoint, ok := pbf.points[wayNode.ID]; ok {
line[i] = wayNodePoint
} else {
// if the way exists in the data dump, its nodes must be
panic(fmt.Errorf("relation.id=%d: way.id=%d points[%d] does not exist", relation.ID, way.ID, wayNode.ID))
}
}
lines = append(lines, line)
wayIds = append(wayIds, way.ID)
} else {
if !shouldUseApi {
return nil, fmt.Errorf("relation.id=%d: ways[%d] does not exist", relation.ID, wayId)
}

line, apiError := buildRelationCoordinatesByWayId(relation, wayId)
if apiError != nil {
return nil, fmt.Errorf("relation.id=%d: way.id=%d %w", relation.ID, wayId, apiError)
}
lines = append(lines, *line)
wayIds = append(wayIds, wayId)
}
}
}
return (&multiPolygonBuilder{}).loop(lines, wayIds)
}

func buildRelationCoordinatesByWayId(relation *osm.Relation, wayId osm.WayID) (*orb.LineString, error) {
ctx := context.Background()
fmt.Printf("Downloading way#%d for relation#%d...\n", wayId, relation.ID)
way, wayError := osmapi.Way(ctx, wayId)
if wayError != nil {
return nil, fmt.Errorf("osmapi.Way(%d): %w", wayId, wayError)
}

nodeIds := make([]osm.NodeID, len(way.Nodes))
for i, wayNode := range way.Nodes {
nodeIds[i] = wayNode.ID
}
line := make(orb.LineString, len(nodeIds))
lineError := buildRelationCoordinatesByNodeIds(ctx, line, nodeIds, 0)
if lineError != nil {
return nil, lineError
}
return &line, nil
}

func buildRelationCoordinatesByNodeIds(ctx context.Context, line orb.LineString, nodeIds []osm.NodeID, offset int) error {
// maximum URI length is about 2k, node ids are 10 characters in average
// that means we can fit about 200 ids per request, let's buffer a bit to be safe
const maxNodeIds = 150
if len(nodeIds) > maxNodeIds {
firstError := buildRelationCoordinatesByNodeIds(ctx, line, nodeIds[:maxNodeIds], offset)
if firstError != nil {
return firstError
}
secondError := buildRelationCoordinatesByNodeIds(ctx, line, nodeIds[maxNodeIds:], offset+maxNodeIds)
if secondError != nil {
return secondError
}
return nil
}

nodes, nodesError := osmapi.Nodes(ctx, nodeIds)
if nodesError != nil {
return fmt.Errorf("osmapi.Nodes(%v): %w", nodeIds, nodesError)
}

for i, nodeId := range nodeIds {
// loop twice because the API doesn't return nodes in the requested order
for _, node := range nodes {
if node.ID == nodeId {
line[offset+i] = node.Point()
}
}
}

return nil
}

func getParentsAndSelf(relationId osm.RelationID) (result []string) {
if parentId, ok := pbf.parentIds[int64(relationId)]; ok {
result = append(result, getParentsAndSelf(parentId)...)
}
return append(result, strconv.FormatInt(int64(relationId), 10))
}

func getTagValue(tags osm.Tags, key string) string {
tag := tags.FindTag(key)
if tag == nil {
return ""
}
return tag.Value
}

func pointsApproxEquals(a orb.Point, b orb.Point) bool {
// we have two data sources: pbf file and live API
// because of floating point inconsistency, strict equality may fail unexpectedly
const epsilon = .00000001
if math.Abs(a.Lat()-b.Lat()) > epsilon {
return false
}
if math.Abs(a.Lon()-b.Lon()) > epsilon {
return false
}
return true
}

func writeRelation(dir string, relation *osm.Relation) error {
ids := getParentsAndSelf(relation.ID)
outputPath := fmt.Sprintf("%s.json", path.Join(dir, path.Join(ids...)))
_, statError := os.Stat(outputPath)
if statError == nil {
// file already exists
return nil
}

vietnamId := "49915"
isPartOfVietnam := slices.Index(ids, vietnamId) > -1
isVietnam := ids[len(ids)-1] == vietnamId
shouldUseApi := isPartOfVietnam && !isVietnam // do not fetch the country's coordinates, it's huge
coordinates, buildError := buildRelationCoordinates(relation, shouldUseApi)
if buildError != nil {
return buildError
}

bbox := coordinates.Bound()
output := map[string]interface{}{
"bbox": []float64{bbox.Left(), bbox.Bottom(), bbox.Right(), bbox.Top()},
"coordinates": coordinates,
"id": relation.ID,
"tags": relation.Tags,
"type": coordinates.GeoJSONType(),
}

_ = os.MkdirAll(path.Dir(outputPath), 0755)

outputBytes, _ := json.Marshal(output)
writeError := os.WriteFile(outputPath, outputBytes, 0644)
if writeError != nil {
panic(fmt.Errorf("os.WriteFile(%s): %w", outputPath, writeError)) // probably something serious, quit asap
}
return nil
}

var pbf = &pbfScanner{
points: make(map[osm.NodeID]orb.Point, 40000000),
relations: make(map[osm.RelationID]*osm.Relation, 1000),
ways: make(map[osm.WayID]*osm.Way, 4000000),

parentIds: make(map[int64]osm.RelationID),
}

type pbfScanner struct {
points map[osm.NodeID]orb.Point
relations map[osm.RelationID]*osm.Relation
ways map[osm.WayID]*osm.Way

parentIds map[int64]osm.RelationID
i int
}

func (s *pbfScanner) Scan(r io.Reader) error {
scanner := osmpbf.New(context.Background(), r, runtime.NumCPU())

for scanner.Scan() {
s.i++
o := scanner.Object()
if node, ok := o.(*osm.Node); ok {
s.points[node.ID] = node.Point()
s.printDot("len(points)=%d", len(s.points))
} else if relation, ok := o.(*osm.Relation); ok {
s.appendRelationIfTypeBoundaryAdministrative(relation)
s.printDot("len(relations)=%d", len(s.relations))
} else if way, ok := o.(*osm.Way); ok {
s.ways[way.ID] = way
s.printDot("len(ways)=%d", len(s.ways))
}
}

return scanner.Err()
}

func (s *pbfScanner) appendRelationIfTypeBoundaryAdministrative(relation *osm.Relation) {
for _, member := range relation.Members {
if member.Type == osm.TypeRelation {
s.parentIds[member.Ref] = relation.ID
}
}

if getTagValue(relation.Tags, "type") == "boundary" &&
getTagValue(relation.Tags, "boundary") == "administrative" {
s.relations[relation.ID] = relation
}
}

func (s *pbfScanner) printDot(format string, a ...any) {
if s.i%50000 == 0 {
fmt.Printf("%d: %s\n", s.i, fmt.Sprintf(format, a...))
} else if s.i%1000 == 0 {
fmt.Print(".")
}
}

type multiPolygonBuilder struct {
lines int
position [][2]int // position[lineId] = [ringId, positionInRing]
rings int
}

func (b *multiPolygonBuilder) loop(lines []orb.LineString, wayIds []osm.WayID) (*orb.MultiPolygon, error) {
lineCount := len(lines)
b.position = make([][2]int, lineCount)
b.lines = 0

var output orb.MultiPolygon
var ring orb.Ring
b.rings = 0

for b.lines < lineCount {
linesBefore := b.lines
for lineId, line := range lines {
if b.position[lineId][1] == 0 {
if ring == nil {
ring = orb.Ring(line)
b.rings++
b.success(lineId)
} else {
ringLast := ring[len(ring)-1]
if pointsApproxEquals(line[0], ringLast) {
ring = append(ring, line...) // good ordering, just append the ids
b.success(lineId)
} else if pointsApproxEquals(line[len(line)-1], ringLast) {
for j := len(line) - 1; j >= 0; j-- {
ring = append(ring, line[j]) // out of order, reverse before appending...
}
b.success(lineId)
}
}
}
}

if b.lines == linesBefore {
return nil, fmt.Errorf("dead loop detected wayIds=%v position=%v", wayIds, b.position)
}

if pointsApproxEquals(ring[0], ring[len(ring)-1]) {
output = append(output, orb.Polygon{ring}) // ring is closed, that's great!
ring = nil
}
}

if ring != nil {
return nil, fmt.Errorf("ring is not closed wayIds=%v position=%v", wayIds, b.position)
}

return &output, nil
}

func (b *multiPolygonBuilder) success(lineId int) {
b.lines++
b.position[lineId] = [2]int{b.rings, b.lines}
}
7 changes: 7 additions & 0 deletions downloader/download.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,12 @@
set -e

_dir=$(cd $(dirname $BASH_SOURCE[0]) && pwd)
cd "$_dir/.."

_gsoPath='data/dvhcvn.json'
_gsoDatePath='data/date.txt'
_gisPath='data/gis.json'
_osmPath='downloader/osm'

# 1256/NQ-UBTVQH15
_date=01/01/2025
Expand All @@ -21,3 +23,8 @@ if [ ! -f $_gisPath ]; then
echo "Generating $_gisPath..."
php "$_dir/02_gis.chinhphu.vn.php" <$_gsoPath >$_gisPath
fi

if [ ! -d $_osmPath ]; then
echo "Generating $_osmPath..."
go run "$_dir/03_osm.go" $_osmPath
fi
Loading
Loading