Some thoughts about the future of this project #53

giggls · 2020-05-25T21:36:47Z

While this project worked reasonably well for producing raster tiles on German Tileserver in recent years and has also been used by OpenMapTiles the current approach showed its limitations recently.

1st Issue

The project is currently using three FOSS transcription libraries. These are:

ICU for general purpose transcription, http://site.icu-project.org/, written in C++
Kakasi for Kanji transcription, http://kakasi.namazu.org/, written in C
tltk for Thai language, written in Python

Back in November Issue #35 was filed and another library pinyin_jyutping_sentence written in python has been found for this purpose.

As we had already noticed with tltka general problem of python shared procedures seemed to be, that imported modules are not persistent between connections in PostgreSQL. Thus imports which take a long time (15 seconds in case of pinyin_jyutping_sentence on my desktop computer and 5 seconds for tltk) will do this not only once but any time a new connection is made to the database.

2nd issue

Also back in November @chatelao added a couple more regular expressions for languages we did not have code for generating street abbreviations yet. Unfortunately this code also slowed down execution time of the functions by an order of magnitude as @otbutz reported in issue #40

Conclusion

Given the fact, that importing OSM data to PostgreSQL is in practise done by osm2pgsql or imposm only, those tools might be a better place to do localization. Currently (AFAIK) just osm2pgsql has the ability to do tag-transformations in import stage, but up till recently had a very static table setup of the target database. Fortunately this just changed with the advent of its new flex backend.

Basically this is why I think we should move the l10n stage to the PostgreSQL import stage instead of keeping it as stored procedures.

Another thing is the actual implementation of Latin transcription. I have a prove of concept implementation which will do this, keeping the approach of the current project for now available at https://github.com/giggls/osml10n.

The idea is to have an external daemon which does the actual transcription while either a script run during import time from osm2pgsql or a replacement for the current osml10n_cc_translit function will connect to this daemon to get a transliterated version of a given string.

This will at least resolve the problem of slow transcription procedures written in python. However, the regular expression problem might only be resolvable by moving this stuff into a (lua-)script running from osm2pgsql.

I am looking forward to your comments.

The text was updated successfully, but these errors were encountered:

Badg · 2020-05-28T02:29:20Z

More fuel to the fire: you can't add custom extensions with C code to RDS, so if you want to have AWS manage your database, you're SOL when it comes to OpenMapTiles. For the raw sql functions, it's likely possible to just manually add them (I haven't tried this yet), but not everything here does that.

I think it's a fantastic idea to move as much import and ETL logic as possible out of postgres and into the ETL pipeline! That gives you much, much more flexibility when actually trying to deploy something, as opposed to local development, research, etc.

chatelao · 2020-05-28T05:02:50Z

Sounds reasonable to precalculate the results:

In which fields would you store the long/short names?
In which language is the "flex" exactly written

Long time ago, I intended to write the rules themselfs language independent as regexp and transpose them with https://en.m.wikipedia.org/wiki/Mustache_(template_system) to any regexp capable languages (yes, I know the practical issues you found later 😳)

https://github.com/chatelao/osm-abbrev#about-the--build

giggls · 2020-05-28T07:01:30Z

@chatelao Unfortunately the code has to be rewritten in lua to be able to move it into osm2pgsql.

chatelao · 2020-05-29T13:56:04Z

So let's learn a new language 😊

chatelao · 2020-05-29T14:01:06Z

Ups, no (normal) regexp support:

giggls · 2020-05-29T14:05:44Z

There is lrexlib and Lua-cURL. I think we will need the both.

chatelao · 2020-05-29T14:44:13Z

Okay, why do we need curl?

giggls · 2020-05-29T14:50:57Z

See https://github.com/giggls/osml10n
I currently think that we will have an external daemon for transcription written in Python. I already have an implementation of this part which is even usable from the existing code.

While in theory this could also be done in Lua I do not even think about re-implementing kakasi, cantonese and tha transcription in Lua for now.

otbutz · 2020-06-03T10:02:40Z

As we had already noticed with tltka general problem of python shared procedures seemed to be, that imported modules are not persistent between connections in PostgreSQL. Thus imports which take a long time (15 seconds in case of pinyin_jyutping_sentence on my desktop computer and 5 seconds for tltk) will do this not only once but any time a new connection is made to the database.

Isn't it possible to defer module loading in python? plpython allows to share resources using the global sd and gd dictionaries. Maybe this can be used to work around the costly module initialization?

The idea is to have an external daemon which does the actual transcription while either a script run during import time from osm2pgsql or a replacement for the current osml10n_cc_translit function will connect to this daemon to get a transliterated version of a given string.

An external daemon sounds like a good idea too.

giggls · 2020-06-03T10:18:42Z

According to the PostgreSQL documentation even GD is only avalable during a session:

The global dictionary GD is public data, that is available to all Python functions within a session

Try this code:

CREATE or REPLACE FUNCTION pyfunc() RETURNS float AS $$
  import time
  import plpy
  
  start = time.time()
  import pinyin_jyutping_sentence
  end = time.time()
  return(end - start)
$$ LANGUAGE plpython3u STABLE;

sven=# select pyfunc();
      pyfunc      
------------------
 19.4291229248047
(1 Zeile)

sven=# select pyfunc();
        pyfunc        
----------------------
 4.76837158203125e-06
(1 Zeile)

sven=#

If psql is terminated it will again be slow on the first call and fast afterwards even without using SD or GD.

giggls · 2020-07-16T15:20:01Z

To follow up to this discussion. A proof of concept implementation of the new approach is now available at https://github.com/giggls/osml10n unfortunately there is a missing feature in osm2pgsql flex-backend which does not give me bounding-boxes of relations.
I consider this a showstopper for now because this will disable country specific transcriptions on those objects.

otbutz · 2020-07-17T07:18:49Z

Looks good. Some early benchmark numbers?

otbutz · 2022-04-25T06:31:13Z

@giggls bbox support for relations got merged into osm2pgsql. see osm2pgsql-dev/osm2pgsql#1284

giggls mentioned this issue Jul 16, 2020

Performace of osml10n_get_name_without_brackets_from_tags #54

Closed

otbutz mentioned this issue Jul 28, 2020

Replace normalization module with frontend-side normalization via libICU osm-search/Nominatim#1892

Closed

otbutz mentioned this issue Oct 7, 2020

Flex backend: Support get_bbox() for relations osm2pgsql-dev/osm2pgsql#1284

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some thoughts about the future of this project #53

Some thoughts about the future of this project #53

giggls commented May 25, 2020 •

edited

Loading

Badg commented May 28, 2020

chatelao commented May 28, 2020 •

edited

Loading

giggls commented May 28, 2020

chatelao commented May 29, 2020

chatelao commented May 29, 2020

giggls commented May 29, 2020

chatelao commented May 29, 2020

giggls commented May 29, 2020

otbutz commented Jun 3, 2020

giggls commented Jun 3, 2020

giggls commented Jul 16, 2020

otbutz commented Jul 17, 2020

otbutz commented Apr 25, 2022

Some thoughts about the future of this project #53

Some thoughts about the future of this project #53

Comments

giggls commented May 25, 2020 • edited Loading

1st Issue

2nd issue

Conclusion

Badg commented May 28, 2020

chatelao commented May 28, 2020 • edited Loading

giggls commented May 28, 2020

chatelao commented May 29, 2020

chatelao commented May 29, 2020

giggls commented May 29, 2020

chatelao commented May 29, 2020

giggls commented May 29, 2020

otbutz commented Jun 3, 2020

giggls commented Jun 3, 2020

giggls commented Jul 16, 2020

otbutz commented Jul 17, 2020

otbutz commented Apr 25, 2022

giggls commented May 25, 2020 •

edited

Loading

chatelao commented May 28, 2020 •

edited

Loading