Querying a predicate based on phrase_from_file is slow #1403

infogulch · 2022-04-07T06:01:22Z

infogulch
Apr 7, 2022

In A Tour of Prolog triska demos prolog with a neat example of the list of unicode codepoints (1.8MB file). The video demonstrates editing the csv file with emacs to create a text prolog database that can be loaded directly, but I was curious if you could get the same results by leaving the file unedited and defining a general predicate that is true when it unifies with the data in the file after parsing.

code_name(Codepoint, Codename) :-
    phrase_from_file(parse_csv(frame(_,Rows), [with_header(false), token_separator(';')]), "UnicodeData.txt"),
    member(Row, Rows),
    Row = [Codepoint, Codename | _].

It's correct enough to query, but unfortunately it is very slow. For example, code_name(20, Name). takes several seconds to complete, and the general query code_name(Code, Name). doesn't show any results at all after several minutes. Tabling didn't seem to work. I wouldn't mind an initial load time, but I'd like to load/cache all the results into the internal database as its parsed the first time -- as if I typed out the database and loaded it. I'm likely missing something big; thoughts?

Another perhaps separate issue is that it parses numbers by default so the hex codes of the first field are a mix between 4-digit hex strings and hex numbers that happen to only have decimal digits interpreted as base-10 integers. Maybe we could add an option to parse_csv to disable automatic number parsing?

infogulch · 2022-04-07T07:03:11Z

infogulch
Apr 7, 2022
Author

A version for SWI does pretty good.

code_name(Codepoint, Codename) :-
    phrase_from_file(csv(HeaderRows, [separator(0';), convert(false)]), "UnicodeData.txt"),
    HeaderRows = [_Header | Rows],
    member(Row, Rows),
    Row =.. [_, Codepoint, Codename | _].

?- time(code_name('8989983',B)).
% 8,112,090 inferences, 0.739 CPU in 0.739 seconds (100% CPU, 10980436 Lips)
false.
?- time(code_name(A,B)).
% 8,042,839 inferences, 0.789 CPU in 0.789 seconds (100% CPU, 10196906 Lips)
A = '0001',
B = '<control>' .
?- time(code_name('0020',B)).
% 8,042,901 inferences, 0.737 CPU in 0.737 seconds (100% CPU, 10914329 Lips)
B = 'SPACE' .
?- time(code_name('10020',B)).
% 8,076,679 inferences, 0.794 CPU in 0.794 seconds (100% CPU, 10170453 Lips)
B = 'LINEAR B SYLLABLE B039 PI' .

For comparison, a modified version of code_name/2 above for scryer with cut (!) after phrase_from_file does complete:

?- time(code_name(P, N)).
   % CPU time: 18.855s
   P = 0, N = "<control>"
?- time(code_name(20, N)).
   % CPU time: 19.379s
   N = "SPACE"
?- time(code_name("10020",B)).
   % CPU time: 19.082s
   false.

2 replies

UWN Apr 7, 2022

Please try to narrow down the source of the difference. (And keep in mind that just the time to get pio into SWI was longer than the time Scryer exists...)

infogulch Apr 7, 2022
Author

I tried with a file with only one row and both are virtually instant, so the difference (unsurprisingly) comes from phrase. I'm not sure how else to debug DCGs but I'll keep looking; I noticed that time/1 doesn't output inference counts. Tested on versions 01a9fd9e (3 days ago) and 4a6ffb6b (3 hours ago) with the same results for both.

Yes it's great that SWI is a worthy competitor. I hope that both prolog implementations and others can grow and mutually refine each other.

infogulch · 2022-04-10T07:23:38Z

infogulch
Apr 10, 2022
Author

This is a somewhat minimal DCG that can parse this particular "csv" file exactly once:

% :- Csv = Rows, phrase(rows(Rows), "a;b\nc;d"), Rows = [["a","b"],["c","d"]].

% util
peek(T), [T] --> [T].

% eof is only true at the end of the string, but does not consume anything.
% A sprinkle of state may be needed to avoid matching multiple times.
eof_([], []).
eof --> call(eof_).

% csv rules
rows([])       --> rows_end.
rows([R | Rs]) --> row(R),rows(Rs).

rows_end --> [eof], eof. % --> [eof-the-atom], eof-the-dcgrule

row([F])      --> field(F),row_end.
row([F | Fs]) --> field(F),row(Fs).

row_end        --> [eol], peek(_).
row_end, [eof] --> [eol], eof.

field([])       --> field_end.
field([C | Cs]) --> [C], {maplist(dif(C), [eof,eol | "\n;"])}, field(Cs).

field_end        --> ";".
field_end, [eol] --> "\n".
field_end, [eol] --> eof.

To use, query phrase_from_file(rows(Rows), "UnicodeData.txt"). which takes about 120s in scryer-prolog on my machine. The file has 34k lines and is 1.8MB, a good candidate for initial benchmarking I think. (To orient with the computational limits here, xsv can compute stats on each column with xsv stats -nd ';' UnicodeData.txt in 30ms.)

There are lots of csv features that are missing from this (custom separator, header options, quoted fields, fixed column count, whitespace trimming, etc.), but this particular file doesn't need them so they're omitted.

I tried to define each DCG predicate so that there cannot be more than one valid result and thus could be executed without any backtracking. I think the necessary rules are there because I haven't found a case where requesting the next case succeeds, but it definitely backtracks because requesting the next case when parsing the whole file spins up the cpu for 20s before returning false. TL;DR: It's nondet, but imo it could be det theoretically.

?- time(phrase_from_file(rows(Rows), "UnicodeData.txt")).
   % CPU time: 112.953s
   Rows = [["0000","<control>","Cc","0","BN",[],[],[],[],"N","NULL",[],[],[],[]],...]
;  % CPU time: 21.033s
   false.

I haven't tuned conjunction or predicate order for performance at all. But if I turn this into a predicate and table it:

:- table data/1.
data(Rows) :- phrase_from_file(rows(Rows), "UnicodeData.txt").

... then querying the first time takes longer but checking for alternate cases is much faster and subsequent queries only take 10s:

?- time(data(Rows)).
   % CPU time: 181.038s
   Rows = [["0000","<control>","Cc","0","BN",[],[],[],[],"N","NULL",[],[],[],[]],...]
;  % CPU time: 0.253s
   false.
?- time(data(Rows)).
   % CPU time: 9.696s
   Rows = [["0000","<control>","Cc","0","BN",[],[],[],[],"N","NULL",[],[],[],[]],...]
;  % CPU time: 0.240s
   false.

Is there a better way to debug DCGs? Maybe need #321 to get more detailed stats.

I marked all the DCGs with dynamic, so I can get the expanded prolog terms with listing/1, which are interesting:

rows([],A,B) :-
   A=[eof|C],
   eof(C,B).
rows([A|B],C,D) :-
   row(A,C,E),
   rows(B,E,D).
   true.

row([A],B,C) :-
   field(A,B,D),
   row_end(D,C).
row([A|B],C,D) :-
   field(A,C,E),
   row(B,E,D).
   true.

row_end(A,B) :-
   A=[eol|C],
   peek(D,C,B).
row_end(A,B) :-
   A=[eol|C],
   eof(C,D),
   B=[eof|D].
   true.

field([],A,B) :-
   field_end(A,B).
field([A|B],C,D) :-
   C=[A|E],
   atom_length(A,1),
   dif(A,'\n'),
   dif(A,;),
   E=F,
   field(B,F,D).
   true.

field_end(A,B) :-
   A=[;|B].
field_end(A,B) :-
   A=['\n'|C],
   B=[eol|C].
field_end(A,B) :-
   eof(A,C),
   B=[eol|C].
   true.

peek(A,B,C) :-
   B=[A|D],
   C=[A|D].
   true.

eof(A,B) :-
   call(user:eof_,A,B).
   true.

8 replies

UWN Apr 13, 2022

Not that I am in any circumstance endorse the usage of codes but to make such code a bit more portable between chars (recommended) and codes, one can write [C] = "\n" in place of C = 0'\n or C = '\n'.

infogulch Apr 13, 2022
Author

Hm, I tried it like field_end --> {[C] = ";"},[C]. and field([C | Cs]) --> [C], {maplist(dif(C), [eof,eol|"\n;"])}, field(Cs). which I think matches your suggestion, but it doesn't parse correctly in swi with chars after using either of these forms.

Separately, I was trying to add 'tests' for edge cases by adding statements like: :- phrase(rows([["a"],["b"]]), "a\nb"). but this throws error(syntax_error(inconsistent_entry),load/1) when I load the module. Is this invalid? Is there some other way to do it?

UWN Apr 13, 2022

doesn't parse correctly in swi with chars

Right, SWI only supports codes.

infogulch Apr 13, 2022
Author

Oh I see, you're saying if I use those forms then I have to set :- set_prolog_flag(double_quotes, codes). (or leave it default). Updated.

infogulch Apr 13, 2022
Author

Thoughts on :- top_level_test(...).?

triska · 2022-04-10T16:35:46Z

triska
Apr 10, 2022

I only have a very tiny comment, independent of the actual main issue, regarding the line:

field([C | Cs]) --> [C], {atom_length(C,1),dif(C, '\n'),dif(C, ';')}, field(Cs). % this is awkward

This can be written more compactly using double quotes as:

field([C|Cs]) --> [C], { maplist(dif(C), "\n;") }, field(Cs).

Note that atom_length/2 is not necessary, because C is a character when using the DCG for parsing text from a file.

3 replies

infogulch Apr 10, 2022
Author

Thanks that looks better. atom_length is there to prevent parsing the eol/eof states which are intended to force it to make progress.

triska Apr 10, 2022

Thank you, I see. Maybe eof and eol need not be kept around explicitly? It suffices to detect this case when parsing maybe.

infogulch Apr 10, 2022
Author

I tried without the explicit states, but if row_end and field_end simply match at eof then they can always match again, thus losing monotonicity and producing an unbounded suffix of empty rows. I'm not sure how else to prevent this.

infogulch · 2022-05-16T18:49:04Z

infogulch
May 16, 2022
Author

With the recent changes to streams and dcgs I thought I'd test this again. To my surprise the "minimial DCG" I posted doesn't parse UnicodeData.txt at all anymore. In particular it can no longer parse this line (line 42 in the original file):

phrase(rows(R), "0029;RIGHT PARENTHESIS;Pe;0;ON;;;;;Y;CLOSING PARENTHESIS;;;;").

It seems there are other problematic lines as well. I'll try to reduce the test case further.

1 reply

triska May 16, 2022

Thank you a lot for tracking this down further, your test cases already led to very important corrections!

infogulch · 2022-07-22T03:25:53Z

infogulch
Jul 22, 2022
Author

I tried again with rebis-dev mark 3 (built from commit f5ca015), and phrase_from_file(rows(Rows), "UnicodeData.txt"). now runs in ~18s on my machine. A 7x improvement over the previous 120s, wow! Also the previous issue of being unable to parse some rows has disappeared.

I expect that there's a lot of remaining performance that can still be squeezed out of the scryer engine given its strings implementation, but I'm impressed by the progress!

Testing methodology

Commands run:

$ cargo --version
cargo 1.62.1 (a748cf5a3 2022-06-08)
$ cargo install --git https://github.com/mthom/scryer-prolog --branch rebis-dev
$ scryer-prolog -v
"f5ca0156"
$ scryer-prolog -f -g 'use_module(unicode),time(phrase_from_file(rows(Rows), "UnicodeData.txt")),halt.'
% CPU time: 18.939s

File unicode.pl:

:- use_module(library(dcgs)).
:- use_module(library(pio)).
:- use_module(library(lists)).
:- use_module(library(time)).
:- use_module(library(dif)).

peek(T), [T] --> [T].

% eof describes end of the string, and does not consume anything.
% Therefore, a sprinkle of state may be needed to avoid matching multiple times.
eof_([], []).
eof --> call(eof_).

% csv rules
rows([])       --> rows_end.
rows([R | Rs]) --> row(R),rows(Rs).

rows_end --> [eof], eof. % --> [eof-the-atom], eof-the-dcgrule

row([F])      --> field(F),row_end.
row([F | Fs]) --> field(F),row(Fs).

row_end        --> [eol], peek(_).
row_end, [eof] --> [eol], eof.

field([])       --> field_end.
field([C | Cs]) --> [C], {maplist(dif(C), [eof,eol | "\n;"])}, field(Cs).

field_end        --> ";".
field_end, [eol] --> "\n".
field_end, [eol] --> eof.

File UnicodeData.txt downloaded from https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt

1 reply

infogulch Jul 22, 2022
Author

After looking at this again, it would be super nice to have a dcg built-in that matches at the end of the file exactly once, along with a counterpart that only matches when its not the end of the file. If these existed, I could strip out eof//0, eof_/1, peek//1 and all the nonsense with the made up eol/eof atoms.

Explicitly matching the end of the string is an important enough use-case to justify built-ins imo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Querying a predicate based on phrase_from_file is slow #1403

{{title}}

Replies: 5 comments 15 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Querying a predicate based on phrase_from_file is slow #1403

infogulch Apr 7, 2022

Replies: 5 comments · 15 replies

infogulch Apr 7, 2022 Author

UWN Apr 7, 2022

infogulch Apr 7, 2022 Author

infogulch Apr 10, 2022 Author

UWN Apr 13, 2022

infogulch Apr 13, 2022 Author

UWN Apr 13, 2022

infogulch Apr 13, 2022 Author

infogulch Apr 13, 2022 Author

triska Apr 10, 2022

infogulch Apr 10, 2022 Author

triska Apr 10, 2022

infogulch Apr 10, 2022 Author

infogulch May 16, 2022 Author

triska May 16, 2022

infogulch Jul 22, 2022 Author

infogulch Jul 22, 2022 Author

infogulch
Apr 7, 2022

Replies: 5 comments 15 replies

infogulch
Apr 7, 2022
Author

infogulch Apr 7, 2022
Author

infogulch
Apr 10, 2022
Author

infogulch Apr 13, 2022
Author

infogulch Apr 13, 2022
Author

infogulch Apr 13, 2022
Author

triska
Apr 10, 2022

infogulch Apr 10, 2022
Author

infogulch Apr 10, 2022
Author

infogulch
May 16, 2022
Author

infogulch
Jul 22, 2022
Author

infogulch Jul 22, 2022
Author