Replies: 5 comments 15 replies
-
A version for SWI does pretty good. code_name(Codepoint, Codename) :-
phrase_from_file(csv(HeaderRows, [separator(0';), convert(false)]), "UnicodeData.txt"),
HeaderRows = [_Header | Rows],
member(Row, Rows),
Row =.. [_, Codepoint, Codename | _].
For comparison, a modified version of
|
Beta Was this translation helpful? Give feedback.
-
This is a somewhat minimal DCG that can parse this particular "csv" file exactly once: % :- Csv = Rows, phrase(rows(Rows), "a;b\nc;d"), Rows = [["a","b"],["c","d"]].
% util
peek(T), [T] --> [T].
% eof is only true at the end of the string, but does not consume anything.
% A sprinkle of state may be needed to avoid matching multiple times.
eof_([], []).
eof --> call(eof_).
% csv rules
rows([]) --> rows_end.
rows([R | Rs]) --> row(R),rows(Rs).
rows_end --> [eof], eof. % --> [eof-the-atom], eof-the-dcgrule
row([F]) --> field(F),row_end.
row([F | Fs]) --> field(F),row(Fs).
row_end --> [eol], peek(_).
row_end, [eof] --> [eol], eof.
field([]) --> field_end.
field([C | Cs]) --> [C], {maplist(dif(C), [eof,eol | "\n;"])}, field(Cs).
field_end --> ";".
field_end, [eol] --> "\n".
field_end, [eol] --> eof. To use, query There are lots of csv features that are missing from this (custom separator, header options, quoted fields, fixed column count, whitespace trimming, etc.), but this particular file doesn't need them so they're omitted. I tried to define each DCG predicate so that there cannot be more than one valid result and thus could be executed without any backtracking. I think the necessary rules are there because I haven't found a case where requesting the next case succeeds, but it definitely backtracks because requesting the next case when parsing the whole file spins up the cpu for 20s before returning false. TL;DR: It's nondet, but imo it could be det theoretically.
I haven't tuned conjunction or predicate order for performance at all. But if I turn this into a predicate and table it:
... then querying the first time takes longer but checking for alternate cases is much faster and subsequent queries only take 10s:
Is there a better way to debug DCGs? Maybe need #321 to get more detailed stats. I marked all the DCGs with dynamic, so I can get the expanded prolog terms with listing/1, which are interesting: rows([],A,B) :-
A=[eof|C],
eof(C,B).
rows([A|B],C,D) :-
row(A,C,E),
rows(B,E,D).
true.
row([A],B,C) :-
field(A,B,D),
row_end(D,C).
row([A|B],C,D) :-
field(A,C,E),
row(B,E,D).
true.
row_end(A,B) :-
A=[eol|C],
peek(D,C,B).
row_end(A,B) :-
A=[eol|C],
eof(C,D),
B=[eof|D].
true.
field([],A,B) :-
field_end(A,B).
field([A|B],C,D) :-
C=[A|E],
atom_length(A,1),
dif(A,'\n'),
dif(A,;),
E=F,
field(B,F,D).
true.
field_end(A,B) :-
A=[;|B].
field_end(A,B) :-
A=['\n'|C],
B=[eol|C].
field_end(A,B) :-
eof(A,C),
B=[eol|C].
true.
peek(A,B,C) :-
B=[A|D],
C=[A|D].
true.
eof(A,B) :-
call(user:eof_,A,B).
true. |
Beta Was this translation helpful? Give feedback.
-
I only have a very tiny comment, independent of the actual main issue, regarding the line: field([C | Cs]) --> [C], {atom_length(C,1),dif(C, '\n'),dif(C, ';')}, field(Cs). % this is awkward This can be written more compactly using double quotes as: field([C|Cs]) --> [C], { maplist(dif(C), "\n;") }, field(Cs). Note that |
Beta Was this translation helpful? Give feedback.
-
With the recent changes to streams and dcgs I thought I'd test this again. To my surprise the "minimial DCG" I posted doesn't parse
It seems there are other problematic lines as well. I'll try to reduce the test case further. |
Beta Was this translation helpful? Give feedback.
-
I tried again with rebis-dev mark 3 (built from commit f5ca015), and I expect that there's a lot of remaining performance that can still be squeezed out of the scryer engine given its strings implementation, but I'm impressed by the progress! Testing methodologyCommands run: $ cargo --version
cargo 1.62.1 (a748cf5a3 2022-06-08)
$ cargo install --git https://github.com/mthom/scryer-prolog --branch rebis-dev
$ scryer-prolog -v
"f5ca0156"
$ scryer-prolog -f -g 'use_module(unicode),time(phrase_from_file(rows(Rows), "UnicodeData.txt")),halt.'
% CPU time: 18.939s File :- use_module(library(dcgs)).
:- use_module(library(pio)).
:- use_module(library(lists)).
:- use_module(library(time)).
:- use_module(library(dif)).
peek(T), [T] --> [T].
% eof describes end of the string, and does not consume anything.
% Therefore, a sprinkle of state may be needed to avoid matching multiple times.
eof_([], []).
eof --> call(eof_).
% csv rules
rows([]) --> rows_end.
rows([R | Rs]) --> row(R),rows(Rs).
rows_end --> [eof], eof. % --> [eof-the-atom], eof-the-dcgrule
row([F]) --> field(F),row_end.
row([F | Fs]) --> field(F),row(Fs).
row_end --> [eol], peek(_).
row_end, [eof] --> [eol], eof.
field([]) --> field_end.
field([C | Cs]) --> [C], {maplist(dif(C), [eof,eol | "\n;"])}, field(Cs).
field_end --> ";".
field_end, [eol] --> "\n".
field_end, [eol] --> eof. File |
Beta Was this translation helpful? Give feedback.
-
In A Tour of Prolog triska demos prolog with a neat example of the list of unicode codepoints (1.8MB file). The video demonstrates editing the csv file with emacs to create a text prolog database that can be loaded directly, but I was curious if you could get the same results by leaving the file unedited and defining a general predicate that is true when it unifies with the data in the file after parsing.
It's correct enough to query, but unfortunately it is very slow. For example,
code_name(20, Name).
takes several seconds to complete, and the general querycode_name(Code, Name).
doesn't show any results at all after several minutes. Tabling didn't seem to work. I wouldn't mind an initial load time, but I'd like to load/cache all the results into the internal database as its parsed the first time -- as if I typed out the database and loaded it. I'm likely missing something big; thoughts?Another perhaps separate issue is that it parses numbers by default so the hex codes of the first field are a mix between 4-digit hex strings and hex numbers that happen to only have decimal digits interpreted as base-10 integers. Maybe we could add an option to parse_csv to disable automatic number parsing?
Beta Was this translation helpful? Give feedback.
All reactions