Advice on Performance Tuning #974

dcnorris · 2021-05-27T19:57:18Z

dcnorris
May 27, 2021
Sponsor

Background: Having played a 'supporting role' in this arXiv paper, complete path enumeration (CPE) has begun to look like a means for definitive treatment of oncology dose-escalation trial designs. I expect in fact that CPE will enable a unification of what are now thought to be disparate classes of dose-finding design. To this end, as part of R package precautionary, I am implementing in ccd.pl (GitHub permalink to current revision) a class of 'cumulative cohort designs' (Ivanova, Fluornoy & Chung (2007)) with certain extensions suggested by Liu & Yuan (2015).

Problem: Whereas my previous DCG for the elementary but widely-used '3+3' design (see file esc.pl) managed to enumerate all 'reasonably-sized' 3+3 trials in about 20 minutes, my ccd.pl code does not enumerate even one trial of practical size in an hour! Achieving my aims with this code probably requires at least a 100x speedup.

david@tarnhelm prolog % scryer-prolog 
?- [ccd]. % obtain ccd.pl from GitHub RAW permalink given by @triska below
   true.
?- J+\(D = 1, time(findall(Matrix, ccd_matrix(D, Matrix), Paths)), length(Paths, J)).
   % CPU time: 10.932 seconds
   % CPU time: 10.936 seconds
   J = 20.
?- J+\(D = 2, time(findall(Matrix, ccd_matrix(D, Matrix), Paths)), length(Paths, J)).
   % CPU time: 115.912 seconds
   % CPU time: 115.917 seconds
   J = 212.
?- J+\(D = 3, time(findall(Matrix, ccd_matrix(D, Matrix), Paths)), length(Paths, J)).
   % CPU time: 618.694 seconds
   % CPU time: 618.698 seconds
   J = 1151.
?- J+\(D = 4, time(findall(Matrix, ccd_matrix(D, Matrix), Paths)), length(Paths, J)).
   % CPU time: 3509.381 seconds
   % CPU time: 3509.385 seconds
   J = 6718.
?- halt.

Question: What are recommended approaches for performance tuning in Prolog? How could I learn where the time is being spent in the above calls? Customary profiling approaches recently helped me to achieve some very nice speedups (see this Twitter thread) of a CPE implemented in R and Rust; does Prolog admit instrumentation for nested attribution of execution time to individual goals and the goals they call? (Presumably, meta-interpretation could help with this?) With Scryer being under such active development, might profiling also point usefully to underlying implementation (or library) issues worth reporting? Are there any intersections with Rust and its tooling that we Scryer Prolog users could exploit for performance tuning?

Edit: Although the particular performance problem that motivated this post is now logged as #975, the underlying question (seeking general advice on Prolog performance tuning) remains.

triska · 2021-05-27T20:27:48Z

triska
May 27, 2021

Thank you a lot for sharing this interesting benchmark!

Here is the permalink of the raw source file that can be tried immediately:

https://raw.githubusercontent.com/dcnorris/precautionary/f7491b2e04be2613df04c78fbddeaafdc7e7e6a2/exec/prolog/ccd.pl

One thing I noticed by looking at the sample queries included in the file is that ceiling_vertex_t/3 takes quite a long time:

%?- time(ceiling_vertex_t([3/3,3/4,3/5,4/6,4/7,4/8,5/9,5/10,6/11,6/12], V, true)).
%@    % CPU time: 0.426 seconds
%@    V = 3/5
%@ ;  % CPU time: 0.962 seconds
%@    V = 4/8
%@ ;  % CPU time: 1.450 seconds
%@    V = 5/10
%@ ;  % CPU time: 1.972 seconds
%@    V = 6/12
%@ ;  % CPU time: 1.982 seconds
%@    false.

Maybe there is a way to speed this up?

Often, major performance improvements come from removing unnecessary work.

16 replies

dcnorris May 29, 2021
Author Sponsor

... or maybe the point is that I've sacrificed the opportunity to exploit that power, which the original had:

tally_decision_ccd(Q, Decision, ccd(RemovalBdy, DeescBdy, EscBdy, FullCoh)) :-
    Q = T/N,
    N in 0..FullCoh, indomain(N),
    T in 0..N,
    if_(hit_ceiling_t(Q, RemovalBdy)
	, Decision = remove
	, if_(hit_ceiling_t(Q, DeescBdy)
	      , Decision = deescalate
	      , if_(hit_floor_t(Q, EscBdy)
		    , Decision = escalate
		    , Decision = stay
		   )
	     )
       ).

Either of these formulations should be equally readable by oncology trialists, for verifying its correspondence with their understanding of the trial protocol. (Our implementation issue is to find goal expansions that let Prolog treat the latter as efficiently as the former?)

triska May 29, 2021

Yes, our goal is to make these declarative constructs also efficient, for example via goal expansion, faster internal implementations in the engine etc.

Note also that major performance improvements are possible by more quickly determining what holds in the conditions, completely independent from any changes or improvements to if_/3. In this specific case for example, note that hit_ceiling_t/2 can very quickly reach a conclusion if its arguments are ground, by using fast built-in arithmetic in cases where the full generality of CLP(ℤ) constraints is not needed. The pattern could look like this:

hit_ceiling_t(Q, Cs, Truth) :-
        (   ground(Q-Cs) ->
            (   member(C, Cs),
                fast_arithmetic_relation_holds(C, Q) ->
                Truth = true
            ;   Truth = false
            )
        ;   ...
        ).

This uses a dynamic groundness check of arguments to see whether a (to be defined) fast arithmetic relation can be used. This distinction is made dynamically to use the fast check when applicable. The other branch (...) can handle the more general case, using reified CLP(ℤ) constraints, as it does now.

triska May 29, 2021

And of course, the fast versions of these arithmetic relations are ideally generated from more declarative descriptions.

dcnorris May 29, 2021
Author Sponsor

I tried these changes to qcompare/4 for the (=<) and (=>) cases used in ceiling/floor hitting:

qcompare(=<, T1/N1, T2/N2, Truth) :-
    %% Let's try using fast arithmetic, when possible
    (	ground(T1/N1 - T2/N2) ->
	(   DN is N2 - N1,
	    (	DN >= 0 ->
		(   T1plusDN is T1 + DN,
		    T1plusDN =< T2 -> Truth = true
		;   Truth = false
		)
	    ;	% DN < 0, so a simpler condition applies
		(   T1 =< T2 -> Truth = true
		;   Truth = false
		)
	    )
	)
    ;	% general case (non-ground args 2 or 3) is handled by CLP(Z) ...
	T1 + max(0, N2 - N1) #=< T2 #<==> B,
	zo_t(B, Truth)
    ).
	
qcompare(>=, T1/N1, T2/N2, Truth) :-
    %% Let's try using fast arithmetic, when possible
    (	ground(T1/N1 - T2/N2) ->
	(   DN is N1 - N2,
	    (	DN >= 0 ->
		(   T2plusDN is T2 + DN,
		    T2plusDN =< T1 -> Truth = true
		;   Truth = false
		)
	    ;	% DN < 0, so a simpler condition applies
		(   T1 >= T2 -> Truth = true
		;   Truth = false
		)
	    )
	)
    ;	% general case (non-ground args 2 or 3) is handled by CLP(Z) ...
	T1 #>= T2 + max(0, N1 - N2) #<==> B,
	zo_t(B, Truth)
    ).

... and saw no change to performance:

%?- J+\(time(findall(Matrix, ccd_matrix(3, Matrix), _Paths)), length(_Paths, J)).
%@    % CPU time: 10.675 seconds
%@    % CPU time: 10.679 seconds
%@    J = 1151. % ^ fast-arithmetic branch in qcompare/4 for both (>=) and (=<)
%@    % CPU time: 10.676 seconds
%@    % CPU time: 10.680 seconds
%@    J = 1151. % ^ fast-arithmetic branch in qcompare/4 for (=<)
%@    % CPU time: 10.786 seconds
%@    % CPU time: 10.792 seconds
%@    J = 1151. % ^ benchmark before another round of performance experiments

But this is as I expected, since (my understanding is) clpz already reverts to the fast arithmetic whenever it can.

Was there some reason you thought I should try bubbling this approach up to the hit_(ceiling|floor)_t/3 predicates?

triska May 30, 2021

Yes, clpz does apply fast arithmetic, but not for reification: It is significantly more costly to post a reified constraint and detect its truth value than to only perform an arithmetic comparison and to deduce the truth "hard-coded" via (->)/2. For example, try:

?- length(_, E), portray_clause(E), N #= 2^E, time((between(1,E,_),5#>3#<==>B,false)).

vs:

?- length(_, E), portray_clause(E), N #= 2^E, time((between(1,E,_),(5>3 -> B = 1; B = 0),false)).

I am using between/3 to run the goal many times, since a single invocation is fast in both cases.

UWN · 2021-05-28T19:51:30Z

UWN
May 28, 2021

member(T/N, Qs),

In stead, use memberd_t(T/N, Qs, true) which is comparable in efficiency to memberchk/2, but it is still pure since it only stops if a == element is found.

This makes the code more determinate, that is less leftover choice points à la ; false.

4 replies

dcnorris May 28, 2021
Author Sponsor

member(T/N, Qs) seems a bit faster than memberd_t(T/N, Qs, true):

ceiling_vertex_t(Qs, T/N, Truth) :-
    %%member(T/N, Qs),
    memberd_t(T/N, Qs, true),
    N1 #= N + 1,
    if_(hit_ceiling_t(T/N1, Qs)
	, Truth = false
	, ( T_1 #= T - 1,
	    N_1 #= N - 1,
	    if_(hit_ceiling_t(T_1/N_1, Qs)
		, Truth = false
		, Truth = true
	       )
	  )
       ).

%?- time(ceiling_vertex_t([3/3,3/4,3/5,4/6,4/7,4/8,5/9,5/10,6/11,6/12], V, true)).
%@    % CPU time: 0.017 seconds
%@    V = 3/5
%@ ;  % CPU time: 0.061 seconds
%@    V = 4/8
%@ ;  % CPU time: 0.105 seconds
%@    V = 5/10
%@ ;  % CPU time: 0.154 seconds
%@    V = 6/12
%@ ;  % CPU time: 0.174 seconds
%@    false. %    ^^^^^ using memberd_t(T/N, Qs, true)
%@    % CPU time: 0.011 seconds
%@    V = 3/5
%@ ;  % CPU time: 0.031 seconds
%@    V = 4/8
%@ ;  % CPU time: 0.052 seconds
%@    V = 5/10
%@ ;  % CPU time: 0.074 seconds
%@    V = 6/12
%@ ;  % CPU time: 0.086 seconds
%@    false. %    ^^^^^ using member(T/N, Qs)

UWN May 29, 2021

T_1 #= T - 1,

Some tiny remark here: Commonly, subsequent states, differences and the like are numbered T0, T1 ... TN, and eventually T or, just T1 ... if no final state is present. Calling a variable T_1 is not very common. It is for this reason that I have named variables holding goals and in particular incomplete goals that still lack some arguments like that. There is some hope that such variables could be part of a kind-of type system that at least identifies some incompatibilities.

dcnorris May 29, 2021
Author Sponsor

Glad to apppreciate this, as I've been making a habit of this _1 idiom. I gather then that the underscore in Prolog should be given wide berth, and used only in a few established, idiomatic ways (as for accessory predicates).

dcnorris May 29, 2021
Author Sponsor

Less 'cute', and probably even clearer now:

ceiling_vertex_t(Qs, T/N, Truth) :-
    memberd_t(T/N, Qs, true),
    N1 #= N + 1,
    if_(hit_ceiling_t(T/N1, Qs)
	, Truth = false
	, ( Tminus1 #= T - 1,
	    Nminus1 #= N - 1,
	    if_(hit_ceiling_t(Tminus1/Nminus1, Qs)
		, Truth = false
		, Truth = true
	       )
	  )
       ).

mthom · 2021-05-28T21:21:13Z

mthom
May 28, 2021
Maintainer

Work on compacting the representation of terms in the heap is ongoing, but once it's finished, it should yield a substantial speed up in all of Scryer. I enthusiastically agree that inlining if_/3 as a Rust built-in is worth looking at but I have plenty on my plate already.

14 replies

triska Jun 8, 2021

Here is a self-contained file with two predicates: qcompare0/4 and qcompare1/4:

:- use_module(library(clpz)).
:- use_module(library(time)).
:- use_module(library(between)).

exp(E) :- N #= 2^E, between(1, N, _).

qcompare0(=<, T1/N1, T2/N2, Truth) :-
    %% Substitute direct comparisons for reified constraints when possible:
    (	ground(T1/N1 - T2/N2) ->
	(   zcompare(C, N2, N1),
	    (	C = (>),
		(   T1 + N2 - N1 #=< T2 -> Truth = true
		;   Truth = false
		)
	    ;	(C = (=) ; C = (<)),
		(   T1 #=< T2 -> Truth = true
		;   Truth = false
		)
	    )
	)
    ;	% general case (non-ground args 2 or 3) is handled by CLP(Z) ...
	T1 + max(0, N2 - N1) #=< T2 #<==> B,
	zo_t(B, Truth)
    ).


qcompare1(=<, T1/N1, T2/N2, Truth) :-
	T1 + max(0, N2 - N1) #=< T2 #<==> B,
	zo_t(B, Truth).

zo_t(0, false).
zo_t(1, true).

With these definitions, I get:

?- time((exp(10),qcompare0(=<, 1/2,3/4, T),false)).
   % CPU time: 0.196 seconds
false.

?- time((exp(10),qcompare1(=<, 1/2,3/4, T),false)).
   % CPU time: 0.185 seconds
false.

So, on my system and with the latest version of Scryer Prolog, the shorter version is also more efficient. Can you confirm this with your setup?

dcnorris Jun 8, 2021
Author Sponsor

Apologies! This seems due to a difference between git fetch --all and git pull which I had not appreciated. I was still behind your latest commits to clpz. Rebuilding now, and will post new stats shortly ...

dcnorris Jun 8, 2021
Author Sponsor

I'm so sorry for the heartburn, @triska — as it turns out, the most elegant code is the fastest! 🔥🔥🔥
Interestingly, it is now the inelegance that incurs overheads of ≈30%.

version(Version) :-
    '$scryer_prolog_version'(Version).

%?- version(Version).
%@    Version = "v0.8.123-595-g176da ...".

%?- benchmark(elegant_if_). % remember to use C-u C-u F10
%@ Pure code that will go in the paper:
%@  D = 1 ...   % CPU time: 0.866 seconds
%@    % CPU time: 0.870 seconds
%@  J(1) = 20.
%@  D = 2 ...   % CPU time: 10.524 seconds
%@    % CPU time: 10.528 seconds
%@  J(2) = 212.
%@ Faster code, exploiting several inelegances:
%@ Warning: overwriting goal_expansion/2
%@  D = 1 ...   % CPU time: 1.151 seconds
%@    % CPU time: 1.155 seconds
%@  J(1) = 20.
%@  D = 2 ...   % CPU time: 13.564 seconds
%@    % CPU time: 13.568 seconds
%@  J(2) = 212.
%@ Partial restoration of elegance ... [elegant_if_].
%@ Warning: overwriting goal_expansion/2
%@  D = 1 ...   % CPU time: 1.207 seconds
%@    % CPU time: 1.211 seconds
%@  J(1) = 20.
%@  D = 2 ...   % CPU time: 14.263 seconds
%@    % CPU time: 14.267 seconds
%@  J(2) = 212.
%@ false.

%?- benchmark(elegant_qcompare). % remember to use C-u C-u F10
%@ Pure code that will go in the paper:
%@  D = 1 ...   % CPU time: 0.924 seconds
%@    % CPU time: 0.928 seconds
%@  J(1) = 20.
%@  D = 2 ...   % CPU time: 11.387 seconds
%@    % CPU time: 11.391 seconds
%@  J(2) = 212.
%@ Faster code, exploiting several inelegances:
%@ Warning: overwriting goal_expansion/2
%@  D = 1 ...   % CPU time: 1.217 seconds
%@    % CPU time: 1.222 seconds
%@  J(1) = 20.
%@  D = 2 ...   % CPU time: 14.498 seconds
%@    % CPU time: 14.503 seconds
%@  J(2) = 212.
%@ Partial restoration of elegance ... [elegant_qcompare].
%@ Warning: overwriting goal_expansion/2
%@  D = 1 ...   % CPU time: 0.870 seconds
%@    % CPU time: 0.874 seconds
%@  J(1) = 20.
%@  D = 2 ...   % CPU time: 10.730 seconds
%@    % CPU time: 10.734 seconds
%@  J(2) = 212.
%@ false.

triska Jun 8, 2021

"Faster code, exploiting several inelegances [to yield slower code]"

dcnorris Jun 8, 2021
Author Sponsor

git rm inelegance.pl
git commit -m "Inelegance is not an option"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advice on Performance Tuning #974

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 34 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Advice on Performance Tuning #974

dcnorris May 27, 2021 Sponsor

Replies: 3 comments · 34 replies

dcnorris May 29, 2021 Author Sponsor

dcnorris May 29, 2021 Author Sponsor

dcnorris May 28, 2021 Author Sponsor

dcnorris May 29, 2021 Author Sponsor

dcnorris May 29, 2021 Author Sponsor

mthom May 28, 2021 Maintainer

dcnorris Jun 8, 2021 Author Sponsor

dcnorris Jun 8, 2021 Author Sponsor

dcnorris Jun 8, 2021 Author Sponsor

dcnorris
May 27, 2021
Sponsor

Replies: 3 comments 34 replies

dcnorris May 29, 2021
Author Sponsor

dcnorris May 29, 2021
Author Sponsor

dcnorris May 28, 2021
Author Sponsor

dcnorris May 29, 2021
Author Sponsor

dcnorris May 29, 2021
Author Sponsor

mthom
May 28, 2021
Maintainer

dcnorris Jun 8, 2021
Author Sponsor

dcnorris Jun 8, 2021
Author Sponsor

dcnorris Jun 8, 2021
Author Sponsor