A Fast Prolog Compiler Proof of Concept

Introduction

This post describes a reasonably performant Prolog to C compiler prototype written in Prolog. It only compiles a the subset of Prolog, but it does include unification and backtracking, which is enough to demonstrate:

N-queens, which is good for testing the performance of backtracking.
the append predicate, which is an excellent example for demonstrating unification, backtracking and code being able to be run in more than one direction,

Context: A few months ago, I applied for a job that involved developing code in Prolog. I hadn’t used Prolog for some time, so I was looking for a project to brush up on my Prolog skills. It occurred to me that an effective way of refreshing my Prolog skills and deepening my understanding of the internals of Prolog at the same time would be to write a compiler that would compile a subset of Prolog to a low level language. I was tempted to target x86-64 machine code (having written code generators targeting this before), but I only had a bit over a week to put something together to demonstrate, so I chose to target C++ instead, later changing it to C as I wasn’t using any features that required C++.

Note: this would not be a good candidate for developing into a full Prolog compiler:

The choice of C as a target language is a poor choice, as it is hard to get good information about the stack (necessary for the way that I do backtracking), and C compilers may not always do tail call optimization - more on this here.
Because of the compressed development time, the compiler is largely a set of write statements from the Prolog parse tree to C, rather than going through more intermediate stages. This is enough to demonstrate the concept, but is not maintainable - more intermediate stages would be needed for maintainability and optimization.
The modern implementations of Prolog all support CLP(FD) (Constraint Logic Programming over Finite Domains), which in provides a fast and powerful way of solving certain types of integer and floating point constraint, and combinatorial problems. I have not looked at what would be involved in supporting that.

The concepts demonstrated in this project (particularly fast backtracking) could be used in a more complete and maintainable implementation.

Introduction
Contents
Design
Backtracking
- Creating a choice point
  - The trail
  - Implementation
- Choice point elimination
Unification
Other optimizations
Debugging/Tracing
Example problems
Comparison with other implementations
- SWI Prolog
  - CLPFD
- GNU Prolog
- C
- Mercury
- Haskell
Type/Mode/Determinism declarations
Speed results
Why use machine code or MLIR (LLVM)
- Machine Code
- MLIR (LLVM)
Future directions
Conclusion

Design

Because the Prolog to C compiler is only a prototype, I only added the functionality necessary to run the tests that I wanted to, so it only handles integers and lists. There is nothing in the design that would prevent this being extended.

It is written in SWI-Prolog and has two files: parser.pl and compiler.pl (there is also an interpreter.pl which is just a simple Prolog interpreter - but a Prolog interpreter written in Prolog is not what this post is about).

parser.pl uses the SWI-Prolog package plammar to parse a Prolog file, then strips out all the symbols, which is described in Other optimizations.

Backtracking

One of the biggest stumbling blocks to writing an efficient Prolog implementation is backtracking. Backtracking can occur where there is a “choice point” - more than one clause that satisfies the goal, so each clause is called, and if it fails, the state of the machine is reset to the choice point, and the next clause is tried. Saving the state at the choice point can be expensive, as a lot of information might need to be saved. Restoring to a choice point can be similarly expensive.

Creating a choice point

When a choice point is reached, it is necessary to save sufficient state to restore the state of the machine to the place where the choice point happened (with the clause to be evaluated incremented by one). The naive way of doing this would be to create a snapshot of the state of the running process, but this would be unfeasibly expensive. The three parts of a process that would be involved in such a snapshot are the registers, the stack and the heap, and the heap is the most expensive.

The Trail

In Prolog, the state of the machine is the variable bindings, so the best way of storing this is to create a stack of variable bindings separate from the machine stack that can be unwound. Such a stack is called a “trail” as discussed in Warren’s Abstract Machine: A Tutorial Reconstruction by Hassan Aït-Kaci (1999).

The reason for a separate stack rather than the machine stack is because of the way that variables may be bound by called clauses:

f(X) :-
    g(X),
    h(X).

In this example, X could be already bound when f(X) is called, it could be bound by g(X), h(X) or not bound at all, and this often can’t be determined at compile time. If for instance g(X) was to bind X, that binding needs to be stored for later unwinding, and if was saved on the machine stack, that information would be lost when g(X) was returned and its stack frame was popped.

Implementation

My implementation has two stacks that are used to store choice point information. Firstly there is a stack of variable bindings - each time a variable is bound that may be unwound, a reference to it is saved on a stack, which in my implementation is stored in the Prolog struct:

UWORD* unwind_stack_decouple;
UWORD top_unwind_stack_decouple;

The other stack is a stack of choice point frames, stored in the Prolog struct as:

FrameStore* frames;
uint32_t frame_top;

Where FrameStore is (simplified and annotated):

typedef struct {

Due to the rules about caller/callee register saving only some registers need to be saved:

    uint8_t* store_bx;
    uint8_t* store_sp;
    uint8_t* store_bp;
    uint8_t* store_12;
    uint8_t* store_13;
    uint8_t* store_14;
    uint8_t* store_15;

In my implementation I save part of the machine stack. I need to store the location where the stack is being saved from live_stack, the place it is stored to store_stack and the sizes for saving and loading (which may be the same, but I want to allow for the possibility that they aren’t):

    uint8_t* live_stack;
    uint8_t* store_stack;
    uint32_t save_size;
    uint32_t load_size;

The next field clause_index is used to move the clause along by one every time a choice point is returned to and clause_count to indicate where how many clauses there are:

    int32_t clause_index;
    int32_t clause_count;

The next field indicates how far to unwind the trail to if backtracking to a choice point:

    UWORD unwind_stack_decouple_mark;

clause_depth is currently only used to display the depth if trace mode is turned on, but could also be used to impose a depth limit.

    uint32_t call_depth;
} FrameStore;

There is some glue code that needs to be written in assembler required for saving and restoring choice points. It was not feasible to write this in C as it kept stepping on registers I needed to preserve.

C (and C++) do not make it easy to deal with the stack pointer directly - it is possible to get access to the stack pointer using "mov %%rsp, %0" in inline assembler or __builtin_frame_address provided by gcc and clang, but the compiler can use values above or below the stack pointer because the x86-64 uses a Red Zone, so it is not an exact science working out which part of the stack to save and restore, and in practice I had to widen the saved address range a little to account for this. This is discussed more below.

Choice point elimination

When writing efficient code, it is generally best to only pay the cost for what is being used, and in many cases, the overhead of the dealing with choice points can be avoided - this is known as Choice Point Elimination. Backtracking will only happen if more than one clause can be called, so if it can be determined that only one clause can be called (such as the trivial case where there is only one clause) the choice point code can be left out altogether.

Even where there is a possibility of backtracking, the store state operation can be left until a clause is being called that is not the last clause - as that is the only case where backtracking can occur. The pseudocode is for this is in the case of two clauses is:

function
    first clause:
        check if first clause can be called
        SAVE STATE
        if what was just saved (rather than reach here from a restore) call first clause
    second clause:
        check if second clause can be called, then call

This can be refined by earlier testing if a later clause can be called:

function
    first clause:
        check if first clause can be called
        if the second clause might be called
            SAVE STATE
        if what was just saved (rather than reach here from a restore) call first clause
    second clause:
        check if second clause can be called, then call

An example of this can be found in the section Append code.

Unification

Unification of two terms is a somewhat expensive operation if it is overused, but here is where a compiler really shines:

If one term is is known not to be bound yet at compile time, such as the first time a variable occurs on the left hand side of a clause, the unification can be replaced by aliasing.
If part of a tree is known at compile time, unification with that part of the tree can be replaced by matching, which is cheaper.

In practice, that means that full unification is only required when the same variable occurs more than one on the left hand side of a clause.

For instance, in:

append([X|Xs],Ys,[X|Zs]) :- append(Xs,Ys,Zs).

there will only need to be one unification to check that the X as the head of the first and third parameter are the same value.

This fits very well with the principle of only paying for features that are used - if code doesn’t have repeated variables on the left hand side, there will be no calls required to the general unification code.

There is something to be careful with when combining unification and backtracking - it needs to be possible to undo part of a unification, which often means leaving structures in an inefficient form.

For instance, if at a particular point in execution, we have X=[1,2|Y], and then Y is then bound to [3], the resulting data structure might be:

Pointer Indirection!

It might be tempting to eliminate the step through Y, but this can’t be safely done in the case where backtracking might change what Y is unified without backtracking over the unification of X:

Pointer Indirection simplified!

An optimization that I don’t perform would be to detect when it is safe to do this as there are no choice points between the binding of X and the binding of Y.

In practice, depending on how this point was reached, there might be even more indirections:

Pointer Indirection complicated!

Other optimizations

I’m not sure if this counts as an optimization or just common sense, but one one the first things I do after parsing the Prolog is strip out all the symbols and replace them by a position (leaving the symbol tables for possible use by debugging later, but they no longer need to be used for compiling). For the N-queens example, the parse tree after this step looks like the following, where v( ) represents a variable, i( ) is an integer and eol is end of list. I also use an explicit list/2 predicate, so [1,X,2] would be represented as list(i(1),list(v(0),list(i(2),eol))).

[f(range,3),f(selectx,3),f(queens,2),f(queens_aux,3),f(not_attack,3)]
[[  clause(['M'],
        [v(0),v(0),list(v(0),eol)],
            []),
    clause(['M','N','Ns','M1'],
        [v(0),v(1),list(v(0),v(2))],
            [function(test_neq,v(0),v(1)),
            function(assign,v(3),function(add,v(0),i(1))),
            fcall(0,[v(3),v(1),v(2)])])],
[   clause(['X','Xs'],
        [v(0),list(v(0),v(1)),v(1)],
            []),
    clause(['X','Y','Ys','Zs'],
        [v(0),list(v(1),v(2)),list(v(1),v(3))],
            [fcall(1,[v(0),v(2),v(3)])])],
[   clause(['N','Qs','Ns'],
        [v(0),v(1)],
            [fcall(0,[i(1),v(0),v(2)]),
            fcall(3,[v(2),eol,v(1)])])],
[   clause(['UnplacedQs','SafeQs','Qs','Q','UnplacedQs1'],
        [v(0),v(1),v(2)],
            [fcall(1,[v(3),v(0),v(4)]),
            fcall(4,[v(3),i(1),v(1)]),
            fcall(3,[v(4),list(v(3),v(1)),v(2)])]),
    clause(['Qs'],
        [eol,v(0),v(0)],
            [])],
[   [clause(['_A1','_A2'],
        [v(0),v(1),eol],
            []),
    clause(['Q0','D0','Q','Qs','D1'],
        [v(0),v(1),list(v(2),v(3))],
            [function(test_neq,v(0),function(add,v(1),v(2))),
            function(test_neq,v(2),function(add,v(1),v(0))),
            function(assign,v(4),function(add,v(1),i(1))),
            fcall(4,[v(0),v(4),v(3)])])]]

When I compile this, I can treat the variable identifiers (the bit inside the v( )) as an offset into the variable list (actually, I can do a little bit better and eliminate some of them).

Binding of variables and assigning of list cells is performed as last as possible inside a clause, preferably till after it has been determined that the clause will be called, that way there is less need to unwind the trail and deallocate list cells.

I do limited garbage collection - when I backtrack to a choice point, any allocations that happened after that choice point are cleaned up - if code is run to completion, there are no memory leaks. What would be a problem is if there was a recursive call used as a long loop (say a REPL) - there would need to be a stack and/or trail cleanup to prevent unlimited memory use.

Other optimizations are at the bit level:

I store 3 tag bits, and I store them in the least significant bits rather than the most significant, as it makes masking easier, also integers can be stored in the upper bits and an arithmetic shift right by 3 will preserve the sign.
The tag for a reference is is TAG_VREF=0b000, rather than any non-zero value, as that will make chasing through the indirections (as shown above) a tiny bit faster - it saves one machine code instruction, but every little helps.

Here is my code for “pointer_chasing”:

static inline void pointer_chase_notag(Prolog* p, UWORD* val) {
    UWORD v;
    uint8_t tag;
loop:
    tag=(*val&TAG_MASK);
    if(((tag&TAG_MASK)==TAG_VREF) && (v=p->variables[(*val>>TAG_WIDTH)])!=TAG_VAR) {
        *val=v;
        goto loop;
    }
}

Debugging/tracing

When developing this, is was necessary to develop some debugging tools. I went with a trace mode similar to SWI Prolog’s leash(all),trace. that prints out the execution trace of the code, then I can spot when the problem is and put in appropriate breakpoints (I spent a lot of time using gdb when developing this).

To turn this on, just set:

trace_mode.

on about line 60 of compiler.pl and it will include the appropriate printf statements in the code.

It is instructive for nqueens(2,Q). to put my trace (left) and SWI Prolog’s trace (right) side by side:

                                    Call: (12) nqueens(2, _50522)
2:>range,2,2,_2 c=1                 Call: (13) range(1, 2, _51822)
                                    Call: (14) 1=\=2
                                    Exit: (14) 1=\=2
                                    Call: (14) _54270 is 1+1
                                    Exit: (14) 2 is 1+1
2:>range,2,2,_2 c=1                 Call: (14) range(2, 2, _52642)
=== saved choice point 1
2:<range,2,2,[2] c=0                Exit: (14) range(2, 2, [2])
2:<range,1,2,[1,2] c=0              Exit: (13) range(1, 2, [1, 2])
3:>nqueens_aux,[1,2],[],_0 c=0      Call: (13) nqueens_aux([1, 2], [], _50522)
3:>selectx,_4,[1,2],_5 c=1          Call: (14) selectx(_59154, [1, 2], _59156)
=== saved choice point 2
3:<selectx,1,[1,2],[2] c=0          Exit: (14) selectx(1, [1, 2], [2])
4:>not_attack,1,1,[] c=0            Call: (14) not_attack(1, 1, [])
4:<not_attack,1,1,[] c=-1           Exit: (14) not_attack(1, 1, [])
4:>nqueens_aux,[2],[1],_0 c=0       Call: (14) nqueens_aux([2], [1], _50522)
4:>selectx,_6,[2],_7 c=1            Call: (15) selectx(_63234, [2], _63236)
=== saved choice point 3
4:<selectx,2,[2],[] c=0             Exit: (15) selectx(2, [2], [])
5:>not_attack,2,1,[1] c=0           Call: (15) not_attack(2, 1, [1])
                                    Call: (16) 2=\=1+1
                                    Fail: (16) 2=\=1+1
5:<not_attack:FAIL                  Fail: (15) not_attack(2, 1, [1])
=== loaded choice point 3           Redo: (15) selectx(_63234, [2], _63236)
4:>selectx,_6,[],_10 c=2            Call: (16) selectx(_63234, [], _68938)
4:<selectx:FAIL                     Fail: (16) selectx(_63234, [], _68938)
                                    Fail: (15) selectx(_63234, [2], _63236)
                                    Fail: (14) nqueens_aux([2], [1], _50522)
                                    Redo: (14) not_attack(1, 1, [])
                                    Fail: (14) not_attack(1, 1, [])
=== loaded choice point 2           Redo: (14) selectx(_59154, [1, 2], _59156)
3:>selectx,_4,[2],_8 c=2            Call: (15) selectx(_59154, [2], _74642)
=== saved choice point 2
3:<selectx,2,[2],[] c=0             Exit: (15) selectx(2, [2], [])
3:<selectx,2,[1,2],[1] c=0          Exit: (14) selectx(2, [1, 2], [1])
4:>not_attack,2,1,[] c=3            Call: (14) not_attack(2, 1, [])
4:<not_attack,2,1,[] c=2            Exit: (14) not_attack(2, 1, [])
4:>nqueens_aux,[1],[2],_0 c=3       Call: (14) nqueens_aux([1], [2], _50522)
4:>selectx,_9,[1],_10 c=4           Call: (15) selectx(_79534, [1], _79536)
=== saved choice point 3
4:<selectx,1,[1],[] c=0             Exit: (15) selectx(1, [1], [])
5:>not_attack,1,1,[2] c=1           Call: (15) not_attack(1, 1, [2])
                                    Call: (16) 1=\=1+2
                                    Exit: (16) 1=\=1+2
                                    Call: (16) 2=\=1+1
                                    Fail: (16) 2=\=1+1
5:<not_attack:FAIL                  Fail: (15) not_attack(1, 1, [2])
=== loaded choice point 3           Redo: (15) selectx(_79534, [1], _79536)
4:>selectx,_9,[],_13 c=2            Call: (16) selectx(_79534, [], _86864)
4:<selectx:FAIL                     Fail: (16) selectx(_79534, [], _86864)
                                    Fail: (15) selectx(_79534, [1], _79536)
                                    Fail: (14) nqueens_aux([1], [2], _50522)
                                    Redo: (14) not_attack(2, 1, [])
                                    Fail: (14) not_attack(2, 1, [])
 === loaded choice point 2          Redo: (15) selectx(_59154, [2], _74642)
3:>selectx,_4,[],_11 c=2            Call: (16) selectx(_59154, [], _92568)
3:<selectx:FAIL                     Fail: (16) selectx(_59154, [], _92568)
                                    Fail: (15) selectx(_59154, [2], _74642)
                                    Fail: (14) selectx(_59154, [1, 2], _59156)
                                    Fail: (13) nqueens_aux([1, 2], [], _50522)
=== loaded choice point 1           Redo: (14) range(2, 2, _52642)
                                    Call: (15) 2=\=2
                                    Fail: (15) 2=\=2
2:<range:FAIL                       Fail: (14) range(2, 2, _52642)
=== loaded choice point 1
2:<range:FAIL                       Fail: (13) range(1, 2, _51822)
=== loaded choice point 0
0:<nqueens:FAIL                     Fail: (12) nqueens(2, _50522)
false.                              false.

They show a similar execution path in that they make the same predicate calls in the same order, but there are gaps in my version - I don’t show the trace through the arithmetic (I inline it), and in my version the choice points in not_attack have been eliminated.

Normally I like to use valgrind to assist with finding uninitialized variables and squashing memory leaks, however it gets most upset when backtracking happens - it gets really confused by entering one function, having the stack rewritten underneath it, and exiting another function. Valgrind also doesn’t like that the range of addresses for copying from and to the stack needs to be expanded a little for safety.

Example problems

I tried four example problems, each one designed to tests a particular aspect of the system:

Append is used to test that the same code has many different options for which parameters are input and which are output.
N-queens tests the performance of the engine in the face of lots of backtracking. N-queens is particularly good for getting an idea of performance as larger N will result in longer execution time.
Four Clauses is a simple test to check that the compiler works on predicates with more than more than two clauses, as my other examples didn’t have such cases.
Pass Through is a pass-through function. The test here is that it shouldn’t generate much code.

I also tested Fibonacci and factorial functions, but there is nothing interesting to show from that.

Append

One of the major features of Prolog that is provided by unification and backtracking is that code can run backwards. Let’s test the compiler by trying various versions of append where append is defined by:

append([],Ys,Ys).
append([X|Xs],Ys,[X|Zs]) :- append(Xs,Ys,Zs).

Here is the generated C source code, cleaned up a very little bit and some explanation added.

Append code

First store some values that might be needed later.

uint8_t append_3(Prolog* p, UWORD arg0, UWORD arg1, UWORD arg2, UWORD voffset, UWORD* voffset_new) {
    FrameStore* fs=NULL;
    UWORD unwind_stack_decouple_mark=p->top_unwind_stack_decouple;
    UWORD unwind_stack_gc_mark=p->top_unwind_stack_gc;
    uint32_t function_frame_top=p->frame_top;

First clause: append([],Ys,Ys).

    {
        uint8_t tag_arg0;
        UWORD voffset_next=voffset+0;
        tag_arg0=(arg0&TAG_MASK);

If arg0 is not either [] or a variable, skip stright to the second clause.

        if(tag_arg0==TAG_EOL) {goto s_label_c0_arg0;}
        if(tag_arg0!=TAG_VREF) {goto fail_label_c0_no_unwind;}
s_label_c0_arg0:;
s_label_c0_arg1:;

If arg1 can’t be unified to arg2, then go to the second clause, but first unwind anything added to the trail.

        if(!unify(p,arg1,arg2)) {goto fail_label_c0;}
s_label_c0_arg2:;

This logic gets a bit tricker. arg0 is known to be [] or a variable. If it is [], the second clause would fail as it needs a non-empty list, so there is no need to store a continuation - the && ((arg0&TAG_MASK)==TAG_VREF) is the optimization that does the second clause check. Otherwise, store the state so this can be backtracked to.

        if(true && ((arg0&TAG_MASK)==TAG_VREF)) {
            fs=&p->frames[++p->frame_top];
            fs->clause_index=0;
            function_frame_top=p->frame_top;
            fs->clause_count=2;
            fs=process_stack_state_load_save(p,0);

This point can be reached in one of two ways: just after saving the state, in which case the clause_index will be incremented from 0 to 1, or just after state restore caused by a backtrack, so the clause_index will be incremented from 1 to 2.

            fs->clause_index++;
            if(fs->clause_index!=1) {

This is just after a state restore, so unwind all the changes that were made from the original entry to the function onwards, then go to the second clause.

                unwind_stack_revert_to_mark(p,unwind_stack_decouple_mark,unwind_stack_gc_mark,function_frame_top);
                goto next_label_c0;
            } else {

This is just after a state save, so if arg0 was a variable, we need to save it to the trail, then return.

                if(tag_arg0!=TAG_EOL)var_set_add_to_unwind_stack(p,arg0>>TAG_WIDTH,TAG_EOL);
            }
        }
        *voffset_new=voffset_next;
        return true;
    }

The two ways of failing the first clause - unwind the trail if necessary.

fail_label_c0:;
    unwind_stack_revert_to_mark(p,unwind_stack_decouple_mark,unwind_stack_gc_mark,function_frame_top);
fail_label_c0_no_unwind:;
    if(fs!=NULL)fs->clause_index++;
next_label_c0:;

Second clause: append([X|Xs],Ys,[X|Zs]) :- append(Xs,Ys,Zs).

    {
        uint8_t tag_arg2,tag_arg0;

There are going to need to be 3 internal variables, so increase the base of the variable window by 3 for the next call.

        UWORD voffset_next=voffset+3;
        UWORD var0, var1, var2;
        tag_arg0=(arg0&TAG_MASK);
        UWORD arg0lc, arg0h, arg0t;

arg0 is either a list or a variable. Deal with either case, anything else is a fail

        if(tag_arg0>=TAG_LIST) {

If it is a list, extract the head and tail.

            List* arg0l=&p->list_values[arg0>>TAG_WIDTH];
            arg0h=arg0l->head;
            pointer_chase_notag(p,&arg0h);
            var0=arg0h;
s_label_c1_arg0h:;
            arg0t=arg0l->tail;
            pointer_chase_notag(p,&arg0t);
            var1=arg0t;
s_label_c1_arg0t:;
        } else if(tag_arg0==TAG_VREF) {

If it is a variable, create the head and tail as variables. Don’t create the actual list yet, as that is just extra garbage to delete if this clause fails.

            var0=(0<<TAG_WIDTH)+TAG_VREF+(voffset<<TAG_WIDTH);
            p->variables[0+voffset]=TAG_VAR;
            var1=(1<<TAG_WIDTH)+TAG_VREF+(voffset<<TAG_WIDTH);
            p->variables[1+voffset]=TAG_VAR;
        } else {goto fail_label_c1_no_unwind;}
s_label_c1_arg0:;
s_label_c1_arg1:;
        tag_arg2=(arg2&TAG_MASK);
        UWORD arg2lc, arg2h, arg2t;

arg2 is treated the same way as arg0, with one difference: the head of each list (X) must unify, so if this is a list we test for this unification, and if this is a variable only the tail needs to be created, not the head.

if(tag_arg2>=TAG_LIST) {
            List* arg2l=&p->list_values[arg2>>TAG_WIDTH];
            arg2h=arg2l->head;
            pointer_chase_notag(p,&arg2h);
            if(!unify(p,var0,arg2h)) {goto fail_label_c1;}
s_label_c1_arg2h:;
            arg2t=arg2l->tail;
            pointer_chase_notag(p,&arg2t);
            var2=arg2t;
s_label_c1_arg2t:;
        } else if(tag_arg2==TAG_VREF) {
            var2=(2<<TAG_WIDTH)+TAG_VREF+(voffset<<TAG_WIDTH);
            p->variables[2+voffset]=TAG_VAR;
        } else {goto fail_label_c1;}

At this point, there is no chance of failure any more, so add any variables to the trail and create any necessary lists. If would be possible to express this more efficiently (two if(...) {...} else {...}), but one of the downsides of generated code is that it will frequently not be as tidy as hand generated code.

s_label_c1_arg2:;
        if(tag_arg0!=TAG_VREF)var_set_add_to_unwind_stack_nogc(p,0+voffset,arg0h);
        if(tag_arg0!=TAG_VREF)var_set_add_to_unwind_stack_nogc(p,1+voffset,arg0t);
        if(tag_arg0==TAG_VREF)arg0lc=plcreate_list(p,var0,var1);
        if(tag_arg0==TAG_VREF)var_set_add_to_unwind_stack(p,arg0>>TAG_WIDTH,arg0lc);
        if(tag_arg2!=TAG_VREF)var_set_add_to_unwind_stack_nogc(p,2+voffset,arg2t);
        if(tag_arg2==TAG_VREF)arg2lc=plcreate_list(p,var0,var2);
        if(tag_arg2==TAG_VREF)var_set_add_to_unwind_stack(p,arg2>>TAG_WIDTH,arg2lc);
        {

Recursively call append. Ideally tail recursion would be used here, but that is only safe if arg0 is a list, not if arg0 is a variable, in which case the stack frame might be needed for backtracking.

            uint32_t local_frame_top=p->frame_top;
            bool found=append_3(p, var1, arg1, var2, voffset_next, &voffset_next);
            set_stack_low_water_mark(p);
            pop_frame_stack(p);

If the call failed, and there is a backtracking possibility, do that here - the call to process_stack_state_load_save is a load this time.

            if(p->frame_top>=local_frame_top && !found) {
                set_stack_low_water_mark(p);
                process_stack_state_load_save(p,local_frame_top);
            }
            if(!found) {goto fail_label_c1;}
        }
        *voffset_new=voffset_next;
        return true;
    }

The call failed, and there was no internal backtracking to be done, so clean up and exit.

fail_label_c1:;
    unwind_stack_revert_to_mark(p,unwind_stack_decouple_mark,unwind_stack_gc_mark,function_frame_top);
fail_label_c1_no_unwind:;
next_label_c1:;
    *voffset_new=voffset;
    return false;
}

Append examples

For each of the following examples, the first line is the input (preceded by a ? ), the rest of the lines are the output.

The simple case. Note that this does not involve saving or loading any choice points:

? append([1,2],[3,4],X).
X = [1,2,3,4]

Solve for the second of the two sublists. Again, note that this does not involve saving or loading any choice points:

? append([1,2],X,[1,2,3,4]).
X = [3,4]

Solve for the first of the two sublists. This requires backtracking, so returns an extra false at the end to indicate no more solutions:

? append(X,[3,4],[1,2,3,4]).
X = [1,2]
false.

All the different ways to construct a list:

? append(X,Y,[1,2,3,4]).
X = []
Y = [1,2,3,4]
X = [1]
Y = [2,3,4]
X = [1,2]
Y = [3,4]
X = [1,2,3]
Y = [4]
X = [1,2,3,4]
Y = []
false.

Extracting the last element of a list:

? append(X,[Y],[1,2,3,4]).
X = [1,2,3]
Y = 4
false.

This doesn’t have any solutions:

? append([1],X,[2,3,4]).
false.

Neither does this:

? append(X,[1],[2,3,4]).
false.

We can have variables left in the result denoted by a number preceded by an _ (_0 could be any list including the empty list):

? append([1,2],X,Y).
X = _0
Y = [1,2|_0]

This example has an infinite list of solutions, and will continue until it is stopped or runs out of resources:

? append(X,[1,2],Y).
X = []
Y = [1,2]
X = [_2]
Y = [_2,1,2]
X = [_2,_5]
Y = [_2,_5,1,2]
X = [_2,_5,_8]
Y = [_2,_5,_8,1,2]
X = [_2,_5,_8,_11]
Y = [_2,_5,_8,_11,1,2]
X = [_2,_5,_8,_11,_14]
Y = [_2,_5,_8,_11,_14,1,2]
X = [_2,_5,_8,_11,_14,_17]
Y = [_2,_5,_8,_11,_14,_17,1,2]
...

This all very closely matches what SWI-Prolog would output.

N-queens

The code I use for N-queens is:

range(M,M,[M]).
range(M,N,[M|Ns]) :- M =\= N, M1 is M+1,range(M1,N,Ns).

% 'select' is part of SWI-Prolog, but a version is provided here so that compiler has access to it
selectx(X,[X|Xs],Xs).
selectx(X,[Y|Ys],[Y|Zs]) :- selectx(X,Ys,Zs).

nqueens(N,Qs) :- range(1,N,Ns),nqueens_aux(Ns,[],Qs).

nqueens_aux([],Qs,Qs).
nqueens_aux(UnplacedQs,SafeQs,Qs) :-
    selectx(Q,UnplacedQs,UnplacedQs1),
    not_attack(Q,1,SafeQs),
    nqueens_aux(UnplacedQs1,[Q|SafeQs],Qs).

not_attack(_A1,_A2,[]).
not_attack(Q0,D0,[Q|Qs]) :-
    Q0 =\= D0+Q,
    Q =\= D0+Q0,
    D1 is D0+1,
    not_attack(Q0,D1,Qs).

Running this using nqueens(8,Q). returns (there are 92 results, some of them have been omitted):

Q = [4,2,7,3,6,8,5,1]
Q = [5,2,4,7,3,8,6,1]
Q = [3,5,2,8,6,4,7,1]
...
Q = [6,4,7,1,3,5,2,8]
Q = [4,7,5,2,6,1,3,8]
Q = [5,7,2,6,3,1,4,8]
false.

The way to interpret this is that each entry in the list represents which column the queen in the corresponding row is, so for instance the list [4,2,7,3,6,8,5,1] represents:

8 Queens Solution!

It is worth looking through the code to see how much backtracking overhead there is.

range technically doesn’t need to backtrack if the first two values are bound to integers (they will be in this example), but to test this, the compiler would need to look at the tail (the bit after the :-) of the second clause, and infer that if arg0 and arg1 are bound, then unify(arg0,arg1) and arg0=\=arg1 are mutually exclusive. As a result of this not being detected, backtracking state will be stored but never loaded for this predicate.

range(M,M,[M]).
range(M,N,[M|Ns]) :- M =\= N, M1 is M+1,range(M1,N,Ns).

select genuinely does require backtracking, and is where all the backtracking actually occurs in this example.

% 'select' is part of SWI-Prolog, but a version is provided here so that compiler has access to it
selectx(X,[X|Xs],Xs).
selectx(X,[Y|Ys],[Y|Zs]) :- selectx(X,Ys,Zs).

nqueens has one clause, so no backtracking.

nqueens(N,Qs) :- range(1,N,Ns),nqueens_aux(Ns,[],Qs).

For nqueens_aux, no backtracking is needed, because the if arg0=[], selectx will return no results in the second clause. However that would require a significant amount of inference machinery to detect, including being able to determine that selectx([],_,_) returns no results. As a result, a choice point will be stored but never loaded for this predicate.

nqueens_aux([],Qs,Qs).
nqueens_aux(UnplacedQs,SafeQs,Qs) :-
    selectx(Q,UnplacedQs,UnplacedQs1),
    not_attack(Q,1,SafeQs),
    nqueens_aux(UnplacedQs1,[Q|SafeQs],Qs).

For not_attack, if last argument (arg2) is not a variable, the clauses are mutually exclusive, as an empty list will trigger the first clause, a non-empty list will trigger the second. The compiler will detect that, so this will have no backtracking overhead. This is important, because N-queens spends most of it’s time in this clause.

not_attack(_A1,_A2,[]).
not_attack(Q0,D0,[Q|Qs]) :-
    Q0 =\= D0+Q,
    Q =\= D0+Q0,
    D1 is D0+1,
    not_attack(Q0,D1,Qs).

Digging a little deeper, here is the code (a little cleaned up) for not_attack:

uint8_t not_attack_3(Prolog* p, UWORD arg0, UWORD arg1, UWORD arg2, UWORD voffset, UWORD* voffset_new) {
    FrameStore* fs=NULL;
    UWORD unwind_stack_decouple_mark=p->top_unwind_stack_decouple;
    UWORD unwind_stack_gc_mark=p->top_unwind_stack_gc;
    uint32_t function_frame_top=p->frame_top;
    {
        uint8_t tag_arg2;
        UWORD voffset_next=voffset+0;
s_label_c0_arg0:;
s_label_c0_arg1:;
        tag_arg2=(arg2&TAG_MASK);
        if(tag_arg2==TAG_EOL) {goto s_label_c0_arg2;}
        if(tag_arg2!=TAG_VREF) {goto fail_label_c0_no_unwind;}
s_label_c0_arg2:;
        if(true && ((arg2&TAG_MASK)==TAG_VREF)) {
            fs=&p->frames[++p->frame_top];
            fs->clause_index=0;
            function_frame_top=p->frame_top;
            fs->clause_count=2;
            fs=process_stack_state_load_save(p,0);
            fs->clause_index++;
            if(fs->clause_index!=1) {
                unwind_stack_revert_to_mark(p,unwind_stack_decouple_mark,unwind_stack_gc_mark,
                    function_frame_top);
                goto next_label_c0;
            } else {
                if(tag_arg2!=TAG_EOL)var_set_add_to_unwind_stack(p,arg2>>TAG_WIDTH,TAG_EOL);
            }
        }
        *voffset_new=voffset_next;
        return true;
    }
fail_label_c0:;
    unwind_stack_revert_to_mark(p,unwind_stack_decouple_mark,unwind_stack_gc_mark,function_frame_top);
fail_label_c0_no_unwind:;
    if(fs!=NULL)fs->clause_index++;
next_label_c0:;
    {
        uint8_t tag_arg1,tag_var0,tag_arg0,tag_arg2;
        UWORD voffset_next=voffset+3;
        UWORD var0, var1, var2;
s_label_c1_arg0:;
s_label_c1_arg1:;
        tag_arg2=(arg2&TAG_MASK);
        UWORD arg2lc, arg2h, arg2t;
        if(tag_arg2>=TAG_LIST) {
            List* arg2l=&p->list_values[arg2>>TAG_WIDTH];
            arg2h=arg2l->head;
            pointer_chase_notag(p,&arg2h);
            var0=arg2h;
s_label_c1_arg2h:;
            arg2t=arg2l->tail;
            pointer_chase_notag(p,&arg2t);
            var1=arg2t;
s_label_c1_arg2t:;
        } else if(tag_arg2==TAG_VREF) {
            var0=(0<<TAG_WIDTH)+TAG_VREF+(voffset<<TAG_WIDTH);
            p->variables[0+voffset]=TAG_VAR;
            var1=(1<<TAG_WIDTH)+TAG_VREF+(voffset<<TAG_WIDTH);
            p->variables[1+voffset]=TAG_VAR;
        } else {goto fail_label_c1_no_unwind;}
s_label_c1_arg2:;
        tag_arg0=(arg0&TAG_MASK);
        if(tag_arg0!=TAG_INTEGER) {goto fail_label_c1_no_unwind;}
        tag_arg1=(arg1&TAG_MASK);
        if(tag_arg1!=TAG_INTEGER) {goto fail_label_c1_no_unwind;}
        tag_var0=(var0&TAG_MASK);
        if(tag_var0!=TAG_INTEGER) {goto fail_label_c1_no_unwind;}
        if(arg0==(arg1+var0-TAG_INTEGER)) {goto fail_label_c1_no_unwind;}
        if(var0==(arg1+arg0-TAG_INTEGER)) {goto fail_label_c1_no_unwind;}
        var2=(arg1+((1<<TAG_WIDTH)+TAG_INTEGER)-TAG_INTEGER);
        var_set_add_to_unwind_stack(p,2+voffset,var2);
        if(tag_arg2!=TAG_VREF)var_set_add_to_unwind_stack_nogc(p,0+voffset,arg2h);
        if(tag_arg2!=TAG_VREF)var_set_add_to_unwind_stack_nogc(p,1+voffset,arg2t);
        if(tag_arg2==TAG_VREF)arg2lc=plcreate_list(p,var0,var1);
        if(tag_arg2==TAG_VREF)var_set_add_to_unwind_stack(p,arg2>>TAG_WIDTH,arg2lc);
        {
            uint32_t local_frame_top=p->frame_top;
            bool found=not_attack_3(p, arg0, var2, var1, voffset_next, &voffset_next);
            set_stack_low_water_mark(p);
            pop_frame_stack(p);
            if(p->frame_top>=local_frame_top && !found) {
                set_stack_low_water_mark(p);
                process_stack_state_load_save(p,local_frame_top);
            }
            if(!found) {goto fail_label_c1;}
        }
        *voffset_new=voffset_next;
        return true;
    }
fail_label_c1:;
    unwind_stack_revert_to_mark(p,unwind_stack_decouple_mark,unwind_stack_gc_mark,function_frame_top);
fail_label_c1_no_unwind:;
next_label_c1:;
    *voffset_new=voffset;
    return false;
}

All the arguments will be inputs for this, so removing the code that is never called, it can be seen that this code is reasonably efficient:

uint8_t not_attack_3(Prolog* p, UWORD arg0, UWORD arg1, UWORD arg2, UWORD voffset, UWORD* voffset_new) {
    FrameStore* fs=NULL;
    UWORD unwind_stack_decouple_mark=p->top_unwind_stack_decouple;
    UWORD unwind_stack_gc_mark=p->top_unwind_stack_gc;
    uint32_t function_frame_top=p->frame_top;
    {
        uint8_t tag_arg2;
        UWORD voffset_next=voffset+0;
s_label_c0_arg0:;
s_label_c0_arg1:;
        tag_arg2=(arg2&TAG_MASK);
        if(tag_arg2==TAG_EOL) {goto s_label_c0_arg2;}
        if(tag_arg2!=TAG_VREF) {goto fail_label_c0_no_unwind;}
s_label_c0_arg2:;
        if(true && ((arg2&TAG_MASK)==TAG_VREF)) {
            // NEVER CALLED
        }
        *voffset_new=voffset_next;
        return true;
    }
fail_label_c0:;
    // NEVER CALLED
fail_label_c0_no_unwind:;
    if(fs!=NULL)fs->clause_index++;
next_label_c0:;
    {
        uint8_t tag_arg1,tag_var0,tag_arg0,tag_arg2;
        UWORD voffset_next=voffset+3;
        UWORD var0, var1, var2;
s_label_c1_arg0:;
s_label_c1_arg1:;
        tag_arg2=(arg2&TAG_MASK);
        UWORD arg2lc, arg2h, arg2t;
        if(tag_arg2>=TAG_LIST) {
            List* arg2l=&p->list_values[arg2>>TAG_WIDTH];
            arg2h=arg2l->head;
            pointer_chase_notag(p,&arg2h);
            var0=arg2h;
s_label_c1_arg2h:;
            arg2t=arg2l->tail;
            pointer_chase_notag(p,&arg2t);
            var1=arg2t;
s_label_c1_arg2t:;
        } else if(tag_arg2==TAG_VREF) {
            // NEVER CALLED
        } else {goto fail_label_c1_no_unwind;}
s_label_c1_arg2:;
        tag_arg0=(arg0&TAG_MASK);
        if(tag_arg0!=TAG_INTEGER) {goto fail_label_c1_no_unwind;}
        tag_arg1=(arg1&TAG_MASK);
        if(tag_arg1!=TAG_INTEGER) {goto fail_label_c1_no_unwind;}
        tag_var0=(var0&TAG_MASK);
        if(tag_var0!=TAG_INTEGER) {goto fail_label_c1_no_unwind;}
        if(arg0==(arg1+var0-TAG_INTEGER)) {goto fail_label_c1_no_unwind;}
        if(var0==(arg1+arg0-TAG_INTEGER)) {goto fail_label_c1_no_unwind;}
        var2=(arg1+((1<<TAG_WIDTH)+TAG_INTEGER)-TAG_INTEGER);
        var_set_add_to_unwind_stack(p,2+voffset,var2);
        if(tag_arg2!=TAG_VREF)var_set_add_to_unwind_stack_nogc(p,0+voffset,arg2h);
        if(tag_arg2!=TAG_VREF)var_set_add_to_unwind_stack_nogc(p,1+voffset,arg2t);
        if(tag_arg2==TAG_VREF) /* NEVER CALLED */;
        if(tag_arg2==TAG_VREF) /* NEVER CALLED */;
        {
            uint32_t local_frame_top=p->frame_top;
            bool found=not_attack_3(p, arg0, var2, var1, voffset_next, &voffset_next);
            set_stack_low_water_mark(p);
            pop_frame_stack(p);
            if(p->frame_top>=local_frame_top && !found) {
                // NEVER CALLED, as not_attack doesn't call anything that might backtrack
            }
            if(!found) {goto fail_label_c1;}
        }
        *voffset_new=voffset_next;
        return true;
    }
fail_label_c1:;
    unwind_stack_revert_to_mark(p,unwind_stack_decouple_mark,unwind_stack_gc_mark,function_frame_top);
fail_label_c1_no_unwind:;
next_label_c1:;
    *voffset_new=voffset;
    return false;
}

Four Clauses

This is simply the code:

f(1).
f(2).
f(3).
f(4).

This test was done simply to check that the backtracking worked if more than two clauses are specified in a predicate. This test passed.

Pass Through

This is the code:

foo(X,Y,Z) :- bar(X,Y,Z).

This is used to check that a predicate that just passes the arguments through to another predicate is reasonably efficient and doesn’t for instance store any choice points.

The generated C code is:

uint8_t foo_3(Prolog* p, UWORD arg0, UWORD arg1, UWORD arg2, UWORD voffset, UWORD* voffset_new) {
        UWORD unwind_stack_decouple_mark=p->top_unwind_stack_decouple;
        UWORD unwind_stack_gc_mark=p->top_unwind_stack_gc;
        uint32_t function_frame_top=p->function_frame_top_last_n_clause;
                UWORD voffset_next=voffset+0;
s_label_c0_arg0:;
s_label_c0_arg1:;
s_label_c0_arg2:;
                bool found=bar_3(p, arg0, arg1, arg2, voffset_next, &voffset_next);
                set_stack_low_water_mark(p);
                if(!found) {goto fail_label_c0_no_unwind;}
                *voffset_new=voffset_next;
                return true;
fail_label_c0:;
        unwind_stack_revert_to_mark(p,unwind_stack_decouple_mark,unwind_stack_gc_mark,function_frame_top);
fail_label_c0_no_unwind:;
next_label_c0:;
        *voffset_new=voffset;
        return false;
}

The generated assembler code (clang with -03) is:

foo_3:                                  # @foo_3
        .cfi_startproc
# %bb.0:
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        pushq   %r15
        pushq   %r14
        pushq   %rbx
        pushq   %rax
        .cfi_offset %rbx, -40
        .cfi_offset %r14, -32
        .cfi_offset %r15, -24
        movq    %r9, %r14
        movl    %r8d, %r15d
        movq    %rdi, %rbx
        movl    %r8d, -28(%rbp)
        leaq    -28(%rbp), %r9
        callq   bar_3
        movq    (%rbx), %rcx
        movl    8(%rbx), %edx
        imulq   $120, %rdx, %rdx
        cmpq    %rbp, 112(%rcx,%rdx)
        jae     .LBB1_2
# %bb.1:
        addq    %rdx, %rcx
        addq    $112, %rcx
        movq    %rbp, (%rcx)
.LBB1_2:
        testb   %al, %al
        setne   %al
        je      .LBB1_4
# %bb.3:
        movl    -28(%rbp), %r15d
.LBB1_4:
        movl    %r15d, (%r14)
        addq    $8, %rsp
        popq    %rbx
        popq    %r14
        popq    %r15
        popq    %rbp
        .cfi_def_cfa %rsp, 8
        retq

This isn’t too horrible. The major improvement that could be made would be to detect that the call to bar can use tail call optimization, and change the call to a jump, which means that this predicate could be simplified to one instruction, but that would require more detection of extraneous elements.

Comparison with other implementations

In this section I compare my compiler with some mature Prolog implementations and some other languages of interest:

C (in particular Clang) as something that can be used to provide a very fast baseline implementation.
Mercury because of its similarity to Prolog.
Haskell as representative of pure functional languages.

I only list open source language implementations, so for instance SICStus Prolog which is one of the fastest Prolog implementations is not included.

NOTE: I’ve only done speed comparisons for one problem (N-queens), so any results are only indicative - speed tests with other problems might yeild quite different results.

I give the speed (in seconds) for N=12 with each implementation, see Speed results for full timing results.

The Prolog -> C compiler took 0.592 for N=12, the variant with the hand-optimized not_attack took 0.481 seconds.

SWI Prolog

SWI Prolog is a free Prolog implementation that compiles Prolog to a virtual machine based on the “ZIP Virtual Machine” and has a fast interpreter for this machine. It is released under the 2 Clause BSD License. It is not as fast as some other Prolog implementations, but is very mature and has a rich set of libraries. nqueens.pl works with SWI Prolog with the source code above.

Unlike the other implementations described here that will have a 64-bit limit on integer size, SWI Prolog supports unbounded integers, so integer operations will be a lot more expensive for integers outside the range gven by:

? current_prolog_flag(max_tagged_integer,X).
X = 72057594037927935.  (0x00ffffffffffffff)
? current_prolog_flag(min_tagged_integer,X).
X = -72057594037927936. (0xff00000000000000)

Nothing I am doing goes outside this range, but there will still some overhead for the overflow check to see if larger integer handling is necessary. The other implementations won’t incur this overhead.

The N=12 time was 4.41 seconds. This is way slower than the other implementations that I tested. This would largely be because all the other implementations compiled to machine code, and SWI Prolog compiles to bytecode (and the large integer check overhead will slow things down a little).

CLPFD

CLPFD stands for Constraint Logic Programming over Finite Domains and provides a fast and powerful way of solving certain types of integer and floating point constraint, and combinatorial problems. N-queens is an example of a combinatorial problem that might benefit from this, but implementing is approach is beyond what I have done, so in this discussion I don’t investigate this.

GNU Prolog

GNU Prolog is a free fast Prolog implementation that from the website: " It first compiles a Prolog program to a WAM file which is then translated to a low-level machine independent language called mini-assembly specifically designed for GNU Prolog. The resulting file is then translated to the assembly language of the target machine (from which an object is obtained).". It is released under the GNU General Public License. nqueens.pl works with GNU Prolog with the source code above.

The N=12 time was 0.97 seconds, taking a little under double the time of my compiler.

C

C is a language that doesn’t need my introduction. It is very widely used and a popular choice for writing efficient low level code. I use Linux, so Clang is my preference, although GCC is also a good choice (code written for one will usually work on the other).

#define MAX_N 16

void solve(int n) {
    int board[MAX_N] = {0}; // Array to track queen positions
    int row = 0;            // Start at the first row
    int col = -1;           // Just before the first column
    int colmask=(1<<(n+1))-1;
    while (row >= 0) {
        // Find the next valid position in the current row
        col=__builtin_ctz(colmask&((-1)<<(col+1)));
        while(col < n) {
            for (int i = 0; i < row; i++) {
                int b = board[i];
                int d = row-i;
                if (b == col - d ||   // Same major diagonal
                    b == col + d) {   // Same minor diagonal
                    goto conflict;
                }
            }
            colmask -= (1<<col);
            board[row] = col; // Place the queen
            goto found_solution;
conflict:;
            col=__builtin_ctz(colmask&((-1)<<(col+1)));
        }
        row--;
        if (row >= 0) {
            col = board[row]; // Try the next column in the previous row
            colmask += (1<<col);
        }
        continue;
found_solution:
        // Move to the next row
        if (row == n - 1) {
            // Found a solution
            printf("[");
            printf("%d",board[0]);
            for (int i = 1; i < MAX_N; i++) {
                printf(",%d",board[i]);
            }
            printf("]\n");
            colmask += (1<<col);
            col = board[row]; // Backtrack
        } else {
            row++;
            col = -1; // Start from the first column in the next row
        }
    }
}

It’s worth spending a moment looking at what makes this fast:

The board positions are stored in an array board rather than a list. This allows very fast access. The board array is the only memory access in the whole function (except for the calls to printf), so that will always be in the L1 cache on the CPU.
I have replaced recursive backtracking by iteration to avoid the overhead of function calls. It makes very little difference in this case as the compiler is pretty good at optimizing recursion, and most of the time is spent in the attack checking loop. Recursive code can always be replaced by iterative code that is almost always at least as fast, although often at the expense of readability.
The set of columns that is still available for use is stored a bit mask colmask, which is very efficient. The Clang/GCC __builtin_ctz can be used to quickly find a 1 bit indicating a column to try.

The N=12 time was 0.0586 seconds - over 10x as fast as my compiler. It is of course expected that the C time would be the fastest, as this specific problem can be coded with no overhead for lists and garbage collection, and in this case an array based backtracking method can be used that is very fast.

It is possible to get a significant speed improvement (about a factor of 2) over this by using vector registers (e.g. SSE/AVX2 on x86-64), but that is beyond the scope of this post. Perhaps I will write about this in a future post.

Mercury

Mercury is a pure logic programming language with a syntax similar to Prolog that is strongly typed, strongly moded (each argument of a predicate needs to declare whether it is an input or an output), and has determinism declarations.

As a result of this, it generates very efficient code, first by compiling it to an intermediate language (one of low level C, high level C, Java or C#), and from there to machine code using gcc (for the C backends) or bytecodes for the (Java and C# backends). I exclusively use the low level C backend in this post.

One downside of being strongly moded is that each possible combination of in/out must be declared separately, rather than the full Prolog unification which allows any paramater to be either an input or output (or combination of both) at runtime. For instance, in append that was discussed in Append example, there needs to be explicit declarations (from the Mercury Library Reference):

:- pred append(list(T), list(T), list(T)).
:- mode append(di, di, uo) is det.
:- mode append(in, in, out) is det.
:- mode append(in, in, in) is semidet.    % implied
:- mode append(in, out, in) is semidet.
:- mode append(out, out, in) is multi.

This is what the core of N-queens looks like in Mercury (full source here):

:- pred range(int::in, int::in, list(int)::out) is det.
range(Start, End, List) :-
    ( if Start > End then
        List = []
    else
        range(Start + 1, End, Rest),
        List = [Start | Rest]
    ).

:- pred selectx(int::out, list(int)::in, list(int)::out) is nondet.
selectx(X, [X | Xs], Xs).
selectx(X, [Y | Ys], [Y | Zs]) :- selectx(X, Ys, Zs).

:- pred nqueens(int::in, list(int)::out) is nondet.
nqueens(N, Qs) :-
    range(1, N, Ns),
    nqueens_aux(Ns, [], Qs).

:- pred nqueens_aux(list(int)::in, list(int)::in, list(int)::out) is nondet.
nqueens_aux([], Qs, Qs).
nqueens_aux(UnplacedQs, SafeQs, Qs) :-
    selectx(Q, UnplacedQs, UnplacedQs1),
    not_attack(Q, 1, SafeQs),
    nqueens_aux(UnplacedQs1, [Q | SafeQs], Qs).

:- pred not_attack(int::in, int::in, list(int)::in) is semidet.
not_attack(_, _, []).
not_attack(Q0, D0, [Q | Qs]) :-
    Q0 \= D0 + Q,
    Q \= D0 + Q0,
    D1 = D0 + 1,
    not_attack(Q0, D1, Qs).

Apart from range being turned into an if ... then ... else to make it deterministic, this code is essentially the Prolog code with extra declarations.

The time for N=12 was 0.269 seconds - about half the time of my code. Specifying the modes, types and determinism greatly assists the compiler with optimization, and it would be extremely difficult for an implementation to beat this.

Haskell

Haskell is a pure functional language, not a logic programming language (it does not use unification or backtracking in its usual operation), but it seemed worth including a pure functional language for comparison, and Haskell is a widely used and powerful choice. It is statically typed, and has lazy evaluation.

This is a fast implementation of N-queens in Haskell:

-- Build with: ghc -O3 -package time --make nqueens.hs
import Data.List (delete)
type Board = [Int]
nqueens :: Int -> [Board]
nqueens n = map fst $ loop [([], [1..n])] 0
where
    loop :: [(Board, [Int])] -> Int -> [(Board, [Int])]
    loop boards counter
    | counter == n = boards
    | otherwise = loop (concatMap expand boards) (counter+1)

    expand :: (Board, [Int]) -> [(Board, [Int])]
    expand (board, candidates) =
    [(x : board, delete x candidates) | x <- candidates, safe x board]

    safe x board = and [ x /= c + n && x /= c - n  | (n,c) <- zip [1..] board]

The time for N=12 was 0.152 seconds, which to me is surprisingly fast. Not having to do Prolog style backtracking certainly helps, as it can code a more limited and efficient form. This implementation of N-queens was the fastest of several that I found (I didn’t write it), and one of the advantages ascribed to Haskell is that it is high enough level that it can aggressively optimize, although fairly deep knowledge is necessary to not shoot the performance in the foot.

Type/Mode/Determinism declarations

In the Mercury section above, type, mode and determinism declarations were added. Adding some or all of these (or being able to infer them from code analysis, which is harder and likely to be less complete) could be used to shrink the code and make it faster by eliminating a lot of checks. For instance, for not_attack, the Mercury declaration is:

:- pred not_attack(int::in, int::in, list(int)::in) is semidet.

Applying this information to the output of the compiler (this is done by hand in for this example, but could be added to the compiler), I get:

uint8_t not_attack_3(Prolog* p, UWORD arg0, UWORD arg1, UWORD arg2, UWORD voffset, UWORD* voffset_new) {
    {
        uint8_t tag_arg2;
        tag_arg2=(arg2&TAG_MASK);
        if(tag_arg2==TAG_EOL) {goto s_label_c0_arg2;}
        goto fail_label_c0_no_unwind;
s_label_c0_arg2:;
        return true;
    }
fail_label_c0_no_unwind:;
    {
        UWORD var0, var1, var2;
        UWORD arg2lc, arg2h, arg2t;
        {
        List* arg2l=&p->list_values[arg2>>TAG_WIDTH];
        arg2h=arg2l->head;
        pointer_chase_notag(p,&arg2h);
        var0=arg2h;
        arg2t=arg2l->tail;
        pointer_chase_notag(p,&arg2t);
        var1=arg2t;
        }
        if(arg0==(arg1+var0-TAG_INTEGER)) {goto fail_label_c1_no_unwind;}
        if(var0==(arg1+arg0-TAG_INTEGER)) {goto fail_label_c1_no_unwind;}
        var2=(arg1+((1<<TAG_WIDTH)+TAG_INTEGER)-TAG_INTEGER);
        {
            bool found=not_attack_3(p, arg0, var2, var1, voffset, voffset_new);
            if(!found) {goto fail_label_c1;}
        }
        return true;
    }
fail_label_c1:;
fail_label_c1_no_unwind:;
    return false;
}

which is significantly faster. It doesn’t need to do any type checking, doesn’t need to have any unification or backtracking options, and nothing needs to be added to the trail as it doesn’t call anything which might backtrack.

Speed results

All tests were done on an Intel(R) Core(TM) i5-1135G7 @ 2.40GHz laptop with 20 GB of RAM running Debian 12, with:

echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

with the machine under otherwise very light load (no web browser open for instance). For most tests I ran it as many times as I could in 600 seconds, then divided the CPU runtime by the run count (GNU Prolog required shorter runs or the global stack would overflow).

I was unable to get results for N=16 for SWI Prolog (Stack limit (1.0Gb) exceeded), and for GNU Prolog (it ran out of global stack which seems to have a limit of 1611392 Kb).

Results (in seconds, smaller is better), again noting that these results are based off one example problem so are only indicative:

N	SWI Prolog	GNU Prolog	My Code	My Code Adj	Mercury	Haskell	C (Clang)
4	0.0000205	0.00000684	0.00000483	0.00000475	0.00000199	0.00000134	0.000000411
5	0.0000823	0.0000226	0.0000161	0.0000149	0.00000708	0.00000358	0.000000738
6	0.000301	0.0000756	0.0000529	0.0000465	0.0000184	0.0000160	0.00000191
7	0.00134	0.000326	0.000224	0.000187	0.0000848	0.0000463	0.00000687
8	0.00604	0.00143	0.000947	0.000796	0.000407	0.000243	0.0000522
9	0.0289	0.00666	0.00432	0.00362	0.00203	0.00121	0.000363
10	0.143	0.0321	0.0206	0.0168	0.00931	0.00581	0.00202
11	0.766	0.169	0.106	0.0876	0.0476	0.0295	0.0108
12	4.41	0.970	0.592	0.481	0.269	0.152	0.0586
13	26.9	5.80	3.49	2.75	1.91	0.940	0.339
14	173	37	21.8	16.7	14.9	5.81	2.11
15	1194	252	145	114	152	38.3	13.9
16	*	*	1020	813	1368	268	96.5

In general the runtimes are ordered: C (best) < Haskell < Mercury < my code (fast not_attack) < my code < GNU Prolog < SWI prolog (worst).

Plotting these as a ratio of \(\frac{time\ taken}{time\ taken_{my\ code}}\) (smaller is better):

N-queens Comparison!

Guessing about some of the features in this graph:

The relative proportion of the time in not_attack increases as N increases - SWI Prolog’s not_attack is particularly inefficient as it doesn’t optimize out the choice point, and GNU Prolog may have this problem too - I think this is responsible for runtime ratio getting worse with increasing N. Making only not_attack more efficient has the opposite effect on my code (fast not_attack), reducing the relative runtime as N increases.
Mercury does well until about N=14, then gets a lot worse - my guess is there is some sort of garbage collection slowdown going on there.
All the other bumps in the graph are for relatively low N. Low N is going to be pretty noisy for particularly efficient implementations as the effect of things like the timing code get in the way of measuring the actual runtime.

Why use machine code or MLIR (LLVM)

Machine Code

For this project, I compiled Prolog to C. What I had originally planned (but decided it would take more time than I had available) was to write a Prolog to x86-64 machine code compiler.

There are several advantages of generating x86-64 machine code over generating C. The first two are large roadblocks if not showstoppers for using Prolog to C in a production quality system:

Better control of the stack. I would be able to better control what goes on the stack to make the stack use less, and determine exactly what ranges of the stack needed to be saved. This would improve backtracking speed.
Can aggressively do tail call optimization, rather than rely on the compiler to do the right thing. This is important, as in Prolog looping is usually done using recursion.
Can choose more suitable calling conventions for register usage. The C and C++ calling conventions on x86-64 have 6 registers (rdi, rsi, rdx, rcx, r8, r9) for input and 2 registers (rax.rdx) for return, and even then they are meant to return a single wide value. This is a good mix for C or C++ where only one value gets returned from a function or method, but there might be way better options for Prolog where any parameter can be an input or an output.
Speed and compiler simplicity - Prolog has assert and retract functions, and it would be slow and complex to have to go through a C compiler every time something was changed.

Disadvantages:

More complex to write, although not much more so - I’ve already written two code generators for x86-64 for other projects.
Much more complex to debug.
Non-portable - if I wanted to target ARM or RISC-V, I would need new backends.
I don’t get the advantage of the gcc or clang code optimizers. That’s many thousand person-years of compiler development that a DIY solution has no access to.

MLIR (LLVM)

LLVM is a set of compile tools (of which clang is a part) that contains a MLIR (Multi-Level Intermediate Representation) that is cross platform, and does give access to low level optimization. It removes the last two disadvantages of compiling to machine code (non-portability and lack of access to optimization).

While it looks promising, I don’t have enough experience with it to know if there are any hidden fishhooks. Some time ago I looked at the intermediate representation of gcc, and that’s a month of my life I won’t get back - it was very difficult to use and poorly documented. I’ve read that MLIR is much better in this regard. I also don’t yet know how difficult it is to code for and debug.

Future directions

This prototype is limited in functionality and expandability, but if I was to turn this into something larger, I would want to consider the following things:

Generate machine code or MLIR (LLVM intermediate code) rather than C.
Make tighter use of the stack to speed up backtracking.
One or more intermediate representations to allow maintainability and better optimization.
Expand to include the other parts of Prolog. This should not be too hard as the heavy lifting of unification and backtracking have been done.
More code analysis for optimization - ideally I would add optional type,mode and determinism annotations to allow faster code.
More powerful garbage collection. While the current code does remove any garbage than is generated by backtracking, which is plenty for N-queens, it would gradually expand the stack and the trail in cases where backtracking didn’t happen (say a tail recursion used as a loop).
Any modern version of Prolog would need to support CLPFD. I have not investigated what would be involved in implementing that.

Conclusion

I developed a prototype compiler that can convert a small subset of Prolog to C, and demonstrated that it can run append and N-queens as expected by Prolog. Care has been taken with the storage of choice points to make backtracking reasonably fast, and it can run N-queens much faster than two mature Prologs (SWI and GNU).

While the code here is not suitable for further development, a more complete version could be written using these techniques.

I welcome any comments, corrections, observations, or improvement suggestions. Please feel free to contact me

Introduction#

Contents#

Design#

Backtracking#

Creating a choice point#

The Trail#

Implementation#

Choice point elimination#

Unification#

Other optimizations#

Debugging/tracing#

Example problems#

Append#

Append code#

Append examples#

N-queens#

Four Clauses#

Pass Through#

Comparison with other implementations#

SWI Prolog#

CLPFD#

GNU Prolog#

C#

Mercury#

Haskell#

Type/Mode/Determinism declarations#

Speed results#

Why use machine code or MLIR (LLVM)#

Machine Code#

MLIR (LLVM)#

Future directions#

Conclusion#