Saturday, December 27, 2008

Simple lisp interpreters in C, Java and Erlang


Overview

I have already covered the lisp interpreters in Erlang and Java. I have created another example that is based on C. So, you have three different implementations in three different languages. Also, these interpreters are based on Peter Norvig's JScheme implementation. That is a fourth project that you can compare and contrast.

Peter Norvig created JScheme, a Lisp Scheme dialect written in Java. It was created in 1998 and it has since been forked, extended and used quite extensively in other modern projects for the web and elsewhere. The C lisp interpreter is based on the JScheme interpreter. The JScheme code is copyrighted to Peter Norvig This is only a modifications to make our Lisp even lighter
than it was originally. Norvig's most recent implementation of JScheme contains 1905 lines of Java code. This implementation contains 2000 lines of C code wrapped in additional Javadoc comments, including implementations of list, hashtable datastructures.

There is a lot that can be added to Norvig's implementation. Add full support for a R5RS Scheme Implementation. Add better input/output support. Copy interesting functions from Arc or Common Lisp. Write a Scheme compiler. In fact, the source code in my implementation and Norvig's is fairly straightforward,

C implementation

To Compile:
type 'make' at the command line prompt
To Run:
type 'make run' to run the tests or ./octanec

For Example:

test.lisp contains the following:
-------------
(+ 4 5 6)
(+ 6 6 6)
(+ 1 2 3 )
(+ 1 (+ 1 (* 1 1)))
-------------

./octanec test.lisp

Example Output

...
...
trace: eval[t1] - instance of procedure calltrace: eval - instance of String: data=[+]
trace: eval - instance of non-pair =>
trace: eval - instance of non-pair =>
trace: eval - instance of non-pair =>
at apply id= 27 4.000000 x==>97125c0 16
at plus 27

test eval [result after eval]: object=9712918 type=16 [15.000000]
...
...

Object Oriented C

The code is written with an object oriented approach. There is one generic object definition that contains data for different types.
octane_object.h

/*
 * The default object type for the scheme parser.
 * The obj_type determines.
 * 
 * If the struct is of a particular type 'E.g. String'
 * then we will need reference that particular pointer.
 *
 * String = string_data;
 * Char   = char_data;
 */
typedef struct Object_struct {
    int    obj_type;    
    char   *string_data;
    double double_data;
    int    boolean_data;
    char   char_data;

    struct Pair_struct      *pair_data;
    struct Procedure_struct *proc_data;
} Object;

To create a new string object, we would do something like the following:
struct Object_struct *new_empty_object(int obj_type) {
    struct Object_struct *new_obj = (struct Object_struct *) malloc(sizeof(struct Object_struct));
    new_obj->obj_type = obj_type;
    return new_obj;
}
...
...
...

struct Object_struct *new_str_object(char *string_data) {
    struct Object_struct* obj = new_object(OBJ_STRING_TYPE, string_data, 0, 0);
    return obj;
}

With Java, Compare and Contrast
The C implementation is an almost exact copy of the Java implementation. Of course, with the C implementation, we created the list, hashtable, and buffer libraries that are available in Java as well as some other modifications. But, the logic is essentially the same as the JScheme and mini scheme code. For example, here is the C and Java version of the lisp 'read' routine.
/**
 * Read and return a Scheme expression, or EOF.
 */
struct Object_struct *read(struct Octane_struct *octane) {
    struct Object_struct *token;
    char *str;
    printf("trace: read()\n");

    token = next_token(octane);
    if (token == NULL) {
        return NULL;
    }
    if (token->obj_type == OBJ_STRING_TYPE) {
        str = token->string_data;
        if (strcmp("(", str) == 0) {
            return read_tail(octane);
        } else if (strcmp(")", str) == 0) {
            printf("WARN: Extra ')' ignored.");
            return read(octane);
        } else {
            return token;
        } // End of the if - else    } else {
        return token;       
    } // End of the if - else}

And the Java version...
    /**
     * Read and return a Scheme expression, or EOF.
     */
    public Object read() {
        System.out.println("trace: read()");
        try {
            Object token = nextToken();
            System.out.println("trace: read() token=" + token);
            if (token == "(") {
                return readTail();
            } else if (token == ")") {
                System.out.println("WARN: Extra ')' ignored.");
                return read();
            } else {
                return token;
            } // End of the if - else        } catch (IOException e) {
            System.out.println("WARN: On input, exception: " + e);
            return EOF;
        } // End try - catch    }


References

Full Source for all three implementations, all_simple_lisp
Description of the Java Implementation

Description of the Erlang Implementation

http://jvmnotebook.googlecode.com/files/mini_jscheme.tar.gz

Bugs

In the load_scheme function, there is an error with the use of 'strcmp', change to the following.
load_scheme():
...
    if (object->obj_type == OBJ_STRING_TYPE) {
                if (strcmp(object->string_data, OCT_EOF) == 0) {
                    printf("exiting read loop (EOF)\n");
                    return;
                }
            }
...

-----------
Addendum(2) - Simple RPN Calculator.

With very little effort, I changed the lisp interpreter into a concatenative/stack language interpreter. In the update, instead of parsing the lisp syntax, the parser detects the various tokens and places or pops values off the stack. With Lisp, the main data structure is the 'list', with stack languages the core data structure mainly consists of a LIFO stack.

"In Reverse Polish notation the operators follow their operands; for instance, to add three and four, one would write "3 4 +" rather than "3 + 4". If there are multiple operations, the operator is given immediately after its second operand; so the expression written "3 − 4 + 5" in conventional infix notation would be written "3 4 − 5 +" in RPN: first subtract 4 from 3, then add 5 to that. An advantage of RPN is that it obviates the need for parentheses that are required by infix"

http://en.wikipedia.org/wiki/Reverse_Polish_notation

Here is an example source file (test.joy):
10 20 +
;; Result after call should be 30
4 * dup *

The values 10 and 20 are the first values put on the stack, these numbers are input parameters to the '+' operation.  The result of '30' is placed onto the stack.  Then the value of 4 is placed on the stack. The '*' multiplication operator is called with 4 and 30 as parameters.  And so on.  Here is the output from these operations:
Usage: octanec <source file>
trace: [EVAL RESULT]:  object=661ef8 type=16 [10.000000]

+- STACK -- top ---------+
| id:0 10.000000
+--------- bottom -------+
trace: [EVAL RESULT]:  object=662060 type=16 [20.000000]

+- STACK -- top ---------+
| id:0 20.000000
| id:1 10.000000
+--------- bottom -------+
        $ function input => (+)
+- STACK -- top ---------+
| id:0 20.000000
| id:1 10.000000
+--------- bottom -------+
trace: [EVAL RESULT]:  object=662160 type=16 [30.000000]

+- STACK -- top ---------+
| id:0 30.000000
+--------- bottom -------+
trace: [EVAL RESULT]:  object=662228 type=16 [4.000000]

+- STACK -- top ---------+
| id:0 4.000000
| id:1 30.000000
+--------- bottom -------+
        $ function input => (*)
+- STACK -- top ---------+
| id:0 4.000000
| id:1 30.000000
+--------- bottom -------+
trace: [EVAL RESULT]:  object=662328 type=16 [120.000000]

+- STACK -- top ---------+
| id:0 120.000000
+--------- bottom -------+
        $ function input => (dup)
trace: [EVAL RESULT]:  object=662328 type=16 [120.000000]

+- STACK -- top ---------+
| id:0 120.000000
| id:1 120.000000
+--------- bottom -------+
        $ function input => (*)
+- STACK -- top ---------+
| id:0 120.000000
| id:1 120.000000
+--------- bottom -------+
trace: [EVAL RESULT]:  object=662570 type=16 [14400.000000]

+- STACK -- top ---------+
| id:0 14400.000000
+--------- bottom -------+
        $ function input => (print_stack)
+- STACK -- top ---------+
| id:0 14400.000000
+--------- bottom -------+
trace: [EVAL RESULT]:  object=662570 type=16 [14400.000000]

+- STACK -- top ---------+
| id:0 14400.000000
+--------- bottom -------+
building (EOF)
;;*************************************;; Result after call should be = '14400.000000';; In Joy, we get the following as well:;; 10 20 + 4 * dup *.;;*************************************

Most of the modifications from the original Lisp interpreter code were made to the apply function. Instead of constructing a linked-list pair structure, all operations are made against the main stack:
...
...
    if (minArgs == 2) {
        switch (idNumber) {
        case PLUS:
        case MINUS:
        case DIVIDE:
        case TIMES:
            compute_res = stack_compute(idNumber, main_stack);
            if (compute_res != NULL) {
                stack_push(main_stack, compute_res);
            }
            return compute_res;
        case SWAP:
            // swap      :   X Y  ->   Y X            // Interchanges X and Y on top of the stack.            last_obj  = stack_pop(main_stack);
            last_obj2 = stack_pop(main_stack);
            stack_push(main_stack, last_obj);
            stack_push(main_stack, last_obj2);
            return last_obj2;
        default:
            printf("Invalid id number=%d\n", idNumber);
            error("Internal error: unknown primitive: ");
            return NULL;
        }; // End of Switch    } else if (minArgs == 1) {
        switch(idNumber) {
        case DUP:
            // dup      :   X  ->   X X            // Pushes an extra copy of X onto stack.            last_obj = stack_pop(main_stack);
            stack_push(main_stack, last_obj);
            stack_push(main_stack, last_obj);
            return last_obj;
        case ID:
            last_obj = stack_pop(main_stack);
            stack_push(main_stack, last_obj);
            return last_obj;
        case POP:
            // pop      :   X  ->            // Removes X from top of the stack.            last_obj = stack_pop(main_stack);
            return last_obj;
        default:
            printf("Invalid id number=%d\n", idNumber);
            error("Internal error: unknown primitive: ");
            return NULL;
        }; // End of Switch...
...

More RPN/Concatenative language resources:

http://www.latrobe.edu.au/philosophy/phimvt/joy.html

http://haskellnotebook.googlecode.com/files/simple_rpncalc.tar.gz

http://math-services.appspot.com/lang2.jsp
-----------

8 comments:

Peter said...

Technically, Brandeis's Tim Hickey started JScheme, not Norvig. Does this matter? Not really, I just think he is a great guy.

Berlin Brown said...

Are you sure. I am pretty sure Norvig created and he even says so.

http://norvig.com/jscheme.html

"Jscheme is my implementation of Scheme in Java. This page presents Jscheme version 1.4, the last version that I released (in April 1998)"

His name is copyrighted to him and is on the source code.

"Since April 1998, development has been picked up by others, notably Tim Hickey at Brandeis and Ken Anderson at BBN."

akuhn said...

They say you can write C in any language. Which I agree is a bad thing if you write C in an OO language. But writing OO in C? I wonder what's the performance penalty when writing OO in C.
How fast is you implementation? ^^ Can you run the following code and report the output of the time cmd for all three of your implementations?

(define (fib x)
(cond ((eq x 0) 1)
((eq x 1) 1)
(true (+ (fib (- x 1)) (fib (- x 2))))))
(fib 23)

Berlin Brown said...

http://www.planetpdf.com/codecuts/pdfs/ooc.pdf

Objected Oriented programming in C is a technique that a lot of people use. In theory, objected oriented can mean many things. You can treat an 'int' as an object or a pointer. In the case of my application, I wanted to replicate as much as I could from the java code, so I replicated a small subset of functionality of various Java objects. Strings, Double, Char, List, Hashtable, Buffer etc. But they are basically structs.

With that being said, there shouldn't be a performance penalty. The only performance penalty is that I spent a couple of hours on the implementation, so I it isn't exactly the best code.

If you look at the Linux kernel, they make heavy use of unions and structs and similar structures.

(define (fib x)
(cond ((eq x 0) 1)
((eq x 1) 1)
(true (+ (fib (- x 1)) (fib (- x 2))))))
(fib 23)

When I get around to it, I will have implement these functions.

akuhn said...

Thanks for the link. Nice to see that there are "modern" styles of writing C code.

It happens that I spend the last week with almost the same work as you. I ported my personal Java lisp to C, and my lesson learned was that its hard to beat the JVM. The JVM is a complexe beast that optimizes the hell out of your Java code. A hand-written OO system in C has a hard stand to beat that. In my case, the initial OO-style C was about half as fast as the Java implementation. And even after tuning the C version, Java still runs faster. That is why I asked you to run the same Fibonacci benchmark with both of your versions.

Berlin Brown said...

akuhn,

Thanks for the reply. I haven't and refuse to work with C++. Yet, the concept of object oriented programming can be applied to many different languages, especially C. That is the approach I try to use when I work with it and I normally don't work with C that often.

With that being said, the JIT/Hotspot and most JVMs are pretty well tuned. I think I was reading that it took the JVM creators 10 years to optimize and tune the JVM. So, it is going to be optimized.

Your example will probably run faster on the JVM than a unoptimized, raw C based implementation. But, if I have the time I will try to run some of your examples.

Also, if I get a chance (probably not), would run a more complicated benchmark, there is where you will normally see the difference.

http://shootout.alioth.debian.org/

Keo said...

A little comment. To me, the Object_struct seems to be wasting unnecessarily much memory. With your definition it is something like 7 x sizeof(int) bytes large. If you used a union for the data slots, the size would drop to sizeof(int) + sizeof(double). With gcc it could even be an anonymous union and the code using it would probably need not change at all.

typedef struct Object_struct {
... int obj_type;
... union {
..... char *string_data;
..... double double_data;
..... int boolean_data;
..... char char_data;

..... struct Pair_struct *pair_data;
..... struct Procedure_struct *proc_data;
... };
} Object;

I am afraid that this is not portable to other compilers but even a named union would be simple to use and much more efficient. It seems to be quite common practice in interpreters for various languages.

Max said...

Mr. Brown,
Let me begin by saying I really appreciate your blog and you're clearly a talented programmer. I'm going over your code and I find it interesting, but I've got one major problem with the C implementation, which is that you've re-implemented the hash table as a singly-linked list, yet retained the name "hash table". Dijkstra would have your ass for that. (But he was an old curmudgeon anyway.) The other thing I'd like to note is that, yes, implementing objects in C is an old and effective concept, (C++ was originally a C library -- then a precompiler, before being transformed into the terrible academic bastardization that it is today) but you haven't really done that either. These are simply structs (which, as noted before me would have been better-implemented as unions) which contain basic data types. Googling for objects in C yields a wealth of articles on the topic, so I'll refrain from lecturing you on how it's done.

All in all -- I'm still reviewing the code, but it seems to be a great little implementation and the only problems here seem to be retaining the Java terminology in your port to C.

Other than that, well done and keep bringing us the interesting bits of code!