LLVM Kaleidoscope .NET
LLVM Kaleidoscope .NET
If you're reading and learning about compilers in these days you will at some point find something about LLVM. I did and I tried to implement the Kaleidoscope tutorial from the official documentation using C#. I won't write a full blown tutorial and in the end I didn't actually follow the official tutorial as it was pretty hard for me to read C++ code.
However I did implement the interpreter, with some bumps on the road and some changes to the syntax. Here you'll find all the code I wrote, roughly equivalent to Chapter 6 & 7 of the official tutorial.
Here you can see the code running the Mandelbrot example
I'm going to point out things that I found out there weren't obvious for me and did confuse me along the road.
First, how do we use LLVM in C# code? There is a nuget package from Microsoft, source here, I suggest to use the 11.0.0-beta since the last non beta version uses LLVM 5. The library contains bindings for the C api of LLVM, when you have some issue look for the actual C code that is being called and google it, it's easier than finding the corresponding C# method/type. I did find something useful looking in projects using the nuget itself so look for them on github, some of the projects are incomplete.
Major differences between the official tutorial and my implementation
Here the key differences:
- I used the visitor pattern and the source code is read eagerly into tokens.
- Functions definitions requires a comma
,
between arguments egdef f(x,y) x*y
instead ofdef f(x y) x*x
- I couldn't understand the Jit class they made in the tutorial, so I didn't do it. I'm using the MCJIT in a way that I hope is not too wrong. I cannot exclude that some of the problems I encountered are related to this.
How the interpreter evaluates expressions
Let's say we want to evaluate something simple 1+2;
then the interpreter emits code of a function with no arguments that returns the evaluated expression as a result, in pseudocode
fun a() { return 3; }
Then it evaluates the function and write the result on the screen.
The issues that hit me hard
Here's the list of what I learned by spending an awful lot of time scratching my head :)
- Suppose you want to evaluate something like this
1+10;20*2;
. We have two subsequent expressions to be evaluated, let's call theme1
ande2
. The interpreter, using the MCJIT, does not work if you implement this flow
emit e1, evaluate e1, emit e2, evaluate e2
You need to implement this flow
emit e1, emit e2, evaluate e1, evaluate e2
Maybe it's obvious but it wasn't for me.
- Same as before, we want to evaluate
1+10;20*2;
. We need to define two functions but we have no name for them, at the beginning I passed an empty string in theAddFunction
method. It turns out that LLVM is able to emit the code, it will name the function@1
and the second one@2
and so on. However when you evaluate the second function you get the result of the first one. I solved by calling the functionanon_expr
as in the official tutorial. You'll get@anon_expr
,@anon_expr.1
,@anon_expr.3
and so on and evaluation will be correct. I have some C code to prove this point if needed. - At chapter 7 when implementing mutable variables the official tutorial uses the type
AllocaInst
, there is anAllocaInst
in the C# library but I couldn't find out how to use it. Moreover the methods for allocating, storing and retrieving values in the C api works withLLVMValueRef
in both the pointer val and in the actual val for example
LLVMValueRef LLVMBuildLoad2(LLVMBuilderRef, LLVMTypeRef Ty, LLVMValueRef PointerVal, const char *Name)
In the arguments we find a LLVMValueRef
representing a pointer, the return value is a LLVMValueRef
containing an actual value.
Since the C# library is a binding for the C api this is what you'll find. Forget AllocaInst
and be careful with the LLVMBuildStore
since in its arguments you will find two LLVMValueRef
, one for the pointer and one for the value. If you switch them, like I did a couple of times, the code will obviously fail.
- Again at chapter 7, during the refactoring needed for the mutable variables, the code emitted for the function definition has to be changed. For example for a function
def f(x,y) x+y;
, what it is required to emit is these (in pseudocode)
def f(x,y):
x1 = alloca double
y1 = alloca double
store x x1
store y y1
x2 = load x1
y2 = load y2
z = x2+y2
return z
Which is pretty weird, I couldn't get my head around this and I still don't. But I checked the IR generated from an equivalent C code
int f(x, y)
{
return x + y;
}
compiling the file with clang -S -emit-llvm main.c
I got
define dso_local i32 @f(i32 %0, i32 %1) #0 {
%3 = alloca i32, align 4
%4 = alloca i32, align 4
store i32 %0, i32* %3, align 4
store i32 %1, i32* %4, align 4
%5 = load i32, i32* %3, align 4
%6 = load i32, i32* %4, align 4
%7 = add nsw i32 %5, %6
ret i32 %7
}
If clang does that, I should too. Please note that using -O1
or -O2
or -O3
produces a simpler IR.
Remarks
If you're code is failing, check the generated IR with the optimization passes on and off. Both can give clues on where the error is. If you understand C++ follow the official tutorial. Check how clang
emits IR for simple C code. My code is working in Fedora but it's not thoroughly tested.