加载中

I enjoy making toy programming languages to better understand how compilers (and, ultimately, the underlying machine) work and to experiment with techniques that aren’t in my repertoire. LLVM is great because I can tinker, and then wire it up as the backend to have it generate fast code that runs on most platforms. If I just wanted to see my code execute, I could get away with a simple hand-rolled interpreter, but having access to LLVM’s JIT, suite of optimizations, and platform support is like having a superpower — your little toy can perform impressively well. Plus, LLVM is the foundation of things likeEmscripten and Rust, so I like developing intuition about how new technologies I’m interested in are implemented.

I’m going to show how to use the LLVM API to programmatically construct a function that you can invoke like any other and have it execute directly in the machine language of your platform.

我热衷于把玩有意思的编程语言,以求能更好的理解编译器(并且最终深入其所依赖的机器底层)是如何运作的,也会去尝试那些不在我拿手技艺之列的技术。LLVM 非常棒,因为摆弄它并将其作为一个后端进行连接,以使其生成能在很多的平台上快速运行的代码. 如果我仅仅只是想看看我的代码时怎么执行的,我可能只会去用一用简单的手动解释器, 但一上手 LLVM 的 JIT, 它的优化套件,以及对平台的支持就像一台超级跑车 — 你的小玩意儿的表现也能令人印象深刻。另外,LLVM 是诸如 Emscripten 和 Rust 这些东西的基础, 因此我喜欢靠直觉来了解我所感兴趣的新技术是如何实现的.

我将要向你展示的,是如何使用 LLVM API 去以编程的方式构建一个函数,你可以想调用其它函数一样调用它,并且可以让其直接以你所用平台的机器语言形式运行。

In this example, I’m going to use the C API, because it is available in the LLVM distribution, along with a C++ API, and so is the simplest way to get started.  There are bindings to the LLVM API in other languages — Python, OCaml, Go, Rust — but the concepts behind using LLVM to generate code are the same across the wrapper APIs.

This example sort of skips to the middle phase of compiler construction. Assume the frontend (lexer, parser, type-checker) has built an AST and we’re now walking it to emit the intermediate representation of the code for the backend to take and optimize and spit out machine code.

In this case, we’ll just type out the straight-line procedural code for a simple function that would normally be dynamically cobbled together in a AST walker function, calling the LLVM API when it encounters certain nodes in the tree.

在本例中,我将使用 C 的 API, 因为它在 LLVM 中是可以使用的,再加上一个 C++ 的 API, 就是起步的最简单方式了.  其它的语言也有到 LLVM API 的绑定 — Python, OCaml, Go, Rust — 但 LLVM 生成代码这一过程后面的原理在所有封装的API中都是一样的.

本例会稍微跳到编译器构建的中间阶段. 假设前端 (词法、语法分析器,类型检查器) 已经构建了一个 AST 并且我们现在正在为后端遍历中间形式的代码,让机器代码得到优化后被生成出来.

在这种情况,我们就只是输入了直线形式的过程代码,得到的函数会在一个AST遍历函数中被动态的拼凑在一起, 当遇到树中特定的节点时就会调用 LLVM 的 API.

For the example, we’ll build a simple adder function, which takes two integers as arguments and returns their sum, the equivalent of, in C:

int sum(int a, int b) {
    return a + b;
}

To be clear about what we are doing here: we are using LLVM to dynamically build an in-memory representation of this function, using its API to set up things like function entry and exit, return and parameter types, and the actual integer add instruction. Once this in-memory representation is complete, we can instruct LLVM to jump to it and execute it with arguments we supply, just as if it was a executable we had compiled from a language like C.

Click here to view the final code.

针对本例,我们将会构建一个简单的加法函数,它会以两个整型数作为入参并返回他们的和,同等的C标识方式如下:

int sum(int a, int b) {
    return a + b;
}

我们要清除现在正在做的事情: 我们正使用 LLVM 构建这个函数在内存中的表现形式, 使用它的 API 来设置项函数的进入和退出,返回和参数类型,以及实际的加法指令这些东西. 一旦这一内存表现形式构建完成, 我们就可以指示 LLVM 跳转到它,并使用我们提供的参数来执行它, 就好像它是一个从像C这样的语言编译而来的可执行的东西.

点击这里查询最终的代码.

Modules

The first step is to create a module. A module is a collection of the global variables, functions, external references, and other data in LLVM. Modules aren’t quite like, say, modules in Python, in that they don’t provide separate namespaces. But they are the top-level container for all things built in LLVM, so we start by creating one.

LLVMModuleRef mod = LLVMModuleCreateWithName("my_module");

The string "my_module" passed to the module factory function is an identifier of your choosing.

Note that as you’re navigating the LLVM C API documentation, different aspects are grouped together under different header includes. Most of what I’m detailing here, such as modules and functions, is contained in the Core.hheader, but I’ll include others as we move along.

模块

第一步是要去创建一个模块. LLVM中一个模块就是一个由全局变量,函数,外部引用以及其它数据组成的集合. 这里的模块不怎么样比方说Python这样的语言中的模块, 它们并不提供独立的命名空间. 但它们是所有构建在LLVM中的东西的顶层容器, 因此我们从创建一个这样的模块开始.

LLVMModuleRef mod = LLVMModuleCreateWithName("my_module");

传入模块工厂函数的字符串 "my_module" 是你所选择的模块标识.

请注意当你正在浏览 LLVM C API 文档 时, 不同的方面会在不同的头包含下被组织在一起. 我在这里详细介绍的大多数东西,比如模块和函数,都包含在 Core.h头下, 而随着我们继续深入,我也将会涵盖其它的东西.

Types

Next, I create the sum function and add it to the module. A function consists of:

  • its type (return type),

  • a vector of its parameter types, and

  • a set of basic blocks.

I’ll get to basic blocks in a moment. First, we’ll handle the type and parameter types of the function — its prototype, in C terms — and add it to the module.

LLVMTypeRef param_types[] = { LLVMInt32Type(), LLVMInt32Type() };
LLVMTypeRef ret_type = LLVMFunctionType(LLVMInt32Type(), param_types, 2, 0);
LLVMValueRef sum = LLVMAddFunction(mod, "sum", ret_type);

LLVM types correspond to the types that are native to the platforms we’re targeting, such as integers and floats of fixed bit width, pointers, structs, and arrays. (There’s no platform-dependent int type like in C, where the actual size of the integer, 32- or 64-bit, depends on the underlying machine architecture.)

类型

接下来,我创建sum函数,并将其添加到模块中。一个函数会包含如下元素:

  • 它的类型 (返回类型),

  • 一个由其参数类型组成的向量, 以及

  • 一个基础块的集合.

稍后我会解释基础块. 首先,我们要处理函数的类型和参数类型 — 用C的术语说,就它的原型 — 并将其添加到模块中.

LLVMTypeRef param_types[] = { LLVMInt32Type(), LLVMInt32Type() };
LLVMTypeRef ret_type = LLVMFunctionType(LLVMInt32Type(), param_types, 2, 0);
LLVMValueRef sum = LLVMAddFunction(mod, "sum", ret_type);

LLVM 的类型对应我们的目标平台上的类型, 比如固定位宽的整型和浮点数, 指针, 结构,以及数组. (没有像C中那样的平台独立的类型,在C中,整型的大小, 32- 或者 64-位,是依赖于机器的架构的。)

LLVM types have constructors, and follow the form "LLVM*TYPE*Type()". In our example, both the arguments passed to the sum function and the function’s type itself are 32-bit integers, so we use LLVMInt32Type() for each.

The arguments to LLVMFunctionType() are, in order;

  1. the function’s type (return type),

  2. the function’s parameter type vector (the arity of the function should match the number of types in the array), and

  3. the function’s arity, or parameter count,

  4. a boolean whether the function is variadic, or accepts a variable number of arguments.

Notice that the function type constructor returns a type reference. This reinforces the notion that what we did here is the LLVM equivalent of declaring a function prototype in C.

The third line in here adds the function type to the module, and gives it the name sum. We get a value reference in return, which can be thought of as a concrete location in the code (ultimately, memory) upon which to add the function’s body, which we do below.

LLVM类型拥有构造函数,并遵循“LLVM*TYPE*Type()”格式。在我们的例子中,传递到sum函数的参数,以及函数类型本身都是32位整型,所以我们可以使用LLVMInt32Type()。

按照顺序,传递到LLVMFunctionType()的参数如下:

1.函数的类型(返回类型)

2.函数的参数类型向量(函数的参数个数应该和数组中的类型个数相匹配)

3.函数的参数个数

4.一个boolean类型,表示函数是否是可变的,或者接受一个可变的参数

请注意,函数类型构造函数返回一个类型引用。这强化了一个概念,即我们在LLVM里面所做的等同于C中的函数原型声明。

这里的第三行增加了函数类型到模块,并命名为sum。我们获取到一个值引用,可以将它认作是代码(实际上是内存)中的固定位置,在它之上可以增加函数体,这是我们下面要做的。


Basic blocks

The next step is to add a basic block to the function. Basic blocks are parts of code that only have one entry and exit point - in other words, there is no other way execution can go than by single stepping through a list of instructions. No if/else, while, loops, or jumps of any kind. Basic blocks are the key to modeling control flow and creating optimizations later on, so LLVM has first-class support for adding these to our in-progress module.

LLVMBasicBlockRef entry = LLVMAppendBasicBlock(sum, "entry");

Note the "append" in the name of the function: it’s helpful to think of what we’re doing as growing a running tally of chunks of code, and so our basic block is appended relative to the function we added to the module previously.

基本块

下一步是增加基本块到函数。基本块是只有一个入口点和出口点的部分代码,换句话说,除了一步步按照一系列指令执行外,没有其它的方式来执行。没有if/else,while,loop,或任意类型的jump。基本块是模型控制流以及后续优化的关键,因此,LLVM具备增加这些到进展中的模块的一流支持。

LLVMBasicBlockRef entry = LLVMAppendBasicBlock(sum, "entry");

注意函数名中的“append”:它有助于我们了解,当运行中的代码块不断增加的时候,我们正在做什么。由此,相对于我们之前增加到模块中的函数,基本块是增加的。

Instruction builders

This notion of a running tally fits with the instruction builder, which is how we add instructions to our function’s one and only basic block.

LLVMBuilderRef builder = LLVMCreateBuilder();
LLVMPositionBuilderAtEnd(builder, entry);

Similar to appending the basic block to the function, we’re positioning the builder to start writing instructions where we left off with the entry to the basic block.

指令创建者

与指令创建者相符合的概念,即我们如何增加指令到函数基本块。

LLVMBuilderRef builder = LLVMCreateBuilder();
LLVMPositionBuilderAtEnd(builder, entry);

类似于增加基本块到函数,我们设定创建者在基本块留下的入口编写指令。

LLVM IR

Sidebar: LLVM’s main stock-in-trade is the LLVM intermediate representation, or IR. I’ve seen it referred to as a midway point between assembly and C. The LLVM IR is a very strictly defined language that is meant to facilitate the optimizations and platform portability that LLVM is known for. If you look at IR, you can see how individual instructions can be translated into the loads, stores, and jumps of the ultimate assembly that will be generated. The IR has 3 representations:

  • as an in-memory set of objects, which is what we’re using in this example,

  • as a textual language like assembly,

  • as a string of bytes in a compact binary encoding, called bitcode.

You may see clang or other tools emit LLVM IR as text or bitcode.

LLVM中间表示

补充说明:LLVM的主要用途就是使用LLVM实现中间表示,或者简写为IR。我认为LLVM的中间表示是C语言和汇编语言之间的中间表示。LLVM中间表示采用的是一种定义非常严格的语言,这就意味着这种语言优化程度高,与平台无关,LLVM就是因这两个特性而出名的。你再查看一下中间表示,你就会明白每个指令是如何转换为最终生成汇编语言的装载、存储和跳转指令的。LLVM的中间表示可看作以下三种东西:

  • 一系列内存对象,我们所举的例子就是这样的。

  • 文本语言,就像汇编语言那样。

  • 一系列由简单的二进制编码所组成字节,或者称作位码。

你可以把clang或者其他工具所生成的LLVM中间表示看做文本语言或者位码。

Back to our example. Now comes the crux of our function, the actual instructions to add the two integers passed in as arguments and return them to the caller.

LLVMValueRef tmp = LLVMBuildAdd(builder, LLVMGetParam(sum, 0), LLVMGetParam(sum, 1), "tmp");
LLVMBuildRet(builder, tmp);

LLVMBuildAdd() takes a reference to the builder, the two integers to add, and a name to give the result. (The name is required due to LLVM IR’s restriction that all instructions produce intermediate results. This can further be simplified or optimized away by LLVM later, but while generating IR, we follow its strictures.) Since the numbers we wish to add are the arguments that were supplied to the function by the caller, we can retrieve them in the form of the function’s parameters using LLVMGetParam(): the second argument to is the index of the parameter we seek from the function.

We call LLVMBuildRet() to generate the return statement and arrange for the temporary result of the add instruction to be the value returned.

回到我们刚才的示例。现在来到了我们加法函数的核心部分:最终将传递过来的两个整型的参数进行相加并返回结果给调用者的指令。

LLVMValueRef tmp = LLVMBuildAdd(builder, LLVMGetParam(sum, 0), LLVMGetParam(sum, 1), "tmp");
LLVMBuildRet(builder, tmp);

LLVMBuildAdd()持有编译器的引用,其中包括两个待相加的整数,一个提供返回结果的名字。(结果的名字是必须的,因为LLVM IR严格要求全部指令必须产生一个中间结果。这一点稍候可以在LLVM进一步简化或者被优化掉,但在当前生成指令过程中,我们先遵循它的约定。)显然我们希望进行相加的个数就是我们后面提供给编译器的参数,并且可以通过使用函数的参数中的LLVMGetParam()来获取:即对应我们所看到此方法中的第二个、第三个参数。

调用LLVMBuildRet()即可生成返回声明,以及相加指令执行后返回临时结果的序列。

返回顶部
顶部