Getting Started with LLVM Core Libraries
上QQ阅读APP看书,第一时间看更新

Introducing LLVM's basic design principles and its history

LLVM is a notoriously didactic framework because of a high degree of organization in its several tools, which allows the curious user to observe many steps of the compilation. The design decisions go back to its first versions more than 10 years ago when the project, which had a strong focus on backend algorithms, relied on GCC to translate high-level languages, such as C, to the LLVM intermediate representation (IR). Today, a central aspect of the design of LLVM is its IR. It uses Single-Static Assignments (SSA), with two important characteristics:

  • Code is organized as three-address instructions
  • It has an infinite number of registers

This does not mean, however, that LLVM has a single form of representing the program. Throughout the compilation process, other intermediary data structures hold the program logic and help its translation across major checkpoints. Technically, they are also intermediate forms of program representation. For example, LLVM employs the following additional data structures across different compilation stages:

  • When translating C or C++ to the LLVM IR, Clang will represent the program in the memory by using an Abstract Syntax Tree (AST) structure (the TranslationUnitDecl class)
  • When translating the LLVM IR to a machine-specific assembly language, LLVM will first convert the program to a Directed Acyclic Graph (DAG) form to allow easy instruction selection (the SelectionDAG class) and then it will convert it back to a three-address representation to allow the instruction scheduling to happen (the MachineFunction class)
  • To implement assemblers and linkers, LLVM uses a fourth intermediary data structure (the MCModule class) to hold the program representation in the context of object files

Besides other forms of program representation in LLVM, the LLVM IR is the most important one. It has the particularity of being not only an in-memory representation, but also being stored on disk. The fact that LLVM IR enjoys a specific encoding to live in the outside world is another important decision that was made early in the project lifetime and that reflected, at that time, an academic interest to study lifelong program optimizations.

In this philosophy, the compiler goes beyond applying optimizations at compile time, exploring optimization opportunities at the installation time, runtime, and idle time (when the program is not running). In this way, the optimization happens throughout its entire life, thereby explaining the name of this concept. For example, when the user is not running the program and the computer is idle, the operating system can launch a compiler daemon to process the profiling data collected during runtime to reoptimize the program for the specific use cases of this user.

Notice that by being able to be stored on disk, the LLVM IR, which is a key enabler of lifelong program optimizations, offers an alternative way to encode entire programs. When the whole program is stored in the form of a compiler IR, it is also possible to perform a new range of very effective inter-procedural optimizations that cross the boundary of a single translation unit or a C file. Thus, this also allows powerful link-time optimizations to happen.

On the other hand, before lifelong program optimizations become a reality, program distribution needs to happen at the LLVM IR level, which does not happen. This would imply that LLVM will run as a platform or virtual machine and will compete with Java, which too has serious challenges. For example, the LLVM IR is not target-independent like Java. LLVM has also not invested in powerful feedback-directed optimizations for the post-installation time. For the reader who is interested in reading more about these technical challenges, we suggest reading a helpful LLVMdev discussion thread at http://lists.cs.uiuc.edu/pipermail/llvmdev/2011-October/043719.html.

As the project matured, the design decision of maintaining an on-disk representation of the compiler IR remained as an enabler of link-time optimizations, giving less attention to the original idea of lifelong program optimizations. Eventually, LLVM's core libraries formalized their lack of interest in becoming a platform by renouncing the acronym Low Level Virtual Machine, adopting just the name LLVM for historical reasons, making it clear that the LLVM project is geared to being a strong and practical C/C++ compiler rather than a Java platform competitor.

Still, the on-disk representation alone has promising applications, besides link-time optimizations, that some groups are fighting to bring to the real world. For example, the FreeBSD community wants to embed program executables with its LLVM program representation to allow install-time or offline microarchitectural optimizations. In this scenario, even if the program was compiled to a generic x86, when the user installs the program, for example, on the specific Intel Haswell x86 processor, the LLVM infrastructure can use the LLVM representation of the binary and specialize it to use new instructions supported on Haswell. Even though this is a new idea that is currently being assessed, it demonstrates that the on-disk LLVM representation allows for radical new solutions. The expectations are for microarchitectural optimizations because the full platform independence seen in Java looks impractical in LLVM and this possibility is currently explored only on external projects (see PNaCl, Chromium's Portable Native Client).

As a compiler IR, the two basic principles of the LLVM IR that guided the development of the core libraries are the following:

  • SSA representation and infinite registers that allow fast optimizations
  • Easy link-time optimizations by storing entire programs in an on-disk IR representation