Creating the Bolt Compiler: Part 7
A Protobuf tutorial for OCaml and C++
October 03, 2020
8 min read
Now that we’ve desugared our language into a simple “Bolt IR” that is close to LLVM IR, we want to generate LLVM IR. One problem though, our Bolt IR is an OCaml value, but the LLVM API we’re using is the native C++ API. We can’t import the value directly into the C++ compiler backend, so we need to serialise it first into a language-independent data representation.
Hey there! If you came across this tutorial from Google and don’t care about compilers, don’t worry! This tutorial doesn’t really involve anything compiler-specific (it’s just the illustrative example). If you care about OCaml then the first half will be up your street, and if you’re here for C++ you can skip the OCaml section.
Protocol Buffers Schema
Protocol buffers (aka protobuf ) encodes your data as a series of binary messages according to a given schema. This schema (written in a .proto
file) captures the structure of your data.
Each message contains a number of fields.
Fields are of the form modifer type name = someIndex
Each of the fields have a modifier: required
/ optional
/ repeated
. The repeated
modifier corresponds to a list/vector of that field. e.g. repeated PhoneNumber
represents a value of type List<PhoneNumber>
.
For example, the following schema would define a Person in a contact book. Each person has to have a name and id, and optionally could have an email address. They might have multiple phone numbers (hence repeated PhoneNumber
), e.g. for their home, their mobile and work.
message Person {required string name = 1;required int32 id = 2;optional string email = 3;enum PhoneType {MOBILE = 0;HOME = 1;WORK = 2;}message PhoneNumber {required string number = 1;optional PhoneType type = 2 [default = HOME];}repeated PhoneNumber phones = 4;}
Note here within the schema for a Person
message, we also number the fields (=1
, =2
, =3
, =4
). This allows us to add fields later without altering the existing ordering of fields (so you can continue to parse these fields without having to know about new fields).
Every time we want to add a new “type” of field, we define a schema (just like how you might define a class to add a new type of object). We can even define a schema within another schema, e.g. within the schema for a Person
, we defined the schema for the PhoneNumber
field.
We can also define the schema for an enum type PhoneType
. This enum acted as a “tag” for our phone number.
Now, just as you might want a field to be one of many options (specified by the enum type), you might want the message to contain exactly one of many fields.
For example in our compiler we might want to encode an identifier as one of two options. Here we don’t want to store data in an index field if we’ve tagged it as a Variable
.
type identifier =| Variable of string| ObjField of string * index (* name and field index *)
In our protobuf, we can add fields for each of these options, and use the keyword oneof
to specify that only one of the fields in that group will be set at once.
Now, whilst this tells Protobuf that one of those values is set, we have no way of telling which one of them is set. So we can introduce a tag
enum field, and query its value (Variable
or ObjField
) when deserialising the object.
message identifier {enum _tag {Variable_tag = 1;ObjField_tag = 2;}message _ObjField {required string objName = 1;required int64 fieldIndex = 2;}required _tag tag = 1;oneof value {string Variable = 2;_ObjField ObjField = 3;}}
Note since the constructor Variable of string
has only one argument of type string
, we can just set the type of that field to string
. We equally could have defined a message _Variable
with schema:
message _Variable {required string varName = 1;}...oneof value {_Variable Variable = 2;_ObjField ObjField = 3;}
Mapping OCaml Type definitions to Protobuf Schema
Now we’ve looked at the identifier
schema, let’s flesh out the rest of the frontend “Bolt IR” schema.
If you notice, whenever we have an OCaml variant type like identifier
we have a straightforward formula to specify the corresponding schema. We:
- Add a tag field to specify the constructor the value has.
- Define a message schema if a constructor has multiple arguments e.g for
ObjField
. If the constructor has only one argument (Variable
) we don’t need to. - Specify a
oneof
block containing the fields for each of the constructors.
We could manually write out all of the schema, but we’d have to rewrite these every time our language changed. However we have a hidden weapon up our sleeve - a library that will do this for us!
There’s the full Protobuf Schema Guide if you want to learn more about other message types.
PPX Deriving Protobuf
OCaml allows libraries to extend its syntax using a ppx hook. The ppx libraries can preprocess the files using these hooks to extend the language functionality. So we can tag our files with other information, such as which types need to have a protobuf schema generated.
We can update our Dune build file to preprocess our ir_gen
library with the ppx-deriving protobuf library:
(library(name ir_gen)(public_name ir_gen)(libraries core fmt desugaring ast)(flags(:standard -w -39))(preprocess(pps ppx_deriving_protobuf bisect_ppx --conditional))
Note the flags command suppresses warning 39 - “unused rec
flag” - since the code the ppx_deriving_protobuf library generates raises a lot of those warnings. I would highly recommend that you also suppress these warnings!
An aside, bisect_ppx
is the tool we use to calculate test coverage. It has a --conditional
flag since we don’t want to preprocess the file with coverage info if we’re not computing the test coverage.
Telling PPX deriving protobuf that we want to serialise a type definition is easy - we just stick a [@@deriving protobuf]
at the end of our type definition. For variant types, we have to specify a [@key 1]
, [@key 2]
for each of the variants.
For example, for our identifier
type definition in our .mli
file:
type identifier =| Variable of string [@key 1]| ObjField of string * int [@key 2][@@deriving protobuf]
We do the same thing in the .ml
file, except we also specify the file to which we want to write the protobuf schema.
type identifier =| Variable of string [@key 1]| ObjField of string * int [@key 2][@@deriving protobuf {protoc= "../../frontend_ir.proto"}]
This path is relative to the src file frontend_ir.ml
. So since this frontend_ir.ml
file is in the src/frontend/ir_gen/
folder of the repo, the protobuf schema file will be written to src/frontend_ir.proto
. If you don’t specify a file path to the protoc
argument, then the .proto
file will be written to the _build
folder.
One extra tip, the ppx deriving protobuf library won’t serialise lists of lists. For example you can’t have:
type something = Foo of foo list list [@key 1] [@@deriving protobuf])
This is because if we have a message schema for foo
, then to get a field of type foo list
is straightforward - we use repeated foo
in our field schema. But we can’t say repeated repeated foo
to get a list of a list of type foo. So to get around this, you’d have to define another type here:
type list_of_foo = foo list [@@deriving protobuf](*note no @key since not a variant type *)type something = Foo of list_of_foo list [@key 1] [@@deriving protobuf]
Finally, to serialise our Bolt IR using this schema, PPX deriving protobuf provides us with <type_name>_to_protobuf
serialisation functions. We use this to get a binary protobuf message that we write to an output channel (out_chan
) as a sequence of bytes.
let ir_gen_protobuf program out_chan =let protobuf_message = Protobuf.Encoder.encode_exn program_to_protobuf program inoutput_bytes out_chan protobuf_message
In our overall Bolt compiler pipeline, I write the output to a .ir
file if provided, or to stdout
:
...match compile_out_file with| Some file_name ->Out_channel.with_file file_name ~f:(fun file_oc ->ir_gen_protobuf program file_oc)| None -> ir_gen_protobuf program stdout )
Auto-generated Protobuf Schema
The README of the repository for the PPX Deriving Library has a thorough explanation of the mapping. We’ve already gone through an example of the autogenerated schema, so the schema shouldn’t be too unfamiliar. There was a slight modification I made though. For (ObjField of string * int
) I said the generated output was:
message _ObjField {required string objName = 1;required int64 fieldIndex = 2;}
This is unfortunately not quite true. I put in the objName and fieldIndex for clarity, but the library doesn’t know what the semantic meaning of string * int
is, so instead it looks like:
message _ObjField {required string _0 = 1;required int64 _1 = 2;}
That’s the downside of the autogenerated schema. You get the generic field names _0
, _1
, _2
and so on instead of objName
and fieldIndex
. And for type block_expr = expr list
, you get the field name _
:
message block_expr {repeated expr _ = 1;}
Decoding Protobuf in C++
Alright, so far we’ve looked at Protobuf schema definitions and how PPX Deriving Protobuf encodes messages and generates the schema. Now we need to decode it using C++. As with the encoding section, we don’t need to know the details of how Protobuf represents our messages in binary.
Generating Protobuf Deserialisation Files
The protoc
compiler automatically generates classes and function definitions from the .proto
file: this is exposed in the .pb.h
header file. For frontend_ir.proto
, the corresponding header file is frontend_ir.pb.h
.
For Bolt, we’re building the C++ compiler backend with the build system Bazel. Rather than manually running the protoc
compiler to get our .pb.h
header file and then linking in all the other generated files, we can instead take advantage of Bazel’s integration with protoc
and get Bazel to run protoc
during the build process for us and link it in.
The Bazel build system works with many languages e.g. Java, Dart, Python, not just C++. So we specify two libraries - a language-independent proto_library
, and then a specific C++ library (cc_proto
) that wraps around that library:
load("@rules_cc//cc:defs.bzl", "cc_library")# Convention:# A cc_proto_library that wraps a proto_library named foo_proto# should be called foo_cc_proto.cc_proto_library(name = "frontend_ir_cc_proto",deps = [":frontend_ir_proto"],visibility = ["//src/llvm-backend/deserialise_ir:__pkg__"],)# Conventions:# 1. One proto_library rule per .proto file.# 2. A file named foo.proto will be in a rule named foo_proto.proto_library(name = "frontend_ir_proto",srcs = ["frontend_ir.proto"],)
We include the cc_proto_library(...)
as a build dependency to whatever files that need to use the .pb.h
file, and Bazel will compile the protobuf file for us. In our case, this is our deserialise_ir
library.
load("@rules_cc//cc:defs.bzl", "cc_library")cc_library(name = "deserialise_ir",srcs = glob(["*.cc"]),hdrs = glob(["*.h"]),visibility = ["//src/llvm-backend:__pkg__", "//src/llvm-backend/llvm_ir_codegen:__pkg__", "//tests/llvm-backend/deserialise_ir:__pkg__"],deps = ["//src:frontend_ir_cc_proto", "@llvm"])
Deserialising a Protobuf serialised file
The frontend_ir.pb.h
file defines a namespace Frontend_ir
, with each message mapping to a class. So for our Bolt program, represented by the program
message in our frontend_ir.proto
file, the corresponding Protobuf class is Frontend_ir::program
.
To deserialise a message of a given type, we create an object of the corresponding class. We then call the ParseFromIstream
method which deserialises the message from the input stream and stores it in the object.
This method returns a boolean value indicating success/failure. We’ve defined our own custom DeserialiseProtobufException
to handle failure:
Frontend_ir::program deserialiseProtobufFile(std::string &filePath) {Frontend_ir::program program;std::fstream fileIn(filePath, std::ios::in | std::ios::binary);...if (!program.ParseFromIstream(&fileIn)) {throw DeserialiseProtobufException("Protobuf not deserialised from file.");}return program;}
So we’re done!
Well, one caveat… this autogenerated class is a dumb data placeholder. We need to read the data out from the program
object.
Reading out Protobuf data from a protobuf class
If you try to read the protoc-generated file frontend_ir.pb.h
, you’ll realise it’s a garguantuan monstrosity, which is not nearly as nice as the standard example in the official tutorial. So instead of trying to read the methods from the file, here is an explanation of the structure of the header file.
TIP: Make sure you have code-completion set-up in your IDE - it means you won’t need to manually search through frontend_ir.pb.h
for the right methods.
We’ll be deserialising messages using the schema defined below:
package Frontend_ir;message expr {enum _tag {Integer_tag = 1;Boolean_tag = 2;Identifier_tag = 3;...IfElse_tag = 11;WhileLoop_tag = 12;Block_tag = 15;}message _Identifier {required identifier _0 = 1;optional lock_type _1 = 2;}...message _IfElse {required expr _0 = 1;required block_expr _1 = 2;required block_expr _2 = 3;}message _WhileLoop {required expr _0 = 1;required block_expr _1 = 2;}...required _tag tag = 1;oneof value {int64 Integer = 2;bool Boolean = 3;_Identifier Identifier = 4;..._IfElse IfElse = 12;_WhileLoop WhileLoop = 13;block_expr Block = 16;...}}message block_expr {repeated expr _ = 1;}
Protobuf message classes and enums
Each Protobuf message definition maps to a class definition. Protobuf enums map to C++ enums. To give each class/enum a globally unique name, they are prepended by the package name (Frontend_ir
) and any classes they’re nested in.
So message expr
corresponds to class Frontend_ir_expr
, and enum _tag
which is nested within message expr
maps to Frontend_ir_expr__tag
.
In the repo, the bin_op
message also has a nested _tag
enum, and this maps to Frontend_ir_bin_op__tag
(note how this namespacing means it doesn’t clash with the _tag
definition in the expr
message).
This holds for arbitrary levels of nesting, e.g. the nested message _IfElse
in the expr
message maps to the class Frontend_ir_expr__IfElse
.
Specific enum values for the enum _tag
e.g. Integer_tag
can be referred to as Frontend_ir_expr__tag_Integer_tag
.
Rather than writing classes\enums by concatenating the package and class names with _
, Protobuf also has a nested class type alias, so you can write these nested message classes as Frontend_ir::expr
and Frontend_ir::expr::_IfElse
.
Accessing Specific Protobuf Fields
Each protobuf required field field_name
has a corresponding accessor method field_name()
(where the field name is converted to lower-case). So the field tag
in the expr
message would map to the method tag()
in the Frontend_ir::expr
class, and the field Integer
would map to the method integer()
, IfElse
to ifelse()
etc.
NB: to reiterate, don’t get confused between the field name e.g IfElse
and the type of the field _IfElse
in the Protobuf message (note the prepended underscore). This is only because the OCaml PPX deriving protobuf library gave them those names - we could have chosen less confusing names if we were writing this proto schema manually.
For optional fields, we can access the fields in the same way, but we also have a has_field_name()
boolean function to check if a field is there.
For repeated fields, we instead have a field_name_size()
function to query the number of items, and we can access item i using the field_name(i)
method. So for a field name _1
the corresponding methods are _1_size()
and _1(i)
. For field name _
in the autogenerated schema, the methods would correspondingly be __size()
and _(i)
.
Sanitising our Frontend IR
Let’s put this in practice by sanitising our protobuf classes into C++ classes directly mapping our OCaml type definitions. Each of the OCaml expr
variants maps to a subclass of an abstract ExprIR
class.
struct ExprIR {virtual ~ExprIR() = default;...};struct ExprIfElseIR : public ExprIR {std::unique_ptr<ExprIR> condExpr;std::vector<std::unique_ptr<ExprIR>> thenBlock;std::vector<std::unique_ptr<ExprIR>> elseBlock;ExprIfElseIR(const Frontend_ir::expr::_IfElse &expr);...};struct ExprWhileLoopIR : public ExprIR {std::unique_ptr<ExprIR> condExpr;std::vector<std::unique_ptr<ExprIR>> loopExpr;ExprWhileLoopIR(const Frontend_ir::expr::_WhileLoop &expr);...};...
I use std::unique_ptr
all over the compiler backend to avoid explicitly managing memory. You can use standard pointers too - this is just a personal preference!
Remember how we had a specific tag
field to distinguish between variants of the expr
type. We have a switch
statement on the value of the tag
field. This tag
tells us which of the oneof{}
fields is set. We then access the correspondingly set field e.g. expr.ifelse()
for the IfElse_tag
case:
std::unique_ptr<ExprIR> deserialiseExpr(const Frontend_ir::expr &expr){switch (expr.tag()) {case Frontend_ir::expr__tag_IfElse_tag:return std::unique_ptr<ExprIR>(new ExprIfElseIR(expr.ifelse()));case Frontend_ir::expr__tag_WhileLoop_tag:return std::unique_ptr<ExprIR>(new ExprWhileLoopIR(expr.whileloop()));...}}
Looking at the message schema for the _IfElse
and block_expr
messages:
message _IfElse {required expr _0 = 1;required block_expr _1 = 2;required block_expr _2 = 3;}message block_expr {repeated expr _ = 1;}
Our constructor reads each of the _IfElse
message’s fields in turn. We can then use our deserialiseExpr
to directly deserialise the _0
field. However, because the message block_expr
has a repeated field _
, we have to iterate through the list of expr messages in that field in a for loop:
ExprIfElseIR::ExprIfElseIR(const Frontend_ir::expr::_IfElse &expr) {Frontend_ir::expr condMsg = expr._0();Frontend_ir::block_expr thenBlockMsg = expr._1();Frontend_ir::block_expr elseBlockMsg = expr._2();condExpr = deserialiseExpr(condMsg);for (int i = 0; i < thenBlockMsg.__size(); i++) {thenExpr.push_back(deserialiseExpr(thenBlockMsg._(i)));}for (int i = 0; i < elseBlockMsg.__size(); i++) {elseExpr.push_back(deserialiseExpr(elseBlockMsg._(i)));}}
The rest of the deserialisation code for the Bolt schema follows the same pattern.
Wrap up
In this post we’ve converted from our OCaml frontend IR to the equivalent C++ classes. We’ll use these C++ classes to generate LLVM IR code in the next part of the tutorial.