Original Author: Rich Skorski
I’ve become fascinated with x64 code recently, and have taken on a quest to learn about it. There’s a fair amount of information on the net, but there isn’t nearly as much for x64 as for x86 code. Some of the sources I’ve found were wishy-washy, too, since they were created before or shortly after the rules were agreed upon. I have found very little in the way of explaining the performance considerations that are not immediately apparent and would come as a surprise to x86 experts.
If you’re here, I’m sure you’re just as interested as I am about it. let me tell you what I know…
What is an ABI?
ABI stands for Application Binary Interface. It’s a set of rules that describe what happens when a function is called in your program, and answers questions like how to handle parameters and the stack for a function call, what registers (if any) are special, how big data types are…those sorts of things. These are the rules that the compiler guys follow when they’re determining the correct assembly to use for some bit of code. There are a lot of rules in the x64 ABI, but the rules that are most open to interpretation make up what’s known as the calling convention.
What is a calling convention?
A calling convention is a set of rules in an ABI that describes what happens when a function is called in your program. That only applies to an honest to goodness call. If a function is inlined, the calling convention does not come into play. For x86, there are multiple calling conventions. If you don’t know about them, Alex Darby does a great job explaining them: start with C/C++ Low Level Curriculum part 3: The Stack and read the later installments as well.
An ABI can be specific for a processor architecture, OS, compiler, or language. You can use that as the short answer as to why Win32 code doesn’t run on a Mac: the ABI is different. Don’t let the compiler specific implementation scare you. The rules for an OS and processor are quite solid so they’ll all have to follow those. The differences can be in how they define the calling convention.
If you think about it, a processor doesn’t know exactly what the stack or functions are. Those are the crucial parts of a calling convention. There are processor instructions that facilitate the implementation of the concepts, but it’s up to programmers to use them for great justice. The compiler takes care of most of that, so we’re at the whims of their implementation when it comes to calling conventions. It’s more likely that the calling convention rules are influenced by the programming language than anything else.
The finer details will only be a burden if you’re linking targets built by different compilers. Even then you might not run into any problems because the calling convention is currently standardized for a given platform. I only mention it in case you read this sometime after it was written and the compilers have diverged. If it comes to that, certainly consult vendor documentation.
It’s worth highlighting that the idea having multiple calling conventions is unique to 32-bit Windows. The reason for that is partly legacy and partly because there are few registers compared to other architectures. Raymond Chen had a series explaining some of the history. Here’s the 1st in the 5 part series: The history of calling conventions, part 1.
What do you mean by x64?
The label x64 refers to the 64-bit processor architectures that extend the x86 architecture. It’s full name is x86-64. You can run x86 code on these processors. The x86-64 moniker might be something you see in hardware documentation, but x64 is probably what you’ll see most often. AMD and Intel chips have different implementations and instructions, thus their own distinct names. AMD is conveniently named AMD 64. Intel has a few: IA-32e, EMT64, and Intel 64. EMT64 and Intel 64 are synonymous, the latter one being the most prominent in Intel’s docs. They say there are “slight incompatibilities” between IA-32e and Intel 64, but I don’t know what they are. If you are curious, they’re buried somewhere in these docs: MSDN – x64 Architecture.
Are there different x64 calling conventions?
On Windows, there is only one calling convention aptly named the “Windows x64 calling convention.” On other platforms there is another: the “System V calling convention.” That “V” is the roman numeral 5. System V is the only option on those systems. So there are 2 calling conventions, but only 1 will be used on a given platform.
There are some similarities between the Windows and System V calling conventions, but don’t let that fool you. It would be dangerous to treat them as such. I myself made that mistake (or would have if I were developing outside of a Windows environment). There’s also a syscall, which is a direct call to the kernel. There are different rules for calling them as opposed to the functions you’ll be writing.
I won’t be discussing System V or syscalls here. I’m not familiar enough with either to speak well about them, and as a game developer you may never deal with them. But be aware that they exist.
A tip of the hat toward consistency
A theme you’ll see with the Windows x64 calling convention is consistency. The fact that there aren’t optional calling conventions like there were for Windows x86 is an example of that. The stack pointer doesn’t move around very much, and there aren’t many “ifs” in the rules regarding parameter passing. I wasn’t part of any decisions about the calling convention, so I can’t be certain. But looking at how it turned out I get the impression any decision that may seem peculiar was made for consistency. I’m not suggesting that alternative solutions would have led to unbearable pain and destruction. I’m merely suggesting a reason why the calling convention is the way it is.
How does the Windows x64 calling convention work?
The first 4 parameters of a function are passed in registers. The rest go on the stack. Different registers will be used for floats vs. integers. Here’s what registers will be used and the order in which they’ll be used:
Integer: RCX, RDX, R8, R9
Floating-point: XMM0, XMM1, XMM2, XMM3
All parameters have space reserved on the stack, even the ones passed in registers. In fact, there’s stack space for 4 parameters even if your function doesn’t have any params. Those parameters are 8 bytes so that’s at least 32 bytes on the stack for every function (every function actually has at least 48 bytes on the stack…I’ll explain that another time). This stack area is called the home space. There are few reasons behind this home space:
- If the registers need to be used for something else, the called function can store the data in the home space without moving the stack pointer.
- It keeps the stack structure easy to determine. That’s very handy for debugging, and perhaps necessary for x64′s stack metadata (another point I’ll come back to another time).
- It’s easier to implement variable argument and non-prototyped functions.
Don’t worry, it’s not as bad as it sounds. Sure, it can be wasteful and it can destroy apps with excessive recursion if you don’t increase the available stack space. However, the space may not be wasted as often as you think. The calling convention says that the home space must exist. It merely suggests what it should be used for. The compiler can use it for whatever it wants, and an optimized build will likely make great use of it. But don’t take my word for it, keep an eye on your stack if you start working on an x64 platform.
The return value is quite easy to explain: Integers are returned in the RAX register; Floats are returned in XMM0.
Member functions have an implicit 1st parameter for the “this” pointer. Take a moment to think about how that’s different from the x86 calling convention… If you decided there’s no difference, then give yourself some bonus points! The “this” pointer will be treated as an integer parameter, ergo it will use the RCX register. Ok, ok, it’s using the full RCX register instead of only the ECX portion, but you get the point.
With regards to function calls, there are registers that are labeled as volatile or non-volatile. Volatile registers can be used by the called function without storing the original contents (if the calling function cares, it needs to store them before the call). Non-volatile registers must contain their origial value when the called function returns. Here’s a table that labels them: MSDN – Register usage.
Notice that the SSE registers are used for float parameters. Float operations will be taken care of by SSE instructions. The x87/MMX registers and instructions are available, but I’ve yet to see them used in a Windows x64 program. If your code uses the x87/MMX registers, the MSDN says that they must be considered volatile across function calls. As a game developer, you may not care about this at all. In fact, you may welcome it (I do). Be aware that this means x64 code uses the same precision for floating-point intermediate values as the operands being used. On x86, you had the power to use up to 80-bits for intermediate results. Bruce Dawson will explain that much better than I can: Intermediate floating point precision.
Which parameter gets which register?
There is a 1:1 mapping between parameters and registers. If you mix types, you still only get 4 parameters in registers. Take a look at this function declaration:
int DoStuff( float param1, short param2, bool param3, double param4, int param5 );
Where does the bool go? What register do you think will hold param2? Here’s what the registers will look like:
Param5 isn’t in any register even though only 2 of the registers reserved for integer parameters are used. Also, param2 and param3 get their own registers even though they could share the same one and still have room to spare. The compiler will not combine multiple parameters into a single register, nor will it stretch a single parameter over multiple registers.
This makes debugging a little easier. If you want to know where a parameter is in memory or registers, you only need to know where it is in the function’s parameter list. You won’t have to examine the types that came before it. This also makes it easier to support unprototyped functions. There will be details on that in a bit.
A struct might be packed into an integer register. For that to happen, the struct must be <= 8 bytes, and its size must be a power of 2. Meeting that criteria will also allow the struct to be returned in a register. Even if you mix float and integer member types, it will be placed in an integer register if able.
If a struct doesn’t fit that description, then the original object is copied into a temporary object on the stack. That temporary object’s address is passed to the function following the same rules as integers and pointers (first 4 in registers, rest on the stack). It’s the caller’s responsibility to maintain these temporary objects.
You might be wondering what happens if you return a struct that can’t fit in a register. Well, those functions sneak in an extra 1st parameter, just like the “this” pointer. This first parameter is the address to a temporary object maintained by the caller. The RAX register is still used to return the address of the temporary return object. This provides the function the address of the return object and doesn’t require conditional logic to determine which register has a parameter. If, for instance, RAX had the return object’s address, certain functions would store RAX upon entry others wouldn’t.
Structs will not use the SSE registers. Ever. If you have a struct that is a single float, it will get passed to the function in an integer register or on the stack. We’ll talk about why that’s a performance concern another time.
Surprisingly, SSE types are handled the same as a struct for the most part. Even though the SSE registers are perfect in this situation, they will still have a temporary put on the stack and have an address passed to the function. I find that super frustrating, but it does make sense. Remember that the home space reserves space for 4 separate 8 byte parameters. There’s no room for a 16 byte SSE parameter. So instead of messing with that consistent behavior, SSE types use the rules already explained up to this point. Another point for consistency. It also makes it easier to implement vararg functions which are explained below.
Unlike structs, return values will go in the XMM0 register. Hooray for that, at least. If you’re using shiny new hardware that has AVX extensions, then these rules apply to the __m256 types as well, and the YMM0 register is used for the return value.
Since different registers are used to pass parameters of different types, you may be scratching your head wondering how vararg functions are handled. Integers will be treated the same, but floats will have their value duplicated in the integer registers. This lets the called function stores values in the home space without needing to decide which register to use. It will always store the integer register. If it decides one of those varargs was a float, then it can use the SSE register as is or load into the SSE register directly from the home space. This is another reason why SSE types aren’t passed in the SSE registers; they wouldn’t fit in the integer registers.
The C89 standard allowed you to call a function without function prototype. I had no idea that was ever possible until reading up on the x64 calling conventions. This feature paired with varargs is another reason why there’s a 1:1 mapping between parameters and registers and why SSE types aren’t passed in registers. It may even be the only reason, and everything else is just a side effect. Regardless, this is how things are. The x64 calling convention was able to get rid of vestigial pieces like calling convention options and nearly drop x87, but this bit of history sticks with us.
There’s still more to talk about!
There’s a lot here, and there’s a lot more to talk about. I’ll let you digest this for now, and we’ll fill in some of the gaps and answer questions you’re likely to come up with later.
Here’s some of what you can expect in the next post(s):
- How the stack is controlled and how it behaves
- Exception handling and stack walking
- Performance considerations
- RIP-relative addressing