Vectiquette

Original Author: Jaymin Kessler

vect·i·quette

[vect-i-kit, -ket]

–noun
1. the code of ethical behavior regarding professional practice or action among programmers in their dealings with vector hardware: vector etiquette.
2. a prescribed or accepted code of usage in matters of SIMD programming, or a set of formal rules observed by programmers that actually care about performance.
3. How not to be a total douche and disrespect your friend the vector processor, who wants so badly to make your game faster.

A few months back I was talking to a friend who was doing his B.S. in Computer Science at a respectable school. The conversation happened to drift towards SIMD. My jaw hit the floor when he told me he had no idea what that was. I gave him a basic rundown and you could see the excitement in his eyes. I mean how could you not get excited? Even explained in the most basic oversimplified terms, the concept of doing 2, 4, 8, or 16 things at once instead of doing just one is universally appealing. He went off all full of excitement and hope, ready to do some vector programming. He came back one week later, beaten, dejected, and confused. His vector code was orders of magnitude slower than his scalar code.

The problem isn’t just with college students. We often get programming tests from professionals that are quite good in many ways, but that show a complete lack of understanding wrt good vector programming practices. Mistakes tend to fall into two categories: lack of general vector knowledge and assuming that what works on one CPU is also best practice on a different CPU.

While there are certain good guidelines to follow, be aware that different things carry different penalties on various CPUs, and the only way to write correct code is to know the details of the hardware you are targeting, and the compiler you are using. I know they probably told you in school not to code for quirks in a specific compiler, but by not doing so you miss out on tremendous opportunities (see techniques for splitting basic blocks in gcc, and rearranging conditionals to take advantage of gcc’s forward and backwards branch prediction assumptions as simple examples)

OK, let’s get started on the journey to efficiency! One of the biggest offenders in slow vector code is moving data in and out of vectors too much. Often people calculate something into a bunch of scalar floats, move them into a vector, do a vector add, and then extract back to scalars. What these people don’t realize is that moving in and out of vectors is rarely free, and is often one of the most expensive things you can do. The main problem lies in the fact that on most systems float registers and vector registers are two completely separate register sets. More specifically, the problem is that to get from float registers to vector registers, you often first have to store four floats out to memory, and then read them back in to a vector register. After your vector operation, you have to reverse the process to get the values back into scalars. You basically took what could have been 4 consecutively issued adds (assuming your CPU/FPU has pipelined float operations, or a non-IEEE-compatible mode) and turned it into 4 scalar stores, 1 vector load, 1 vector add, one vector store, 4 scalar loads, and who knows how many stalls/hazards! As Steven Tovey rightfully pointed out, if the alignment of the vector is bad, the number of vector loads could be 2, and a bunch of permutes and permute mask gen instructions. Awesome! As a general rule, you don’t want to mix scalar and vector calculations, and if you do, make damn sure that you aren’t just doing one or two vector operations. You have to do enough in vectorland to justify the cost of getting in and out of the registers.

Even if you are on a platform like NEON where the vector registers and float registers alias each other, you still have to be careful. On NEON, switching between scalar and vector mode requires a pipeline flush on newer Cortexes, and that can be semi-costly. The problem here is almost opposite of what we described before because instead of moving things in and out of vector registers and calling only vector instructions, you are keeping things in the same registers but mixing scalar and vector instructions. If you are going from general purpose registers to NEON, its just as bad. While unlike the PS3’s PPU which needs to go through memory, ARM<–>NEON actually has forwarding paths between register sets, but there is still an asymmetrical cost associated with the transfer. Its just something to think about when you think you have a free pass to mix scalar and vector code.

Building whole vectors isn’t the only way to screw yourself. Unfortunately, one of the most common things people do with vectors often results in horrific performance! Take a look at this

// this makes baby altivec cry

some_vec.w = some_float;

See what I did there? We are inserting a non-literal stored in a float register into a vector. I don’t mean to sound idealistic but if you are wrapping built in vector types in structs, I think its best not to define functions for inserting/extracting scalars (depending on the CPU). If they are there, people will use them. The least you could do is name them something horrific like

inline void by_using_this_insert_x_function_I_the_undersigned_state_that_i_know_and_understand_the_costs_associated_with_said_action_and_take_full_responsibility_for_the_crappy_code_that_will_undoubtedly_result_from_my_selfishness_and_reckless_disregard_for_good_code( float x );

There, that ought to teach em a lesson!

There is a clever way to get around some of the above hassles, and its lovingly referred to as “float in vector.” The concept is simple enough. Instead of using floats all over the place, you make a struct that acts like a float, but internally is a vector. This lets you write code that looks like its a mix of vector and scalar, but it actually lives entirely in vector registers. While some_vec * some_float could be disastrous in some cases, if some_float is secretly a vector, this will compile to a single vector multiply. Hot tip: duplicate your scalar to all lanes of the float in vec’s internal vector, because it allows code like the previous example to work unaltered.

One last thing I want to quickly mention before moving on to code writing tricks. Aside from the PS2 VUs, most vector units don’t have cross vector math operations (very useful for dot products). Therefore while code like vec.x * vec.x + vec.y * vec.y + vec.z * vec.z can technically be done completely in vector registers, it takes a lot more work to move stuff around. For a way around this, see point 7 below.

Giving GCC What It Wants

Another important point is to understand the personality of the compiler you are using. Don’t take the attitude that the compiler should do something for you. As a programmer, it is your job to help out the compiler as much as possible (best case) and not make the compiler’s job harder (worst case). So, what does good vector code look like on GCC? The list below is in now way exhaustive, but it contains a few semi useful tips that can make a big difference. I’ll try reeeeeally hard to keep each item brief as to serve as a good introduction, but if you want more details feel free to ask me (or google).

1) If possible, use lots of const temporaries. Storing the results of vector operations in lots of const temporaries helps GCC track the lifetime of things in more complex code, and therefore help the compiler keep stuff in registers.

2) If a type fits in a register, pass it by value. DO NOT PASS VECTOR TYPES BY REFERENCE, ESPECIALLY CONST REFERENCE. If the function ends up getting inlined, GCC occasionally will go to memory when it hits the reference. I’ll say it again: If the type you are using fits in registers (float, int, or vector) do not pass it to the function by anything but value. In the case of non-sane compilers like Visual Studio for x86, it can’t maintain the alignment of objects on the stack, and therefore objects that have align directives must be passed to functions by reference. This may be fixed or the Xbox 360. If you are multiplatform, the best thing you can do is make a parameter passing typedef to avoid having to cater to the lowest common denominator.

3) In a related note, always prefer returning a value to returning something by reference. For example

// bad

void Add(Vector4 a, Vector4 b, Vector4& result);

//good-er

Vector4 Add(Vector4 a, Vector4 b);

The above code is standalone (non-member) functions but this applies to member functions as well. Remember that this is a very C/C++ thing. If you are writing in a nutso language like C#, it can be over 40x faster to return by reference because of the compiler’s inability to optimize simple struct constructors and copies.

4) When wrapping vector stuff in a struct, make as many member functions const as possible. Avoid modifying this as much as you can. For example

// bad, it sets a member in this

void X(FloatInVec scalar);

// good, it creates a temporary vec and returns it in registers

Vector4 X(FloatInVec scalar) const;

Not only does this help out the compiler, but it also allows you to chain stuff in longer expressions. For example, some_vec.w(some_val).normalize().your_mom();

5) For math operations on built-in vector types, using intrinsics is not always the same as using operators. Lets say you have two vectors. There are two ways to add them

vec_float4 a;

vec_float4 b;

vec_float4 c = a + b;

vec_float4 d = spu_add(a, b); // I like si intrinsics better but…

Which is better greatly depends on the compiler you are using and the version. For example in older versions of GCC, using functions instead of operators meant that the compiler wasn’t able to do mathematical expression simplification. It had semantic information about the operators that it didn’t have for the intrinsics. However I have heard from a few compiler guys that I should avoid using operators because most of the optimization work has gone into intrinsics, since that is the most used path. Not sure if this is still true but its definitely worth knowing the two paths aren’t necessarily equal and you should look out for what your compiler does in different situations.

6) When not writing directly in assembly, separate loads from calculations. Its often a good idea to load all the data you need into vector registers before using the data in actual calculations. You may even want to include a basic block splitter between the loads and calculations. This can help scheduling in a few ways.

7) Depending on what you plan to do with your data, consider using SoA (structure of arrays) instead of AoS (array of structures). I wont go too far into the details of SoA but it basically boils down to having 4 vectors containing {x0, x1, x2, x3}, {y0, y1, y2, y3}, {z0, z1, z2, z3}, {w0, w1, w2, w3} instead of the more “traditional” {x, y, z, w}. There are a few reasons for using this. First of all, if the code you are writing looks and feels something like this

FloatInVec dist = vec.x * vec.x + vec.y * vec.y + vec.z * vec.z

it can be a bit of a pain to do when your vectors are in {x, y, z, w} form. There is a lot of painful shifting and moving things around, and a lot of stalls because you can’t add the x, y, and z products until you line them up. Now lets look at this as SoA

Vector4 x_vals, y_vals, z_vals;

Vector4 distances = x_vals * x_vals + y_vals * y_vals …

image from slide 49 of Steven Tovey’s excellent presentation

Now, you can freely write code that looks kinda scalar, but you don’t have to extract and move around the x, y, and z values. Everything is already where it needs to be. Also, unlike the first example, you get four for free! If you are doing cross vector operations or using x, y, and z independently in calculations, and if you have many to do at once, it might be a good idea to use SoA. Depending on the latency of the instructions involved, you might even want to consider unrolling to fill in any gaps caused by any stalls. Speaking of which…

8) Depending on how many registers you have, consider unrolling. Don’t just randomly do it, but first look at our code in a pipeline analyzer to see if it would even help, and to check register usage. If there aren’t that many gaps or you are already using most of the available registers, unrolling probably wont help and may even end up making your code slower due to spilling or increased pressure on the i-cache.

9) On the SPUs (or any other hardware that has no scalar support), be very wary of consecutive writes to scalar pointers. There is no way to know at compile time if consecutive writes of values already in registers will be to addresses in the same 16 byte vector, so the compiler must be very conservative. In this case, restrict won’t help.

10) Know your alignment requirements. what constitutes an unaligned load, and what the penalties are for different alignments.

Tacked On Advanced Topic: Odd and Even Pipelines

Jonathan Adamczewski rightfully pointed out that this section felt a little out of place, bolted on, and not as flushed out as some of the sections above. Also, it made my blog post a little too long, so I cut it. But don’t worry, those of you who were just dying to hear me drone on and on about the art of balancing odd and even pipelines will get the chance in my next post. It works out well for me because I was almost completely out of ideas as to what to write next. So please wait for it.

Conclusion

Here is the lesson. I don’t care how smart you think you are, use your perf tools and look at the disassembly. Its never enough to look at your source code and say it looks faster. Its a bad idea to time your stuff, notice that your new optimized version is slower, and then not try to find out why. Also its a terrible idea to take things I say in this as absolute fact (or even remotely correct) without verifying for yourself

beware of simple tests. Optimizers are complicated beasts and its hard to say that just because X is inlined in your test or Y is scheduled nicely doesnt mean it will be so in the real world. Whenever possible, test your code where it will be used.

Shoutouts to

  • @nonchaotic the low level ninja, editing buddy, and the guy who hopefully stops me from saying anything too horribly stupid.
  • @Matt_D_ for reminding me why I could never be an english major (or speaker)
  • @twoscomplement for verifying what I already knew to be true: that I can drone on and on and forget what my own point was
  • @CarterBen for reminding me that while he may know what I mean when I use vague language, others may totally misinterpret my words
  • @DylanCuthbert for helping me suck at life less

Remember for all your stupid SPU tricks, especially if you are doing 2D stuff