Vectiquette

Original Author: Jaymin Kessler

vect·i·quette

[vect-i-kit, -ket]

–noun
1. the code of ethical behavior regarding professional practice or action among programmers in their dealings with vector hardware: vector etiquette.
2. a prescribed or accepted code of usage in matters of SIMD programming, or a set of formal rules observed by programmers that actually care about performance.
3. How not to be a total douche and disrespect your friend the vector processor, who wants so badly to make your game faster.

A few months back I was talking to a friend who was doing his B.S. in Computer Science at a respectable school. The conversation happened to drift towards SIMD. My jaw hit the floor when he told me he had no idea what that was. I gave him a basic rundown and you could see the excitement in his eyes. I mean how could you not get excited? Even explained in the most basic oversimplified terms, the concept of doing 2, 4, 8, or 16 things at once instead of doing just one is universally appealing. He went off all full of excitement and hope, ready to do some vector programming. He came back one week later, beaten, dejected, and confused. His vector code was orders of magnitude slower than his scalar code.

The problem isn’t just with college students. We often get programming tests from professionals that are quite good in many ways, but that show a complete lack of understanding wrt good vector programming practices. Mistakes tend to fall into two categories: lack of general vector knowledge and assuming that what works on one CPU is also best practice on a different CPU.

While there are certain good guidelines to follow, be aware that different things carry different penalties on various CPUs, and the only way to write correct code is to know the details of the hardware you are targeting, and the compiler you are using. I know they probably told you in school not to code for quirks in a specific compiler, but by not doing so you miss out on tremendous opportunities (see techniques for splitting basic blocks in gcc, and rearranging conditionals to take advantage of gcc’s forward and backwards branch prediction assumptions as simple examples)

OK, let’s get started on the journey to efficiency! One of the biggest offenders in slow vector code is moving data in and out of vectors too much. Often people calculate something into a bunch of scalar floats, move them into a vector, do a vector add, and then extract back to scalars. What these people don’t realize is that moving in and out of vectors is rarely free, and is often one of the most expensive things you can do. The main problem lies in the fact that on most systems float registers and vector registers are two completely separate register sets. More specifically, the problem is that to get from float registers to vector registers, you often first have to store four floats out to memory, and then read them back in to a vector register. After your vector operation, you have to reverse the process to get the values back into scalars. You basically took what could have been 4 consecutively issued adds (assuming your CPU/FPU has pipelined float operations, or a non-IEEE-compatible mode) and turned it into 4 scalar stores, 1 vector load, 1 vector add, one vector store, 4 scalar loads, and who knows how many stalls/hazards! As Steven Tovey rightfully pointed out, if the alignment of the vector is bad, the number of vector loads could be 2, and a bunch of permutes and permute mask gen instructions. Awesome! As a general rule, you don’t want to mix scalar and vector calculations, and if you do, make damn sure that you aren’t just doing one or two vector operations. You have to do enough in vectorland to justify the cost of getting in and out of the registers.

Even if you are on a platform like NEON where the vector registers and float registers alias each other, you still have to be careful. On NEON, switching between scalar and vector mode requires a pipeline flush on newer Cortexes, and that can be semi-costly. The problem here is almost opposite of what we described before because instead of moving things in and out of vector registers and calling only vector instructions, you are keeping things in the same registers but mixing scalar and vector instructions. If you are going from general purpose registers to NEON, its just as bad. While unlike the PS3’s PPU which needs to go through memory, ARM<–>NEON actually has forwarding paths between register sets, but there is still an asymmetrical cost associated with the transfer. Its just something to think about when you think you have a free pass to mix scalar and vector code.

Building whole vectors isn’t the only way to screw yourself. Unfortunately, one of the most common things people do with vectors often results in horrific performance! Take a look at this

// this makes baby altivec cry

some_vec.w = some_float;

See what I did there? We are inserting a non-literal stored in a float register into a vector. I don’t mean to sound idealistic but if you are wrapping built in vector types in structs, I think its best not to define functions for inserting/extracting scalars (depending on the CPU). If they are there, people will use them. The least you could do is name them something horrific like

inline void by_using_this_insert_x_function_I_the_undersigned_state_that_i_know_and_understand_the_costs_associated_with_said_action_and_take_full_responsibility_for_the_crappy_code_that_will_undoubtedly_result_from_my_selfishness_and_reckless_disregard_for_good_code( float x );

There, that ought to teach em a lesson!

There is a clever way to get around some of the above hassles, and its lovingly referred to as “float in vector.” The concept is simple enough. Instead of using floats all over the place, you make a struct that acts like a float, but internally is a vector. This lets you write code that looks like its a mix of vector and scalar, but it actually lives entirely in vector registers. While some_vec * some_float could be disastrous in some cases, if some_float is secretly a vector, this will compile to a single vector multiply. Hot tip: duplicate your scalar to all lanes of the float in vec’s internal vector, because it allows code like the previous example to work unaltered.

One last thing I want to quickly mention before moving on to code writing tricks. Aside from the PS2 VUs, most vector units don’t have cross vector math operations (very useful for dot products). Therefore while code like vec.x * vec.x + vec.y * vec.y + vec.z * vec.z can technically be done completely in vector registers, it takes a lot more work to move stuff around. For a way around this, see point 7 below.

Giving GCC What It Wants

Another important point is to understand the personality of the compiler you are using. Don’t take the attitude that the compiler should do something for you. As a programmer, it is your job to help out the compiler as much as possible (best case) and not make the compiler’s job harder (worst case). So, what does good vector code look like on GCC? The list below is in now way exhaustive, but it contains a few semi useful tips that can make a big difference. I’ll try reeeeeally hard to keep each item brief as to serve as a good introduction, but if you want more details feel free to ask me (or google).

1) If possible, use lots of const temporaries. Storing the results of vector operations in lots of const temporaries helps GCC track the lifetime of things in more complex code, and therefore help the compiler keep stuff in registers.

2) If a type fits in a register, pass it by value. DO NOT PASS VECTOR TYPES BY REFERENCE, ESPECIALLY CONST REFERENCE. If the function ends up getting inlined, GCC occasionally will go to memory when it hits the reference. I’ll say it again: If the type you are using fits in registers (float, int, or vector) do not pass it to the function by anything but value. In the case of non-sane compilers like Visual Studio for x86, it can’t maintain the alignment of objects on the stack, and therefore objects that have align directives must be passed to functions by reference. This may be fixed or the Xbox 360. If you are multiplatform, the best thing you can do is make a parameter passing typedef to avoid having to cater to the lowest common denominator.

3) In a related note, always prefer returning a value to returning something by reference. For example

// bad

void Add(Vector4 a, Vector4 b, Vector4& result);

//good-er

Vector4 Add(Vector4 a, Vector4 b);

The above code is standalone (non-member) functions but this applies to member functions as well. Remember that this is a very C/C++ thing. If you are writing in a nutso language like C#, it can be over 40x faster to return by reference because of the compiler’s inability to optimize simple struct constructors and copies.

4) When wrapping vector stuff in a struct, make as many member functions const as possible. Avoid modifying this as much as you can. For example

// bad, it sets a member in this

void X(FloatInVec scalar);

// good, it creates a temporary vec and returns it in registers

Vector4 X(FloatInVec scalar) const;

Not only does this help out the compiler, but it also allows you to chain stuff in longer expressions. For example, some_vec.w(some_val).normalize().your_mom();

5) For math operations on built-in vector types, using intrinsics is not always the same as using operators. Lets say you have two vectors. There are two ways to add them

vec_float4 a;

vec_float4 b;

vec_float4 c = a + b;

vec_float4 d = spu_add(a, b); // I like si intrinsics better but…

Which is better greatly depends on the compiler you are using and the version. For example in older versions of GCC, using functions instead of operators meant that the compiler wasn’t able to do mathematical expression simplification. It had semantic information about the operators that it didn’t have for the intrinsics. However I have heard from a few compiler guys that I should avoid using operators because most of the optimization work has gone into intrinsics, since that is the most used path. Not sure if this is still true but its definitely worth knowing the two paths aren’t necessarily equal and you should look out for what your compiler does in different situations.

6) When not writing directly in assembly, separate loads from calculations. Its often a good idea to load all the data you need into vector registers before using the data in actual calculations. You may even want to include a basic block splitter between the loads and calculations. This can help scheduling in a few ways.

7) Depending on what you plan to do with your data, consider using SoA (structure of arrays) instead of AoS (array of structures). I wont go too far into the details of SoA but it basically boils down to having 4 vectors containing {x0, x1, x2, x3}, {y0, y1, y2, y3}, {z0, z1, z2, z3}, {w0, w1, w2, w3} instead of the more “traditional” {x, y, z, w}. There are a few reasons for using this. First of all, if the code you are writing looks and feels something like this

FloatInVec dist = vec.x * vec.x + vec.y * vec.y + vec.z * vec.z

it can be a bit of a pain to do when your vectors are in {x, y, z, w} form. There is a lot of painful shifting and moving things around, and a lot of stalls because you can’t add the x, y, and z products until you line them up. Now lets look at this as SoA

Vector4 x_vals, y_vals, z_vals;

Vector4 distances = x_vals * x_vals + y_vals * y_vals …

image from slide 49 of Steven Tovey’s excellent presentation

Now, you can freely write code that looks kinda scalar, but you don’t have to extract and move around the x, y, and z values. Everything is already where it needs to be. Also, unlike the first example, you get four for free! If you are doing cross vector operations or using x, y, and z independently in calculations, and if you have many to do at once, it might be a good idea to use SoA. Depending on the latency of the instructions involved, you might even want to consider unrolling to fill in any gaps caused by any stalls. Speaking of which…

8) Depending on how many registers you have, consider unrolling. Don’t just randomly do it, but first look at our code in a pipeline analyzer to see if it would even help, and to check register usage. If there aren’t that many gaps or you are already using most of the available registers, unrolling probably wont help and may even end up making your code slower due to spilling or increased pressure on the i-cache.

9) On the SPUs (or any other hardware that has no scalar support), be very wary of consecutive writes to scalar pointers. There is no way to know at compile time if consecutive writes of values already in registers will be to addresses in the same 16 byte vector, so the compiler must be very conservative. In this case, restrict won’t help.

10) Know your alignment requirements. what constitutes an unaligned load, and what the penalties are for different alignments.

Tacked On Advanced Topic: Odd and Even Pipelines

Jonathan Adamczewski rightfully pointed out that this section felt a little out of place, bolted on, and not as flushed out as some of the sections above. Also, it made my blog post a little too long, so I cut it. But don’t worry, those of you who were just dying to hear me drone on and on about the art of balancing odd and even pipelines will get the chance in my next post. It works out well for me because I was almost completely out of ideas as to what to write next. So please wait for it.

Conclusion

Here is the lesson. I don’t care how smart you think you are, use your perf tools and look at the disassembly. Its never enough to look at your source code and say it looks faster. Its a bad idea to time your stuff, notice that your new optimized version is slower, and then not try to find out why. Also its a terrible idea to take things I say in this as absolute fact (or even remotely correct) without verifying for yourself

beware of simple tests. Optimizers are complicated beasts and its hard to say that just because X is inlined in your test or Y is scheduled nicely doesnt mean it will be so in the real world. Whenever possible, test your code where it will be used.

Shoutouts to

  • @nonchaotic the low level ninja, editing buddy, and the guy who hopefully stops me from saying anything too horribly stupid.
  • @Matt_D_ for reminding me why I could never be an english major (or speaker)
  • @twoscomplement for verifying what I already knew to be true: that I can drone on and on and forget what my own point was
  • @CarterBen for reminding me that while he may know what I mean when I use vague language, others may totally misinterpret my words
  • @DylanCuthbert for helping me suck at life less

Remember for all your stupid SPU tricks, especially if you are doing 2D stuff

Leaderboards: A How-To Guide

Original Author: David Czarnecki

INTRODUCTION

Leaderboards in video games are an interesting beast. They are used to rank players against one another using some quantifiable criteria such as # of kills, XP, total time played, points in a level, etc. Leaderboards have been an intimate part of my life for almost 3 years now at Agora Games. As such, I’m always looking for ways to improve our leaderboard technology and answer questions like: How do you efficiently insert and rank players in real time? How do you efficiently update and re-rank players in real time? How do you retrieve information about a specific member of a leaderboard? How do you retrieve information about a player and players around them in a leaderboard?

It was the morning of January 1, 2011, I had literally ridden the short bus back from my New Year’s dinner some hours earlier, and I woke up with a desire to make the world more awesome. It’s 2011 and we still don’t have flying cars … so ,,|, future. Where to start? Leaderboards.

LEADERBOARDS USING REDIS

At the studio I had discussed with colleagues the possibility of using Redis, an advanced key-value storage engine, for leaderboards. In less than an hour, I had the set of Redis commands using their sorted set data type (a set of data that is sorted based on an associated “score”) to perform operations on leaderboards such as:

  • Retrieving general information about a leaderboard such as total members or total pages
  • Adding or removing members from a leaderboard
  • Retrieving information about a member in the leaderboard such as their rank or score
  • Updating score information for a member in the leaderboard
  • Retrieving an arbitrary page of leaders from the leaderboard
  • Retrieving the leaders around a given member in a leaderboard, also known as an “Around Me” leaderboard
  • Retrieving information for an arbitrary set of members in a leaderboard, e.g. How do my friends compare against me?

The result of all this is now encapsulated in the Redis libraries in a number of programming languages such as C, C# C++, Lua, etc. The leaderboard gem is merely a semantic wrapper around the underlying Redis commands. It is left as an exercise to the reader to port the leaderboard code to your favorite programming language.

RUBBER MEETS ROAD

Here’s how you’d actually use the library to generate and interact with a ‘highscores’ leaderboard as outlined above.

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  11
 
  12
 
  13
 
  14
 
  15
 
  16
 
  17
 
  18
 
  19
 
  20
 
  21
 
  22
 
  23
 
  24
 
  25
 
  26
 
  27
 
  28
 
  29
 
  30
 
  31
 
  32
 
  33
 
  34
 
  35
 
  36
 
  37
 
  38
 
  39
 
  40
 
  41
 
  42
 
  43
 
  44
 
  45
 
  46
 
  47
 
  48
 
  49
 
  50
 
  51
 
  52
 
  53
 
  54
 
  55
 
  56
 
  57
 
  58
 
  59
 
  60
 
  61
 
  62
 
  63
 

Create a new leaderboard or attach to an existing leaderboard named ‘highscores’:
   ruby-1.8.7-p302 > highscore_lb = Leaderboard.new(‘highscores’)
   => #<Leaderboard:0x1018e4250 @page_size=25, @port=6379, @host=”localhost”, @redis_connection=#<Redis client v2.1.1 connected to redis://localhost:6379/0 (Redis v2.1.10)>, @leaderboard_name=”highscores”>
If you need to pass in options for Redis, you can do this with the redis_options parameter:
  redis_options = {:host => ‘localhost’, :port => 6379, :password => ‘password’, :db => ‘some_redis_db’}
  highscore_lb = Leaderboard.new(‘highscores’, redis_options[:host], redis_options[:port], Leaderboard::DEFAULT_PAGE_SIZE, redis_options))
You can set the page size to something other than the default page size (25):
  ruby-1.8.7-p302 > highscore_lb.page_size = 5
   => 5
  ruby-1.8.7-p302 > highscore_lb
   => #<Leaderboard:0x1018e2130 @leaderboard_name=”highscores”, @page_size=5, @port=6379, @redis_connection=#<Redis client v2.1.1 connected to redis://localhost:6379/0 (Redis v2.1.10)>, @host=”localhost”, @redis_options={:host=>”localhost”, :port=>6379}>
Add members to your leaderboard:
  ruby-1.8.7-p302 > 1.upto(10) do |index|
  ruby-1.8.7-p302 > highscore_lb.add_member(“member_#{index}”, index)
  ruby-1.8.7-p302 ?> end
   => 1
Get some information about your leaderboard:
  ruby-1.8.7-p302 > highscore_lb.total_members
   => 10
  ruby-1.8.7-p302 > highscore_lb.total_pages
   => 1
Get some information about a specific member(s) in the leaderboard:
  ruby-1.8.7-p302 > highscore_lb.score_for(‘member_4’)
   => 4.0
  ruby-1.8.7-p302 > highscore_lb.rank_for(‘member_4’)
   => 7
  ruby-1.8.7-p302 > highscore_lb.rank_for(‘member_10’)
   => 1
Get page 1 in the leaderboard:
  ruby-1.8.7-p302 > highscore_lb.leaders(1)
=> [{:member=>”member_10″, :rank=>1, :score=>”10″}, {:member=>”member_9″, :rank=>2, :score=>”9″}, {:member=>”member_8″, :rank=>3, :score=>”8″}, {:member=>”member_7″, :rank=>4, :score=>”7″}, {:member=>”member_6″, :rank=>5, :score=>”6″}, {:member=>”member_5″, :rank=>6, :score=>”5″}, {:member=>”member_4″, :rank=>7, :score=>”4″}, {:member=>”member_3″, :rank=>8, :score=>”3″}, {:member=>”member_2″, :rank=>9, :score=>”2″}, {:member=>”member_1″, :rank=>10, :score=>”1″}]
Add more members to your leaderboard:
  ruby-1.8.7-p302 > 50.upto(95) do |index|
  ruby-1.8.7-p302 > highscore_lb.add_member(“member_#{index}”, index)
  ruby-1.8.7-p302 ?> end
   => 50
  ruby-1.8.7-p302 > highscore_lb.total_pages
   => 3
Get an “Around Me” leaderboard for a member:
  ruby-1.8.7-p302 > highscore_lb.around_me(‘member_53’)
=> [{:member=>”member_65″, :rank=>31, :score=>”65″}, {:member=>”member_64″, :rank=>32, :score=>”64″}, {:member=>”member_63″, :rank=>33, :score=>”63″}, {:member=>”member_62″, :rank=>34, :score=>”62″}, {:member=>”member_61″, :rank=>35, :score=>”61″}, {:member=>”member_60″, :rank=>36, :score=>”60″}, {:member=>”member_59″, :rank=>37, :score=>”59″}, {:member=>”member_58″, :rank=>38, :score=>”58″}, {:member=>”member_57″, :rank=>39, :score=>”57″}, {:member=>”member_56″, :rank=>40, :score=>”56″}, {:member=>”member_55″, :rank=>41, :score=>”55″}, {:member=>”member_54″, :rank=>42, :score=>”54″}, {:member=>”member_53″, :rank=>43, :score=>”53″}, {:member=>”member_52″, :rank=>44, :score=>”52″}, {:member=>”member_51″, :rank=>45, :score=>”51″}, {:member=>”member_50″, :rank=>46, :score=>”50″}, {:member=>”member_10″, :rank=>47, :score=>”10″}, {:member=>”member_9″, :rank=>48, :score=>”9″}, {:member=>”member_8″, :rank=>49, :score=>”8″}, {:member=>”member_7″, :rank=>50, :score=>”7″}, {:member=>”member_6″, :rank=>51, :score=>”6″}, {:member=>”member_5″, :rank=>52, :score=>”5″}, {:member=>”member_4″, :rank=>53, :score=>”4″}, {:member=>”member_3″, :rank=>54, :score=>”3″}, {:member=>”member_2″, :rank=>55, :score=>”2″}]
Get rank and score for an arbitrary list of members (e.g. friends):
  ruby-1.8.7-p302 > highscore_lb.ranked_in_list([‘member_1’, ‘member_62’, ‘member_67’], true)
   => [{:rank=>55, :member=>”member_1″, :score=>1.0}, {:rank=>33, :member=>”member_62″, :score=>62.0}, {:rank=>28, :member=>”member_67″, :score=>67.0}]

 

 

Let’s cover a few more scenarios in-depth as to what calls you might make.

I’ve based Scenario 1 and 2 from LittleBigPlanet. NOTE: I have absolutely no idea if this is what actually happens in that game.

Scenario 1: Player finishes a level in game, upload player’s score (add_member), display to the player the scores around them (set page_size and use around_me call for player).

 

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  11
 
  12
 
  13
 
  14
 
  15
 
  16
 
  17
 
  18
 
  19
 

fossil:leaderboard dczarnecki$ irb
ruby-1.8.7-p302 > require ‘rubygems’
 => true
ruby-1.8.7-p302 > require ‘leaderboard’
 => true
ruby-1.8.7-p302 > highscore_lb = Leaderboard.new(‘highscores’)
 => #<Leaderboard:0x1011b23b0 @leaderboard_name=”highscores”, @page_size=25, @port=6379, @redis_connection=#<Redis client v2.1.1 connected to redis://localhost:6379/0 (Redis v2.0.4)>, @host=”localhost”, @redis_options={:host=>”localhost”, :port=>6379}>
ruby-1.8.7-p302 > 1.upto(10) do |index|
ruby-1.8.7-p302 > highscore_lb.add_member(“member_#{index}”, index)
ruby-1.8.7-p302 ?> end
 => 1
ruby-1.8.7-p302 > highscore_lb.add_member(‘DavidCzarnecki’, 6)
 => true
ruby-1.8.7-p302 > highscore_lb.page_size = 5
 => 5
ruby-1.8.7-p302 > highscore_lb.around_me(‘DavidCzarnecki’)
 => [{:member=>”member_7″, :score=>”7″, :rank=>4}, {:member=>”member_6″, :score=>”6″, :rank=>5}, {:member=>”DavidCzarnecki”, :score=>”6″, :rank=>6}, {:member=>”member_5″, :score=>”5″, :rank=>7}, {:member=>”member_4″, :score=>”4″, :rank=>8}]
ruby-1.8.7-p302 >

 

 

Scenario 2: Player finishes same level in game, update player’s score (add_member), display to the player the scores around them (set page_size and use around_me call for player). Note, you can call add_member again, because under the hood in Redis, it will just update the player’s score (if it exists) and re-rank.

 

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 

ruby-1.8.7-p302 > highscore_lb.add_member(‘DavidCzarnecki’, 15)
 => false
ruby-1.8.7-p302 > highscore_lb.page_size = 5
 => 5
ruby-1.8.7-p302 > highscore_lb.around_me(‘DavidCzarnecki’)
 => [{:member=>”DavidCzarnecki”, :score=>”15″, :rank=>1}, {:member=>”member_10″, :score=>”10″, :rank=>2}, {:member=>”member_9″, :score=>”9″, :rank=>3}, {:member=>”member_8″, :score=>”8″, :rank=>4}, {:member=>”member_7″, :score=>”7″, :rank=>5}]
ruby-1.8.7-p302 >

 

 

I’ve based Scenario 3 and 4 from Call of Duty: Black Ops playing a wager match. NOTE: I have absolutely no idea if this is what actually happens in that game. Also, in Scenario 3 and 4, you could certainly use the add_member call with the player’s total earnings. I simply want to illustrate the delta call (change_score_for).

Scenario 3: Players finish a match and you want to update their winnings (change_score_for) for that match.

 

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 

ruby-1.8.7-p302 > total_money_lb = Leaderboard.new(‘total_money’)
 => #<Leaderboard:0x101180680 @leaderboard_name=”total_money”, @page_size=25, @port=6379, @redis_connection=#<Redis client v2.1.1 connected to redis://localhost:6379/0 (Redis v2.0.4)>, @host=”localhost”, @redis_options={:host=>”localhost”, :port=>6379}>
ruby-1.8.7-p302 > total_money_lb.change_score_for(‘DavidCzarnecki’, 15)
 => “15”
ruby-1.8.7-p302 > total_money_lb.change_score_for(‘ChristianArca’, 7)
 => “7”
ruby-1.8.7-p302 > total_money_lb.leaders(1)
 => [{:member=>”DavidCzarnecki”, :score=>”15″, :rank=>1}, {:member=>”ChristianArca”, :score=>”7″, :rank=>2}]
ruby-1.8.7-p302 >

 

 

Scenario 4: Players finish another wager match and you want to update their winnings (change_score_for) for that match. NOTE: If you use negative values, the score is decremented.

 

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 

ruby-1.8.7-p302 > total_money_lb.change_score_for(‘DavidCzarnecki’, -8)
 => “7”
ruby-1.8.7-p302 > total_money_lb.change_score_for(‘ChristianArca’, 16)
 => “23”
ruby-1.8.7-p302 > total_money_lb.leaders(1)
 => [{:member=>”ChristianArca”, :score=>”23″, :rank=>1}, {:member=>”DavidCzarnecki”, :score=>”7″, :rank=>2}]
ruby-1.8.7-p302 >

 

 

BUT … REDIS?

It’s scaleable because you can setup master-slave replication to have multiple read-only slaves. I assume all of these are important to you as game developers.

All of this amounts to a Guarantee Fairy right? It makes a man feel good. You figure you put this blog post under your pillow at night, the Guarantee Fairy might come by and leave a quarter, am I right? What’s my point? The point is, how do you know the fairy isn’t a crazy glue sniffer? “Building model airplanes” says the little fairy; well, we’re not buying it. He sneaks into your house once, that’s all it takes. The next thing you know, there’s money missing off the dresser, and your daughter’s knocked up. I’ve seen it a hundred times.

Wait … what?

SCIENCE. IT WORKS B*TCHES

One of the things I love about Redis is that it’s backed by science … mathematics and computer science. In particular, all of the commands give their respective time complexity in Big O notation. If you don’t know what that is, consider yourself fired. Not really. But … still. It’s important because you want to know how long operations are going to take given a particular data set size. This might be useful information from a game developer’s perspective to figure out average or worst-case scenario time estimates for interacting with a leaderboard based on total expected players.

At the request of one of my colleagues, Ola Mork, I did some basic benchmarking. Insert 10 million sequential scores into a leaderboard (in practice you will probably not be doing this all at once) and determine the average time to request an arbirtrary page from the leaderboard. Insert 10 million random scores into a leaderboard (again, in practice you will probably not be doing this all at once) and determine the average time to request an arbitary page from the leaderboard.

The long and short of it is that it’s fast. Negligible difference in terms of leaderboard page retrieval.

 

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  11
 
  12
 
  13
 
  14
 
  15
 
  16
 
  17
 
  18
 
  19
 
  20
 
  21
 
  22
 
  23
 
  24
 
  25
 
  26
 
  27
 
  28
 
  29
 
  30
 
  31
 
  32
 
  33
 
  34
 
  35
 
  36
 
  37
 
  38
 
  39
 
  40
 
  41
 
  42
 
  43
 
  44
 
  45
 
  46
 
  47
 
  48
 
  49
 
  50
 
  51
 
  52
 

10 million sequential scores insert:
  ruby-1.8.7-p302 > insert_time = Benchmark.measure do
  ruby-1.8.7-p302 > 1.upto(10000000) do |index|
  ruby-1.8.7-p302 > highscore_lb.add_member(“member_#{index}”, index)
  ruby-1.8.7-p302 ?> end
  ruby-1.8.7-p302 ?> end
   => #<Benchmark::Tms:0x101605660 @label=””, @stime=173.61, @total=577.52, @real=911.718175172806, @utime=403.91, @cstime=0.0, @cutime=0.0>
Average time to request an arbitrary page from the leaderboard:
  ruby-1.8.7-p302 > requests_to_make = 50000
   => 50000
  ruby-1.8.7-p302 > lb_request_time = 0
   => 0
  ruby-1.8.7-p302 > 1.upto(requests_to_make) do
  ruby-1.8.7-p302 > lb_request_time += Benchmark.measure do
  ruby-1.8.7-p302 > highscore_lb.leaders(rand(highscore_lb.total_pages))
  ruby-1.8.7-p302 ?> end.total
  ruby-1.8.7-p302 ?> end
   => 1
  ruby-1.8.7-p302 >
  ruby-1.8.7-p302 > p lb_request_time / requests_to_make
  0.001808
   => nil
10 million random scores insert:
  ruby-1.8.7-p302 > insert_time = Benchmark.measure do
  ruby-1.8.7-p302 > 1.upto(10000000) do |index|
  ruby-1.8.7-p302 > highscore_lb.add_member(“member_#{index}”, rand(50000000))
  ruby-1.8.7-p302 ?> end
  ruby-1.8.7-p302 ?> end
   => #<Benchmark::Tms:0x10164ebf8 @label=””, @stime=172.94, @total=577.91, @real=1356.57155895233, @utime=404.97, @cstime=0.0, @cutime=0.0>
Average time to request an arbitrary page from the leaderboard:
  ruby-1.8.7-p302 > requests_to_make = 50000
   => 50000
  ruby-1.8.7-p302 > lb_request_time = 0
   => 0
  ruby-1.8.7-p302 > 1.upto(requests_to_make) do
  ruby-1.8.7-p302 > lb_request_time += Benchmark.measure do
  ruby-1.8.7-p302 > highscore_lb.leaders(rand(highscore_lb.total_pages))
  ruby-1.8.7-p302 ?> end.total
  ruby-1.8.7-p302 ?> end
   => 1
  ruby-1.8.7-p302 >
  ruby-1.8.7-p302 > p lb_request_time / requests_to_make
  0.00179680000000001
   => nil

FIN

Leaderboards may seem complicated, but they are not. Wrapping software like Redis with a set of semantic commands, we have a near complete set of use cases covered for how we might interact with a leaderboard. Finally, it’s fast.

Take a look at the leaderboard gem code and get involved. Comments or code. The choice is yours.

You can find more hilarity over on my Twitter account, CzarneckiD.

Akka Does Async

Original Author: Sean Parsons

What On Earth Is This All About?

To those who have not seen it, I’m going to introduce the highly useful Akka library and show how asynchronous software can be developed easily using it. Most programming languages provide some kind of threading capability, from fork() in C to Threads and ExecutorService in Java and a lot of variation in between. The main downside to these models is that they tend to require a lot of manual co-ordination on the part of the programmer. For example spawning a Thread directly in Java to do some work is fine, but requires a hand rolled hook to catch the case where the Runnable throws an exception, which results in you often writing more code to handle that rather than the actual logic you’re interested in.

 

Akka And Actors.

Akka adopts the Actor Model to alleviate a lot of the issues with normal concurrent programming (erlang and Scala are languages that have native support for actors). Akka goes much further than the stock Scala actors and also supports writing actors in Java as well, with clean API support for both languages. Actor programming in Akka using Scala is based around pattern matching, which I explored in my previous post, the first example from there can be easily translated into an actor as we can see here:

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  11
 
  

class IntMatchActor extends Actor {
  def receive = {
    case 1 => self.reply(“This is number 1.”)
    case 2 => self.reply(“This is number 2, it comes after 1.”)
    case _ => self.reply(“This value is neither 1 nor 2.”)
  }
}
val intMatchActor = actorOf[IntMatchActor].start
println(intMatchActor !! 1)
println(intMatchActor !! 99)

The main differences are that it inherits from Actor, calls a reply method on the self instance to give the response to the caller and rather than calling a method on the actor in the conventional sense it passes a message to the actor. That operator (!!) on the surface is where the special sauce is held, it puts the message into a mailbox for the actor to process asynchonously. I hear you cry “But this is no different to before!”, however there are variations on that operator which do things differently:

  • ! – This is a fire and forget call which immediately returns control to the caller, but means any response from the call wont even be sent by the callee.
  • !!! – In this case the operator returns an instance of Future which can be checked for completion at a later date. Two IO intensive jobs could be initiated at the same time and their results combined once they have both completed as opposed to running them sequentially, as seen below:
1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  11
 
  12
 
  13
 
  14
 
  15
 
  

val firstLongProcessActor = actorOf[FirstLongProcessActor].start
val secondLongProcessActor = actorOf[SecondLongProcessActor].start
val startTime = currentTimeMillis
val firstFuture = firstLongProcessActor !!! 1000
val secondFuture = secondLongProcessActor !!! 1000
Futures.awaitAll(List(firstFuture, secondFuture))
val timeTaken = currentTimeMillis startTime
println(“Process took “ + timeTaken + “ms.”)
// Output is:
// 1: Sleeping for 1000ms.
// 2: Sleeping for 1000ms.
// 1: Slept for 1000ms.
// 2: Slept for 1000ms.
// Process took 1006ms.

 

Making Sure When Things Go Pop You Don’t Care.

As software developers we go to pains to ensure that problems are averted, handled and/or ignored.  Rather than trying to cover every eventuality (try/catch expressions more than 2 levels deep are when I start to cry), it’s much better to embrace failure, expect it and cope with it. Two prime examples of this would be services on the end of a wire, like a database server of some form or a misbehaving third party library that explodes after 50,000 calls have been made of it. Supervisors can keep an eye on at least 1 actor and restart those appropriately effectively invisibly to the callee.

Here’s a simple example of the restart mechanic, whereby the actor BrokenActor uses a class called BrokenLibrary to total up some arbitrary values, but when it exceeds 10 it throws an IllegalStateException:

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  11
 
  12
 
  13
 
  14
 
  15
 
  16
 
  17
 
  18
 
  19
 
  20
 
  21
 
  

val supervisor = Supervisor(SupervisorConfig(
  AllForOneStrategy(List(classOf[IllegalStateException]), Some(3), Some(1000)),
  Supervise(actorOf[BrokenActor], Permanent) :: Nil
))
supervisor.start
val brokenActor = supervisor.children.head
printlnRequestResult(brokenActor, 1)
printlnRequestResult(brokenActor, 5)
try {
  printlnRequestResult(brokenActor, 20)
} catch {
  case _=> println(“Expected exception thrown.”)
}
printlnRequestResult(brokenActor, 5)
// Output is:
// Request = 1, Result = Some(1)
// Request = 5, Result = Some(6)
// Expected exception thrown.
// Creating new BrokenLibrary.
// Request = 5, Result = Some(5)

While we still get the exception that occurred, the supervisor has invisibly restarted the actor back to a stable state before the next request is made against it.

 

There’s A Lot More.

This one post has only really scratched the surface of Akka other important aspects are the persistence of mailboxes, software transactional memory, integration with AMQP servers and with Apache Camel.  I intend on covering more of these in future posts and looking at how Akka performs at some later date.

 

Until Next Time…

As with my last post (and I hope every post), I’ve created a SBT project that will allow you to play with all the things discussed in this post to your hearts content, which is at the following URL:

Omni-directional shadow mapping

Original Author: Bernat Muñoz

Introduction

If there’s one thing that I like more than learning, is teaching, thus I’ve tried in the past to give back as much knowledge as I’ve received from others: some times, simply translating from English to Spanish, some other, from my experiences on a concrete matter. So whenever blogs became trendy, I started writing about what I knew or was working on. But then, I started to have less and less free time, and eventually stopped writing at all.

To keep this short, I plan to write about what I’m working at home, mostly related to implementing different techniques on my graphics engine for PC. Some posts won’t probably be hard to grasp from a technical point of view, neither exclusive on the net, but I want to give a detailed enough explanation which let’s you actually implement it, and were to go from there. Also, I might get corrected if I actually assumed something that’s not true.

Another motivation about talking about something that I implemented a few hours/days ago is providing a working sample. I intend to create a very simple demo, containing as few as possible besides the described technique, so any doubts can be cleared by just looking at the source.

Shadow mapping

When adding shadows to a graphics engine, most people start implementing what I (and I guess most people) consider the easiest way of shadow mapping, which is spot light shadow mapping. This is quite logical, as it serves as a base for other techniques, and it’s quite straightforward to implement. Funny thing, is that most people don’t actually support shadow mapping from other light types, once this basic implementation is done. This might not be totally true on professional engines, but I’ve seen it happening a lot in hobby 3D engines. At least, that’s my experience, I would love to be corrected on this.

From an implementation point of view, there’s one major choice to do when dealing with omni directional lights and it’s shadows: either use non-linear projections or multiple linear projections (reference this slides from NVidia, page 43). I won’t talk about using non-linear projections, because I’ve never implemented omni-directional shadow maps with them, but you can gather some details from the NVidia slides linked. The good thing about multiple linear projections, is that it’s a simple extension from spot light shadow mapping: the idea is using several normal shadow maps to cover all directions (more details later on this).

So, wrapping up, I’m going to go detail how to omni-directional shadow mapping using multiple linear projections, assuming that you already know or have a working implementation of spot light shadow mapping. Let’s hope this is actually useful to someone 🙂

Omni-directional shadow mapping: general idea

Let’s first detail what we want to implement. Keeping in mind the technique we’re aiming at, we need to calculate several shadow maps, that cover all directions. This means actually calculating 6 shadow maps, each in following an axis direction (+-X, +-Y, +-Z), so we cover all the space that the light potentially illuminates and casts shadows. In order to avoid any gaps or overlaps between shadow maps, we have to force it’s FOV to 90º. Visually, we want to mimic this setup:

Then, we could just attach those textures to a cube map, and sort of use the light -> pixel vector to make the shadow lookup. The problem is that, as detailed on the slides linked before, cubemaps and shadow textures don’t work actually match what we need. The main problem is that cubemap distance is radial instead of planar, that’s what we would like to use. That can be fixed, as we have a 90º projection can, the depth value can be easily reconstructed from the light -> pixel vector.

Another problem with cubemaps, is the lack of hardware support for Z comparison and PCF. That could be easily fixed by just doing the comparison and PCF taps ourselves, but it’s quite a pity loosing those, performance wise. A better (from the performance point of view) solution is to lay out all the shadowmaps into a single 2D texture, set up in a grid. Then we could use an indirection cubemap in order to map the light -> pixel vector to the 2D texture, and do the shadow computation in a regular 2D shadow texture, thus getting hardware Z comparison and PCF. This is known as Virtual Shadow Depth Cube Textures (VSDCT).

Omni-directional shadow mapping: generating the indirection texture and shadow maps

By now, you should have a pretty clear of the idea of the whole process. First of all,  we need to generate the indirection cubemap. This is simply a cubemap that maps from the light->pixel vector to the 2D texture actually containing our calculated shadow maps. The most simple approach is using a small two component cubemap (GL_RG16 in OpenGL). Then, for each face of this cubemap, just generate the corresponding 2D UVs: for example, on the 2D grid layout on the prior image, and assuming a 2D texture size of 1024×1024, the UVs from the +X face would go from (0,0) to (1024/3, 1024/2). That actually means that, if the indirection cubemap is 256×256, the +X face would contain (0,0) on pixel (0,0), and (1024/3, 1024/2) on pixel (255, 255). All the omni lights that use the same texture size can share the indirection texture, so you want to generate those during loading/light setup.

Then, whenever the shadow maps need to be update, you should repeat 6 times the following process:

  1. Generate a camera pointing to one of the basis vectors, from the light position
  2. Setup the render target to render the shadowmap to one of the positions of the grid
  3. Cull and the render depth from that view

This is pretty straightforward, generating the views is a matter of generating a regular view on the axis you’re currently rendering, for example if you use a radius as a property of your light, and you’re rendering the +X axis, that would come down to generate the view from light_position to light_position+vec(radius,0,0). Then, you need to setup your render target displacement and size to your current axis, so you only render the part of the VSDCT you are updating.

Omni-directional shadow mapping: shadowing

Now, you have all the elements, so the process to actually render the lights is the following:

  1. Bind the indirection cube map
  2. Bind the 2D shadow map grid
  3. Calculate light->pixel vector (probably you’ll want to calculate that on vertex shader, interpolate and normalize it on the fragment shader)
  4. Look up the UV from the indirection cube map using the light->pixel vector, getting the UV from the corresponding shadow map from 6 lay out on the 2D shadow map grid
  5. Calculate the eye space Z from the light->pixel vector.
  6. Create the 3 components vector, XY from the indirection cube map lookup, Z from the calculated value on step 5
  7. Look up on the 2D shadow map grid with the calculated vector on step 6, and get hardware comparison and PCF 🙂

The only tricky part is step 5. In order to get all the shadows in one pass and fast, we don’t pass the model and view matrices for each axis, thus needing to calculate that Z value from the light->pixel vector. To get the eye space Z (keeping in mind the a 90º projection), just selecting the component from the light->pixel vector with the largest magnitude, then only transforming it to clip space (transforming by the projection matrix) is needed. We could, of course, just use the light projection matrix, but an there’s an easiest way, as we only need to project the Z value, we can just derivate from the canonical projection matrix, ending up with the following computation:

Where Zeye is the largest component from the light->pixel vector, as explained earlier. The resulting Z can be then compared to the stored shadow map Z.

Omni-directional shadow mapping: issues

Now, you know all that’s needed to implement omni directional shadow mapping. The main issue with the described technique is the contact points between the different shadow maps, and how filtering affects them. The problem comes from the lookups from the 2D shadow map grid actually graving depth information from one of the other 5 shadow maps, thus getting wrong results. If you actually do some extra filtering besides the hardware PCF, then you can start getting really noticeable artifacts. Leaving a several pixel border between shadow maps on the 2D shadow map grid can hide some errors, but it’s not a perfect solution. Also, you can try just scaling a bit the UVs depending on camera distance or number of taps if that is variable on your engine, that can also hide the most visible errors, which might be enough, depending on your situation.

Closing and what’s next

Well, this was a pretty simple article with new tech I want to add to my engine and to a simple test bed that I plan to release with every blog, but this time was impossible due to time constraints and me being a bit more busy than expected this week. Let’s hope I actually finish a light framework to be able to provide a complete (and very simple) sample with every post, so you can get the explanation AND the source code in case I actually missed detailing something.

I’ll probably be working on more shadow related techniques for the next articles, VSM, PSSM and ID shadow maps specially gathering my interest, now that I did warm up a bit 🙂

If this is where we are, where the hell are we going?

Original Author: Matt Ditton

Unlike the multitude of very talented people on this blog, today I won’t be telling you anything. Well except for what you can read between the lines. There will be no deeply technical information and no witty and funny anecdotes about the wacky history of my development experience. I’ll save them for next time.

Today I want to ask a question. You don’t have to answer right away. You can think on it if you want. Please take your time. But I’d love to know your answer.

The question is

“If we look at the current state of the industry, where are we going to be in 5 years? What does 2016 look like for us? As an industry, as developers and as a culture.

What is the world of game development going to look like?”

It’s a loaded question. I’ll even admit that maybe it’s a dumb question. As my first post to a new community of developers maybe it’s a wishy-washy blog topic. But after working through the last eleven years of this industry I really want to ask it. And I’ve got 2 reasons.

First, I keep seeing weird moments of dichotomy. And they fascinate me. I don’t know if they speak to a wider trend in development or if it’s just crazy timing. But looking at something like this…

The original data, because script embedding in posterous is a headache.

… makes me wonder if we’re turning into two very different development worlds.

John Burgardt.

another 4 people in development. As a side note, if you do the Minecraft credits at the same rate as the StarCraft 2 credits they finish in 5.8 seconds. That includes the title. But no timpani guy.

The second reason is that the future is a massive part of what I do these days. In the daytime I look after a games degree at a university. At night I’m part of 4-6 man indie game startup. I’m trying to help build futures for newbies and veterans who all love this industry. All the while knowing that no matter what my best guess is about the future it’s still just a guess.

Fear: An Appropriate Response to the Business of Next Generation Game Development” slide show about the horror of next gen development that was an interesting precursor to the death of work for hire in Australia.

The broad strokes of these two idea had some really good points, in fine details maybe there’s a few holes. But the real world impact to the lives of my friends has been staggering to watch.

Ok so enough rambling. Here’s my stupid predictions for 2016, I’d love to see yours in the comments.

Mobile

  • “Apple and Google own mobile and desktop. But the difference between mobile and desktop is only screen real-estate, oh and desktop has less features. Google might have a larger market but the culture of pay nothing will keep pissing of Developers till they figure out how to sell their player’s data.”

Business

  • “Publishers still own the release schedules on platforms, but Developers will own content. Big deals come with tailoring content to a platform. You’ll still have give a chunk of IP to get anywhere but it only took you six months to drop out the game anyway. And if all else fails you can sell it to a game show in Korea.”

Content

  • “Game to Movie and Movie to Game will continue to happen but just like Books to Movies they will continue to piss people off.”

Tech

  • “Engines are bought off the shelf, a large chunk is open source but tools set them apart. Languages are (mostly) interchangeable. You buy an engine, you write in whatever language. The arguements and bugs this causes will be hilarious.”

Outsourcing

  • “Sound and art libraries become the growing middleware market for prototyping, and pretty much everything you need will be part of the engine anyway. Interaction libraries start making an impact, but the growing hatred of “fetch me a spigot” mini games means you still have to do some work. But that job where one guy makes rocks for 5 years is gone. Sorry Mark.”

Consoles

  • “Huh, like my phone?
    No Consoles.
    Like the web?
    No Consoles.
    You mean my Xbox tv?
    I like the Sony Google center is better, It’s got American Idol Hero Season 12 coming out.”

Education

  • “Your GPA hinges on your metacritic rating. Once you earn your beer money through microtransactions in your free-to-play art game, you win the degree.”

Development

  • “Autodesk turns MAX, MAYA and Softimage into theme skins for AutoDesk 3D Creator thingy 2015. It’s one 3d editor engine with a thousand bolt-on plugins. Plugins are sold in themed packs “Particles”, “Animation”, “Modeling”. The combinations mean you’ll end up buying several packs just to get the 1-2 features you actually use. And still not enough people will be using Softimage. Blender will make serious inroads once people figure out how to use it.”

Employment

  • “Massive teams only exist in countries with strong government support, or around studios with current hit driven product. Major publishers listed on the stock market will continue to fire staff around quarterly earning time. But will continue to buy whole companies staffed by people they fired 2 years ago who’ve done something good with their new found freedom.”

Ok it’s your turn. Where are you going to be in 2016? And what are you going to be doing?

Rolling an iOS Game Engine Without Breaking the Bank

Original Author: James Podesta

There are a lot of great engines out there now with very reasonable prices. Unity and Unreal being the most well known. However, for small iPhone games it is still possible to roll your own engine without getting bogged down in endless months of development. Even with the adequate 3d power of the iPhone4 and iPad,  low-tech 2D games are still topping the charts and proving that a sole developer can still make a mark in the appstore.

Making your own engine can be a great learning experience,  and gives you complete control over any performance bottlenecks and the ability to fix any bugs and implement any obscure features your game might need.

I’m currently developing an RPG called Magelore.  I am using my own engine for this, which I initially wrote with Daniel Krenn for Venger,  released under the Wretched Games label.  It has been used in 4 released apps, with Magelore to be the 5th.  The engine is simple and contains the minimum set of features to streamline iOS app development without any bells and whistles.  While Venger was 3D, our subsequent games have been 2D as well as Magelore which is all 2D.

The most important point worth noting, is that if you don’t want to get bogged down writing an engine for months and never getting onto the game part,  it is best not to get too obsessed with optimising the engine.  A typical iOS 2D app is generally fill rate limited,  so spending too much of your precious time optimisations engine components isn’t going to pay off in the short term.   Physics and collisions are the other likely bottleneck.  Unless you’re doing your own physics system,  there is not much you can improve here from the engine side.  You simply need to adjust your design and gameplay code so as not to need so too many expensive operations per frame.

I’ll now go through the minimum elements you need for an engine in order to get a game onto the AppStore.  I’ll go into some of these modules in much greater detail in future blogs.  This will need to be a very brief overview or the blog would be massive.

Cross Platform

It can be a huge boon for development to make your engine cross platform, and really doesn’t cost a lot.  Our engine runs on PC also, which means we can use Visual Studio which as a much superior debugger to XCode. It also means we have access to mouse and keyboard for debugging shortcuts,  easy screen shots and frapping of videos.   The PC version has been set up so it will run in 320×480,  640×960 or 1024×768 modes in both portrait and landscape modes.  This way I can test what the game will look like on any device by changing a commandline parameter.  Since you’re using OpenGL,  to make something cross platform you mainly just need to isolate a few minor modules –  startup, sound, textures, input, and minor discrepancies between OpenGL and OpenGLES.   Our objective C OSX files are restricted to the AppDelegate, the View class, the OpenGLRenderer1 and 2 classes, SoundEngine and the Texture class.

C++ versus Objective C

My engine is completely C++.  The Objective C parts are restricted to the AppDelegate,  the EAGLView class and a small OpenGLRenderer1 and OpenGLRenderer2 class. Most importantly, this means the PC version shares 99% of the code with the iOS version.

File System

On most games,  you can get by with just a ReadFile and a WriteFile method. Most games do not require streaming files, and its generally more efficient to read them in one chunk and process them in memory. I always use a flat filesystem with my games. This means every resource is accessed by just a unique filename with no path. This removes a lot of possibilities for human error when managing your assets since they can be moved anywhere and still work, and can never get mixed up.  I’ve used that technique for console games with 10 gigs of assets, so there’s no scaling issue once you have worked out good file naming standards.

Asynchronous loading is an optional extra that will give you some power to polish UI transition between levels, but its not essential in most games.  You sound system will probably need streaming for music,  but I have used middleware for sound  (namely Apple’s SoundEngine.cpp/h)  which handles its own threading and soundbank streaming.

Data File Reader

You’ll want to data drive as much content as possible,  so you’ll need to pick a data file format and stick to that standard for all your data files. I use XML and have a smaller helper object that just manages construction and destruction of TinyXML documents.  Generally the runtime hit for loading and parsing your data will not be an issue,  but if it is it generally means you need to break up your XMLs into smaller parts so they don’t need to be loaded all at once.  Supporting Auto-Refresh is very useful for development here.  That is,  your class that reads an XML should continually check if the file has changed and reload it.  This is especially good for any system that needs rapid iteration like particle systems or combat balancing,  and saves you having to spend a lot of development time implementing an editor for those data files.

Maths

Good maths support is essential in any game. Vectors, trigonometry wrappers, random number generators, interpolators,  and Min, Max, Clamp templates should be there. In a 3d game you’ll need matrix and quaternion support as well.  Code for any complex routines, like quaternion maths,  are easily found on the net so no need to be a maths genius to put together a good library of maths helpers. Depending on your game design,  you will generally not need to worry about low-level optimisations of your maths library.   When choosing your vector classes, be careful not to fall into the optimising trap. The readability difference between a optimised vector library that efficiently pipelines a vector math unit,  versus a simple operator overloaded type-safe vector library is massive. However the speed improvement you might get for that payout may be insignificant as maths were not your bottleneck.

Containers

You need to provide a good set of stacks, lists, maps, hashtables. I’d recommend just using STL for this but keep an eye on your object file sizes.  A common trick to avoid code bloat is to use containers of pointers,  as generally compilers can share the same code for a  std::vector<classA *>  and  std::vector<classB *>.   The are some of other advantages to using a container of pointers, so its well worth looking into.

Render Framework

Of course, you’ll need to be able to render graphics and some sort of render manager class can help co-ordinate sprite systems, material systems, and models and provide an extendable framework for adding new types of rendering primitives.   I have taken a layered approach to rendering.  The scene is split into a lot of layers and each layer holds a bunch of Render Jobs.   Each layer is given a view object which determines its camera and projection matrix, thus determining if it is a 2D or 3D layer.  You can specify a sorting mode for each layer.  Each render job is added with a priority value.  In a 3D layer, that priority might be the render objects centre Z co-ordinate and you can specify that this layer gets sorted.   Other layers can be left unsorted and so the order you add render jobs becomes the order they render.  This can be useful for UI.  You can also set render targets and capture targets for each layer.   A render target specifies that this layer will render to a specific texture,  instead of the back buffer.  A capture target specifies that a copy of the backbuffer will be taken when the layer is finished and stored in a specific texture.   In practice,  capture targets are quite slow on iOS, so you’re better off sticking with render targets only.   Note that this is really only used for special effects and so you can happily skip this feature for a simple iOS game.

Render Jobs are an extendable callback driven rendering system.  Any engine or game module can implement their own render job.  Examples of render jobs I support are Sprites, Fonts and Primitive Rendering.  The concept is system. You tell the engine that you are creating a rendering job on a specific layer and priority and supply it with a callback function.   The engine returns a pointer to the some raw memory you can use for your job data. Once you have filled as much memory as you require you tell the engine that you are done.  At the appropriate time in the Draw pass it will invoke your callback function, passing it your render job data. You can then access OpenGL directly to implement whatever kind of rendering you need.

Materials and Textures Pipeline

You’ll certainly need textures.  Support for compressed textures (PVR) is also desirable but not essential .  In 2D artwork the compression artifacts can be quite noticable. They are really better suited for 3d object textures. I’ve restricted myself to just PNGs and PVRs so as to simplify the art requirements.  It’s fairly easy to support a range of textures through middleware, but it can be best to just stick to one format that does all your needs and not end up with massive bloats of code and inconsistent resources.   Materials are currently just a simple table of information detailing what texture they use, and what blend modes and render states to activate.

Fonts

You’ll need to be able to render fonts.  I use AngelCode’s excellent BMFont  utility to convert any font into a PNG with a text data file.  Font renderers will need to support all justification modes,  and it is very handy to have a render-in-box method that will word wrap text into a box area. This comes in handy for drawing requesters.  I also like to have support embedded commands like change text colours or jumping to a tab position so you can easily syntax highlight messages without having to split it into font draw calls.

Sprites

Sprites are your moving, animating elements in a 2D game,  though they can be used as billboards in a 3D game also. For sprites I use PSDs as my input source.  Each sprite animation frame comes in as a separate PSD layer and hotspots can also be defined through layers. Since I do all my art creation in photoshop, this is a convenient storage format. It may also be useful to support making a sprite out of lots of single png files – which would allow you to render something animated in

3dsMax and bring it quickly into the game.   I only support PSDs on the PC platform,  which then dumps out a binary sprite file with all sprites quad packed into large texture pages.  This sprite file is then used by the iOS device for its textures – which saves on load times and storage size on the iPhone, with the trade off that I need to run the game on the PC to generate data for the iOS version.

Particle System

At some point you’ll want some explosions or sparkles or some autonomous animating effect. For Magelore,  I use a simple envelope driven sprite manager with movement defined in an XML file. This gives me a very flexible system with minimal coding effort.  If your game is heavily focussed on retro pixel based effects where you need 100s or 1000s of points,  you may want to spend a lot of time here optimising for data cache access, vectorising the updates and reducing the dynamic memory accesses during the draws.  This can bog you down for a long time and needs many iterations of profiling and testing and very low-level knowledge of the memory pipeline for your hardware. However, for a lot of game styles you can get by with just a simple flexible generic system. Fillrate is probably going to be your biggest bottleneck with particle systems.

Primitive Renderer

Being able to dynamically generate and render raw textured uv’d polygons is always useful for prototyping new effects or one off effects.  I prefer to put a wrapper over OpenGL to hide any minor OpenGLES discrepancies. It also bad for someone to touch the raw render states as your material pipeline will be unaware of these changes to state.

Accelerometer Input

Many iOS games will need accelerometer input. This comes to your app via the accelerometer message so you can just provide your appDelegate as the delegate for that message,  and then wrap it up and send it to your C++ input module for everything else to access.  I put an adaptive smoothing over it so the values look smoother when rendered on screen but are still responsive to fast movement.

UI System

My UI System is, in most respects like a standard windowing input system – frames that have area on screen and children.  It deviates in a few areas to be more useful in a multitouch with your fingers environment like iOS devices.   Frames can be oversized and overlap. Touches will go to the frame they are most inside.  Frames can also register for multitouch,  which means while held, all future touches are captured by that frame.  An example use might be a zoom icon. When you touch zoom you can use the second finger to pinch the screen in to control the zoom.   I still expose the raw touch list to the game if they want to do something very freestyle that doesn’t map to the simple concept of buttons.

Serializer

One popular way to save games is via a serializer. Some important features for a serializer are:

endian agnostic – always serialize bytes at a time so the endian of the platform doesn’t matter. Useful if you ever make a PS3 or X360 version of your game and want data compatibility.

bit serializing – I like to include a method that lets you serialise an int to specified number of bits.  If you use the serializer for creating network packets, you’ll want to pack them as small as possible.

version number – important for your future upgrade path so you can add new content to your objects.

compression – supporting gzip compression is easy and fast.  If your posting large serializations to web servers,  any reduced bandwidth is a win.

blocks – I support a StartBlock and EndBlock serializer command.  This inserts a size into the stream for you enabling you to skip whole blocks during deserialization if that object is no longer important to your game.

ascii output – I include an ascii serializer  (write only)  for debugging purposes. When saving my games, I run the process a second time using an ascii serializer so if anything is going wrong I can easily see what data is causing it.

Scripting

Scripting language is essential for data driven games. I use a custom scripting language in my engine. When taking an off-the-shelf language like Lua or GameMonkey,  just be sure you’re not going to run into performance and memory bottlenecks. A lot of existing script languages can eat up a considerable portion of your CPU just doing garbage collection.  My script language was meant purely to automate simple tasks and has no dynamic memory usage. It is meant to be used as high level control I can embed scripts like “open door;  delay 3;  close door” into any system.   Note that some scripting languages are not designed to handle a command like “delay 3″ – one that stalls the script for 3 seconds – unless they are run as co-routines or threads,  which can add unwanted complexity.

Physics

I’ve not used any physics systems in any of our released games and Magelore is no exception.  Venger used a simple collision system for ray collides and custom 3d movement routines.  For a 3d physics system I’d recommend looking at Bullet or Newton Physics.  For a 2d physics system its hard to go past Box2D.  Writing your own requires very strong mathematics skills and I’ve seen many failed attempts by good programmers.

Sound

For iOS sound we’ve used SoundEngine.cpp which comes with Apples sample programs. It provides simple access to playing mono and stereo caf files, and for streaming m4a files.

Video Playback

Often useful. Fortunately this is extremely easy to implement on iOS through Apples MPMoviePlayerController class. You just create one of this and add its view on top of your current view. You later get a callback when the video is complete.

Debugging

It’s important to support a basic suite of Asserts, Errors and Logging. Lately I’ve taken to using HTTP debugging where you can listen on port 80 and deliver game information as HTML pages on HTTP GET requests. This is great for getting debug information from remote machines like someone QA’ing your game.  On my last project, which was very script heavy, I did a full debugger for my script system via the HTTP webpage. That turned out to be a great help to the designers,  and if they really got stuck, I could just browse to their X360 web page and see what was going wrong.

Scene Manager and Entity System

You’ll need some kind of object framework. Something to keep track of all your objects and call Update and Draw methods.  I’m just using a old style fat entity system for simplicity. My scene management is a simple list of entities. I can get away with this because I generally only have objects alive if they are near being visible, and I don’t have too many live objects for it to become a bottleneck.  Venger was a tunnel shooter and so we were able to just create and destroy all objects as you progressed through the tunnel,  meaning all live objects where valid for Update

I try to keep my hierarchies as flat as possible.  If you’re ever finding that your hierarchies are getting deep,  it’s time to switch to a component system.  However, components come with some baggage since you have extra intra-entity communication, dependency and ordering issues to deal with. So there is still some advantages to keeping with a fat entity system,  especially if you already have one working and the alternative would be to scrap it and start all over again 😉

All Done

Phew… Now that I have all that off my chest, I hope to write about more focussed topics in future blogs where I’ll be able to describe in better detail with pictures, code sample and videos how aspects of the game are implemented. Areas I have ear-marked for discussion include how content is generated,  applying updates to serialized levels,  script usage in xml content,  real-time lighting system, and so forth.

For information on the progress of Magelore, including pictures and video,  check out  

Facebook Is A Console

Original Author: Christian Arca

The “new kid on the block” always gets talked about the most. They’re for starters new, and that can be exciting, scary, and confusing all at the same time. For the past year, “social games” have been the “new kid on the block”.  Quotes such as “Social Games aren’t games!” or “Social Games aren’t social!” and “You can’t put title X on Facebook because it would be horrible as a social game!” have been muttered but there’s one common problem with all of these statements. Everyone is not thinking of Facebook in the right way. Facebook is a console. 

Broken Social Game Scene )

So let’s say that tomorrow Epic Games and People Can Fly announced that they are releasing a web based version of Bulletstorm exclusively available via Facebook. Some of the options which would be available (in terms of monetization) would be as follows: 

Option 1 

Pay X Facebook Credits to play this game! You will have unlimited access! 

Option 2

Play the Bulletstorm Demo for free but Pay X Facebook credits to unlock the full game! 

Option 3

Bulletstorm is free to play but certain items cost virtual currency and maps are unlocked by the number of “neighbors” you have. 

Option 4

Bulletstorm is free to play but all items and maps cost virtual currency. 

Option 5

A certain gameplay mode is free to play via Facebook and it up-sells / markets the console version of the title attempting to capture a new audience. 

Those are all rather real possibilities and aren’t the only options. While some of these might not be the best options it shows the versatility of how one could potentially position their title to monetize on a social network such as Facebook. If some of these options might seem familiar it might be because they’re not too far from how Steam, XBox Live Marketplace, and the Playstation Network operate. You purchase a demo and then unlock the full game or you purchase a title upfront and then purchase DLC for it. Facebook is simply a social marketplace that offers many viral methods of distribution which much like the iTunes App Store has seen great success within video-games. 

Mobile devices are inherently social.“). Video-games are all about interactions. A player’s interactions with the game, a player’s interaction with the game compared with another player’s interaction, multiple player interactions which are happening concurrently, there are interactions left and right. The issue aren’t the lack of interactions but rather how to make them meaningful or as Koster would say make them “emotional notes”. 

If I am logged on to XBox Live and I want to play a game with my friends and they aren’t online I often find myself getting in touch with them to ask if they want to play a game. A lot of times today I find that I mostly use Facebook or Twitter to contact them. So why not create a party invite within your Facebook stream? In fact, Facebook is one GIANT matchmaking service waiting to happen for video-games. I would consider getting your friends together to play a video-game worthy of the interaction highlight reel. 

Another note worthy of mentioning would be any progress which a player completes within a game. If I unlock a tough achievement I’ll tell my friends about it. I overtake one of my friends in a leaderboard I’m going to let them and everyone they know about it. If I need help on level 4 because the giant spider is really causing me some trouble I’m going to ask my friends about it. All these I believe are interaction highlight material because you want to brag, share, or ask for help and that is a social behavior. Not to mention, that services such as Raptr, Gamer DNA, and others are completely dedicated into mimicking a Facebook wall of a player’s game activity. 

As in traditional game development, it comes down to the developer executing on their title. What option makes the most sense for the title and keeps the title true to its form. So go ahead, put ANY game you’d like on Facebook. 

@ChrisA9

 

The Pain Of Debuggery

Original Author: Tony Albrecht

 

As one of the older guys in the studios I work in (This is for #AltDevBlogADay right? For grumpy old game devs?), I occasionally hear young programmers whinging about how hard debugging is. About about how their debugger doesn’t integrate quite the way they would prefer it to, about the complexity of multi-threaded debugging or about how the froth on their latte isn’t quite the right consistency. When I hear this type of conversation I saunter over with my strong black coffee, introduce myself, tell them to pull their pants up and then tell them about a job I once had and a debugging session I once survived.

This was in a time before I worked in the game industry, back in the early to mid ’90s. It was my first job where I was being actually paid to program – I’d been there for a couple of years writing code for mining and some visualisation applications for defence – you know, helping big organisations to destroy both the earth and the humans on it. It was fun stuff – I was working on Silicon Graphics machines and writing 3D graphics in OpenGL – it was as close to writing games as I could get without actually writing games.

Then I was moved onto an unusual new task, one that involved porting some PC code to a Silicon Graphics Indy. Now that in itself wasn’t that unusual. The unusual bit of this job was that it was code for a program that was to be run in an abattoir. It was designed to take snap shots of the halved carcasses of the cattle as they moved down the production line, hung by their hooves on a hook, and then analyse those snap shots and determine meat grades, highlight any abnormalities in the flesh (tumors, severe bruising or the like) and then file the images and associated data to be used for stored cataloging. The code was awful – written by physicists (I can say that, I used to be one. Physicists tend to believe that you don’t need to be taught to code, that you just need to transcribe the maths into code and it will all just work. Anyway, I digress…) The code was full of hardcoded floats, painfully copied out to far too many decimal places, and obscure mathematics written with complete disregard of the physical limitations of floating point numbers on that platform. It was impossible to debug, but I was assured that it was working.

The Indy that was to be used wasn’t hardy enough to be on the slaughter floor where the camera was (apparently they weren’t designed to be immersed in blood), so that was kept in a concrete room not too far away. This room had only one window, and just outside that window was where the newly sloughed cow skin was taken and boiled in preparation for tanning. The smell from that neighbouring room was unpleasant, to say the least. This concrete room with the computer in it was the room where I had the fragrant pleasure of programming from.A small podium was built on the slaughter floor, onto which the camera was mounted. On the side of the podium (accessible from the ground) was a small LCD display with a keypad on it. This display would show the image of the carcass when it was captured and then flash up analysed results as a text overlay as it was saved to disk in the other room. That LCD and keypad also acted as an interface to the application (connected to the Indy via 20 or 30 meters of cable) – I could tweak the program settings from there as the carcasses moved past.

Just behind the camera’s podium (to my right as I faced the LCD) was where the cows were sliced in twain by a large mustachioed guy with a chainsaw. Yes, there was splattering blood, grinding bones and lots of noise – but that wasn’t an issue. As the cow was cut in two, the insides of that cow *usually* ended up on a conveyor belt which shipped the steaming internals off to somewhere unknown. Occasionally, the stomach missed the belt or was split by the chainsaw – usually resulting in me having to leap to safety as the semi digested contents splashed to the ground about me. Or, I’d end up buying new shoes.

The workers there seemed to take great pleasure in making me as bloody and gore splattered as possible. As the carcasses rolled past on their hooks in front of the computer’s camera, there was a guy who would ensure that if the carcass was too damaged that it would be pushed off onto another track where it could be trimmed further (if possible) before returning it to the track with the other meat. This other track happened to pass through exactly where I would stand in front of the LCD, so if I wasn’t paying attention then I’d be hit by 300kg of still quivering steak (this only happened once, to the great amusement of those around me).

So, yes, this was an unusual work environment for a programmer. To make it even more of a challenge, I would have to start work at 4am with the rest of the guys on the slaughter floor.  I’d do my 8 hours there, checking readings, collecting data, doing a bit of coding in the stinky room, dodging carcasses and bursting stomachs on the slaughter floor and then drive the hour or so back to my office and code for a few more hours – 12+ hour days were fairly normal (and no, I didn’t get overtime).

I lost count of the number of weeks that I did this for. At some stage I was involved in a 5 car pile up on my way to the abattoir, with me being car number 5. I was fine, but my car and the $14,000 SGI Indy on the passenger seat didn’t fare so well. The Indy worked, but not for long – the mother board had cracked when the Indy hit the car floor. Luckily it was still under warranty and Silicon Graphics replaced it for us (don’t tell them it was due to the car crash – they might want their money back).

As the code matured I was needed less at the abattoir, so I could drop in at the start of the day and install a new build, then head back to my office to code more or work on other tasks. There was no internet connection to the abattoir, so the only way to change anything was to drive there with the code on a 3.5″ floppy. There were two technicians who were trained to start up the computers and set things running, but they weren’t programmers and had no understanding of the OS or anything other than what we’d trained them for.

One day I was at the office and received a call that the software was crashing consistently. I managed to duplicate the problem locally, and it was a quick and easy fix. But I didn’t want to do the 2 hour round trip drive to the abattoir to install it, so I figured that I could “Airport ’75″ the tech guys and talk them through making the code changes required, compile it and set it running. I mean, how hard could it be?

The first problem I had was that the mobile (cell) phone didn’t work in the stinky concrete room – there was no signal at all. So, I directed Technician ‘B’ (a guy with no computing skills) to be the phone guy and he had to stand at the doorway of the stinky room and relay my instructions to Technician ‘A’ who would type in what he was told. This started fine – we managed to get emacs up and running from a console window which was a great start.

Then the fun started.

“You need to press Control X Control F directory_name slash filename.cpp”

“Press what?”

“The control key – bottom left corner of the keyboard. Hold it down and then press the ‘x’ key then the ‘f’ key, then let go of the control key and type in the file name… What can you see on the screen now? Ok, now press Control Meta Cokebottle…”

…and so on…

This would have been hard enough at the best of times; it was like coding via Chinese Whispers – a noisy environment via a conduit who had no idea what I was talking about relaying my instructions to someone who had on the vaguest notion of what I was talking about. I had to direct the remote programmer to the correct point in the correct line in the correct file, make the changes required and then repeat until the code *should* have been fixed. Fortunately, it was an easy fix that required only a couple of changes – it only took us about half an hour to make the necessary code changes.

Compilation was easy, “Hold the Alt key and press Enter and that will build it. Wait until its stops printing and then read me what you see on screen”. I was hoping for success first time.

“What all of it?” came the delayed relayed reply.

“Shit”

I spent the next hour deciphering the relayed compilation errors, reverse engineering them to try and figure what could possibly be the problem in the first place and then directing the remote programmer to the point in the file that might have been the problem in order to fix it. It was fun in almost exactly the same way that drunkenly falling over and smashing face first into concrete is fun – a moment of blissful free fall when you think you might have gotten away with it, followed by the realisation that this will take a long time to fix.

But fix it I did, and it only took slightly longer then it would have for me to drive there and back. Which in my mind was a win.

So next time you’re feeling hard done by because you have to use printfs to debug some code, or you’re having trouble understanding what a particular error relates to, just remember. It could be worse. Much worse. At least you’re not standing ankle deep in blood and bile while you’re doing it.

And what happened to the program I ported? Well I ended up proving that the original code never worked reliably. It was too heavily dependent on the ambient lighting conditions  and would give different readings depending on the time of day.

So, what was your worst debugging session?

(This article and others like it can be found on my blog – Seven Degrees of Freedom)

Radix Sort for Humans

Original Author: Steven Tovey

Introduction

 Welcome to my first contribution to the thus far excellent personal blog.

On my travels around the games industry I have noticed that although many people know about the existence of the Radix Sort, most know that it’s typically quick (some even known it’s non-comparison based, and linear time). What a great number of individuals I’ve met don’t seem to know however, is the nuts and bolts of how it actually works! Given this, I thought I’d have a crack at explaining Radix Sort for us mere mortals. With any luck, if you’re scratching your head at the notion of a sort that doesn’t do any comparisons or wondering how one is able to break free from the shackles of that O(n log n) thing that your Comp Sci. professor told you about, then this post will help you through it.

The high-level concept of Radix Sort can be imagined by thinking about what you could do if you had an array of, say 128, unique 16bit integers that you wanted to sort. What is perhaps the most obvious way to do it, if you didn’t care about memory utilisation or locality of your results? A simple way would be to simply use each value in the array as an index into an array. Something akin to the following:

for(uint32_t currentValue=0; currentValue<maxCount; ++currentValue) {

     uint32_t value= unsortedArray[currentValue];
     sortedArray[value] = value;
}

Obviously this is pretty wasteful in this form, but this extremely basic concept gives rise to the foundation of how Radix Sort works, put another way, we simply aim to just put the values into the correct places straight away without any need to compare with any other values. Obviously in the above example there are numerous points of failure and problems that we need to deal with, but we’ll get to those shortly.

Okay, so we’ve got the concept of roughly what a Radix Sort aims to do; now how do we go about solving the problems with our naive first-pass implementation? Like all good programmers, we’ll begin by defining the problems we’re trying to solve. Perhaps the most fatal blow to the code above is the larger the values you have in the list of unsorted values, the greater amount of memory you will need to accommodate their placement directly into the correct spot in the results array. This is because of the way we’re placing the values, we just use the value itself to determine where to store it in the corresponding results array. Another problem is collisions. The word “collision” here simply means, “When two items map to the same location in memory in the results array,” for example, when we have the same value twice (or more) in the input array that we’re attempting to sort. For the acquainted, the concept is analogous to that of a collision in the context of a hash map. Another problem that we actually wouldn’t have with the above (due to it working on 16bit unsigned integers) is what to do with negative values or values of a different type, such as floating point numbers. It is not a stretch for the imagination to contrive use-cases in which our sort should be robust enough to deal with this type of data.

 

Analysing Data Using Elementary Statistical Tools

 One of the reasons I like Radix Sort is that the solutions to these seemingly death-dealing problems come in the form of some mathematical tools you were likely taught at primary school (for those of you that only speak American, read “elementary school”). By performing some rudimentary analysis on the data, we’re able to sort it robustly and efficiently. The analysis that I speak of is something called a histogram, a simple mathematical tool used to depict the distribution of a dataset. In its traditional form a histogram is just a bar chart where the height of the column represents the number of times a given value is present in a set. Check out the example below:

To calculate a histogram for an arbitrary set of values we can simply iterate over all the values in our set. For each value in our array we maintain a running total of the number of times it has been encountered… That’s it! (My apologies if you were expecting something more complex). We have a histogram for our array, and we’re well on our way to conquering Radix Sort. Pause for a moment here, and actually take a look at the histogram for the dataset you’re sorting. It can be interesting and very illuminating to see just how your data is spread out and you may gain some interesting insight.

An astute reader may have spotted that I actually glossed over an important implementation detail here. How exactly do you keep track of a running total for each unique value? Ideally we’d like this to be nice and quick, so we don’t want to use some nefarious mapping data structure to keep track of the histogram. Besides, I did promise that I’d keep things simple so how about a nice, simple, flat array? The problem is if we’re sorting, for example, 32bit integers and want to use the values themselves as array indices denoting the counter for a specific value, we’re left with the rather inconvenient problem of requiring 232 elements just to store the histogram! Not to fear though, as this is where the concept of the radix enters the fray.

If we have a smaller radix, of say 8 bits, and multiple histograms (one for each byte of the 32bit integers we’re sorting, a total of four), then we can use a relatively small amount of memory for the histograms which is proportional to the assumable range of the radix. Calculation of multiple histograms can be done in parallel (or in the same loop) no matter how many histograms you seek to calculate for the dataset, you just shift and mask off the number of bits you’re interested in for each histogram. Here’s a quick example of what I mean, for an 8bit radix, you’d simply do: (x & 0xff), ((x>>8)&0xff), ((x>>16)&0xff) and (x>>24) to access each byte of the 32bit value individually. This type of bit shifting should be immediately familiar to any graphics coders out there as it is often used to access individual channel in a 32bit colour value. The code to calculate four histograms (one for each byte) from our 32bit integers ends up looking a little something like this:


for(uint32_t currentItem=0; currentItem<maxCount; ++currentItem) {

     const uint32_t value = unsortedArray[currentItem];
     const uint32_t value0 = value & 0xff;
     const uint32_t value1 = (value>>0x8) & 0xff;
     const uint32_t value2 = (value>>0x10) & 0xff;
     const uint32_t value3 = (value>>0x18);
     histogram0[value0]++;
     histogram1[value1]++;
     histogram2[value2]++;
     histogram3[value3]++;
}

At this point I’d like to stress that there is absolutely no reason why you must have an 8bit radix. It is common, yes, but 11bit, 16bit, or whatever you want will also work. When you’re actually implementing this algorithm you should probably try out a few different radix sizes to see which gives you the best results on your target hardware. It’s essentially a trade-off between cache performance accessing the supporting data structures and doing less passes over the input array. Increasingly from my experience the more cache optimal solution (i.e.: the smaller Radix size and hence histogram/offset table) tends to perform best as memory latency increases relative to CPU performance. It’s also worth noting that histograms for different subsets of a full dataset can be summed in order to produce the histogram for the entire set, this trick can be useful when doing this type of thing in parallel, but we’ll get to that in due course.

 

Offset Calculation

If you cast your mind back to the beginning of this post, I stated that the guiding principle of Radix Sort was to place values at the correct places in the output array immediately without performing any comparisons. To do this we need to know at what offset into our output array we should use to start placing writing to for each unique value in our input array. The histogram was actually an intermediate step in calculating this list of offsets. To illustrate this I will use a worked example, consider the following list of unsorted numbers:

1, 2, 4, 3, 1, 1, 3, 1, 7, 6, 5

If we wanted to place the value ‘2’ directly into the output array at its correct place we would actually need to place it at index 4 (i.e.: the 5th slot). So the table we’re computing will simply tell us that for any values of ‘2’ begin placing them at index 4. We then increment the offset for the value we just placed as we go, so that any subsequently occurring values which are the same (assuming there are any) go just after the one we’ve placed. The offset table we want to calculate for the above example would look something like this:

0 – N/A (There are no 0′s in the set, so it doesn’t matter!)
1 – 0
2 – 4
3 – 5
4 – 7
5 – 8
6 – 9
7 – 10

So how do we calculate this offset table for a histogram? That’s easy; each location in the table is just the running total of each value in the histogram at that point. So in this example, the offset for ‘2’ would be 4, because we have no ‘0’s, but then four ‘1’′s. This unsurprisingly is a total of 4! The next offset, for ‘3’, is 5 because in our data set we only have one ‘2’, and 0+4+1 is 5. The technique of increasing the offset for a given value as you place it in the output array gives rise to a very subtle, but important property of Radix Sort that is vital for implementations which begin at the least significant byte (LSB Radix Sort) —the sort is stable. In the context of sorting algorithms, a sort is said to be stable if the ordering of like values in the unsorted list is preserved in the sorted one, we shall see why this is so important for LSB Radix Sort later on. Incidentally don’t worry about what LSB means just yet, we’ll get to that!

A quick note at this point, the application of these types of analysis tricks can also be done offline to help you better design a good solution around the data you’re trying to process. I’m hoping that a certain Mr. Acton might be kind enough to share some the increasingly important statistical tools that have made their way into his bag of tricks at some point on this blog in the future, :).

 

What Have We Got So Far?

The data flow chart below shows what we’ve got so far. We’ve successfully applied some elementary data analysis to our data, and in the process computed the only supporting data structure we need: a table of offsets for each radix which tells us where to begin inserting each datum in our output array. I find a fun and useful way to visualise what we’ve just done is to imagine taking the bars of the histogram for each number in our data set, rotate them 90 degrees, and then lay them end to end. You can imagine these bars come together to form a contiguous block of memory, with each bar being big enough to hold all the occurrences of the particular value it represents. The offset table is just the indices along this array where each bucket begins. Now comes the fun part, actually moving the data.

Things are going to get a little more complicated here, but only very, very slightly I promise. There are actually two ways to approach data movement in Radix Sort. We can either start with the least significant bits or with the most significant bits (LSB or MSB). Most people with a serial programming mindset tend to go with LSB, so we’ll start there and then cover MSB later on as that has some advantages (predominately for parallel programming).

 

Serial Offender!

Radix Sort is typically not an in-place sort, that is to say implementations typically requires some auxiliary storage which is proportional to the number of elements in the original, unsorted list in order to operate. To move the data around we need to do one pass through our unsorted array for each radix (the number of passes will change depending on the size of each datum being sorted and the size of your radix, of course). Each time we perform a pass, we are actually sorting the data with respect to the particular radix we are considering, and I’ll begin by discussing this for LSB and then discuss MSB.

The first pass will typically sort by the first byte, the second by the second (but respecting the positioning of those elements with respect to the first) and so on. Hopefully at this point you are able to see both why the stable property of LSB Radix Sort is so important and where it comes from. The data movement pass is just a single pass over the input array for each radix as mentioned. For each pass, you just read the value in the input array, mask off the bits that are relevant to the particular radix you are considering and then look-up the offset from the table of indices we computed earlier. This offset tells you where in the corresponding output array we should move the value to. When you do this, you need to be sure to increment the offset table so that the next value from the input array with the same binary signature will be written to the next element in the array (not on top of the one you just wrote!). That’s all there is to it really! You just do this once for each radix, something like this:


for(uint32_t currentValue=0; currentValue<maxCount; ++currentValue) {

    const uint32_t value = unsortedArray[currentValue];
    const uint32_t offsetTableIndex = value & 0xff;
    const uint32_t writePosition = offsetTable0[offsetTableIndex];
    offsetTable0[offsetTableIndex]++;
    auxiliaryArray[writePosition] = value;
}

So that’s our LSB Radix Sort for integers, but what if you want to sort floating point numbers, or indeed negative numbers? I’m sure if you’re reading this you’re probably aware of the common storage formats of floating point values (the most common of which being IEEE 754) or two’s complement for signed integers. How does Radix Sort deal with these things? The answer is perfectly well if you apply a simple transformation to the input data, in time-honoured fashion I’ll leave this as an exercise for the reader as I don’t have time to delve into this now. If you’re really struggling then please leave a comment or something and I’ll try and find time to write it up for you 🙂


Making the Best of a Parallel Situation

So we’ve covered how to Radix Sort works when starting from the LSB, seems to work nicely, why would we want to complicate things any further? This part will talk about how sorting data beginning with the most significant digits changes the nature of the problem and how we can use this to our advantage in this increasingly parallel world.

If you think about what you actually have at each stage of the data movement passes, you will see that using MSB Radix gives rise to a particularly interesting property, the first pass over the data actually produces n independent sorting problems, where n is the assumable range of the radix. If you can’t see why the parallel programmer in you is now rubbing his/her hands together, don’t worry, here’s why!

The first pass over the data in the context of an MSB Radix Sort actually has the effect of roughly bucketing the data according it’s most significant byte (or whatever size radix you’re working with), to put it another way, you know that after each pass of MSB Radix Sort, none of the values in the buckets will ever need to be moved into other buckets (which is of course not the case with LSB Radix Sort). This important property actually opens up the potential for us to distribute the sorting of the different buckets over multiple processing elements, crucially with each element working on independent data, making synchronisation much simpler. To perform MSB in serial you just re-apply Radix sort to each bucket independently. Another interesting side effect of doing MSB Radix Sort is that each application of Radix Sort makes the data more and more cache coherent for subsequent passes.

 

Wrapping it up…

So, there you have it. Radix Sort explained (hopefully well enough) for humans! I hope after reading this you are able to have a strong mental model of how Radix Sort works in practise and moves data around in memory, and importantly how it is able to break free of the O(n log n) lower limit on comparison-based sorting algorithms. Just a quick disclaimer, for some of my more performance minded readers: I make no guarantees as to the optimality of my implementation; the purpose here was to explain the concepts behind the algorithm, not to write an optimal implementation.

Non Virtual Interfaces

Original Author: Matt Dalby

Over the years, one technique has found a cosy corner in my bag of tricks, it gets used reasonably regularly, yet everyone whom I talk to about it has absolutely no idea what it is. Although I can guarantee that pretty much everyone has either unwittingly used it, or used and interface  which has been implemented using it.

 

The technique in question is the Non Virtual Interface (NVI).

 

A simple example of the technique would be:

class Interface

{

public:

    // publicly accessible interface

    void Execute()

    {

        InternalExecute();

    }

private:

    // private function which will be overridden by specialisation

    virtual void InternalExecute()

    {

        // do work

    }

};

class InterfaceChild : public Interface

{

private:

    virtual void InternalExecute()

    {

        // do work specific to this class.

    }

};

Here we see that the interface class implements a public interface with a single entry point “Execute”. The Execute function calls a privately declared virtual function called “InternalExecute”.  This enforces that any class which inherits from our “Interface” will only be able to provide a specialised version of the “InternalExecute” function, and not the interface itself, keeping the interface consistent across all specialisations of our base “Interface”. 

 

Keeping the interface consistent, and limiting the scope of what can be overridden helps to keep the code focused, and helps enforce the original intention and direction of the interface onto child implementations.

 

This is also a useful technique for when you wish to execute code before, and/or after the virtual function call. In an interface which uses public virtual functions, we end up replicating the pre/post-amble code throughout each virtual function override, however using NVI, we can place this common code directly before we call the private virtual function which specialisations override. For example, we may wish to implement a naive form of thread synchronisation around our private virtual call, this can be achieved by the following:

 

void Interface::Execute()

{

    //preamble

    // Eg: naive critical section / semaphore lock

    InternalExecute();

    //post amble

    // Eg: naive critical section / semaphore unlock.

}

So, next time you’re designing an interface, consider whether you really need to place your virtual functions in the interface itself. Can you hide the implementation from the outside world? If so, NVI is for you!