SWIG and a Miss

Original Author: Nick Darnell

There are so many pitfalls you’ll encounter after using SWIG for any extended period of time or with a large enough codebase.  I thought I would go over some of the more notable ones I’ve encountered creating a C# wrapper that had me sighing and doubling my caffeine intake.  That said, I do like SWIG.  If you color within the lines it works quite well and has in the end saved me more time than it has cost me.

I’m still debating if it’s worth investing time into my own cleaner/purpose built wrapper generator.  I’d like to hear some thoughts from others who have had to deal with this decision and what you chose.  If you went with writing your own, did you use any helpful libraries (like for parsing C++)?  Also, feel free to add to the this list in the comments if you’ve run into another notable problem you ran into with SWIG.

1 ) Consistent Generator Configuration

Using different build configurations for code generation is a bad idea.  Even though your generated code will compile using different build configurations, the #defines you set on the swig command line (-DMY_DEFINE) should be consistent in all build configurations.  Varying the generator configuration can lead to things like stale files (unless you rebuild every time), as well as an inconsistent wrapper interface.

2 ) Automatic Macro Expansion Is Evil

If you have a macro that does different things based upon Debug or Release SWIG will pick the one matching the -D command line define and expand the code to match that.  It won’t just insert your macro as you would if you were writing the code naturally.  The best way I’ve found to prevent this from happening is just block off the macros it’s trying to expand.

#if !defined(SWIG)
 
      #define MYNEW ...
 
      #define MYDELETE ...
 
  #endif

3 ) Const Correctness

C# does not have a way of mimicking C++’s const’ness.  So objects returned in C++ from a function as const Type& are treated as a regular pointer.  Allowing your wrapper user to try and change the underlying data in the object even though it’s suppose to be const.  To avoid this, you should probably remove the function entirely.  Alternatively you can treat it as a new object and generate a new C++ object on the heap and copy it there so that any edits made to it wont be reflected in the actual object.  But that has many negative sides also, like not being able to see changes in the object the reference was for.  You also create a memory leak since SWIG has no idea it has created a new object, so you need to find a way to tell swig you created new memory.  To create a wrapper class for it you can use the %typemap(out) macro like this,

%typemap(out) const Vector3& %{ $result = new Vector3(*$1); %}

4 ) Return Reference

By default SWIG treats TYPE & as a pointer.  Which is in some sense good and in others horrible.  Like it makes perfect sense for a List class to have a function called Get(int index) that returns the TYPE& to the object in the list.  In a case like that, treating it as a pointer is fantastic.  However, suppose you have the following C++ class,

class Matrix
 
  {
 
      Matrix();
 
      Matrix& Identity();
 
  }

Then in C# you do the following,

Matrix4 wrappedMatrix = new Matrix4();
 
  return wrappedMatrix.Identity();

Creating the Matrix4 from C# sets the flag that SWIG needs to delete the underlying native Matrix4 instance when this object goes out of scope.

Next we call Identity() which using the default C# generated code will return a pointer to the same native Matrix4 instance, but a new managed C# wrapper class will be used to wrap it.  This new C# class will have the flag set to NOT clean up the memory since it’s just holding onto a pointer and wasn’t responsible for creating it.

Finally we return the C# object created using the Identity() method, which is a different instance than the one wrappedMatrix points to, even though they both point to the same native class.  Then we return from the scope we are in, wrappedMatrix is garbage collected, the Identity() spawned matrix continues to live on in managed code, but is now pointing at a completely bogus location in memory since wrappedMatrix was responsible for deleting the native object and did so.

So how do you solve this one?  The easiest way is the same as #3, safe bet is just to not have those functions wrapped, instead provide functions that return void.  Ideally, SWIG would provide some way for me to markup the class and say,

SWIG_THIS Matrix& Identity(); // SWIG_THIS doesn't actually exist.

SWIG_THIS would indicate that I’m just returning the same object, so instead of creating a new wrapper C# class when you return from the wrapped Identity() function, just return “this”.

5 ) Customizable Allocation

There’s no standardized way of changing the allocation mechanism SWIG uses.  It will create/destroy arrays using stock malloc and free.  It will create/destroy everything else using new/delete.  The way around this is to modify the SWIG script files it uses to generate the default code. (swigLib*.swg,*.i).  In particular, carrays.i for changing array allocation.  For object creation, you’ll have to modify swigs scripts for your destination language of choice.  For C# you can find them here (swigLibcsharpcsharp.swg).

6 ) Alignment Fail

SWIG and aligned data types don’t get along together.  Presumably you have your own allocator and have either A) Have overridden global new/delete, B) Created a common base class with new/delete functions, or C) Created a macro to use on every class you want to override new/delete on.  If you have then in the case of object creation, you’re safe.  Though you may want to replace SWIGs use of new/delete anyway with your new/delete macros as I am sure you’d like to make sure you track exactly where those allocations came from.  If you have any arrays SWIG creates, you’ll definitely need to replace their usage of malloc and free with one going through your allocator so that you can handle alignment.

7 ) Nested Types

SWIG doesn’t handle nested data types.  I haven’t found a way around it, you just have to pull them out of the parent class if you want them wrapped.

8 ) Where Are The Out/Ref Parameter Flags

SWIG doesn’t map method parameters that are passed by (non-const) reference as “out” or “ref” parameters in C#.  The way around this is to do the following,

%include "typemaps.i"
 
   
 
  %define %TypeOutParam(TYPE)
 
      %apply TYPE& OUTPUT { TYPE& };
 
  %enddef
 
   
 
  %TypeOutParam(bool)
 
  %TypeOutParam(int)
 
  %TypeOutParam(float)
 
  ...

That will make all instances of those types being used in the context void MyMethod(int& x, int& y) would be mapped to void MyMethod(out int x, out int y) in C#.  Alternatively if I had done,

%include "typemaps.i"
 
   
 
  %define %TypeRefParam(TYPE)
 
      %apply TYPE& INOUT { TYPE& };
 
  %enddef
 
   
 
  %TypeRefParam(bool)
 
  %TypeRefParam(int)
 
  %TypeRefParam(float)
 
  ...

The method void MyMethod(int& x, int& y) would be mapped to void MyMethod(ref int x, ref int y) in C#.

9 ) IntPtr == void*

For some unknown reason SWIG does not map void* to anything useful in C#.  So I came up with the following to map it to the IntPtr struct in C#.

%typemap(ctype)  void * "void *"
 
  %typemap(imtype) void * "IntPtr"
 
  %typemap(cstype) void * "IntPtr"
 
  %typemap(csin)   void * "$csinput"
 
  %typemap(in)     void * %{ $1 = $input; %}
 
  %typemap(out)    void * %{ $result = $1; %}
 
  %typemap(csout)  void * { return $imcall; }

10 ) string[] –> char**

For C# string[] is not mapped to char**.  So here is how you map it,

%typemap(ctype) char** "char**"
 
  %typemap(imtype) char** "string[]"
 
  %typemap(cstype) char** "string[]"
 
   
 
  %typemap(csin) char** "$csinput"
 
  %typemap(csout, excode=SWIGEXCODE) char**, const char**& {
 
      int ret = $imcall;$excode
 
      return ret;
 
    }
 
  %typemap(csvarin, excode=SWIGEXCODE2) char** %{
 
      set {
 
        $imcall;$excode
 
      } %}
 
  %typemap(csvarout, excode=SWIGEXCODE2) char** %{
 
      get {
 
        int ret = $imcall;$excode
 
        return ret;
 
      } %}
 
   
 
  %typemap(in) char** %{ $1 = $input; %}
 
  %typemap(out) char** %{ $result = $1; %}

11 ) Operators

Operators are not wrapped by default.  There is probably a way to map them to operators in C#, but I went with another route. You can rename them so that they are, for example

%rename(Add) Vector3::operator +;

12 ) Memory Lifecycle Management

SWIG has this internal flag in generated classes that tells it if it needs to clean up the memory for the native object when the instance is garbage collected by .Net.  If you have pesky life-cycle management requirements in your system you’ll need to be able to control this flag on the fly.  The way I’ve found to do this is with the %typemap macro,

%typemap(cscode) YOUR_CLASS
 
  %{
 
      public bool IsLifecycleManaged
 
      {
 
          get { return swigCMemOwn; }
 
          set { swigCMemOwn = value; }
 
      }
 
  %}

Now you can control on a per instance basis whether or not .Net garbage collecting the object will clean it up.

13 ) Callbacks

If you need to do callbacks use the “%feature(“director”)”.  The documentation explains it more in depth, but it wasn’t clear to me at first how to do it until I learned about directors.

14 ) Extension Macro

You can extend/add functionality to the C++ class prior to the wrapper code getting generated using the “%extend” macro.  The extension code is added as a C function though, so you might wonder, how do I access the class that this function is supposed to be part of.  The code SWIG generates refers to the ‘this’ pointer as the variable ‘self’ so you can access anything public on the class by just doing

self->...

15 ) Unfriendly Template Class Names

SWIG generates unfriendly names for templated classes.  The best way around this is to use the %template macro.  It works like this,

%template(ListFloat) List<float>;

16 ) Tracking New Memory

If a function creates a new object and returns it, you need to tell SWIG this, otherwise you’ll have a memory leak.  By default SWIG does not destroy any object it wraps unless it created a temporary object in the heap for a value type on the stack.  The way you inform swig a method returns a new object is by doing,

%newobject MyFactory::Create;

17 ) Casting

As it stands I don’t have a good solution to this one. I thought I should mention it though, because it has been a real thorn in my side. If you have a method in C++ that you wrap and you return a base class like BaseObject*. SWIG will wrap that inside a BaseObject C# class. Now all the other SWIG wrapped classes that derrived from BaseObject will exist and will similarly inherit from a C# BaseObject.

However, attempting to cast your new C# BaseObject into anything other than its parent classes will result in an exception. Because SWIG has no way of knowing the true type of the BaseObject* it returned from the function. So the C# object it creates is going to be BaseObject and not the actual C# wrapper class for the native types true type.

SWIG and a Miss @ nickdarnell.com


How Complex is Complex Enough?

Original Author: Luke Dicken

I’ve been doing some very simplistic hacking around in Unity lately – really basic stuff, mostly to build a visualisation system for my research. In the process of doing this, I made a little maze, “borrowed” some assets from a Unity tutorial, and set it up so that you could control one of the characters with the arrow keys. What I’m working towards is using Automated Planning to assume control of the character, and then get into the special sauce that I’ve been working on for the past few years to make it more awesome – but more on that another day. Anyhow I decided to show it to my mom as it was, and she seemed to get a real kick out of it, possibly because it looks a lot shinier than whiteboards full of equations and diagrams.

No wizards or fighting – and no control over what your players are taking away.

The next day I got an email talking about how she really enjoyed killing the wizard at the end of the maze. I raised my eyebrows quite high – the assets I’d taken from the tutorials were a robot and a battery, and the aim was to collect the battery and return to the start, which disappeared as soon as the robot model collided with it (like I said – simplistic).

This is something that we often find in the gamedev world – the player has a tendency to bring their own perspective, bias and, frankly, baggage in with them and that can lead to wacky situations and weird unintended narratives being shoehorned by the player into the game, especially when things are not transparent to them.

Last year at the Paris Game AI Conference, the University of Minnesota’s Baylor Wetzel presented an incredibly thought provoking session titled “Inside Your Players’ Mind with Playtesting” in which he laid out some experiments he’d been doing with his students. The setup was a simple turn-based fantasy strategy game with two players. Each had a selection of different units in various quantities and every turn chose to attack one of their opponent’s units, with the aim being to eliminate the opposing army and thus kill the other player. The experiment was run with one human and one AI in each game, and after the game concluded, the human player was asked to fill in a survey, and evaluate the AI’s difficulty, fun and realism levels as well as attempt to guess the rules that were driving the decision process.

Wetzel’s game is simple but effectively demonstrates the point.

Wetzel’s evaluation was based on a 1-5 grading system and only a sample of 13 participants, which maybe isn’t quite as fine grain as it needs to be for the detailed analysis the results were presented in, so I’ll gloss very quickly over them : The results showed that the player’s perception of difficulty and realism seemed to demonstrate a correlation, but difficulty and fun, and realism and fun did not. What I think is the most interesting thing about the results of the survey wasn’t the numerical data however, it was the narrative that the participants tried to use to justify the actions of the AI players when guessing the decision making. One of the best examples of this was given in the slides :

“The AI has a hatred of dragons and always attacks them first. He’ll attack strong units but when he sees a dragon he can’t help himself and immediately goes for them.”

Of course, it would be perfectly possible to create a logic system that obeyed this kind of narrative device, but in actual fact the behaviour that the player had observed was significantly more simple; it attacked the unit with the longest name first.

In many ways this ties in with a chat I had recently with Mitu Khandaker of University of Portsmouth for her “Gambrian Explosion” column over at Game Set Watch on the subject of it was incredibly diverse, but in breadth, covering the full spectrum of potential moral alignment of the player. In practice, players tended to be either polarised fully light-side or dark-side, so the majority of the diversity was never experienced.

The point is that the nature of the system isn’t important – what really matters is how the player perceives the system because ultimately that is what drives popularity and sales. You can get away with a dumb system masquerading as something more, provided nobody notices. Equally, you can have a very sophisticated system, based on a range of features and advanced algorithms, but if that same behaviour could be approximated by something a lot more simple, its possible that people will assume that you’ve done the simple thing and not understand the subtlety of your implementation.

It’s a balancing act to juggle how smart to make your algorithms, and how much quick hacking you can get away with, but the main thing that I’d like you to take away from this post is that sometimes less is more and sometimes, more is less. Play into that and make sure the player sees your work at its best – or better if possible.

Oh and seriously, don’t show your mom half-baked things you’re working on. It never ends well.


A Kiss is always harder than a Kill

Original Author: Claire Blackshaw

Games let us craft interactive systems in which we can play and explore. We play with limited rule-sets, physics, combat, squad based AI, and a million other complex systems which give us a world to explore. Why then is one of the core human experiences, our social nature, all but missing from games?

[Delete several paragraphs explaining where we are and why – assume audience knowledge]

Jump to Nuts & Bolts discussion.

It all boils down to the fact in many ways the dialogue tree system we are so comfortable with is a local maxima in evolution of social interactions and needs to be discarded to make progress.

Let’s take a brutal approach to a social interaction with a game agent. Now if we look at the three stages we can separate them nicely. Currently Inputs are stupidly simple, the game-pad offers the emotional range Morse code, no wonder the agent has trouble interacting. I’m less concerned here because of Kinetic and the brilliant academic research being conducted in this field.

Our outputs however are extremely high quality with amazing voice acting, great visuals and delivery in our high end products. Though the methods we use aren’t scalable, and impractical for dynamic systems. Again there is great research in text-to-speech, automated lip syncing and similar technologies. So great strides being made in this area, but it will be the most painful transition to make as early version of this technology just can’t compete with the pre-baked polished deliveries we have at our disposal.

While challenging I think both of those areas are moving forward at a good pace and will be solved in the near future.

Then we get down to the gritty problem of the agent itself and the formulation of a response. Something we are getting very good at for say tactical combat simulation, where the inputs are better and the outputs rule governed. The dialogue tree however is at the end of the day a static graph (path manipulation through skill rolls or simple scripting doesn’t count) which creates a dumb but effective reflex agent.

How can we grow this agent into a more intelligent system which allows for play and exploration? Can we make an agent who processes and develops an opinion of our actions rather than simply reacting to them by a cast iron static reflex system?

Now while research is going on in this area its super fluffy and long range and does not get as much funding, mostly due to past developments in academia (see AI Winter). One of the trickiest areas with an intelligent response is it requires some degree of knowledge, and knowledge representation is a damn tricky problem in the AI world.

Nuts & Bolts

The key point to the entire problem is that the agent in a game-world, unlike an agent trying to exist in “reality”, lives in a world where we know everything. Using this we can seed them with knowledge they require and a basic relationship network. Through this they can perceive and understand any event which is within the scope of the game world as we have provided them with a working knowledge base from which to understand it.

This will allow us to make advances faster than our “reality” counter parts with quicker results as we have limited the scope of the problem space. Game developers like to cheat 🙂 . It would also help if for the short term if we allow ourselves to use clunky or unappealing input or output systems which would not be acceptable in a commercial product 🙁

Now I know this all sounds very fluffy and I sound like a crazy newbie who has discovered neural nets for the first time and has a naïve grin ear to ear. Though I do acknowledge the scope of this problem and I know the way forward is solving a lot of smaller problems to construct the tools to solve the larger. My dissertation which focussed on building a query/response framework along with the concept of 2nd degree authorship was the umpteenth step for me but the 1st solid piece of results.

2nd Degree authorship had to be proven because I know that at the end of the day we do still want to write and construct narratives, rather than pure random simulation situations. I proved this with a fairly simple rule-based modifier system, one possible solution to procedural authorship of social interactions.

Another requirement is data model which was generic enough to be expanded without being reworked. Again I manage to propose a fairly simple atomic model which would work for several game systems.

Finally I looked at building a query response model which could hold context and remain fairly generic. My research mostly hit dead ends until I considered representing a conversation as a tree, which provides context and a query as a path through that tree with null nodes. Response then could be sub-trees. XPath nicely fitted to this concept and I’m quite happy with the flexibility of the resulting system.

The full details up to this point can be read in my dissertation here.

The next step is build a system in which agents can observe a self-contained play session, discuss it with a player entity or another agent and have a personality persists over multiple sessions. I had two possible planned prototypes to explore this concept. This is the thing I was struggling over for the last week.

The detective prototype was appealing due to its lack of what we will call secondary work (not directly pertaining to the research). In this scenario we would use 2nd degree authorship aided by some procedural systems to generate a crime scenario. The player would then interrogate several agents to try build a picture of the truth. The agents would of course have a personal partial vision of the situation and be manipulating their responses to remove the appearance of guilt.

This is an appealing prototype firmly built on my previous work with little need for secondary work. The two key concerns however were that it strongly depended on the strength of my generation algorithm and would therefore possible reach a dead-end or be handicapped if that system were inferior. It also caters to a fairly narrow spectrum of games, not having many traditional games systems interaction with the social model.

So the second prototype is a tactical turn based shooter. A game I’ve been throwing around for the last year or so, in which the world is persistent but the missions are throw-away and players build reputation. The idea is that the agents (in this case infiltrating squad members, or defence agents) can debrief their controller after the mission about the events. These agents would be persistent over the game world getting hired, fired and killed.

The main point here is the social information is mostly being scraped from a “live” game session in which one side (the defender) is not actually present. The downtime between missions however can be used to de-brief and casually talk about items. This means the game itself is not reliant on the social system, making it easier to have multiple versions, swap out or rewrite the social component without altering the game itself.

Also the persistent nature of the agents mean if I leave a server running long enough distinct social ecosystems should hopefully evolve.

Well that’s the plan; I’ll discuss more details in future posts and as the project advances. If you have any interest in this field or my research please do contact me I’m always looking for people to bounce ideas off and just share my crazy dream to cut-down the dialogue tree and burn its pitiful remains in the fires of progress.

Further ramblings on the matter are on my site, FlammablePenguins.com


The best defense is a good offense

Original Author: Francesco Carucci 

My students might be moving their first careful steps in the dangerous lands of programming, but the problems I’m forcing them to face week after week are nothing short than “difficult”, even in code that is seemingly simple and straightforward. This week we were working on a sprite engine in C# (XNA), adding some code to select different sprite animations to play, where an animation is defined as an array of images. The sprite keeps track of the index of the current image being used in the current animation. Pretty straightforward…. crash.

Oh crashes happen, it’s part of the game, and we quickly found out the source of the problem:

Texture2D GetFrame(int index)
 
  {   
 
    return frames[index];    
 
  }

An exception, IndexOutOfBound, was being thrown when a new animation is selected: the current index is being maintained by the Sprite class, and it’s not guaranteed that the new animation contains at least the same number of frames of the previous one. GetFrame() can, potentially, receive an index that is out of bound.

Solution: Make sure GetFrame() doesn’t fail if the index is out of bound.

Something like:

Texture2D GetFrame(int index)
 
  {  
 
    if (index >= frames.Count)    
 
       index = 0;   
 
    return frames[index];
 
  }

This is a very common approach to the problem that in my opinion is not actually solving it, merely hiding the dirt under carpet of an if guard. Surely, it “fixes” the bug, but at what cost: some added complexity (it’s an if, one more branch and one more unit test to write in an automated tested environment), that hides away the source of the bug; the system can still be in an inconsistent state. We are just trying to keep going in the presence of a blatant fault.

The more stable solution, I explained to the students, is to fix the source of the inconsistency, in this case either having one separate index per animation or, more simply, reset the frame index to zero when a new animation is selected: an animation is guaranteed to always have at least one frame, the code is thus safe, it doesn’t need a check in GetFrame() (one less branch) and it won’t crash… ever.

Unless we have an animation with no frames.

This is precisely the behavior I want: if there’s an animation with no frames, I want the application to crash and tell me if it’s in an inconsistent state that I can not ignore. If it crashes, I have to fix it.

Fail early, fail loud

This simple example shows my students a broader approach to software construction: offensive programming versus defensive programming. Instead of trying to prevent a crash by writing lots of defensive code that checks anything that can possibly go wrong (but obfuscates meaningful and useful code in a see of ifs), I prefer writing less code, but make it so execution stops as early as possible in the presence of inconsistent state, invalid preconditions, broken invariants.

In my experience this approach leads to less and simpler code, easier to modify and with less crashes.

An obvious drawback of offensive programming might be a live system, that is used by clients (content creators?) while it’s being developed: ring a bell? It’s exactly how we work in the Game Industry. Although I’m convinced that it produces less down time in the long run (by producing better and less crash-y code), it’s hard to convince an artist staring at a call-stack that this is a Good Thing™.

I’m again very open to suggestions here, how would you tackle the problem of writing lean and offensive code versus making the content creators experience pleasant in the presence of a (undoubtedly more rare) show stopper?

Some ideas:

  • User friendly crash handler
  • Automated unit, functional and smoke testing before release to minimize crashes
  • Build staging

I played a lot in the past with unit and functional testing, especially in the form of asset testing to make sure that assets are sanitized before entering the system, with very promising results that led to fairly stable daily release cycles; something very much appreciated by content creators. Still that show stopper is hard to sell…


Ready, Set, Allocate! (Part 4)

Original Author: Paul Laska

Links to other parts:

In this part the free method is covered.

 The system used for development and testing is an old laptop running Windows XP Pro Service Pack 3, with a Pentium-M 1.6GHz 32-bit processor, 400MHz FSB, and 512MB of RAM.

 To release memory back to the allocation system, the following steps will be taken:

  1. Retrieve the allocation header.
  2. Determine the pages used to track the memory.
  3. Release the pages back into the memory heap.

 The free function signature looks like so:

void my_free (void* pMem)

pMem is the memory to be released back to the allocation system. The free function can be redirected similarly to malloc, using a define.

#define MYFREE( pMem )	my_free( pMem )

1. Retrieve the allocation header.

Recall, the allocation header was placed just prior to the address returned from malloc.  So to retrieve the allocation size and alignment information, the address pointed to by pMem is decremented by the size of an allocation header; this makes the address pointed to the beginning of the allocation header, so a quick cast and the allocation header members can be retrieved.

       // The tracking header exists just prior to the pMem pointer, step backwards the size of an allocation header to get it.
 
        u32 uSize = ((TALLOCATION_HEADER*)((u32)(pMem) - sizeof(TALLOCATION_HEADER)))->uSize;
 
        u32 uAlignment = ((TALLOCATION_HEADER*)((u32)(pMem) - sizeof(TALLOCATION_HEADER)))->uAlignment;

The amount of padding used before the header, to keep the memory aligned is determined, and used to move the pMem pointer back, so that when the memory is released, all the memory added for the allocation header is released as well.

      u32 uAllocationHeaderPaddingSize = ((sizeof(TALLOCATION_HEADER) % uAlignment) > 0) ? uAlignment - sizeof(TALLOCATION_HEADER) % uAlignment : 0;
 
        // Align the header, so the beginning address of the memory will be aligned.
 
        u32 uAllocationHeaderSize = sizeof( TALLOCATION_HEADER ) + uAllocationHeaderPaddingSize;
 
        // Move the pointer backwards to include the allocation header.
 
        pMem = (void*)( (u32)( pMem ) - uAllocationHeaderSize );

2. Determine the pages used to track the memory.

The first thing to do here is to figure out the “address”, or offset from the beginning of the memory being tracked.  Then the tracking unit where the allocation was started can be determined using the address.  Also the offset for the beginning bit/page within the first tracking unit is calculated using the address.

      u32 uAddress = (u32)(pMem) - (u32)(const_cast(kpMemory));
 
        u32 uBeginningTrackingUnit = uAddress / kMemPageSize / kMemNumPagesPerUnit;
 
        u32 uPreOffset = uAddress / kMemPageSize % kMemNumPagesPerUnit;

The size information obtained from the allocation header allows the number of pages used to track the allocation to also be calculated.

      u32 uNumPagesUsed = uSize / kMemPageSize + ((uSize % kMemPageSize)? 1 : 0);

3. Release the pages back into the memory heap.

Clearing pages for each tracking unit follows the same basic pattern, a bitmask of the pages used is built, then bitwise inverted and logically AND’ed to the tracking unit, with results stored back into the same tracking unit.

If the entire allocation is contained within one tracking unit, then it is cleared and the function returns.

      // Check if the used pages are contained within one tracking unit.
 
        if(uNumPagesUsed < (kMemNumPagesPerUnit - uPreOffset))
 
        {
 
              // The entire allocation is contained in the one Tracking Unit, build a bit mask and clear the pages all at once.
 
              aMemPageTrackingBits[uBeginningTrackingUnit] &=  ~((kMemTrackingUnitAllPagesInUse << (kMemNumPagesPerUnit - uNumPagesUsed)) >> uPreOffset);
 
              return;
 
        }

Otherwise, the free operation is broken down into three parts:

  1. Free the partially used tracking unit for the beginning of the allocation.
  2. Free any whole tracking units used for the allocation.
  3. Free a partially used tracking unit, if used, for the end of the allocation.
      // All pages from the PreOffset to the end of the first Tracking Unit are used, clear them.
 
        aMemPageTrackingBits[uBeginningTrackingUnit] &= ~(kMemTrackingUnitAllPagesInUse >> uPreOffset);
 
   
 
        // Determine how many pages are left to clear.
 
        uNumPagesUsed -= (kMemNumPagesPerUnit - uPreOffset);
 
   
 
        // Clear whole tracking units.
 
        u32 uNextUnitToClear = (uBeginningTrackingUnit + 1);
 
        for( ; uNextUnitToClear <= (uBeginningTrackingUnit + (uNumPagesUsed / kMemNumPagesPerUnit)); ++uNextUnitToClear)
 
        {
 
              aMemPageTrackingBits[uNextUnitToClear] &= ~kMemTrackingUnitAllPagesInUse;
 
        }
 
   
 
        // Determine how many pages are left to clear.
 
        uNumPagesUsed -= ((uNextUnitToClear - (uBeginningTrackingUnit + 1)) * kMemNumPagesPerUnit);
 
   
 
        // Terminate if done, also protects from going past the end of the aMemPageTrackingBits array.
 
        if(uNumPagesUsed == 0 || (uNextUnitToClear >= kMemNumTrackingUnits))
 
        {
 
              // If this assertion fails, then something is messed up with the tracking information,
 
              // because it was believed that pages were still in use that must exist past the array
 
              // of tracking bits.
 
              assert(uNumPagesUsed == 0);
 
              return;
 
        }
 
   
 
        // Clear the remaining pages.
 
        aMemPageTrackingBits[uNextUnitToClear] &= ~(kMemTrackingUnitAllPagesInUse << (kMemNumPagesPerUnit - uNumPagesUsed));

And that’s it, the free method is done.  Next time I’ll cover the simple high performance timer used for testing.


Monitoring your game

Original Author: Niklas Frykholm

Many bugs are easy to fix with debuggers, stack traces and printf-statements. But some are hard to even see with such tools. I’m thinking of things like frame rate hitches, animation glitches and camera stutters. You can’t put a breakpoint on the glitch because what constitutes a glitch is only defined in relation to what happened in the frame before or what will happen in the next frame. And even if you are able to break exactly when the glitch occurs, you might not be able to tell what is going on from the call stack.

In these situations, some way of monitoring and visualizing your game’s behavior can be invaluable. Indeed, if we graph the delta time for each frame, the hitches stand out clear as day.

Delta-time graph with frame rate drops.

A graph like this opens up many new ways of attacking glitch bugs. You can play the game with the graph displayed and try to see what game actions trigger the glitches. Do they happen when a certain enemy is spawned? When a particular weapon is fired? Another approach is to draw the total frame time together with the time spent in all the different subsystems. This immediately shows you which subsystem is causing the frame rate to spike. You can constrain the problem further by graphing the time spent in narrower and narrower profiler scopes.

Visualization tools like these can help with many other issues as well. Want to find out where a weird camera stutter comes from? Plot the camera position, the position of its look-at target and any other variables that may influence its behavior to pin down the source of the problem. Draw a graph representing your memory fragmentation to find problematic allocations and get an overall feeling for how bad the situation is. Does something look slightly off with the animations? Graph the bone rotations to make sure that you don’t have any vibrations or discontinuities. Graph your network usage to make sure you stay below the bandwidth cap.

Rotation of a bone during a jump animation.

When you study your game in this way, you will most likely learn things that surprise you. Games are highly complex systems built by a large number of people over a long period of time. As all complex systems they show emergent behavior. You can be quite certain that at least someone has done at least done something that is completely unexpected and totally weird. You can’t hope to discover these things using just a bottom-up approach. There is too much code and too much data. Instead you must study your game as if it was an alien organism. Prod it and see how it reacts. Keep the graphs on screen and make sure that they look sane.

There are many different kinds of data that can be interesting and many ways of visualizing them – graphs, bars, charts, etc. But in all cases the pattern is pretty much the same. We have some data that we record from the game and then we have a visualizer that takes this data and draws it in some interesting way. Schematically, we can represent it like this:

Basic monitoring system schematic.

I will refine this picture shortly, but first lets do a little data-oriented design and ask ourselves how we can best store and process this data.

If you have read any of my earlier blog posts you will know that I’m a fan of big dumb continuous memory buffers and data structures that look like “file formats for memory”. And this approach works perfectly for this problem. We can just store the data as a big block of concatenated structs, where each struct represents some recorded data. We begin each record with an enum specifying the type of recorded event and follow that with a variable sized struct with data for that particular event.

Data buffer layout.

The event types might be things such as ENTER_PROFILER_SCOPE, LEAVE_PROFILER_SCOPE, ALLOCATE_MEMORY, FREE_MEMORY, RECORD_GLOBAL_FLOAT, etc.

RECORD_GLOBAL_FLOAT is the event type used for all kinds of data that we want to draw in graphs. We record the data with calls like these:

record_global_float("application.delta_time", dt);
 
  record_global_float("application.frame_rate", 1.0f / dt);

The corresponding data struct is just:

struct RecordGlobalFloatEvent {
 
      const char *name;
 
      float value;
 
  };

Note that there is an interesting little trick being used here. When we record the events, we just record the string pointers, not the complete string data. This saves memory, makes the struct fixed size and gives us faster string compares. This works because record_global_float() is called with static string data that is always at the same address and kept in memory throughout the lifetime of the application. (In the rare case where you want to call record_global_float() with a dynamic string, you must allocate a copy of that string at some permanent location, i.e. do a form of string interning.)

Now, let’s refine the picture slightly. There is a problem with recording all data to a single memory buffer and that is multithreading. If all threads record their data to the same memory buffer then we need lots of mutex locking to make sure they don’t step on each other’s toes.

We might also want to add support for some kind of off-line (i.e., not in-game) visualization. Off-line visualizers can take advantage of the full power of your development PC to implement more powerful visualization algorithms. And since they have near unlimited memory, they can record the entire data history so that you can explore it back and forth after the game session has ended.

With these refinements our monitoring system now looks like this:

Advanced monitoring system schematic.

Each thread has a small TLS (thread-local-storage) cache with 64 K or so of debug memory where it records its events. When the cache gets full or we reach the end of the frame, the thread acquires the lock to the global event buffer and flushes its data there.

The active on-line visualizers process the events in the buffer and visualize them. Simulatenously, we send the data over TCP so that it can be processed by any off-line visualizers. In the process we consume the buffer data and the buffer can be filled with new data from the threads.

(We allocate all the buffers we use on a special debug heap, so that we separate the allocations which we only do for debugging purposes from the allocations done by the main game.)

Recording float data requires just a few lines of code.

enum RECORD_GLOBAL_FLOAT_EVENT = 17;
 
  enum THREAD_BUFFER_SIZE = 64*1024;
 
  __thread char *_thread_buffer;
 
  __thread unsigned _thread_buffer_count;
 
   
 
  inline void record_global_float(const char *name, float value)
 
  {
 
       if (_thread_buffer_count + 12 > THREAD_BUFFER_SIZE)
 
           flush_thread_buffer();
 
   
 
       char *p = _thread_buffer + _thread_buffer_count
 
       *(unsigned *)p = GLOBAL_FLOAT;
 
       *(RecordGlobalFloatEvent *)(p+4).name = name;
 
       *(RecordGlobalFloatEvent *)(p+4).value = value;
 
      thread_buffer_count += 12;
 
  }

When you have the data, writing the graph visualizer is not much work. Just save the data over a couple of frames and plot it using a line drawer.

In the BitSquid engine, we also expose all the data recording functions to Lua scripting. This makes it possible to dynamically create graphs for all kinds of data while the game is running.

As an example of this, a couple of days ago a game programmer suspected that some problematic behavior was caused by a low update frequency in the mouse driver. We quickly bashed out a couple of lines in the game console to produce a graph of the mouse data and could immediately confirm that this indeed was the case:

Core.Debug.add_updator(
 
    function ()
 
      Profiler.record_statistics("mouse", Mouse.axis(0))
 
    end 
 
  )
 
  graph make mousegraph
 
  graph add_vector3 mousegraph mouse
 
  graph range mousegraph -20 20

Graph of mouse input showing frames with no input.

BitSquid blog.)


Getting metadata from images on iOS

Original Author: Gustavo Ambrozio

Recap

My latest post was on my repo on GitHub. Cool, I got code stalkers!

One thing was missing from the post though: how to get metadata from existing images. In this post I’ll show you a few methods to do this as well as how to use my NSMutableDictionary category to do this.

Getting images using UIImagePickerController

If you’re getting images from an UIImagePickerController you have to implement this delegate method:

- (void)imagePickerController:(UIImagePickerController *)picker didFinishPickingMediaWithInfo:(NSDictionary *)info

In iOS 4.1 or greater your info dictionary has a key called UIImagePickerControllerReferenceURL (for images from the library) or UIImagePickerControllerMediaMetadata (for images taken from the camera). If your info has the UIImagePickerControllerMediaMetadata key, then you just have to initialize your NSMutableDictionary with the NSDictionary you get from the info dictionary:

NSMutableDictionary *metadata = [[NSMutableDictionary alloc] initWithDictionary:[info objectForKey:UIImagePickerControllerMediaMetadata]];

But if you took an image from the library things are a little more complicated and not obvious at first sight. All you get in a NSURL object. How to get the metadata from this?? Using the AssetsLibrary framework, that’s how!

NSMutableDictionary *imageMetadata = nil;
 
  NSURL *assetURL = [info objectForKey:UIImagePickerControllerReferenceURL];
 
   
 
  ALAssetsLibrary *library = [[ALAssetsLibrary alloc] init];
 
  [library assetForURL:assetURL
 
      resultBlock:^(ALAsset *asset)  {
 
          NSDictionary *metadata = asset.defaultRepresentation.metadata;
 
          imageMetadata = [[NSMutableDictionary alloc] initWithDictionary:metadata];
 
          [self addEntriesFromDictionary:metadata];
 
      }
 
      failureBlock:^(NSError *error) {
 
      }];
 
  [library autorelease];

One caveat on using this: because it uses blocks, there’s no guarantee that your imageMetadata dictionary will be populated when this code runs. In some testing I’ve done it sometimes runs the code inside the block even before the [library autorelease] is executed. But the first time you run this, the code inside the block will only run on another cycle of the apps main loop. So, if you need to use this info right away, it’s better to schedule a method on the run queue for later with:

[self performSelectorOnMainThread:SELECTOR withObject:SOME_OBJECT waitUntilDone:NO];

To make things easier, I’ve created an init method to my category:

- (id)initWithInfoFromImagePicker:(NSDictionary *)info;

You just have to add the NSMutableDictionary+ImageMetadata.h to your file and then use:

NSMutableDictionary *metadata = [[NSMutableDictionary alloc] initWithInfoFromImagePicker:info];

And you’re done! The category checks for the iOS version and for the correct keys and does everything for you. Just be careful about the issue with blocks I mentioned above.

Reading from the asset library

Well, I kinda spoiled the answer to this one already. If you’re using the AssetsLibrary to read images, you can use the method above, with the same caveat: it might not be accessible until some time after the method is called.

Again I created an init method in my category:

- (id)initFromAssetURL:(NSURL*)assetURL;

Using AVFoundation

iOS 4.0 introduced AVFoundation. AVFoundation gives us a lot of possibilities to work with pictures and the camera. Before iOS 4 if you wanted to take a picture you’d have to use an UIImagePickerController. Now you can use AVFoundation and have a lot of control over the camera, the flash, the preview, etc…

If you use AVFoundation to capture photos you’ll probably use AVCaptureStillImageOutput‘s:

- (void)captureStillImageAsynchronouslyFromConnection:(AVCaptureConnection *)connection 
 
                                      completionHandler:(void (^)(CMSampleBufferRef imageDataSampleBuffer, NSError *error))handler

The completion handler gives you a CMSampleBufferRef that has the metadata. But how to get it out f there is not clear from the documentation. It turns out it’s really simple:

CFDictionaryRef metadataDict = CMCopyDictionaryOfAttachments(NULL, imageDataSampleBuffer, kCMAttachmentMode_ShouldPropagate);

Since CFDictionaryRef is toll free bridged with NSDictionary, the whole process would look like this:

CFDictionaryRef metadataDict = CMCopyDictionaryOfAttachments(NULL, imageDataSampleBuffer, kCMAttachmentMode_ShouldPropagate);
 
  NSMutableDictionary *metadata = [[NSMutableDictionary alloc] initWithDictionary:(NSDictionary*)metadataDict];
 
  CFRelease(metadataDict);

At the risk of repeating myself, I again created an init method for this:

- (id)initWithImageSampleBuffer:(CMSampleBufferRef) imageDataSampleBuffer;

Wrapping up

So, there you have it, now you can read and write metadata.

What’s still missing are some methods to easily extract information from this dictionary. I have already created another method to extract the CLLocation information from it. As I now have a way to get and set this information I even converted it to a @property on the category, giving our NSMutableDictionary a nice way to access the location using the dot notation.

It’s very easy to add getter methods for every property but I have not done so yet. Feel free to fork my repo on GitHub and send pull requests for me to incorporate.

I also added another method to add the image’s digital zoom as the next update of Snap will have digital zoom and I’m writing this information to the pictures as well.

Oh, and have I mentioned that you should get Snap!

I’ll be writing every 15 days here but I try to post at least once a week on visit my blog and subscribe.


What Does Your Game Believe In?

Original Author: Mike Jungbluth

What you believe in is what makes you who you are. It is what informs you of your opinions and dictates much of what you perceive as real. The games we make are no different. From the characters that we control, the world they live in, and how the player interacts with each, if the core beliefs are consistent and persistent, that will be felt on an incredibly deep level. In fact, you could even call it the heart and soul of a game. That sort of special x-factor that helps to make a game feel more alive than even a bigger budget game sitting next to it on the shelf.

But like having beliefs in real life, it is a double edge sword. As soon as those beliefs are called into question, your entire reality can become questionable. The deeper or more core to the person or world the belief, the further everything can come crashing down the moment they are betrayed.

But, when we are mindful of those beliefs, we can not only plan on where to implement them, but also use them to inform us of priorities in everything from game mechanics to which assets require the most polish.

The easiest place to establish a belief system in your game is with your characters. It can be as simple as a black and white moral code (good or evil, life or death) or something more gray (political beliefs or spirituality). In either case, just mentioning them on a base level can make the character infinitely more identifiable to the player. But it is when you can weave them into the actions and appearance of the character that you will have the most success and payoff.

Left 4 Dead does a great job of having character designs that instantly informs the player of where the main characters fit into society before the zombie outbreak. And from that, we can make assumptions of what they believed in, compare that to what our own personal beliefs are, and find the one that matches us best. That makes the horror of an outbreak feel all the more real, because not only are we confronting the monsters in the game, but our own internal fears about what we would do in that situation.

After appearance, motion is the second step towards visually identifying someone’s beliefs. Getting into those beliefs of a character and how it drives their actions is what animation is all about. The idea of giving truth in the motion means more than just its physical believability. It is what is emotionally driving that physical movement that truly defines the character and makes them believable. And those beliefs will help drive the player forward as well.

Knowing that Kratos enjoys dismantling his foes and feels no remorse reinforces the brutal animations and gameplay of cutting through waves of enemies. As the player you relish the chance for a large mob to appear that you can dissect in the cruelest manners possible. Contrast this with Nathan Drake, who values life in cut scenes, yet mows down enemy after enemy in game with little care. This disconnect can betray the sincerity of those beliefs and even be comical when honestly examined. The same can be said of John Marston, who complains about doing the right thing and being his own man in cut scenes, yet without hesitation  guns down the rebels you have been helping because someone from the other side tells you to. Very quickly his motive and actions turn hollow, and many people feel that when playing through that portion of the game.

All three are recognizable and memorable characters, but only one allows the player to practice what he preaches.

For games where the avatar is meant to be more of a blank slate, that allows the player to inject their own beliefs or choices, it becomes all the more important for the supporting characters of the world to have strongly held beliefs. Mass Effect 2 does this brilliantly, and many of the most emotionally compelling sequences stem from this. Be it Thane’s deep religious values that conflict with his profession, Mordin’s almost absolute belief in knowledge and science over emotion, or Jack’s inability to believe in anyone except herself, those characters are strong because it feels that their convictions come from an honest place. When those beliefs are challenged by either the player or the world is when they become truly alive.

A big part of establishing this belief system and maintaining it is through an exhaustive character bible. Beyond just model sheets and reference for movements, really think about what drives the character forward. What has lead them to the point they are at when the game starts, and where do they draw the line in their world, as to what they believe in. Do their beliefs change or grow as the game progresses?

Do they mind getting their hands dirty or are they reluctant to do so? Both can allow for the same overall gameplay and creation of assets, but being aware of what they believe can make what happens before, during and after all the more meaningful when the animations or dialog matches those beliefs. This goes for not only the character, but the player. In fact, going a step further, this is how we can even begin to color the player’s beliefs, and make them question their own values versus those of the characters in the game.

Approaching morality and beliefs in this manner requires a level of intellectual investment and involvement by the player that can exceed the binary wheel of good or evil many games use. It also removes the need for an in game win/lose state which can blunt the moral impact for the player, when they care more about the reward or win than the emotion or morality of the decision. Giving the player choice in the outcome of their experience is what sets us apart from other forms of media. But when you force them to not just watch, but play in the life of a character they morally oppose or disagree with, the effect can be incredibly moving.

You can also take those beliefs of a character, and inject them into the gameplay, beyond just moral decisions. Valkyria Chronicles goes so far as having a character’s beliefs actually affect their abilities. Each character has a set of special abilities that gives them different buffs in battle. Often times they are directly tied to their upbringing, sexual orientation or racial beliefs. So for example, Dallas Wyatt, a female engineer, is attracted to women and dislikes men. To convey in game that she has a strong attraction to girls (one in particular), pairing her with female soldiers will cause her to start fighting better in order to impress them. Conversely, she is given a fighting penalty in-game where her stats will go down when near male squadmates.  Other characters have a strong bias against other factions or races, which can directly impact how useful they are in battle when around those groups. It adds an emotional complexity to the battles that fosters actual bonds between the player and their preferred team members. Add in the perma-death feature, and very quickly, as a player, you begin to take each battle and move of your troops personally. I’ve talked to people who have played the game that actually forsake using a higher tiered character because their beliefs were against their own. They actually chose to make the game harder for themselves to stand firm of their own beliefs. That is the power of merging beliefs into game systems.

Once the beliefs of the main characters have been established, showing how the world reacts to those beliefs is a natural way to show how strong their convictions are and to what lengths they fit in with the world. One of the main moments that sticks out to me from Beyond Good & Evil is when after fighting to expose the conspirators of the world, I go back to the city hub that I had traveled through many times before. But this time the town was rioting, and tearing down the propaganda they had bought into up to that point. Seeing my actions translate to the world resonated with me in a way I hadn’t really experienced before. My actions and Jade’s beliefs made that world re-evaluate theirs. It is the type of thing everyone dreams about: making a difference and having everyone understand what you believe in. It isn’t something many people get a chance to feel, on that level, in real life. It would be a shame to squander that opportunity when it is possible for everyone to feel that in games.

L.A.Noire reinforces the beliefs of Cole Phelps on multiple levels, from art to design. The little touch of him stepping over the body at a crime scene helps to ingrain his level of respect for the job in a much deeper way for me than anything he says in a cut scene. Having the amount of damage I cause to the city, pedestrians or police car tallied at the end of each case and my partner yelling at me every time I hit something reminds me of the laws I am sworn to uphold more than any of the monologues being spouted. So much so that for the first time in an open world game, I actually paid attention to the stop lights and obeyed traffic laws. And if I was going to disobey them, I used my siren. The game required neither of me, but because it had honestly established what the characters believed in, I wanted to carry that narrative forward through my play style.

I contrast this with my experience in Grand Theft Auto IV. When Niko starts the game reluctant to kill anyone, I instantly felt drawn to his story. His beliefs were established up front, and I really appreciated the effort. When he quickly falls back into the crime lifestyle thanks to his cousin, and is forced to kill a man, I was excited when he cursed his cousin’s name after it happened. I couldn’t wait for him to verbally cut loose on his cousin, if not physically cut him for it. Yet, when Niko got into the car, and sat next to his cousin, all he said was, “It’s done.” I felt so betrayed that I took the disc out.

That is how powerful we as humans are in our beliefs and how attached we can become to those of the characters in our games. As with any belief system, if it isn’t respected, it is going to cause a person to lash out. Having your own beliefs and knowing others is a special gift. Let’s make sure, as game developers, we are not only aware of our character’s beliefs but that we also take note of what the player’s may be. Neither are easy to tap into, and if done half assed it can cause more harm than good. But when done right, it can be felt in a way that everyone involved in the experience will deeply appreciate.

Define the beliefs of your game. Identify the parts of your game that speak to those beliefs, and shine a light on them. Root out the moments that betray it, and remove them or change them to match. Doing so will not only give you a chance to say something, but it will allow those playing it to honestly listen.


Instruction Level Parallelism

Original Author: Luke Hutchinson

I remember a long time ago reading Michael Abrash’s Ramblings in Real Time articles in DDJ. 15 years back when they were working on Quake, he explained how they made use of the expensive floating point divide on the original Pentium (October 1996, )

Here is the code again. This time each line has been commented with a clock cycle and also a letter. The letters are simply to make referring to individual lines simpler in the explanation below.

 
  inline vec_float4 mat4_mul_vec4( const mat4 & m, vec_float4 v ){
 
    vec_float4 zero;
 
    vec_float4 ret;
 
    vec_float4 x, y, z, w;
 
    zero = (vec_float4) vec_vspltisw( 0 );    // -1       [A]
 
    x = vec_vspltw( v, 0 );                   //  0       [B]
 
    y = vec_vspltw( v, 1 );                   //  1       [C]
 
    z = vec_vspltw( v, 2 );                   //  2       [D]
 
    w = vec_vspltw( v, 3 );                   //  3       [E]
 
    ret = vec_vmaddfp( m.col0, x, zero );     //  4       [F]
 
    ret = vec_vmaddfp( m.col1, y, ret );      // 16       [G]
 
    ret = vec_vmaddfp( m.col2, z, ret );      // 28       [H]
 
    ret = vec_vmaddfp( m.col3, w, ret );      // 40       [I]
 
    return ret;                               // 52       [J]
 
  }
 
  

WTF does cycle -1 mean ? Have we built a time machine ? No, but i’ll get to that in a minute.

On cycles 0-3, we execute the vec_vspltws [B]-[E]. On cycle 4, the results of instructions [A] and [B] are ready so we can use them as inputs to the vec_vmaddfp [F]. [F] has a 12 cycle latency, so the next vec_vmaddfp at [G] is blocked until cycle 16, and so on.

So what’s the go with the magic -1 ? [A] does not depend on either input argument m or v, so depending on how mat4_mul_vec4() is used, it may not be important. If as is often the case, m and or v are calculated just before calling mat4_mul_vec4(), then [A] can generally be slotted in somewhere for free.

For example,

 
  inline vec_float foo( mat4 m, vec_float4 v0, vec_float4 v1 ){
 
    return mat4_mul_vec4( m, vec_vaddfp( v0, v1 ) );
 
  }
 
  

Here the vec_vspltisw can be scheduled in the latency window between vec_vaddfp and the first vec_vspltw.

So how do we do better than the first implementation ?

 
  inline vec_float4 mat4_mul_vec4( const mat4 & m, vec_float4 v ){
 
    vec_float4 zero;
 
    vec_float4 ret, t0, t1;
 
    vec_float4 x, y, z, w;
 
    zero = (vec_float4) vec_vspltisw( 0 );    // -1       [A]
 
    x = vec_vspltw( v, 0 );                   //  0       [B]
 
    z = vec_vspltw( v, 2 );                   //  1       [C]
 
    y = vec_vspltw( v, 1 );                   //  2       [D]
 
    w = vec_vspltw( v, 3 );                   //  3       [E]
 
    t0 = vec_vmaddfp( m.col0, x, zero );      //  4       [F]
 
    t1 = vec_vmaddfp( m.col2, z, zero );      //  5       [G]
 
    t0 = vec_vmaddfp( m.col1, y, t0 );        //  16      [H]
 
    t1 = vec_vmaddfp( m.col3, w, t1 );        //  17      [I]
 
    ret = vec_vaddfp( t0, t1 );               //  29      [J]
 
    return ret;                               //  41      [K]
 
  }
 
  

We’ve got more instructions here, there’s an extra vec_vaddfp. If you were just looking at the intrinsics and considering them as little functions in your mental model, then this implementation would look like crack smoking. Yet this second implementation is 41+1 cycles vs 52+1 cycles, that’s more than 20% faster.

There is an underlying pattern to what we did here to optimize for latency. Reducing dependency chains. This is really just the same as any other scheduling problem outside of computer science, be it project management, manufacturing production lines, etc.

Lets look at the two versions again, this time with the instruction dependencies shown. The ascii art here isn’t brilliant, but if you look at it for a second hopefully it makes sense. Where one intrinsic depends on the result of a previous, we’ve depicted this with an arc ending with an ‘o‘. The “critical path” is the longest dependency chain; this has been highlighted.

 
  inline vec_float4 mat4_mul_vec4(
 
   const mat4 & m,                            // -.
 
   vec_float4 v ){                            // ===+
 
                                              //  . |
 
    vec_float4 zero;                          //  . |
 
    vec_float4 ret;                           //  . |
 
    vec_float4 x, y, z, w;                    //  . |
 
                                              //  . |
 
    zero = (vec_float4) vec_vspltisw( 0 );    // -.-|-.
 
                                              //  . | .
 
    x = vec_vspltw( v, 0 );                   // ===o===+
 
                                              //  . . . |
 
    y = vec_vspltw( v, 1 );                   // -.-o-.-|-.
 
                                              //  . . . | .
 
    z = vec_vspltw( v, 2 );                   // -.-o-.-|-.-.
 
                                              //  . . . | . .
 
    w = vec_vspltw( v, 3 );                   // -.-o-.-|-.-.-.
 
                                              //  .   . | . . .
 
    ret = vec_vmaddfp( m.col0, x, zero );     // =o===o=o========+
 
                                              //  .       . . .  |
 
    ret = vec_vmaddfp( m.col1, y, ret );      // =o=======o======o=+
 
                                              //  .         . .    |
 
    ret = vec_vmaddfp( m.col2, z, ret );      // =o=========o======o=+
 
                                              //  .           .      |
 
    ret = vec_vmaddfp( m.col3, w, ret );      // =o===========o======o=+
 
                                              //                       |
 
    return ret;                               // ======================o
 
  }
 
  

In the original version, the critical path is

 
    v
 
    x = vec_vspltw( v, 0 );
 
    ret = vec_vmaddfp( m.col0, x, zero );
 
    ret = vec_vmaddfp( m.col1, y, zero );
 
    ret = vec_vmaddfp( m.col2, z, zero );
 
    ret = vec_vmaddfp( m.col3, w, zero );
 
    return ret;
 
  

And the optimized version,

 
  inline vec_float4 mat4_mul_vec4(
 
   const mat4 & m,                            // -.
 
   vec_float4 v ){                            // ===+
 
                                              //  . |
 
    vec_float4 zero;                          //  . |
 
    vec_float4 ret, t0, t1;                   //  . |
 
    vec_float4 x, y, z, w;                    //  . |
 
                                              //  . |
 
    zero = (vec_float4) vec_vspltisw( 0 );    // -.-|-.
 
                                              //  . | .
 
    x = vec_vspltw( v, 0 );                   // -.-o-.-.
 
                                              //  . | . .
 
    z = vec_vspltw( v, 2 );                   // ===o=====+
 
                                              //  . . . . |
 
    y = vec_vspltw( v, 1 );                   // -.-o-.-.-|-.
 
                                              //  . . . . | .
 
    w = vec_vspltw( v, 3 );                   // -.-o-.-.-|-.-.
 
                                              //  .   . . | . .
 
    t0 = vec_vmaddfp( m.col0, x, zero );      // -o---o-o-|-.-.-.
 
                                              //  .   .   | . . .
 
    t1 = vec_vmaddfp( m.col2, z, zero );      // =o===o===o=======+
 
                                              //  .         . . . |
 
    t0 = vec_vmaddfp( m.col1, y, t0 );        // -o---------o-.-.-|-.
 
                                              //  .           . . | .
 
    t1 = vec_vmaddfp( m.col3, w, t1 );        // =o===========o===o===+
 
                                              //                    . |
 
    ret = vec_vaddfp( t0, t1 );               // ===================o=o=+
 
                                              //                        |
 
    return ret;                               // =======================o
 
  }
 
  

The critical path,

 
    v
 
    z = vec_vspltw( v, 2 );
 
    t1 = vec_vmaddfp( m.col2, z, zero );
 
    t1 = vec_vmaddfp( m.col3, w, t1 );
 
    ret = vec_vaddfp( t0, t1 );
 
    return ret;
 
  

Comparing the two versions this way makes the reason for the speed up clear.

Obviously commenting the dependency chains like this in your code is a bit unweildly, but i find visualizing it this way is very helpful.

Earlier i said that we had just used PPC as an example, and that this even applied to SSE. Out-of-order execution does throw an extra complication into the mix, but at the end of the day, the cpu isn’t doing anything magic, it can never run faster than the longest dependency chain. But as assumptions are dangerous, and to make sure we haven’t gone all theoretical and missed some important practical issues, lets check this.

These tests are run on a Core i7 Nehalem running Ubuntu GNU/Linux, and on a PS3 running Yellow Dog GNU/Linux. Each test displays the clock cycle count for the original ‘slower’ version, and the new ‘faster’ version, plus the percentage speed up the faster version acheived. For the PPU and SPU versions where we have predicted what the speed up should be, the measured results match up very closely (the SPU results are predicited in comments in the code). Due to the myriad of different SSE implementations, attempting to predict the exact speed up is less useful, but the end result is still a good one. The output of the tests is

 
  sse
 
  slower 0x000538f4
 
  faster 0x000482a8
 
  speed up 13.635%
 
  
 
  ppu
 
  slower 0x000032c7
 
  faster 0x0000280a
 
  speed up 21.148%
 
  
 
  spu
 
  slower 0x00001d4e
 
  faster 0x0000186c
 
  speed up 16.662%
 
  

main.c++

 
  #include <stdio.h>
 
  #include <string.h>
 
  
 
  #if defined ( __PPU__ )
 
  # include <altivec.h>
 
  # include <vec_types.h>
 
    typedef vec_float4 float4;
 
  #elif defined ( __SPU__ )
 
  # include <spu_intrinsics.h>
 
    typedef qword float4;
 
  #else
 
  # include <xmmintrin.h>
 
    typedef __m128 float4;
 
  #endif
 
  
 
  extern float4 latency( const float *m4data, const float *v4data );
 
  
 
  int main( int argc, char **argv ){
 
    static const float m4data[16] __attribute__((aligned(16)))={
 
      1.f, 0.f, 0.f, 0.f,
 
      0.f, 1.f, 0.f, 0.f,
 
      0.f, 0.f, 1.f, 0.f,
 
      0.f, 0.f, 0.f, 1.f,
 
    };
 
    static const float v4data[4] __attribute__((aligned(16)))={
 
      2.f, 3.f, 4.f, 5.f,
 
    };
 
    latency( m4data, v4data );
 
    return 0;
 
  }
 
  

latency_sse.c++

 
  #include <stdint.h>
 
  #include <stdio.h>
 
  #include <xmmintrin.h>
 
  
 
  struct mat4{
 
    __m128 col0, col1, col2, col3;
 
  };
 
  
 
  // Slower version where all addps' are dependant on each other
 
  static inline __m128 mat4_mul_vec4_slower( const mat4 & m, __m128 v ){
 
    // Notice we are being careful here with temporary variables to make sure that
 
    // the destination and one source for each intrinsic is the same.  Most SSE
 
    // (but not AVX) instructions modify one of the source operands.  By
 
    // explicitly copying v up front, GCC ((Ubuntu/Linaro 4.4.4-14ubuntu5) 4.4.5)
 
    // generates slightly better code.
 
    __m128 t0 = v;
 
    __m128 t1 = v;
 
    __m128 t2 = v;
 
    __m128 t3 = v;
 
    t0 = _mm_shuffle_ps( t0, t0, 0xff );
 
    t1 = _mm_shuffle_ps( t1, t1, 0xaa );
 
    t2 = _mm_shuffle_ps( t2, t2, 0x55 );
 
    t3 = _mm_shuffle_ps( t3, t3, 0x00 );
 
    t0 = _mm_mul_ps( t0, m.col0 );
 
    t1 = _mm_mul_ps( t1, m.col1 );
 
    t2 = _mm_mul_ps( t2, m.col2 );
 
    t3 = _mm_mul_ps( t3, m.col3 );
 
    t0 = _mm_add_ps( t0, t1 );
 
    t0 = _mm_add_ps( t0, t2 );
 
    t0 = _mm_add_ps( t0, t3 );
 
    return t0;
 
  }
 
  
 
  // Faster version with two addps' in parallel
 
  static inline __m128 mat4_mul_vec4_faster( const mat4 & m, __m128 v ){
 
    __m128 t0 = v;
 
    __m128 t1 = v;
 
    __m128 t2 = v;
 
    __m128 t3 = v;
 
    t0 = _mm_shuffle_ps( t0, t0, 0xff );
 
    t1 = _mm_shuffle_ps( t1, t1, 0xaa );
 
    t2 = _mm_shuffle_ps( t2, t2, 0x55 );
 
    t3 = _mm_shuffle_ps( t3, t3, 0x00 );
 
    t0 = _mm_mul_ps( t0, m.col0 );
 
    t1 = _mm_mul_ps( t1, m.col1 );
 
    t2 = _mm_mul_ps( t2, m.col2 );
 
    t3 = _mm_mul_ps( t3, m.col3 );
 
    t0 = _mm_add_ps( t0, t1 );
 
    t2 = _mm_add_ps( t2, t3 );
 
    t0 = _mm_add_ps( t0, t2 );
 
    return t0;
 
  }
 
  
 
  static inline uint64_t clock(){
 
    uint32_t lo, hi, pid;
 
    asm volatile( "rdtscp" : "=a"(lo), "=d"(hi), "=c"(pid) );
 
    return ((uint64_t)hi<<32)|lo;
 
  }
 
  
 
  __m128 latency( const float *m4data, const float *v4data ){
 
  
 
    // Load inputs
 
    mat4 m4={
 
      _mm_load_ps( m4data + 0 ),
 
      _mm_load_ps( m4data + 4 ),
 
      _mm_load_ps( m4data + 8 ),
 
      _mm_load_ps( m4data + 12 )
 
    };
 
    __m128 v4 = _mm_load_ps( v4data );
 
  
 
    enum { NUM_LOOPS=10000 };
 
  
 
    // Profile slower code
 
    uint64_t t0;
 
    {
 
      // Call once before we read the time stamp counter, to get rid of cache
 
      // effects
 
      v4 = mat4_mul_vec4_slower( m4, v4 );
 
      t0 = clock();
 
      for ( unsigned i=0; i<NUM_LOOPS; ++i ){
 
        v4 = mat4_mul_vec4_slower( m4, v4 );
 
      }
 
      t0 = clock() - t0;
 
    }
 
  
 
    // Profile faster code
 
    uint64_t t1;
 
    {
 
      v4 = mat4_mul_vec4_faster( m4, v4 );
 
      t1 = clock();
 
      for ( unsigned i=0; i<NUM_LOOPS; ++i ){
 
        v4 = mat4_mul_vec4_faster( m4, v4 );
 
      }
 
      t1 = clock() - t1;
 
    }
 
  
 
    printf( "slower 0x%08xnfaster 0x%08xnspeed up %.3f%%n",
 
     (unsigned)t0, (unsigned)t1, 100.f*(float)((int64_t)(t0-t1))/(float)t0 );
 
  
 
    // Return final result so the calculations cannot be optimized out
 
    return v4;
 
  }
 
  

latency_ppu.c++

 
  #include <altivec.h>
 
  #include <ppu_intrinsics.h>
 
  #include <stdint.h>
 
  #include <stdio.h>
 
  #include <vec_types.h>
 
  
 
  struct mat4{
 
    vec_float4 col0, col1, col2, col3;
 
  };
 
  
 
  // Slower version where all vmaddfp's are dependant on each other.  Note that
 
  // passing m by reference here is actually important.  ppu-g++ (4.1.1 from
 
  // Barcelona Supercomputing Center) generates significantly worse code when
 
  // passed by value.
 
  static inline vec_float4 mat4_mul_vec4_slower( const mat4 & m, vec_float4 v ){
 
    vec_float4 zero;
 
    vec_float4 ret;
 
    vec_float4 x, y, z, w;
 
    zero = (vec_float4) vec_vspltisw( 0 );    // -1
 
    x = vec_vspltw( v, 0 );                   //  0
 
    y = vec_vspltw( v, 1 );                   //  1
 
    z = vec_vspltw( v, 2 );                   //  2
 
    w = vec_vspltw( v, 3 );                   //  3
 
    ret = vec_vmaddfp( m.col0, x, zero );     //  4
 
    ret = vec_vmaddfp( m.col1, y, ret );      //  16
 
    ret = vec_vmaddfp( m.col2, z, ret );      //  28
 
    ret = vec_vmaddfp( m.col3, w, ret );      //  40
 
    return ret;                               //  52
 
  }
 
  
 
  // Faster version with two vmaddfp's in parallel
 
  static inline vec_float4 mat4_mul_vec4_faster( const mat4 & m, vec_float4 v ){
 
    vec_float4 zero;
 
    vec_float4 ret, t0, t1;
 
    vec_float4 x, y, z, w;
 
    zero = (vec_float4) vec_vspltisw( 0 );    // -1
 
    x = vec_vspltw( v, 0 );                   //  0
 
    z = vec_vspltw( v, 2 );                   //  1
 
    y = vec_vspltw( v, 1 );                   //  2
 
    w = vec_vspltw( v, 3 );                   //  3
 
    t0 = vec_vmaddfp( m.col0, x, zero );      //  4
 
    t1 = vec_vmaddfp( m.col2, z, zero );      //  5
 
    t0 = vec_vmaddfp( m.col1, y, t0 );        //  16
 
    t1 = vec_vmaddfp( m.col3, w, t1 );        //  17
 
    ret = vec_vaddfp( t0, t1 );               //  29
 
    return ret;                               //  41
 
  }
 
  
 
  static inline uint32_t clock(){
 
    // As a workaround for a bug in the CellBE (where overflow from least
 
    // significant word to most significant word is not atomic), we simply discard
 
    // the most significant word.  Our test is quick enough that we don't need to
 
    // worry about the least significant word overflowing more than once.
 
    register uint64_t tb;
 
    asm volatile( "mftb  %0nt" : "=r"(tb) );
 
    return (uint32_t) tb;
 
  }
 
  
 
  vec_float4 latency( const float *m4data, const float *v4data ){
 
  
 
    // Load inputs
 
    mat4 m4={
 
      vec_lvxl( 0,  m4data ),
 
      vec_lvxl( 16, m4data ),
 
      vec_lvxl( 32, m4data ),
 
      vec_lvxl( 48, m4data ),
 
    };
 
    vec_float4 v4 = vec_lvxl( 0, v4data );
 
  
 
    enum { NUM_LOOPS=10000 };
 
  
 
    // Profile slower code
 
    uint32_t t0;
 
    {
 
      // Call once before we read the time stamp counter, to get rid of cache
 
      // effects
 
      v4 = mat4_mul_vec4_slower( m4, v4 );
 
      t0 = clock();
 
      for ( unsigned i=0; i<NUM_LOOPS; ++i ){
 
        v4 = mat4_mul_vec4_slower( m4, v4 );
 
      }
 
      t0 = clock() - t0;
 
    }
 
  
 
    // Profile faster code
 
    uint32_t t1;
 
    {
 
      v4 = mat4_mul_vec4_faster( m4, v4 );
 
      t1 = clock();
 
      for ( unsigned i=0; i<NUM_LOOPS; ++i ){
 
        v4 = mat4_mul_vec4_faster( m4, v4 );
 
      }
 
      t1 = clock() - t1;
 
    }
 
  
 
    printf( "slower 0x%08xnfaster 0x%08xnspeed up %.3f%%n",
 
     (unsigned)t0, (unsigned)t1, 100.f*(float)((int32_t)(t0-t1))/(float)t0 );
 
  
 
    // Return final result so the calculations cannot be optimized out
 
    return v4;
 
  }
 
  

latency_spu.c++

 
  #include <spu_intrinsics.h>
 
  #include <stdint.h>
 
  #include <stdio.h>
 
  
 
  struct mat4{
 
    qword col0, col1, col2, col3;
 
  };
 
  
 
  // Slower version where all fm/fma's are dependant on each other
 
  static inline qword mat4_mul_vec4_slower( const mat4 & m, qword v ){
 
    qword shufAAAA, shufBBBB, shufCCCC, shufDDDD;
 
    qword ret;
 
    qword x, y, z, w;
 
    shufAAAA = si_ila( 0x10203 );             // -5
 
    shufBBBB = si_orbi( shufAAAA, 4 );        // -3
 
    shufCCCC = si_orbi( shufAAAA, 8 );        // -2
 
    shufDDDD = si_orbi( shufAAAA, 12 );       // -1
 
    x = si_shufb( v, v, shufAAAA );           //  0
 
    y = si_shufb( v, v, shufBBBB );           //  1
 
    z = si_shufb( v, v, shufCCCC );           //  2
 
    w = si_shufb( v, v, shufDDDD );           //  3
 
    ret = si_fm ( m.col0, x );                //  4
 
    ret = si_fma( m.col1, y, ret );           //  10
 
    ret = si_fma( m.col2, z, ret );           //  16
 
    ret = si_fma( m.col3, w, ret );           //  22
 
    return ret;                               //  28
 
  }
 
  
 
  // Faster version with two fm's and two fma's in parallel
 
  static inline qword mat4_mul_vec4_faster( const mat4 & m, qword v ){
 
    qword shufAAAA, shufBBBB, shufCCCC, shufDDDD;
 
    qword ret;
 
    qword x, y, z, w;
 
    qword xy, zw;
 
    shufAAAA = si_ila( 0x10203 );             // -5
 
    shufCCCC = si_orbi( shufAAAA, 8 );        // -3
 
    shufBBBB = si_orbi( shufAAAA, 4 );        // -2
 
    shufDDDD = si_orbi( shufAAAA, 12 );       // -1
 
    x = si_shufb( v, v, shufAAAA );           //  0
 
    z = si_shufb( v, v, shufCCCC );           //  1
 
    y = si_shufb( v, v, shufBBBB );           //  2
 
    w = si_shufb( v, v, shufDDDD );           //  3
 
    xy = si_fm( m.col0, x );                  //  4
 
    zw = si_fm( m.col2, z );                  //  5
 
    xy = si_fma( m.col1, y, xy );             // 10
 
    zw = si_fma( m.col3, w, zw );             // 11
 
    ret = si_fa( xy, zw );                    // 17
 
    return ret;                               // 23
 
  }
 
  
 
  static inline uint32_t clock(){
 
    return - spu_readch( SPU_RdDec );
 
  }
 
  
 
  qword latency( const float *m4data, const float *v4data ){
 
  
 
    // Load inputs
 
    mat4 m4={
 
      si_lqd( si_from_ptr(m4data), 0  ),
 
      si_lqd( si_from_ptr(m4data), 16 ),
 
      si_lqd( si_from_ptr(m4data), 32 ),
 
      si_lqd( si_from_ptr(m4data), 48 )
 
    };
 
    qword v4 = si_lqd( si_from_ptr(v4data), 0 );
 
  
 
    enum { NUM_LOOPS=10000 };
 
  
 
    // Initialize decrementer to maximum value
 
    spu_writech( SPU_WrDec, 0xffffffff );
 
  
 
    // Profile slower code
 
    uint32_t t0;
 
    {
 
      t0 = clock();
 
      for ( unsigned i=0; i<NUM_LOOPS; ++i ){
 
        v4 = mat4_mul_vec4_slower( m4, v4 );
 
      }
 
      t0 = clock() - t0;
 
    }
 
  
 
    // Profile faster code
 
    uint32_t t1;
 
    {
 
      t1 = clock();
 
      for ( unsigned i=0; i<NUM_LOOPS; ++i ){
 
        v4 = mat4_mul_vec4_faster( m4, v4 );
 
      }
 
      t1 = clock() - t1;
 
    }
 
  
 
    printf( "slower 0x%08xnfaster 0x%08xnspeed up %.3f%%n",
 
     (unsigned)t0, (unsigned)t1, 100.f*(float)((int32_t)(t0-t1))/(float)t0 );
 
  
 
    // Return final result so the calculations cannot be optimized out
 
    return v4;
 
  }
 
  

build.sh

 
  #! /bin/sh
 
  
 
  PS3_IPADDR=192.168.1.3
 
  
 
  set -e
 
  
 
  COMMON_CXXFLAGS="-O3 -Wall -Werror -pedantic"
 
  g++     ${COMMON_CXXFLAGS}           main.c++ latency_sse.c++ -o latency_sse
 
  ppu-g++ ${COMMON_CXXFLAGS} -maltivec main.c++ latency_ppu.c++ -o latency_ppu
 
  spu-g++ ${COMMON_CXXFLAGS}           main.c++ latency_spu.c++ -o latency_spu
 
  
 
  echo 'sse'
 
  ./latency_sse
 
  
 
  echo 'nppu'
 
  scp latency_ppu ${PS3_IPADDR}:/tmp/latency_ppu
 
  ssh ${PS3_IPADDR} /tmp/latency_ppu
 
  
 
  echo 'nspu'
 
  scp latency_spu ${PS3_IPADDR}:/tmp/latency_spu
 
  ssh ${PS3_IPADDR} /tmp/latency_spu
 
  

These tests were also a good example of why you can’t blindly trust your compiler; always check the disassembly when performance is important! For the SSE version, GCC needed hand holding to minimise the number of register copies. For both the PPU and SPU versions, passing the matrix argument by value rather than by reference suprisingly caused lots of completely useless stores inside the loop. Unfortunately GCC still insisted on adding a redundant register copy to the SPU code, adding two cycles (and reducing the speed up by around 1%).

It is worth noting that SSE4.1 added the dpps dot product instruction. This gives another possible approach to implementing the function, though the cost of the matrix transpose is unfortunately too much and makes things slower. But if the matrix is already row major and SSE4.1 is available, then using dpps is faster. There are other platforms where transpose and dot product is the way to go.

Thats been a pretty in-depth look at a pretty simple function. Though the ideas here are generally applicable. Hopefully it has helped you gain a deeper understanding of how your code is being executed on the target hardware, and from that, how it can be made to run faster.

Earlier i said that latency “is almost always” the right thing to optimize for. Next time we’ll have a look at an optimization technique where it’s not. We’ll investigate loop pipelining, and (if your slightly mad) how it can manually be done in SPU assembler. We’ll also look at something we convieniently missed in our discussion above, dual issue.