Roll-Your-Own Vector

Original Author: Thomas Young

(Also posted to series of posts about Vectors and Vector based containers.)

STL vectors offer a great combination of dynamically resizing convenience and low level contiguous buffer accessing performance, but std::vector is not the only option, and custom implementations of this container can sometimes be a better fit for your specific requirements.

Roll your own

It seems like it’s fairly common practice in the game development industry to replace stl::vector (and other standard containers) with custom STL-style vector or array implementations.

Some public examples are the the Bitsquid Foundation Library, and I know that a lot of other developers also do something similar.

I think that a lot of people may have been ‘got bitten’ in the past with STL containers, or at least found that they were able to resolve issues or significantly improve performance by replacing STL containers with their own code. This was definitely the case with PathEngine.

We switched to a ‘home grown’ version of std::vector some years back as a result of some concrete run time performance issues, and saw some quite significant improvements from this change.

Two parts

There are potentially two main reasons for switching:

  • issues with the mechanism for custom memory allocation for STL containers, and
  • performance considerations

For some people memory allocation is the main reason for switching to a custom vector allocation, but this wasn’t an issue for us when we first switched to our own vector implementation. At that time, memory allocation customisation in PathEngine was fairly minimal (based on overriding global new and delete and only possible for DLL builds) and we didn’t use STL allocators.

The initial motivation for switching PathEngine to a custom vector was performance, and I’ll be looking exclusively at the performance side of things in this post, but will come back to look at custom allocation issues separately later on.

Initial issue and expectations

The initial issue for us was with stl::vector methods showing up as significant in performance profiling, for certain operations, and with clients telling us that they needed these operations to run faster.

Most of our vectors contain simple here, but you have to register to read more than 2 articles). and I was expecting to see the biggest improvements from type specialisations for these elements, but this turned out not to be the most significant issue.

Basic vector implementation

Let’s have a look at what a basic vector implementation might look like.

We’ll make some simplifications.

In particular we’ll avoid adding any iterator classes and just use index values or raw pointers to elements for iteration. But we’ll also ignore things like exceptions. (As with a lot of other game development code PathEngine doesn’t throw or handle exceptions.) And we’ll stick to C++03 for backwards compatibility.

Let’s start with the following:

#include <string.h>
 
  #include <stdlib.h>
 
  #include <new>
 
  
 
  template <class T>
 
  class cVector
 
  {
 
  public:
 
      typedef typename std::size_t size_type;
 
  private:
 
      size_type _size;
 
      size_type _capacity;
 
      T* _data;
 
  
 
      static T*
 
      allocate(size_type size)
 
      {
 
          return static_cast<T*>(malloc(sizeof(T) * size));
 
      }
 
      static void
 
      copyRange(T* begin, T* end, T* dest)
 
      {
 
          while(begin != end)
 
          {
 
              new((void*)dest) T(*begin);
 
              ++begin;
 
              ++dest;
 
          }
 
      }
 
      static void
 
      deleteRange(T* begin, T* end)
 
      {
 
          while(begin != end)
 
          {
 
              begin->~T();
 
              ++begin;
 
          }
 
      }
 
  
 
  public:
 
      typedef T* iterator;
 
      typedef T value_type;
 
  
 
      cVector()
 
      {
 
          _size = 0;
 
          _capacity = 0;
 
          _data = 0;
 
      }
 
      ~cVector()
 
      {
 
          deleteRange(_data, _data + _size);
 
          free(_data);
 
      }
 
  
 
      // needs some other constructors
 
      // size(), empty() and a bunch of other methods!
 
  
 
      void
 
      push_back(const T& value)
 
      {
 
          if(_size != _capacity)
 
          {
 
              new((void*)(_data + _size)) T(value);
 
              ++_size;
 
              return;
 
          }
 
          size_type newCapacity = _capacity ? _capacity * 2 : 1;
 
          T* newData = allocate(newCapacity);
 
          copyRange(_data, _data + _size, newData);
 
          new((void*)(newData + _size)) T(value);
 
          deleteRange(_data, _data + _size);
 
          free(_data);
 
          _data = newData;
 
          _capacity = newCapacity;
 
          ++_size;
 
      }
 
      T&
 
      operator[](size_type index)
 
      {
 
          return _data[index];
 
      }
 
      const T&
 
      operator[](size_type index) const
 
      {
 
          return _data[index];
 
      }
 
      T*
 
      begin() const
 
      {
 
          return _data;
 
      }
 
      T*
 
      end() const
 
      {
 
          return _data + _size;
 
      }
 
  };

This misses out a lot of the vector interface, obviously, but still manages to give us what are arguably the key benefits of STL style vectors, amortized constant time push back and automatic buffer memory management, in quite a small amount of code.

Note that we can continue using a load of other stuff from the standard library (such as sort algorithms), because of the way STL design decouples algorithms from actual container implementations through the iterators concept. Raw pointers into our vector buffer meet same the requirements for random access iterators as the iterators returned by std::vector.

The code is pretty straightforward, but with one slightly tricky aspect being placement new. This is kind of fundamental to vector containers, and so it’s important to understand a bit about what’s going on here.

We’re using malloc and free for underlying buffer allocation, with constructor and destructor calls invoked separately from actual memory allocation. The call to new((void*)dest) T(*begin) constructs an object of type T at the (already allocated) memory location pointed to by dest, using T’s copy constructor with argument *begin. The call to begin->~T() calls T’s destructor without freeing the underlying memory.

Malloc and free will do for now, but I’ll look at customising the underlying memory allocation methods in my next article.

A lot of stuff like size(), empty(), front(), back() has been omitted to save space, but can be added easily enough.

Doing it yourself

So far it’s unlikely that this code will actually give us any speedups over standard library versions, at least in release builds, but there are already some interesting implications and potential advantages of ‘doing it yourself’ like this:

  • We can control exact container semantics, independent of compiler version
  • The code is simpler and easier to understand
  • All of the code is available to us
  • It can work out faster in debug builds
  • There are very minimal dependencies

Knowing exactly what semantics will be applied for underlying buffer allocation can be important in order to use vectors appropriately, for example it is nice to know for sure if calling clear() will free the underlying buffer (as discussed in this previous post).

If you step into the push_back operation of your standard library vector implementation in a debugger, you’ll probably find this goes through a bunch of cryptically named template methods, which may then also end up with some calls to standard library binary components (for which source may not be available). It’s not so easy to understand what’s going on, and although this should all get optimised out in release builds, debug builds are obliged to actually perform each of these calls (so that everything makes sense in the debugger).

And the vector implementation in your standard library also probably depends on a bunch of other stuff, with the vector header pulling in a bunch of other headers. In many situations we can depend on the C++ standard library being available, and working correctly, but in some cases this may not be the case, and extra dependencies may be an issue. A case in point was with the Android NDK, particularly in early versions, where C++ standard library support was limited and it wasn’t possible to depend on all of this working completely.

But besides the above points I think that although it’s great just to be able to have a look under the hood, understand a bit about how this stuff works, or can work, and dispel some of the mystery that can otherwise shroud some of our standard library invocations!

Reinventing the wheel

There are a lot of reasons, conversely why not to do something like this. We’re told not to ‘reinvent the wheel’ and this is pretty good general advice.

Standard vector implementations get a lot of testing, and there may be non-obvious corner cases or tricky situations, which we don’t consider, but which are handled correctly by std::vector.

A vector implementation written by your compiler vendor may take advantage of compiler specifics, or otherwise work together with the compiler to get better performance than an implementation written in generic C++.

Cutting stuff out of the vector interface means that some standard vector based code won’t work, and other semantic differences from std::vector can be a potential source of confusion.

Standard vectors may be tooled up to the teeth with stuff like debug checking instrumentation, which can help catch some tricky bugs. (But note that this is partly orthogonal to the issue of faster debug builds, because we can add some debug checking back in without the same kind of call complexity as standard vector implementations.)

Also, standard vector implementations are under development by an external party, and if we stick with the standard implementation we can get the benefits from new releases in the future without any work on our side (kind of like middleware licensing).

One case in point is C++11 move semantics. The PathEngine vector implementation was written before move semantics and so you probably won’t get much benefit from implementing move constructors in the contained elements when using this class, unless we update the vector class implementation, but if we stuck with std::vector this is something we would have got ‘for free’. (In practice PathEngine avoids using non-simple element types for vectors in performance critical code, so move semantics are not so important for us, but the basic point about external updates and improvements remains valid.)

Some good points on both sides of the argument then, but keeping these various considerations in mind, let’s go on to look at what we can actually do with custom vector code in place..

Resize methods

We’ll need to look at what goes on in the resize() methods, so lets go ahead and add these:

    void
 
      resize(size_type newSize)
 
      {
 
          if(newSize <= _size)
 
          {
 
              deleteRange(_data + newSize, _data + _size);
 
              _size = newSize;
 
              return;
 
          }
 
          if(newSize <= _capacity)
 
          {
 
              constructRange(_data + _size, _data + newSize);
 
              _size = newSize;
 
              return;
 
          }
 
          size_type newCapacity = newSize;
 
          if(newCapacity < _size * 2)
 
          {
 
              newCapacity = _size * 2;
 
          }
 
          reallocate(newCapacity);
 
          constructRange(_data + _size, _data + newSize);
 
          _size = newSize;
 
      }
 
      void
 
      resize(size_type newSize, const T& fillWith)
 
      {
 
          if(newSize <= _size)
 
          {
 
              deleteRange(_data + newSize, _data + _size);
 
              _size = newSize;
 
              return;
 
          }
 
          if(newSize <= _capacity)
 
          {
 
              constructRange(_data + _size, _data + newSize, fillWith);
 
              _size = newSize;
 
              return;
 
          }
 
          size_type newCapacity = newSize;
 
          if(newCapacity < _size * 2)
 
          {
 
              newCapacity = _size * 2;
 
          }
 
          reallocate(newCapacity);
 
          constructRange(_data + _size, _data + newSize, fillWith);
 
          for(size_type i = _size; i < newSize; i++)
 
          {
 
              ::new((void*)(_data + i)) T(fillWith);
 
          }
 
          _size = newSize;
 
      }

These depend on the following additional private helper methods:

    void
 
      reallocate(size_type newCapacity)
 
      {
 
          T* newData = allocate(newCapacity);
 
          copyRange(_data, _data + _size, newData);
 
          deleteRange(_data, _data + _size);
 
          free(_data);
 
          _data = newData;
 
          _capacity = newCapacity;
 
      }
 
      static void
 
      constructRange(T* begin, T* end)
 
      {
 
          while(begin != end)
 
          {
 
              new((void*)begin) T();
 
              ++begin;
 
          }
 
      }
 
      static void
 
      constructRange(T* begin, T* end, const T& fillWith)
 
      {
 
          while(begin != end)
 
          {
 
              new((void*)begin) T(fillWith);
 
              ++begin;
 
          }
 
      }

Potential gotcha: push back reference to an existing element

We’ve moved some shared logic out into that reallocate() method. This is quite similar to some code in the push_back() method, and it can be tempting to use the same reallocate call there also, but there’s a subtle trap to watch out for.

The temptation is to move copy construction of the pushed element after the buffer copy and delete so we can refactor push_back() as follows:

    void
 
      push_back(const T& value)
 
      {
 
          if(_size != _capacity)
 
          {
 
              new((void*)(_data + _size)) T(value);
 
              ++_size;
 
              return;
 
          }
 
          size_type newCapacity = _capacity ? _capacity * 2 : 1;
 
          reallocate(newCapacity);
 
          new((void*)(newData + _size)) T(value); // this line moved after buffer delete
 
          ++_size;
 
      }

But this will now fail (potentially unpredictably!) for certain cases where the value being pushed back actually references back into the existing vector buffer, for example with something like v.push_back(v.front()).

There’s some discussion about this in this stackoverflow question, but it looks like standard vector implementations all support this kind of push_back operation, and so this is something we should try to support in our custom vector.

[Update: turns out this gotcha got me again! As pointed out by DeadMG on the comments for a later post, the same issue also applies to one of the resize() methods I’ve posted here, e.g. with v.resize(v.size() + 3, v.back()).]

Code repetition, resizing without a fill value

There’s still a lot of code repetition between the two resize methods, but we want to make sure that this is fast, and so I’m kind of treating this as the first priority. For example, we could combine these into a single method by setting a default value for the fillWith argument, (fillWith=T()) but this will most likely result in a load of unnecessary copy operations for the case where no fill value is actually required.

We’ve added a variation on the placement new call, in fact, for that no fill construction case. Let’s look at this a bit more closely.

Construction subtleties and zero initialisation

To construct elements without a fill value in the above code, we use new((void*)begin) T(). The idea is to call the default constructor for class T with memory already allocated, but it turns out that there are some subtleties here in certain cases, notably when T is a built in type.

It turns out that when this version of placement new is called for a built in type we get ‘default initialisation’.

There’s an good explanation about the difference between these two initialisation types on this stackoverflow question.

Note that the same issue applies on vector construction when we do something like cVector<int> v(10), which uses quite similar code but with the corresponding constructor not shown in the example code.

When we first switched to a custom vector implementation we decided to change these construction semantics, and use default initialisation, and it turned out that this was actually the main concrete benefit in terms of actual speedups for our vector use cases.

I tested this again with the current SDK build, in a modern compiler environment (Clang 3.0, with -std=c++0x), and this still makes a significant difference in certain situations, although now mostly limited to specific load and save methods.

For example, loading unobstructed space for our flatshead_dungeon benchmark mesh:

  • with standard zero initialisation semantics takes 0.000143 seconds
  • with modified default initialisation semantics takes 0.000113 seconds

It looks like the difference in this specific case is then actually mainly down to the following code for loading persistent data into a buffer:

// _backStore is a cVector<char>
 
  // tUnsigned32 is an integer typedef
 
  // 'is' is an input stream object
 
      tUnsigned32 arraySize;
 
      is.openNextElementAsArray_GetSize(arraySize);
 
      _backStore.resize(arraySize);
 
      is.finishReadElementAsArray(&_backStore[0]);

So with standard vector semantics this will normally zero the buffer before loading, and this can be a noticeable overhead.

If we want _backStore to correctly report it’s size, and are avoiding nasty hacks, it doesn’t look like there is any way to avoid this overhead with a standard vector implementation, but we could decide just not to use a vector in this case and revert to raw buffer allocation.

Changed semantics

When using our custom vector implementation with this change we need to be aware of the changed semantics, but to be quite honest I actually wasn’t aware of this aspect of std::vector behaviour before we switched (elements being zeroed on vector construction or resize) and so was already writing code as if element values were undefined.

I guess an alternative approach for the resize() method, could be to add a separate method called resize_Uninitialised(), and allow this to be chosen explicitly by the calling code, but it’s not so easy to provide this kind of choice on construction.

Standalone benchmark for vector resize

Checking this against std::vector, for completeness, with the following benchmarking code (with Clang 3.0 and -std=c++0x):

template <class tVector> static int
 
  ResizeBenchmark()
 
  {
 
      int sum = 0;
 
      for(int i = 0; i != 400; ++i)
 
      {
 
          tVector v;
 
          v.resize(100000);
 
          v.back() += 1;
 
          sum += v.back();
 
      }
 
      return sum;
 
  }

I get:

container type build time sum
std::vector debug 0.0924 seconds 400
std::vector release 0.0125 seconds 400
cVector, zero initialisation debug 0.0930 seconds 400
cVector, zero initialisation release 0.0338 seconds 400
cVector, default initialisation debug 0.0002 seconds 79801
cVector, default initialisation release 0.0001 seconds 79801

The sum variable is there to help ensure the benchmark logic is not optimised out, but also shows us that different semantics are being applied (and we can see that, in the default initialisation cases, the OS is actually giving us back the same buffer again each time through the loop).

In general stand-alone benchmarks like this should be taken with a handful of salt (because of cache issues and because the actual effects on real code performance will usually be quite different), but can be useful at least to demonstrate that there is a difference.

(Note that there can be other ways to avoid the initialisation cost shown in this benchmark, however. The next post in this series discusses this in more detail.)

Type specialisation

The other main change we made was to detect POD types and use specialised code paths for vectors with these element types.

The point is that, for built-ins and certain user defined types, constructor and destructor calls will have no effect and can be omitted, and copy construction can be replaced with memcpy.

With C++11 this becomes quite straightforward, as we can use is_trivially_copyable to ask the compiler whether or not the corresponding shortcuts can be applied for a given type.

Before C++11, however, we had to set this up for ourselves, and we used the following mechanism for this:

struct cPOD_Tag {};
 
  struct cNonPOD_Tag {};
 
  template <class T>
 
  struct cTypeTraitsHelper
 
  {
 
      typedef cNonPOD_Tag tPODType;
 
  };
 
  #define DECLARE_POD_TYPE(t) template<> struct cTypeTraitsHelper<t*> { typedef cPOD_Tag tPODType };
 
  DECLARE_POD_TYPE(int)
 
  DECLARE_POD_TYPE(char)
 
  //...... add declarations for other types, as required
 
  template <class T>
 
  typename cTypeTraitsHelper<T>::tPODType PODType(T&)
 
  {
 
      return typename cTypeTraitsHelper<T>::tPODType();
 
  }

With this set up appropriately we can then switch our buffer helper methods depending on whether or not the contained element is declared as a POD type, for example:

    static void
 
      constructRange(T* begin, T* end, cPOD_Tag)
 
      {
 
      }
 
      static void
 
      constructRange(T* begin, T* end, cNonPOD_Tag)
 
      {
 
          while(begin != end)
 
          {
 
              new((void*)begin) T;
 
              ++begin;
 
          }
 
      }
 
      static void
 
      constructRange(T* begin, T* end)
 
      {
 
          constructRange(begin, end, PODType(begin));
 
      }

(And similarly for stuff like destruction, copy construction and buffer fill.)

This kind of type specialisation made some difference to performance historically, but in practice it turned out that the compiler is often clever enough to optimise this stuff out by itself, and the actual performance difference resulting from this change was not hugely significant.

I’ve tested this again in the current release (with Clang 3.0, with and without -std=c++0x)and if I remove all the POD type declarations it looks like the only stuff that actually gets affected by this significantly are now some methods for saving preprocess objects to persistence, which are not so performance critical.

Tweaking initial push_back behaviour

Another optimisation is to adjust the initial behaviour when calling push_back on an empty vector, to avoid small allocations.

After some benchmarking we changed the following line in the push_back() method:

    size_type newCapacity = _capacity ? _capacity * 2 : 1;

to

    size_type newCapacity = _capacity ? _capacity * 2 : 8;

What this does is to avoid initial allocations of size 1, 2 and 4 and jump straight to a buffer of size 8, which works out as a net performance gain for us.

If you think about what’s going on in the underlying memory allocation subsystem this makes sense, because small allocations are probably usually going to be aligned and padded up to a certain minimum size anyway.

In-class buffers

Another way to avoid small allocations can be to reserve a fixed amount of space in the vector itself for use when the buffer allocation requirement is below a certain size.

I know of at least one developer that uses this technique but we don’t do this currently at PathEngine.

Amortize faster!

The above code doubles vector capacity each time a new buffer is allocated to ensure amortized constant time push_back, but there’s nothing special about the number 2 here. In fact, we can should be able to use any constant > 1 and still get this amortized complexity.

There’s a trade-off here between number of allocations (which take time) and buffer overallocation (which cost memory, and also potentially reduce cache efficiency), but in PathEngine, changing this constant to 3 seems to give us a net benefit:

    size_type newCapacity = _capacity ? _capacity * 3 : 8;

Note that it’s all the more important with this setting to make sure we avoid permanent overallocations wherever possible, e.g. by applying the ‘shrink-to-fit’ idiom, as discussed here.

Other optimisations and benchmarking

There are other tweaks you could try out, but this can start to be a bit hit and miss if you don’t have some good high level benchmarks set up to capture the real performance situation needing to be optimised.

With benchmarks in place a sampling profiler can help you pin down areas where you can most benefit from optimisations, but watch out for situations where the real issues are with higher level vector management and using vectors effectively. Maybe a whole bunch of allocations could be avoided just by reusing a vector across calls or across frames, but a profiler isn’t going to tell this directly.

Note that a number of the above tweaks target the underlying memory allocations because this is often the most significant cost associated with vector operations, and there may then be other ways to reduce these memory allocation costs.

I’ll look into the issue of memory allocation customisation more generally in my next post, and we’ll see, for example, that using realloc() is something that can also potentially improve performance.

Conclusion

It’s good to know that we can replace std::vector, first of all. There’s a bit of placement new trickery, and at least one gotcha to look out for, but overall this turns out to be not as hard as you might expect. After that, whether or not this is a good idea is something to be decided on a case by case basis!

You should have some ideas, now, about how a custom vector class can potentially improve performance and in my next post we’ll see how this can also be desirable if you need a lot of control over buffer memory allocation.

Optimising your vector class is something of a low level optimisation and there are often higher level changes you can make to your program that make a lot more difference, and can even remove the need for this kind of low level optimisation.

In the case of PathEngine, the various low level vector optimisations I describe here are less significant than when we first switched because of other higher level optimisations since then that avoid or reduce costly vector operations in many cases where these were an issue, but our custom vector nevertheless still helps to improve SDK performance.

When measuring performance differences it’s important to test vector operations in the actual target run-time context, and to include the effects of memory allocation and processor caching in real situations. If you think that a custom vector class could improve your application performance, then perhaps the best way to measure this could be to actually go ahead and plug in a custom vector implementation (perhaps based on the code shown above) and compare some high level benchmark timings..

** Comments: Please check the existing comment thread for this post before commenting. **

Efficient Vectors of Vectors

Original Author: Thomas Young

(Also posted to series of posts about Vectors and Vector based containers.)

STL style vectors are convenient because they hide the details of internal buffer management, and present a simplified interface, but sometimes convenience can be a trap!

In my previous post I touched briefly on STL vectors with non-simple element types, and mentioned the ‘vector of vectors’ construct in particular as a specific source of memory woes.

In this post I’ll talk about this construct in more detail, explain a bit about why there is an issue, and go on to suggest a fairly straightforward alternative for many situations.

Interview question

One of my standard interview questions for c++ programmers has been:

What can be done to optimize the following class?

#include <vector>
 
  
 
  class cLookup
 
  {
 
      std::vector<std::vector<long> > _v;
 
  public:
 
  
 
      long addLine(const std::vector<long>& entries)
 
      {
 
          _v.push_back(entries);
 
          return _v.size() - 1;
 
      }
 
  
 
      long getNumberOfEntriesInLine(long lineIndex) const
 
      {
 
          return _v[lineIndex].size();
 
      }
 
      long getEntryInLine(long lineIndex, long indexInLine) const
 
      {
 
          return _v[lineIndex][indexInLine];
 
      }
 
  };

(Index type issues ignored for simplicity.)

Note that the question intentionally doesn’t tell you what needs to be optimised. I guess I ideally want applicants to then either:

  1. ask what should be optimised, or
  2. immediately spot what is ‘wrong’ with the code (and go on to make some suggestions for improvement)

It’s generally a good first reflex in any optimisation situation to try to find out some more about exactly what’s going on (whether by profiling, or measuring statistics, or even just looking at how a piece of code is used), and (perhaps more fundamentally), what needs to be optimised, and so asking for more information is probably a good initial response to pretty much any optimisation question.

But the vector of vectors in the above code such an expensive and inefficient construct that if you’ve ever done something like this in the wild, or come across similar constructs during optimisations, then it should really stand out as a red flag, and I don’t think it’s unreasonable to just assume that this is what the question is asking about.

The problem

The class methods can be divided into a ‘lookup building’ part, which might only be used at preprocess or loading time, and a non-mutating ‘query’ part.

In real life we should usually be motivated and directed by some kind of performance measure, and there are then two issues potentially resulting from that vector of vectors, with our performance measures ideally telling us which (or both) of these issues we actually need to focus on:

  • inefficiencies during construction, or
  • run-time memory footprint

Lookup construction

The construction part has the potential to be horribly inefficient, depending on the compilation environment.

It’s all about what the top level vector does with the contained objects when increasing buffer capacity, and, more specifically, whether or not C++11 ‘move semantics’ are being applied.

To increase buffer capacity, vectors allocate a new block of memory and then move the contents of the existing buffer across. Before C++11 move semantics the move part actually requires copy construction of elements in the new buffer from elements in the old buffer and then destruction of the old elements, but with C++11 move semantics this copy and destroy can be avoided.

We can see this in action with the following test code:

#include <vector>
 
  #include <iostream>
 
  
 
  int i = 0;
 
  
 
  class cSideEffector
 
  {
 
      int _i;
 
  public:
 
      cSideEffector()
 
      {
 
          _i = i++;
 
          std::cout << "element " << _i << " constructedn";
 
      }
 
      cSideEffector(const cSideEffector& rhs)
 
      {
 
          _i = i++;
 
          std::cout << "element " << _i << " copy constructed (from element " << rhs._i << ")n";
 
      }
 
      ~cSideEffector()
 
      {
 
          std::cout << "element " << _i << " destroyedn";
 
      }
 
  };
 
  
 
  int
 
  main(int argc, char* argv[])
 
  {
 
      std::vector<std::vector<cSideEffector> > v;
 
      v.resize(4);
 
      v[0].resize(1);
 
      v[1].resize(1);
 
      v[2].resize(1);
 
      v[3].resize(1);
 
      std::cout << "before resize(5), v.capacity() = " << v.capacity() << 'n';
 
      v.resize(5);
 
      return 0;
 
  }

Building this with Clang 3.0, without the -std=c++0x option, I get:

element 0 constructed
 
  element 1 copy constructed (from element 0)
 
  element 0 destroyed
 
  element 2 constructed
 
  element 3 copy constructed (from element 2)
 
  element 2 destroyed
 
  element 4 constructed
 
  element 5 copy constructed (from element 4)
 
  element 4 destroyed
 
  element 6 constructed
 
  element 7 copy constructed (from element 6)
 
  element 6 destroyed
 
  before resize(5), v.capacity() = 4
 
  element 8 copy constructed (from element 1)
 
  element 9 copy constructed (from element 3)
 
  element 10 copy constructed (from element 5)
 
  element 11 copy constructed (from element 7)
 
  element 1 destroyed
 
  element 3 destroyed
 
  element 5 destroyed
 
  element 7 destroyed
 
  element 8 destroyed
 
  element 9 destroyed
 
  element 10 destroyed
 
  element 11 destroyed

But if I add the -std=c++0x option (enabling C++11 move semantics), I get:

element 0 constructed
 
  element 1 constructed
 
  element 2 constructed
 
  element 3 constructed
 
  before resize(5), v.capacity() = 4
 
  element 0 destroyed
 
  element 1 destroyed
 
  element 2 destroyed
 
  element 3 destroyed

So we can see first of all that without move semantics enabled each buffer reallocation triggers copy and delete for all the existing elements. And then we can see that this is something that this is something that can potentially be resolved by turning on support for move semantics.

Note that this copy and delete usually won’t be an issue for basic types (where copy can be just a memory copy, and delete a null operation), but if the vector elements have their own internal buffer management (as is the case with vectors of vectors) then this results in a whole lot of essentially unnecessary memory heap operations, which be both a significant performance hit and bad news for memory fragmentation.

Move semantics can be pretty cool if you can depend on the necessary compiler support, and you can then find a load of stuff on the web discussing this in more detail (e.g. here).

At PathEngine, however, we need to support clients building the SDK on a bunch older compilers, and if there is some way to implement stuff like this lookup class without all this overhead regardless of whether C++11 move semantics are actually available then we definitely want to do this, and the lookup construction stage remains an optimisation target!

It’s all about the runtime

Actually, in practice we would never want to do something like cLookup::addLine() during loading. Something like this would normally only be called at preprocess generation time, with the possibility to store the resulting preprocess data to persistent files and load back in at runtime with much more direct data paths.

(In reality there would be some other stuff to support this in that cLookup class, this is a simplified example and doesn’t give the whole picture.)

And then, while we obviously want to avoid all those memory ops during preprocess generation, the real issue here is probably the implications of the vector of vector construct on the SDK run-time, and particular on run-time memory footprint.

It turns out that vectors of vectors are also very bad news for your system memory heap.

For something like PathEngine this is *worse* than the potential construction inefficiencies from unnecessary copying,

and this is an issue *whether or not* C++11 move semantics are being applied.

Memory footprint

From the point of view of memory footprint there are two main problems here:

  1. The fact that separate buffer allocations are made per entry in _table, and
  2. Operating system and vector class memory overhead for each of those allocations

The situation will look something like this:

http://upcoder.com/static/image/posts/vector-of-vectors-memory.png

One buffer gets allocated at the top level, by _v, and filled in with the sub vector class data, and a bunch of other buffers then also get allocated by these sub vectors.

Note that each vector has four data members, a buffer pointer, a current size value (or end pointer), a current capactity (or end pointer), and a pointer to an allocator object. It’s most common to see STL containers without allocation customisation, and so it can be a surprise to see the allocator pointer. When allocator customisation is not being used the allocator pointers all be set to zero, but nevertheless get stored once per lookup line, and help to bulk out the memory overhead.

If each of the four vector data members is a 32 bit value then this gives us a starting cost of 16 bytes per line in the lookup, most of which is not really necessary in this case.

And allocating an individual buffer for each line is also a bad thing because allocating a lot of buffers will lead to memory fragmentation, and because there is a non-negligeable amount of built in hidden overhead per system memory allocation. Each individual buffer allocated from the system will require additional data related to tracking the buffer in the system memory allocation heap, and will also most likely be aligned to some fixed value by the system (so with memory buffer lengths actually being extended to a multiple of that fixed value).

As a result, for a lookup object created with short lines, the memory overhead for each line can easily exceed the amount of actual data being stored.

And we should also be aware of the possibility of vector overallocation.

In the code shown, the subvectors probably shouldn’t be overallocated because they are copy constructed from existing vectors (although I haven’t researched the ‘official’ situation for this in the standard!), but the top level vector will be overallocated (by a factor of 1.5, on average, as discussed in my previous post), and in the general case of vector of vector construction overallocation may not just mean buffers being larger than necessary, but also potentially that more buffers are allocated than actually needed.

In practice, in loading code, we set things up so that vectors get loaded without any overallocation, but it can be worth checking that this is definitely working as expected.

With or without overallocation it’s clear that there are some significant memory footprint and fragmentation issues, and these kinds of memory issues are then also usually performance issues, because of the need to keep memory caches working effectively, and because of heap management costs!

Collapsed vector vector

The key point, for this use case, is that we actually only ever need to add data to the end of the lookup, and so we just don’t need all that separate dynamic state tracking for each individual line.

And the solution is then to ‘collapse’ all those individual line buffers into a single buffer, giving us just one single buffer containing all the lookup entries concatenated together in order.

Since the lookup lines are not all the same length we also need some way to find the lookup entries for each line, and so we use a second buffer which indexes into the concatenated entries for this purpose.

http://upcoder.com/static/image/posts/collapsed-vector-of-vectors.png

The index buffer tells us exactly where each line starts in the concatenated entries buffer.

Note that we give the index buffer one entry per lookup line, plus one additional entry which points to the end of the concatenated entries. (This extra index entry enables us to treat all the lookup lines in the same way, and avoids an explicit test for first or last line.)

We could add code for this directly into the cLookup class fairly easily, but this is something we’re bound to come across with regards other data structures, and we should set things up to make it as easy as possible for us to avoid the vector of vectors construct, so why not create a new container class for this purpose, which can be used to replace vectors of vectors more generally?

Data members

We need to maintain a couple of dynamically resized buffers (with increasing size), and std::vector is actually just the thing for that (!), so we can start out with something like:

#include <vector>
 
  
 
  template <class T>
 
  class cCollapsedVectorVector
 
  {
 
      typedef typename std::vector<T>::size_type size_type;
 
      std::vector<T> _v;
 
      std::vector<std::size_type> _index;
 
  public:
 
  
 
      //... implementation here
 
  };

These vector members save us re-implementing a bunch of low-level buffer management stuff, whilst still resolving all the major issues we had with the vector of vectors construct.

We want to make it easy to replace vector of vectors with this new class, so lets copy the std::vector interface as far as reasonably possible.

We’ll add things like a size_type typedef and basic STL style vector methods like size() and empty(), and try to make common element access code for vectors work with the new class.

We could also provide the same interface for the actual push_back operation (i.e. taking a subvector to be concatenated), but I’ve chosen to diverge from that interface to make it clear in the calling code that something a bit different is going on at construction time.

I’ve also avoided trying to do anything that might require proxy objects or iterator classes in order to keep things simple. The final class can then look something like this:

#include <vector>
 
  #include <cassert>
 
  
 
  template <class T>
 
  class cCollapsedVectorVector
 
  {
 
  public:
 
      typedef typename std::vector<T>::size_type size_type;
 
  private:
 
      std::vector<T> _v;
 
      std::vector<size_type> _index;
 
  
 
  public:
 
  
 
      cCollapsedVectorVector() :
 
       _index(1)
 
      {
 
          _index[0] = 0;
 
      }
 
      cCollapsedVectorVector(const std::vector<std::vector<T> >& buildFrom) :
 
       _index(buildFrom.size() + 1)
 
      {
 
          _index.clear();
 
          _index.push_back(0);
 
          for(size_type i = 0; i != buildFrom.size(); ++i)
 
          {
 
              _index.push_back(_index.back() + buildFrom[i].size());
 
          }
 
          _v.resize(_index.back());
 
          size_type vIndex = 0;
 
          for(size_type i = 0; i != buildFrom.size(); ++i)
 
          {
 
              for(size_type j = 0; j != subVectorSize(i); ++j)
 
              {
 
                  _v[vIndex++] = buildFrom[i][j];
 
              }
 
          }
 
          assertD(vIndex == SizeL(_v));
 
      }
 
  
 
      void
 
      shrinkToFit()
 
      {
 
          if(_v.capacity() != _v.size())
 
          {
 
              std::vector<T>(_v).swap(_v);
 
          }
 
          if(_index.capacity() != _index.size())
 
          {
 
              std::vector<size_type>(_index).swap(_index);
 
          }
 
      }
 
  
 
      bool operator==(const cCollapsedVectorVector& rhs) const
 
      {
 
          return _index == rhs._index && _v == rhs._v;
 
      }
 
  
 
      void swap(cCollapsedVectorVector& rhs)
 
      {
 
          _v.swap(rhs._v);
 
          _index.swap(rhs._index);
 
      }
 
  
 
      bool empty() const
 
      {
 
          return _index.size() == 1;
 
      }
 
      void clear()
 
      {
 
          _index.resize(1);
 
          _v.clear();
 
      }
 
  
 
      void pushBackSubVector()
 
      {
 
          _index.push_back(_index.back());
 
      }
 
      void pushBackSubVector(size_type size)
 
      {
 
          _index.push_back(_index.back() + size);
 
          _v.resize(_v.size() + size);
 
      }
 
      void pushBackInLastSubVector(const T& value)
 
      {
 
          assert(!empty());
 
          _index.back()++;
 
          _v.push_back(value);
 
      }
 
      void popBackInLastSubVector()
 
      {
 
          assert(!empty());
 
          assert(subVectorSize(size() - 1) > 0);
 
          _index.back()--;
 
          _v.pop_back();
 
      }
 
  
 
      size_type size() const
 
      {
 
          return _index.size() - 1;
 
      }
 
      const T* operator[](size_type i) const
 
      {
 
          return &_v[_index[i]];
 
      }
 
      T* operator[](size_type i)
 
      {
 
          return &_v[_index[i]];
 
      }
 
  
 
      const T* back() const
 
      {
 
          assert(!empty());
 
          return &_v[_index[_index.size() - 2]];
 
      }
 
      T* back()
 
      {
 
          assert(!empty());
 
          return &_v[_index[_index.size() - 2]];
 
      }
 
  
 
      size_type subVectorSize(size_type i) const
 
      {
 
          return _index[i + 1] - _index[i];
 
      }
 
  };

Application

In cases where we only need to add to the end of the data structure (as in the case of cLookup), we can build to this new class directly, but in cases where this is not possible we could also fall back to using a vector of vectors at build or preprocess time, with this then baked into a cCollapsedVectorVector using the relevant constructor.

Anyway, now we can rewrite the lookup class as follows:

class cLookup
 
  {
 
      cCollapsedVectorVector<long> _v;
 
  
 
  public:
 
  
 
      long addLine(const std::vector<long>& entries)
 
      {
 
          _v.pushBackSubVector(entries.size());
 
          for(std::vector<long>::size_type i = 0; i != entries.size(); ++i)
 
          {
 
              _v.back()[i] = entries[i];
 
          }
 
          return _v.size() - 1;
 
      }
 
      void shrinkToFit()
 
      {
 
          _v.shrinkToFit();
 
      }
 
  
 
      long getNumberOfEntriesInLine(long lineIndex) const
 
      {
 
          return _v.subVectorSize(lineIndex);
 
      }
 
      long getEntryInLine(long lineIndex, long indexInLine) const
 
      {
 
          return _v[lineIndex][indexInLine];
 
      }
 
  };

Note the addition of a shrinkToFit() method, which can be called when the calling code has finished adding lines, to avoid overallocation.

Conclusion

Vectors of vectors should be avoided if possible, and if data is only ever added at the end if the top level vector this is easy to achieve.

This is an example of an optimisation which should be applied pre-emptively, I think, because of the performance and memory footprint implications.

If you have vectors of vectors in your code, take a look at this right now and see if they can be replaced with something like cCollapsedVectorVector!


Havok Physics: 4 tips you won’t find in the docs

Original Author: Eric Undersander

http://www.ericundersander.com/blog2_havok/header_bar_vdb.png

Hey fellow coders, I’ve used Havok Physics at several game studios over the years and I thought I’d share some little-known but hopefully useful tips about this middleware.

For a rigid body, find all rigid bodies touching it

A recent game project involved a house interior scene, and I needed to know what objects were resting on a given table.  I could have added a collision listener to every table and maintained lists of contacting objects, but there’s an easier way.  The simulation already tracks these contacts every frame for all rigid bodies, so I dug into some undocumented internal classes (hkpLinkedCollidable and hkpSimpleConstraintContactMgr) and extracted what I needed.  This function is based on some code in one of the Havok demo projects.

#include <Physics2012/Collide/Agent3/Machine/Nn/hkpAgentNnTrack.h>
 
  #include <Physics2012/Dynamics/Collide/hkpSimpleConstraintContactMgr.h>
 
  
 
  void GetContactingBodies(const hkpRigidBody* body, 
 
      hkArray<hkpRigidBody*>& contactingBodies)
 
  {
 
      const hkArray<hkpLinkedCollidable::CollisionEntry>& collisionEntries = 
 
          body->getLinkedCollidable()->getCollisionEntriesNonDeterministic();
 
      for(int i = 0; i < collisionEntries.getSize(); i++) 
 
      { 
 
          const hkpLinkedCollidable::CollisionEntry& entry = collisionEntries[i]; 
 
          hkpRigidBody* otherBody = hkpGetRigidBody(entry.m_partner); 
 
          if (otherBody != HK_NULL) 
 
          { 
 
              if (entry.m_agentEntry->m_contactMgr->m_type == 
 
                  hkpContactMgr::TYPE_SIMPLE_CONSTRAINT_CONTACT_MGR)
 
              {
 
                  hkpSimpleConstraintContactMgr* mgr = 
 
                      static_cast<hkpSimpleConstraintContactMgr*>(
 
                      collisionEntries[i].m_agentEntry->m_contactMgr);
 
                  if (mgr->m_contactConstraintData.getNumContactPoints() > 0)
 
                  {
 
                      contactingBodies.pushBack(otherBody);
 
                  }
 
              }
 
          }
 
      }
 
  }

Identify objects in the Visual Debugger

The Havok Visual Debugger (VDB) runs alongside your game and shows a debug view of the physics scene. Often you’ll want to select an object in the VDB and learn more about it. If you have a code debugger attached to your running game (such as the Visual Studio debugger), you can inspect the selected object in your debug watch window.

First, double-click the object in the VDB to bring up Display Object Properties. The object’s id is (by default) the memory address of the associated hkpCollidable. Copy-paste this address into your code debugger’s watch window and do a bit of arithmetic to get the address of the parent hkpRigidBody or hkpShapePhantom (subtract 16 on 32-bit platforms; subtract 32 on 64-bit platforms). Cast to the appropriate pointer type.
http://ericundersander.com/blog2_havok/vdb_object_id.png

You can also override the display id in your game code. Use whatever’s convenient for debugging, for example, the memory address of the parent game object:

void MyGameObject::SetRigidBodyDisplayID()
 
  {
 
      m_rigidBody->setProperty(HK_PROPERTY_DEBUG_DISPLAY_ID, this);
 
  }

Large heightfields in the Visual Debugger

Large heightfields are a problem for the VDB. It displays a 1000×1000 heightfield using approximately two million triangles, which brings the tool’s performance from 50 fps down to 5 fps on my PC. For heightfields larger than about 2400×2400, the VDB runs out of memory and crashes. Fortunately, you can override the display shape for a rigid body. Create an hkGeometry (an indexed triangle list) and assign it to the rigid body like so:

m_body->addProperty(HK_PROPERTY_OVERRIDE_DEBUG_DISPLAY_GEOMETRY_NO_DELETE, 
 
      displayGeom);

I’ve written a helper function that inspects a source heightfield and overrides the display geometry if necessary. It reduces the display triangle count by downsampling (reducing resolution) by an appropriate factor. There’s also a function to help with cleanup–you need to delete the hkGeometry created earlier.

Click here for the helper functions.

Convex radius should be zero for your static environment

Okay, I lied! This last tip is actually in the Havok docs. I include it here because I’ve seen it frequently overlooked at game studios and even in the Havok demo projects.

Many Havok shape types feature an optional convex radius. This is an extra shell added to your shape, usually 5 cm, and it’s necessary for best performance. However, it’s inconvenient in the case of triangle-based shapes because Havok only uses the shell for shape queries—it’s ignored for raycasts. So, for example, if you position your character above the ground using a capsule query and then run foot IK using raycasts, you’ll get a 5 cm discrepancy.

Fortunately, triangle-based shapes are primarily used for static environments (fixed rigid bodies), and you can safely use a convex radius of zero for static environments. As the User Guide explains, “We only need to ensure that any two interacting bodies have some extra radius between them. As long as all dynamic bodies have an extra radius, the extra radius on a stationary landscape is not necessary.”

Here’s a breakdown of the relevant shapes–I suggest double-checking how you’re using these in your game:

Type Default convex radius How to set convex radius
hkpCompressedMeshShape 0.05 Constructor or setRadius()
hkpExtendedMeshShape 0.05 Constructor or setRadius()
hkpSampledHeightFieldShape n/a (No convex radius)
hkpTriSampledHeightFieldCollection 0.05 Constructor or setRadius()
hkpBvCompressedMeshShape 0 hkpDefaultBvCompressedMeshShapeCinfo:: m_convexRadius
hkpStaticCompoundShape n/a (No convex radius)

Thanks for reading!

Reading The Papers: Guaranteed Candy

Original Author: Michael Cook

For some developers, all this talk of procedural content generation is bittersweet. If you’re generating a world map, for instance, you probably don’t have too many requirements of it. A bit of land, not too many mountains – other than that, the randomness is part of the appeal. But what if you have very specific requirements? Well, then you’d need to test the content you generate to make sure it’s okay to give to the player. For many games this is a tricky and daunting task. How on earth would you automatically test a physics puzzle game, for instance? I’m glad you asked. This week on Reading the Papers – a novel approach to simulation that can test real-time physics-based games (and more).

We’re reading Cut The Rope, and specifically talks about how they implemented a solver for the game, despite the fact that it’s complex, physics-driven and real-time. It’s a neat little project in its own right, but the authors are hopeful that the approach could be applied to many other types of game, too.

The authors have done a lot of work on Ropossum, and written multiple papers about different elements of the project. The level generator uses computational evolution to place objects in the game world to form a puzzle, and then can evaluate those levels using a variety of approaches – human designers can give feedback for a mixed-initiative approach; it can use a selection of custom heuristics (more on those in this paper); and as we’re going to see today, it can also automatically verify playability of a level it generated. We’ll no doubt see the rest of the Ropossum system in another column some other day, but for now – solvability.

Procedural content generation is all about the balance between unpredictability and reliability. On the one hand, we want to generate content that no-one has seen before, without needing a human designer to be there to do it. But at the same time, if there’s no human designer to curate the content, we need to be confident that what comes out is usable, and hopefully fun. If we’re generating puzzles, we need to be confident that a solution exists.

For turn-based or grid-based puzzle games, this isn’t too tricky. In fact, artificial intelligence has been looking at this kind of problem for a long time, inventing chess puzzles or producing altered versions of existing boardgames. Making a game turn-based and grid-based means that it’s easy to represent within a computer, and you can easily explore the space of possibilities. When a game is real-time, though, and when it has lots of complex systems at work in each frame like a physics-based game, then the problem gets many orders of magnitude harder.

So the problem is this: given a level design for a game like Cut The Rope, decide whether or not it can be solved, and if possible provide a series of actions that would lead to a solution. The way proposed in this paper uses a Prolog solver that runs alongside a simulation of the game. At each timestep, the game is translated into a description which Prolog can understand, and then the solver uses this representation to work out what sensible moves can be taken right now. If you’ve never heard of Prolog before, don’t worry – you’ll get the gist from a quick example. Below is a description of a Cut The Rope level in the paper’s Prolog representation:

candy(527, 140). velocity_up. velocity_to_right.
 
  frog(390, 330). rope(550, 50, 105).
 
  rope(440, 50, 195). air_cush(507, 195, 4).
 
  reachable(rocket).

Some of these are obvious facts, like candy(X, Y) which states the candy’s co-ordinates in 2D space. Others are more subtle – reachable(C) means that the component C is accessible to the candy given its current trajectory, so if a candy is floating upwards then things below it aren’t reachable. This helps the system infer what moves are sensible or not – if the candy is never going to get near a component no matter how long you wait, then it’s not sensible to activate it.

Now, suppose we’re looking at this description of the game state, and we want to figure out what to do next. There are only a few moves we can make in a given Cut The Rope level, like pushing an air cushion or cutting a rope. How do we know which actions are likely to have an effect? The authors suggest a set of rules which Prolog can check to see if a move is ‘sensible’ currently. So for instance, for pressing an air cushion we might check:

air-cush_press :- air-cush(X,Y,Dir), candy_in_dir_of_air-cush(X,Y), distance(X, Y, 80).

In other words, we can press an air cushion if there is an air cushion in the level (obviously); if the candy is in the direction of the air cushion (otherwise there’s no point) and if the candy is near enough to the cushion to be affected by it. The authors suggest heuristics for each of these actions – if you were writing them for your own game, you can imagine inventing your own heuristics for whatever mechanics you’re dealing with. Crucially, this set of rules includes a ‘void’ action which equates to waiting, or not taking an action. Most of the time, this is what the simulator will be doing.

Running alongside this Prolog system is the physics engine of the game itself. The system works by simulating an interval of time in the game engine, and then sending the game state over to the Prolog system to see which actions, if any, are available to the player. If there’s more than one action available, the system branches its simulation of the game and continues to search, in a depth-first fashion, through the game. The image below shows how a tree of possible moves is built up as the system explore its options. When it reaches a failure state, the system simply backtracks until it finds an option it hasn’t explored yet.


Screen Shot 2013-11-29 at 12.43.18

Of course, if the game is updating sixty times a second then there probably isn’t much going on between any two timesteps. The authors note that if you check with your Prolog system every single frame then your system very quickly slows down. So instead, they implemented a variable timestep which adds in a delay after an action has been taken, where no checks are made with the Prolog engine. Some actions have small delays, because they often are followed very quickly by new actions (like cutting ropes). Others, like firing rockets, have longer delays. This is all unique to whatever game you’re trying to analyse, but these heuristics are normally easy to figure out for the game’s designer.

So the essence of the system boils down to the relationship between the game’s raw, continuous, real-time simulation, and the elegant logical representation in Prolog. By choosing reasonable time intervals, and having a good intuition of which moves are sensible at any moment, the Prolog system can help tame this huge state space, and figure out a sequence of actions that completes a level. Of course, some game designs may not suit this. Cut The Rope only lets the player interact with what’s on screen, there’s no placement of new components or anything complex like that. And many actions in the game actually reduce the possibility space, by popping bubbles or cutting ropes. Nevertheless, the paper clearly describes a really nice framework for solving these kinds of problems. I think it could easily be extended to a wide variety of game types.

I didn’t have time to cover everything mentioned in this particular paper – and as I mentioned in the introduction, there are other papers about Ropossum because it’s such a large piece of research. For example, the evolutionary level generator can work with a human designer to co-operate on a level design. The human designer can fix certain parts of the design and ask the system to redesign the remaining elements, for example – all the while ensuring the results are solvable. Mixed initiative tools like this are going to be an exciting feature of game design in the future.

I spoke to Noor about the research at AIIDE this year, where this paper was presented. She has a great outlook on this kind of procedural generation, and she and her co-authors seem passionate about tackling these harder generation tasks, where you often need to explore huge, real-time state spaces. We often propose the idea of expert AI players being used to test levels, but this isn’t a good solution for so many reasons. If we can find ways to intelligently explore levels to find ways through them, we give ourselves so many more tools to design and analyse the content we generate. Exciting stuff.

Where To Find More

the Platformer AI Competition – she’s also made great contributions to understanding platformer level design, which we’ll cover in a future column) All three are very friendly and active on Twitter, and this work is active and ongoing for Mohammad I believe, so if you’re interested in more information I’d recommend getting in touch with him.

One final note on this edition of The Saturday Paper – thanks to some very kind suggestions and the lovely people at AltDevBlog, The Saturday Papers will now be appearing over there as well, under the title ‘Reading The Papers’ – thanks to Jurie Hornemann for the title. Hello to all our new readers. I’m really excited about hearing new opinions about the column. Don’t forget, there’s a healthy back catalogue of papers you can read over on my website.

Using STL Vectors

Original Author: Thomas Young

(Also posted to series of posts about Vectors and Vector based containers.)

Using STL Vectors

STL style vectors are a pretty useful construct.

I remember coming to C++ (and the STL) from a background in lower level programming, and finding STL vectors a bit surprising, with the idea of amortized constant time complexity and so on.

And then it was actually something of a minor revolution just to be able to set up a contiguous data buffer, without any significant programming effort, and without knowing the maximum required buffer size in advance.

Before we had C++ and the STL I remember a common source of errors being dynamically generated data overflowing fixed buffers, which is something we just don’t see any more, thanks in no small part to the humble vector (and brethren).

But convenience can come at a cost, and there’s some stuff you need to be careful about when applying STL vectors in performance critical situations.

In this post I’ll discuss the main ways in which STL style vectors are applied in PathEngine, and look at some of the key performance and memory points coming out of this.

Prefer vector over list

Generally speaking, contiguous buffers are good, and memory allocations and scattered access patterns are to be avoided, and we’re advised to prefer std::vector over std::list in most situations.

See this blog post, for example.

And this has certainly been our experience with PathEngine. In fact, the SDK now uses vector-like data structures almost completely to the exclusion of pointer based data structures (even for things like dynamic meshes).

In many ways I think vectors can offer us the best of both low level and higher level coding paradigms, with the flexibility and automatic memory management of dynamically sized containers, but with the the performance benefits of low-level direct access to a contiguous data buffer and the simplicity of integer element indexing.

Vector use cases

The stuff we do with vectors can be broadly categorised into two main use cases:

  1. the construction of (potentially complex) preprocess objects, and
  2. run-time buffers for queries, where buffer size requirements are fundamentally dynamic in nature

Preprocess objects include things like the pathfinding visibility graph. These objects can take time to build, but this is something that can be done by the game content pipeline, with preprocess data saved and loaded back from persistence by the game runtime.

Key issues

Over-allocation in preprocess build

It’s a great simplification in preprocess build code to be able to just build objects in one go, with buffers created and written to without the need to calculate buffer sizes in advance. Vectors let us do this, and then also double up as buffer memory management (and buffer size tracking) once the preprocess objects have been built.

But if we’re not careful, due to the way vectors allocate additional capacity, we can easily end up with a lot of memory wasted in over-allocation.

For a vector built with push_back() calls, if we assume that the vector doubles its capacity each time capacity is exceeded by a push_back() operation (a pretty standard implementation), then we’ll end up on average with capacity being 1.5 times the actual final required buffer size.

We can avoid this overallocation with the shrink to fit idiom, e.g. with something like the following:

template <class T> void
 
  ShrinkToFit(T& container)
 
  {
 
      if(container.capacity() != container.size())
 
      {
 
          T(container).swap(container);
 
      }
 
  }

This has the effect of replacing the vector buffer with a newly allocated buffer of exactly the required size. So we avoid overallocation at the cost of a buffer copy operation (usually not significant at preprocess generation time), whilst also retaining all the convenience of the vector object buffer management (automatic buffer deletion and buffer size query).

With C++11, you can replace the helper function with a call to the new vector::shrink_to_fit() method, which does the same thing.

Iterator validity

It’s great to be able to work with vector buffer contents while the vectors are being built, but it’s quite a common gotcha to forget that vector resize operations will sometimes invalidate existing iterators, or raw pointers into the buffer.

A great way to sidestep this issue is to use simple integer indices instead of iterators or pointers, and this also has the advantage that indices into a vector remain valid also across load and save operations.

Hidden costs for non simple elements

Vectors are good for managing simple data buffers, but things can get costly with elements with significant set up or teardown costs, or with their own internal memory management.

One specific example is the ‘vector of vectors’ construct.

I’ll talk about this in more detail in a future post, but the short story is that building vectors of vectors can potentially be a big performance hit, and the resulting data structure can also be the source of quite significant memory inefficiencies, and so this construct should be avoided as far as possible.

Run-time dynamic buffers

Memory allocation is expensive (in general, but particularly so if you’re sharing a heap with other threads!), and run-time allocations can lead to memory fragmentation, so the ideal situation, if possible, would be to avoid any run-time memory allocations!

In some cases we can avoid run-time allocations, by preallocating suitably sized buffers, but this isn’t always possible.

We can’t always predict query buffer requirements, or if there is a theoretical maximum buffer requirement this may be too large to be practical.

For example, a lot of queries in PathEngine do some work local to a specific part of a ground mesh, and will usually only need to work with a small number of elements (e.g. mesh faces, or whatever), but with the actual number of elements relating a given query depending on stuff like the complexity of local geometry or the query distance or range.

Ground meshes can potentially be very large, and if we preallocate buffers based on total ground mesh size, but then only ever make locally restricted queries then we’re potentially wasting a lot of memory.

We could choose some arbitrary size for preallocated buffers, but this is inflexible and prone to failure (in a similar way to ‘old style’ MAX_PATH_LENGTH and fixed buffer path formatting), with queries that fail unexpectedly when query complexity takes us just over the preallocated size requirement, and it’s also much nicer if we can support arbitrary query sizes. Maybe the client does want to query over a large area, and can spare the memory for this.

So often the only choice is to then go ahead and embrace dynamically sized run-time buffers.

Vectors in run-time query code

The simplest thing to do is then just to create vectors directly in the code, as and when needed in the query code.

So this could look something like the following:

void
 
  cLocalisedShape::getFacesOverlapped(cVector<cFace>& result)
 
  {
 
      std::vector<cFace> openQueueBuffer;
 
      FacesOverlappedByConvexPoly_SI(_mesh, _points, _centreF, _poly, openQueueBuffer, results);
 
  }

This is a made up example, but the idea is that we’re going to perform a kind of fill operation through local ground mesh, and we need to keep track of ‘open’ faces during the process.

We don’t know how many faces will actually need to be explored, but the fill operation will be restricted by mesh locality, and in general the number of faces explored is expected to be much smaller than the total number of faces in the mesh.

But the problem is that sequentially pushing faces on to the buffer is going to lead to multiple memory allocations, and if the query code is optimised then these memory allocations can easily become a significant bottleneck.

Preallocating at query time (be careful about clear() semantics!)

If we have an idea about common query situations then we could do something like the following:

void
 
  cLocalisedShape::getFacesOverlapped(cVector<cFace>& result)
 
  {
 
      std::vector<cFace> openQueueBuffer(20);
 
      openQueueBuffer.clear();
 
      FacesOverlappedByConvexPoly_SI(_mesh, _points, _centreF, _poly, openQueueBuffer, results);
 
  }

The idea is to reserve some capacity in the vector to ensure that the buffer doesn’t subsequently need to be expanded in common query situations.

The first thing to note is that the exact semantics of the clear() call here are important.

What we want this to do is to resize the vector to zero, whilst leaving the buffer allocated. If the vector buffer is freed by the clear call than that leaves us worse off than before.

According to the current standard it should be guaranteed behaviour that clear() does not free the underlying buffer, but this wasn’t always the case. In some versions of Microsoft Visual Studio calling clear() does frees the buffer, but this can be worked around by replacing clear() with resize(0) (which doesn’t).

In our case we actually use our own STL style class implementation, so we know we can depend on this, but if you use std::vector and your code will run on different platforms and compilation environments then it’s worth checking the actual semantics being applied (e.g. by asserting that the vector capacity is unchanged after the clear).

Assuming the buffer is not cleared however, this is still not an ideal solution:

  1. there’s still at least one buffer allocation made for every query (we’d rather see none!), and
  2. the starting buffer size is arbitrary, and suffers from the same kind of issues as other similar arbitrary constants – maybe for some clients the common query situations require exploration of 25 to 30 faces..

Holding on to vectors across queries

What we do in PathEngine is to hold on to a standard set of vectors and reuse these across queries (and frames).

The code for this looks something like the following:

void
 
  cLocalisedShape::getFacesOverlapped(cQueryContext& qc, cVector<cFace>& result)
 
  {
 
      std::vector<cFace>& openQueueBuffer = qc.lockFaceBuffer();
 
      openQueueBuffer.clear();
 
      FacesOverlappedByConvexPoly_SI(_mesh, _points, _centreF, _poly, openQueueBuffer, results);
 
      qc.releaseFaceBuffer();
 
  }

Query context objects get passed down through the various query code paths as required.

In single threaded use there’s just one query context passed in to each query, and in multi-threaded use a pool of query contexts is created.This still is not an ideal solution, for various reasons, (for example with the intrusive requirement to pass in a query context object), and I’ve got a feeling there’s probably a much nicer general way to handle this run-time buffer requirement out there somewhere, but it does the job for us!

What we get at runtime is then a bunch of allocations triggered when the game starts to call queries, but with the number of buffer reallocations allocations trailing off as the game ‘warms up’. Which is generally fine, from a performance point of view, and also minimises fragmentation issues.

Note that this leaves us with the same over-allocation issue as with preprocess construction, but in practice runtime buffer sizes are smaller than preprocess buffers and so this has turned out not to be significant, with both allocation time overheads and fragmentation being higher priority issues. (If runtime buffer overallocation is an issue for you, just periodically shrinking to fit all run-time buffers could potentially be good enough.)

Conclusion

The bottom line is that dynamically sized containers make coding easier and, if you know how to use them, STL style vectors can give you this convenience without sacrifices in performance or memory efficiency.

Note that there are issues with the actual stl::vector specification, and standard library stl::vector implementations. In some cases you may want to replace stl::vector with your own custom vector class, and this is something I’ll cover in more detail in a future post.

The points in this post are intended to apply equally whether you are using stl::vector, or your own custom STL style vector class.

To iterate the main points:

  1. Look out for over-allocation issues. It’s easy to overlook, and in extreme cases ‘shrinking to fit’ can be give you significant memory gains!
  2. Indexed element accessing is a good way to go with STL style vectors.
  3. Be careful about using vectors with non-simple element types.
  4. Making sure you know what’s going on with your vector buffer allocations. Vectors do a good job of hiding the underlying buffer allocation details, which is a good thing, but sometimes we need to know these details. It can be worth checking the actual allocation semantics being applied explicitly. (Check that clear() isn’t freeing the buffer, for example.)
  5. Avoiding runtime allocations can be difficult, but can be the only way to get the best performance, and if you need to then there are ways to do this.