Prototyping and Code Quality

Original Author: Timo Heinapurola

Prototyping is a popular method for getting a feel for the core features of a given application. It is often used as the first step when your company is looking into a new prospect with a lot of uncertainty. The method relies on short iterations to quickly verify whether a set of new functionality is worth investing in.

In this article I focus on how code quality is related to a project developed using prototyping. I will first go through some of the main stages of such a project followed by some pit falls that you should avoid when moving into actual production.

The prototyping phase

Let’s imagine you are in the following situation. You have a great idea for an application or a game and the competence to make it. The first stage of development is going to be research. You will start looking at competing products in order to see what their shortcomings are and whether you should tackle them in your product.

After you have figured out a suitable set of features and nailed the higher level vision of the product you can start prototyping. In prototyping you build one or more smaller applications in order to test the core features that were found to be central to your product.  This stage of development consists of a number of very small iterations where your primary goal is to implement features so that decision makers can try them out to see whether they work or not.

At this point in development it’s important not to focus too much on code quality and documentation. Code written for the prototype should be clean enough for you and other developers working on the prototype to be able to develop features with. Supporting a too high quality level slows down your work and does not add much value to the prototype. This is because, as will be discussed later in this article, prototype code is meant to be “thrown away”.

To minimize the amount of work being thrown away, the code should be as light as possible. This means using architectures and frameworks that allow flexibility and don’t lock you in on certain ways of working. For instance, if you are developing a game that is going to have an online component but the feature set you are currently prototyping does not require online play, you can safely omit the online part all together.

Moving on to quality code

Once you are ready to move into production with the software, I would recommend using the prototype as a reference more than an actual initial implementation. This is because you now have a rather detailed map of the path in front of you when it comes to implementing the prototype features in the actual product and keeping the prototype intact helps you navigate in the future.

Reimplementing the features in the prototype has multiple benefits:

  1. You get to focus on code quality without the need to wade through messy prototype code.
  2. The prototype code will remain as a standalone implementation without dependencies on systems that will now change during actual implementation.
  3. You are practically forced to go through and review the algorithms and data structures developed for the prototype.
  4. You have more than just design on paper as you know exactly how you CAN implement the application features. It’s now about making it work with cleaner code and on the required platforms and frameworks.

Architecture

When starting to develop the actual application you should first do some research on the platform and frameworks you have chosen to use and make sure your choices are still valid. You might want to draw a couple of images and write a few lines of design documentation that will lay out the base of your application architecture. Architecture documentation is important but you should not overdo it. I wrote earlier about documentation in general and I think it’s important to keep it on a sensible level. Architecture documentation, however, is the most important kind of documentation as it gives you the frames within which to work. It’s also very difficult to describe architecture in code commenting or code itself. This is because architecture defines aspects like naming conventions and component interoperability that are more global in nature.

The software architecture will follow you throughout the project. It forms the basis of the whole application, it’s form, if you will. This is why it should be clean and clearly communicated and all developers in the project should follow those same guidelines. Good architecture actually forms the basis of good quality code, which is why it deserves its own section in this article. If all code follows the same form it’s easier for everyone to understand the details when the form is already understood. This also leads to less buggy code as developers know instinctively what is happening and where, as you speak the same language with familiar words.

Once you have an idea of the software architecture you can start working on the actual code. Sometimes you will have to make modifications to the architecture during development. It’s not as frequent as making modifications to individual features due to requirements changes but it still happens. These changes should be clearly justified either by significantly improved code quality, performance or just making a crucial feature possible. Changes to architecture should not be made lightly. It’s like changing the constitution of the application, and constant changes will lead to chaos. You should also always modify the whole application to meet the new architecture definition. It’s not a good idea to have remnants of the old world lying around.

When developing the architecture of the application you should also consult people who are going to be developing the application with you. They might have ideas about the high-level implementation of the software. It’s much cheaper to take those ideas into account at this stage of development instead of incorporating them at a later point in time through architecture changes.

Practical tips

Keeping up code quality is a constant process that requires input from every single member of the team. This is why the whole team has to stand by the common guidelines. No single person should dictate developing the software architecture. It’s important to get feedback from all developers as they will spend all their coding hours using it. Minimizing aggravation minimizes the potential of bugs and makes you want to do a better job overall.

I would also recommend leaving the codebase in a better state than you found it in. This will eventually improve the quality and will not require anyone to use a couple of days to get things in order.

Syntax auto-formatting and refactoring in programming tools have been steadily improving making it easier to maintain common coding conventions. They might not be the holy grail and often do little in the form of naming conventions, though. You should always try and use the same terms everywhere in the code with method and property names being formed in the same manner. It might make sense to try and keep an appendix on used terms but I personally would recommend just getting to know the code base and getting a feeling of the different words being used and then just trying to use them in your own code. Although, conventions like always using the word “Is” in front boolean typed properties should be documented in the naming conventions part of the architecture documentation.

Peer review is also a good way of spotting potential issues in code. If done right, you will get valuable feedback on your code. Peer review also has the added benefit of sharing information about the code base. Features implemented by you will be seen by your peers and wise versa. Having someone review your code will show you how understandable the code actually is. It also pushes you to write code you can feel proud of.

In addition to peer review I personally recommend going through your code changes just before pressing the commit button. You should perform a self review, if you will. This way you might recall situations where you thought you would like to do something but had moved on forgetting to write down what it actually was. You also have time to check whether the code is legible, remove debugging code and add commenting if required.

Perhaps the most important tip I can give you is that you should always take code quality seriously. It makes sense from a business point of view but it also helps you as a developer to make your days that much easier and less frustrating!

Summary

Prototyping allows you to figure out what might work and what doesn’t. You should not overdo code quality in this phase as prototype code should be light and quick to write. The application prototype should also be used more as a reference than an actual basis to develop the final product on.

Good quality code, especially in the final product, can sometimes mean the difference between success and failure. Code that is written with predefined conventions and following structures that everyone agrees on makes it easier for new people to get into working with the system while also minimizing misunderstandings and thus those pesky bugs.

The most important part of code quality is a good and clean architecture. This includes anything from component interactions to coding conventions. It forms the basis of communication between teammates and makes code much more readable.

You might also want to read my article “What Documentation?” for more on my thoughts on documenting.

Building an Engine Plugin System

Original Author: Niklas Frykholm

A plugin system is a useful way to allow developers to extend the capabilities of a game engine. Of course, an engine can also be extended by directly modifying the source code, but there are several drawbacks with that approach:

  • Changing the code requires you to recompile the engine. Anyone who wants to modify the engine must have the full source code, access to all the libraries and the build environment set up correctly.

  • Every time you pull changes from upstream you will have to merge your changes with the incoming patches. Over time, this adds up to a significant chunk of work.

  • Since you work directly in the source code, instead of against a published API, refactoring of engine systems might force you to rewrite your code from scratch.

  • There is no easy way to share the modifications you have made with other people.

A plugin system solves all these issues. Plugins can be distributed as compiled DLLs. They are easily shared and you can install them by just putting them in the engine’s plugin folder. Since the plugins use an explicit API, they will continue to work with new versions of the engine (unless backwards compatibility is explicitly broken).

Of course, the plugin API can never cover everything, so there will always be things you can do by modifying the engine that you can’t do through the plugin API. Nevertheless, it is a good complement.

A Tale of Two APIs

When building a plugin system, there are actually two APIs that you need to think about.

The first, and most obvious one, is the API that the plugin exposes to the engine: a set of exported functions that the engine will call at predefined times. For a very basic system, it could look something like this:

__declspec(dllexport) void init();
 
  __declspec(dllexport) void update(float dt);
 
  __declspec(dllexport) void shutdown();

The other API, which usually is a lot more complicated, is the API that the engine exposes to the plugin.

In order to be useful, the plugin will want to call on the engine to do stuff. This can be things like spawning a unit, playing a sound, rendering some meshes, etc. The engine needs to provide some way for plugins to call on these services.

There are a number of ways of doing this. One common solution is to put all the shared functionality in a common DLL and then link both the engine application and the plugin against this DLL.

plugin_1

The drawback of this approach is that the more functionality that the plugins need access to, the more must go in the shared DLL. Eventually you end up with most of the engine in the shared DLL, which is pretty far from the clean and simple APIs that we strive for.

This creates a very strong coupling between the engine and the plugins. Every time we want to modify something in the engine, we will probably have to modify the shared DLL and thus likely break all of the plugins.

As anyone who has read my previous articles know I really don’t like these kinds of strong couplings. They prevent you from rewriting and refactoring your systems and thus eventually cause your code to stagnate.

Another approach is to let the engine’s scripting language (in our case Lua) work as the engine’s API. With this approach, any time a plugin wanted the engine to do something it would use a Lua call.

For lots of applications I think this can be a really good solution, but in our case it doesn’t seem like a perfect fit. First, the plugins will need access to a lot of stuff that is more “low level” than what you can access from Lua. And I’m not to keen on exposing all of the engine’s innards to Lua. Second, since both the plugins and the engine are written in C++, marshalling all the calls between them through Lua seems both overly complicated and inefficient.

I prefer to have an interface that is minimalistic, data-oriented and C-based (because of C++ ABI compatibility issues and also because of… well… C++).

Interface Querying

Instead of linking the plugin against a DLL that provides the engine API. We can send the engine API to the plugin when we initialize it. Something like this (a simplified example):

plugin_api.h:

typedef struct EngineApi
 
  {
 
  	void (*spawn_unit)(World *world, const char *name, float pos[3]);
 
  	...
 
  } EngineApi;

plugin.h

#include "plugin_api.h"
 
  
 
  __declspec(dllexport) void init(EngineApi *api);
 
  __declspec(dllexport) void update(float dt);
 
  __declspec(dllexport) void shutdown();

This is pretty good. The plugin develeoper does not need to link against anything, just include the header file plugin_api.h, and then she can call the functions in the EngineApi struct to tell the engine to do stuff.

The only thing that is missing is versioning support.

At some point in the future we probably want to modify the EngineApi. Perhaps we discover that we want to add a rotation argument to spawn_unit() or somehting else.

We can achieve this by introducing versioning in the system. Instead of sending the engine API directly to the plugin, we send the plugin a function that lets it query for a specific version of the engine API.

With this approach, we can also break the API up into smaller submodules that can be queried for individually. This gives us a cleaner organization.

plugin_api.h

#define WORLD_API_ID    0
 
  #define LUA_API_ID      1
 
  
 
  typedef struct World World;
 
  
 
  typedef struct WorldApi_v0 {
 
  	void (*spawn_unit)(World *world, const char *name, float pos[3]);
 
  	...
 
  } WorldApi_v0;
 
  
 
  typedef struct WorldApi_v1 {
 
  	void (*spawn_unit)(World *world, const char *name, float pos[3], float rot[4]);
 
  	...
 
  } WorldApi_v1;
 
  
 
  typedef struct lua_State lua_State;
 
  typedef int (*lua_CFunction) (lua_State *L);
 
  
 
  typedef struct LuaApi_v0 {
 
  	void (*add_module_function)(const char *module, const char *name, lua_CFunction f);
 
  	...
 
  } LuaApi_v0;
 
  
 
  typedef void *(*GetApiFunction)(unsigned api, unsigned version);

When the engine instances the plugin, it passes along get_engine_api(), which the plugin can use to get hold of different engine APIs.

The plugin will typically set up the APIs in the init() function:

static WorldApi_v1 *_world_api = nullptr;
 
  static LuaApi_v0 *_lua_api = nullptr;
 
  
 
  void init(GetApiFunction get_engine_api)
 
  {
 
  	_world_api = (WorldApi_v1 *)get_engine_api(WORLD_API, 1);
 
  	_lua_api = (LuaApi_v0 *)get_engine_api(LUA_API, 0);
 
  }

Later, the plugin case use these APIs:

_world_api->spawn_unit(world, "player", pos);

If we need to make a breaking change to an API, we can just introduce a new version of that API. As long as get_engine_api() can still return the old API version when requested for it, all existing plugins will continue to work.

With this querying system in place for the engine, it makes sense to use the same approach for the plugin as well. I.e. instead of exposing individual functions init(), update(), etc, the plugin just exposes a single function get_plugin_api() which the engine can use in the same way to query APIs from the plugin.

plugin_api.h

#define PLUGIN_API_ID 2
 
  
 
  typedef struct PluginApi_v0
 
  {
 
  	void (*init)(GetApiFunction get_engine_api);
 
  	...
 
  } PluginApi_v0;

plugin.c

__declspec(dllexport) void *get_plugin_api(unsigned api, unsigned version);

Since we now have versioning on the plugin API as well, this means we can modify it (add new required functions, etc) without breaking existing plugins.

Putting It All Together

Putting all this together, here is a complete (but very small) example of a plugin that exposes a new function to the Lua layer of the engine:

plugin_api.h

#define PLUGIN_API_ID       0
 
  #define LUA_API_ID          1
 
  
 
  typedef void *(*GetApiFunction)(unsigned api, unsigned version);
 
  
 
  typedef struct PluginApi_v0
 
  {
 
  	void (*init)(GetApiFunction get_engine_api);
 
  } PluginApi_v0;
 
  
 
  typedef struct lua_State lua_State;
 
  typedef int (*lua_CFunction) (lua_State *L);
 
  
 
  typedef struct LuaApi_v0
 
  {
 
  	void (*add_module_function)(const char *module, const char *name, lua_CFunction f);
 
  	double (*to_number)(lua_State *L, int idx);
 
  	void (*push_number)(lua_State *L, double number);
 
  } LuaApi_v0;

plugin.c

#include "plugin_api.h"
 
  
 
  LuaApi_v0 *_lua;
 
  
 
  static int test(lua_State *L)
 
  {
 
  	double a = _lua->to_number(L, 1);
 
  	double b = _lua->to_number(L, 2);
 
  	_lua->push_number(L, a+b);
 
  	return 1;
 
  }
 
  
 
  static void init(GetApiFunction get_engine_api)
 
  {
 
  	_lua = get_engine_api(LUA_API_ID, 0);
 
  
 
  	if (_lua)
 
  		_lua->add_module_function("Plugin", "test", test);
 
  }
 
  
 
  __declspec(dllexport) void *get_plugin_api(unsigned api, unsigned version)
 
  {
 
  	if (api == PLUGIN_API_ID && version == 0) {
 
  		static PluginApi_v0 api;
 
  		api.init = init;
 
  		return &api;
 
  	}
 
  	return 0;
 
  }

engine.c

// Initialized elsewhere.
 
  LuaEnvironment *_env = 0;
 
  
 
  void add_module_function(const char *module, const char *name, lua_CFunction f)
 
  {
 
  	_env->add_module_function(module, name, f);
 
  }
 
  
 
  void *get_engine_api(unsigned api, unsigned version)
 
  {
 
  	if (api == LUA_API_ID && version == 0 && _env) {
 
  		static LuaApi_v0 lua;
 
  		lua.add_module_function = add_module_function;
 
  		lua.to_number = lua_tonumber;
 
  		lua.push_number = lua_pushnumber;
 
  		return &lua;
 
  	}
 
  	return 0;
 
  }
 
  
 
  void load_plugin(const char *path)
 
  {
 
  	HMODULE plugin_module = LoadLibrary(path);
 
  	if (!plugin_module) return;
 
  	GetApiFunction get_plugin_api = (GetApiFunction)GetProcAddress(plugin_module, "get_plugin_api");
 
  	if (!get_plugin_api) return;
 
  	PluginApi_v0 *plugin = (PluginApi_v0 *)get_plugin_api(PLUGIN_API_ID, 0);
 
  	if (!plugin) return;
 
  	plugin->init(get_engine_api);
 
  }

This has also been posted to The Bitsquid blog.

Multi-Threaded Programming 3: Locking, Lock-Free, Wait-Free

Original Author: Joseph Simons

Now I want to cover the basic terminology used when talking about concurrent algorithms. This will be helpful so that you are familiar with the available techniques you have when you have multiple threads operating together. The term concurrent itself refers to an algorithm that can scale from being synchronous on a single thread to utilizing multiple threads at the same time. The terms I am going to cover also exist independently of any architecture, though the actual implementation will vary due to the underlying hardware differences. I am going to cover them in order of increasing difficulty to implement, so if you are interested in implementing an algorithm of your own, I would suggest starting with the simplest implementation and moving on only as needed.

If you want to read up on other things that I have covered, then here are my previous posts in this series:

 

locking

Locking

In the previous post, we ended with an example of a mutex lock. This is the easiest and most straightforward method available in order to restrict access to a particular area of code for only a single thread at a time. However, as you may have surmised, it will also have the worst performance when dealing with many threads all accessing the same area at once. This is because the threads are restricted to operating serially, so you basically have given up on the performance benefits of using multiple threads for the area where you lock. But if it is an area that is accessed infrequently, or there is only a small chance that multiple threads will be in the locked area at the same time, it can easily be a “good enough” solution.

Locking is a good first step when you are dealing with data that can be read and modified by multiple threads. As we saw with the explanation of caches, when you modify data on one thread that change will quickly propagate to other threads that then read that data. However, the trickiness lies in the fact that these modifications can occur at any point within your algorithm. So by forcing the algorithm to operate in a serial manner we prevent any unexpected changes that can occur and disrupt what we expect to happen.

While debugging mutex locks can be fairly simple (as far as threading problems go), there are several ways that things can go wrong. If you are using locks and find that program has stopped responding, simply pause it and examine it in the debugger. Chances are that one of the following three things has occurred.

The first pitfall you might have fallen into is what is referred to as a deadlock. This occurs when the operation of your program results in such a manner where a lock is obtained but is never released. This causes your program to halt because no further progress can be made because all the threads are waiting on something that is never going to happen.

One cause of this can be due to not acquiring locks in the same manner in different places in your code. So if you acquire Lock A then Lock B in one place and Lock B then Lock A in another, you can get in a situation where one has Lock A locked and tried to lock Lock B, but can’t because Lock B has already been acquired. Since it has Lock A already, the thread that is holding Lock B cannot get it and so never unlocks Lock B. If you acquire the locks always in the Lock A then Lock B direction, then you can prevent this problem from ever occurring.

DeadLockThreads-1

 

Another cause is simply attempting to use the same lock in multiple places, but you have code that calls to lock multiple times without a resulting unlock. You can easily design a re-entrant mutex that can handle this sort of use case if desired, but make sure that this behavior is what you want.

DeadLockThreads-2

 

The final case that I will cover is likely the hardest to ascertain. When you have multiple threads of different priorities, you need to be careful when it comes to sharing locks. If you have a case where there is low priority, medium priority, and high priority thread, and the low and high priority threads both execute in the same critical section, then you need to be wary of random boosting in Windows, that can help avoid this situation for you.

DeadLockThreads-3

 

Lock-Free

When using atomic operations, we are also forcing a linear execution on our program, but in a much smaller scope. However, they do not fall prey to the things that can go wrong when using locks. Therefore, if we are not using any locks, then our program will always be able to advance. If our algorithm is guaranteed to never wait (even if it has to repeat some of its work), then it is considered a lock-free algorithm. Another way of thinking about it is if you were to arbitrarily pause any thread that is in your algorithm, it will not affect any of the other threads in there.

However, when dealing with lock-free algorithms, even though you do not need to worry about deadlocks as described above, you have a whole new set of problems. The main issue when writing code that doesn’t force linear execution for large sections is that almost anything can happen at anytime. When thinking through the code, you have to take into account that any other thread can come along and perform operations, such as changing a variable, adding / removing a node, or reorganizing the data. Because of this, it is common with lock-free algorithms to continuously test the assumptions before performing an action. If the assumed state is incorrect, then lock-free algorithms typically will loop and repeat their work until they can safely proceed. So, if an area of code has low contention, then using locks might actually lead to faster execution.

However, just testing the state isn’t without its caveats. One particularly nasty thing that can happen is known as the ABA problem. The idea here is that even if you are making sure that you are operating on the same address, by testing it with a Compare-And-Swap atomic action, it can still fail. If the address was recycled in some way (such as being deleted than allocated again, or popped then pushed again) then even though it is at the same location, it is not the same data. This can lead to crashes or lost data, and is very difficult to track down due to the fact that it can be extremely rare.

ABAProblem

 

Wait-Free

Now, if you have an algorithm that will never repeat work and will always advance, then you have a wait-free algorithm. Because of their properties, they will always finish in a fixed number of steps. These can be exceedingly difficult to implement and it is rare to actually see them in use, though technically any algorithm can be certain circumstances). This means that for games, where performance is typically the highest goal and the number of threads is low, lock-free algorithms are generally the best option for highly threaded, time-critical areas.

 

Final Considerations

Never underestimate the power of a “good enough” solution. Because multi-threaded programming carries with it a high chance of hard to detect bugs, a tiny bit of performance tradeoff for a more stable codebase can definitely be worth it. Therefore, it is actually good practice to start with using simple locks to protect shared data. Then, when you are certain that those locks are a bottleneck, investigate into lock-free solutions. Also, because writing good lock-free algorithms can be difficult, it is usually a good idea to keep an eye out for already implemented libraries that of. These will help reduce  possible implementation errors, since simple tests might fail to catch all the problems that can occur. Especially since the issues can be very dependent on rare, timing specific situations.

It is also a very good idea to redesign your code so that you minimize the areas where multiple threads can interact with each other. This gives you the benefits of parallelism without many the issues that can occur (such as all those described earlier). We will look at ways to do this later in the series, but I wanted the reader to first be familiar with some of the difficulties that can come with multi-threaded programming. That way, if you encounter a situation where you have to worry about multiple threads executing on the same data, you have a good grasp of the problem space going into it.

 

See You Next Time

Next Time

For the next post, we are going to look at a basic locking queue implementation. For the post after that, we will then see a lock-free queue implementation where I will cover why we need to make the changes that we did in order to convert it. For the following post, we will see if we can optimize the lock-free queue further, and some pitfalls to watch out for while doing so. Because of the difficulties with wait-free algorithms and the fact they typically don’t have much real-world benefit, we will not be covering those further.

 

Further Reading

If you want to read some more about locking, lock-free, and wait-free algorithms, feel free to check out the following areas:

Custom Vector Allocation

Original Author: Thomas Young

(Also posted to series of posts about Vectors and Vector based containers.)

A few posts back I talked about the idea of ‘rolling your own’ STL-style vector class, based my experiences with this at PathEngine.In two follow-ups I talked about the general approach and also some specific performance tweaks that actually helped in practice for our vector use cases.I haven’t talked about custom memory allocation yet, however. This is something that’s been cited in a number of places as a key reason for switching away from std::vector so I’ll come back now and look at the approach we took for this (which is pretty simple, but nonstandard, and also pre C++11), and assess some of the implications of using this kind of non-standard approach.

I approach this from the point of view of a custom vector implementation, but I’ll be talking about some issues with memory customisation that also apply more generally.

Why custom allocation?

In many situations it’s fine for vectors (and other containers) to just use the same default memory allocation method as the rest of your code, and this is definitely the simplest approach.

(The example vector code I posted previously used malloc() and free(), but works equally well with global operator new and delete.)

But vectors can do a lot of memory allocation, and memory allocation can be expensive, and it’s not uncommon for memory allocation operations to turn up in profiling as the most significant cost of vector based code. Custom memory allocation approaches can help resolve this.

And some other good reasons for hooking into and customising allocations can be the need to avoid memory fragmentation or to track memory statistics.

For these reasons generalised memory customisation is an important customer requirement for our SDK code in general, and then by extension for the vector containers used by this code.

Custom allocation in std::vector

The STL provides a mechanism for hooking into the container allocation calls (such as vector buffer allocations) through allocators, with vector constructors accepting an allocator argument for this purpose.

I won’t attempt a general introduction to STL allocators, but there’s a load of material about this on the web. See, for example, this article on Dr Dobbs, which includes some example use cases for allocators. (Bear in mind that this is pre C++11, however. I didn’t see any similarly targeted overview posts for using allocators post C++11.)

A non-standard approach

We actually added the possibility to customise memory allocation in our vectors some time after switching to a custom vector implementation. (This was around mid-2012. Before that PathEngine’s memory customisation hooks worked by overriding global new and delete, and required dll linkage if you wanted to manage PathEngine memory allocations separately from allocations in the main game code.)

We’ve generally tried to keep our custom vector as similar as possible to std::vector, in order to avoid issues with unexpected behaviour (since a lot of people know how std::vector works), and to ensure that code can be easily switched between std::vector and our custom vector. When it came to memory allocation, however, we chose a significantly different (and definitely non-standard) approach, because in practice a lot of vector code doesn’t actually use allocators (or else just sets allocators in a constructor), because we already had a custom vector class in place, and because I just don’t like STL allocators!

Other game developers

A lot of other game developers have a similar opinion of STL allocators, and for many this is actually then also a key factor in a decision to switch to custom container classes.

For example, issues with the design of STL allocators are quoted as one of the main reasons for the creation of the EASTL, a set of STL replacement classes, by Electronic Arts. From the EASTL paper:

Among game developers the most fundamental weakness is the std allocator design, and it is this weakness that was the largest contributing factor to the creation of EASTL.

And I’ve heard similar things from other developers. For example, in this blog post about the Bitsquid approach to allocators Niklas Frykholm says:

If it weren’t for the allocator interface I could almost use STL. Almost.

Let’s have a look at some of the reasons for this distaste!

Problems with STL allocators

We’ll look at the situation prior to C++11, first of all, and the historical basis for switching to an alternative mechanism.

A lot of problems with STL allocators come out of confusion in the initial design. According to Alexander Stepanov (primary designer and implementer of the STL) the custom allocator mechanism was invented to deal with a specific issue with Intel memory architecture. (Do you remember near and far pointers? If not, consider yourself lucky I guess!) From this interview with Alexander:

Question: How did allocators come into STL? What do you think of them?

Answer: I invented allocators to deal with Intel’s memory architecture. They are not such a bad ideas in theory – having a layer that encapsulates all memory stuff: pointers, references, ptrdiff_t, size_t. Unfortunately they cannot work in practice.

And it seems like this original design intention was also only partially executed. From the wikipedia entry for allocators:

They were originally intended as a means to make the library more flexible and independent of the underlying memory model, allowing programmers to utilize custom pointer and reference types with the library. However, in the process of adopting STL into the C++ standard, the C++ standardization committee realized that a complete abstraction of the memory model would incur unacceptable performance penalties. To remedy this, the requirements of allocators were made more restrictive. As a result, the level of customization provided by allocators is more limited than was originally envisioned by Stepanov.

and, further down:

While Stepanov had originally intended allocators to completely encapsulate the memory model, the standards committee realized that this approach would lead to unacceptable efficiency degradations. To remedy this, additional wording was added to the allocator requirements. In particular, container implementations may assume that the allocator’s type definitions for pointers and related integral types are equivalent to those provided by the default allocator, and that all instances of a given allocator type always compare equal, effectively contradicting the original design goals for allocators and limiting the usefulness of allocators that carry state.

Some of the key problems with STL allocators (historically) are then:

  • Unnecessary complexity, with some boiler plate stuff required for features that are not actually used
  • A limitation that allocators cannot have internal state (‘all instances of a given allocator type are required to be interchangeable and always compare equal to each other’)
  • The fact the allocator type is included in container type (with changes to allocator type changing the type of the container)

There are some changes to this situation with C++11, as we’ll see below, but this certainly helps explain why a lot of people have chosen to avoid the STL allocator mechanism, historically!

Virtual allocator interface

So we decided to avoid STL allocators, and use a non-standard approach.

The approach we use is based on a virtual allocator interface, and avoids the need to specify allocator type as a template parameter.

This is quite similar to the setup for allocators in the BitSquid engine, as described by Niklas here (as linked above, it’s probably worth reading that post if you didn’t see this already, as I’ll try to avoid repeating the various points he discussed there).

A basic allocator interface can then be defined as follows:

class iAllocator
 
  {
 
  public:
 
      virtual ~iAllocator() {}
 
      virtual void* allocate(tUnsigned32 size) = 0;
 
      virtual void deallocate(void* ptr) = 0;
 
  // helper
 
      template <class T> void
 
      allocate_Array(tUnsigned32 arraySize, T*& result)
 
      {
 
          result = static_cast<T*>(allocate(sizeof(T) * arraySize));
 
      }
 
  };

The allocate_Array() method is for convenience, concrete allocator objects just need to implement allocate() and free().

We can store a pointer to iAllocator in our vector, and replace the direct calls to malloc() and free() with virtual function calls, as follows:

    static T*
 
      allocate(size_type size)
 
      {
 
          T* allocated;
 
          _allocator->allocate_Array(size, allocated);
 
          return allocated;
 
      }
 
      void
 
      reallocate(size_type newCapacity)
 
      {
 
          T* newData;
 
          _allocator->allocate_Array(newCapacity, newData);
 
          copyRange(_data, _data + _size, newData);
 
          deleteRange(_data, _data + _size);
 
          _allocator->deallocate(_data);
 
          _data = newData;
 
          _capacity = newCapacity;
 
      }

These virtual function calls potentially add some overhead to allocation and deallocation. It’s worth being quite careful about this kind of virtual function call overhead, but in practice it seems that the overhead is not significant here. Virtual function call overhead is often all about cache misses and, perhaps because there are often just a small number of actual allocator instance active, with allocations tending to be grouped by allocator, this just isn’t such an issue here.

We use a simple raw pointer for the allocator reference. Maybe a smart pointer type could be used (for better modern C++ style and to increase safety), but we usually want to control allocator lifetime quite explicitly, so we’re basically just careful about this.

Allocators can be passed in to each vector constructor, or if omitted will default to a ‘global allocator’ (which adds a bit of extra linkage to our vector header):

    cVector(size_type size, const T& fillWith,
 
          iAllocator& allocator = GlobalAllocator()
 
          )
 
      {
 
          _data = 0;
 
          _allocator = &allocator;
 
          _size = size;
 
          _capacity = size;
 
          if(size)
 
          {
 
              _allocator->allocate_Array(_capacity, _data);
 
              constructRange(_data, _data + size, fillWith);
 
          }
 
      }

Here’s an example concrete allocator implementation:

class cMallocAllocator : public iAllocator
 
  {
 
  public:
 
      void*
 
      allocate(tUnsigned32 size)
 
      {
 
          assert(size);
 
          return malloc(static_cast<size_t>(size));
 
      }
 
      void
 
      deallocate(void* ptr)
 
      {
 
          free(ptr);
 
      }
 
  };

(Note that you normally can call malloc() with zero size, but this is something that we disallow for PathEngine allocators.)

And this can be passed in to vector construction as follows:

    cMallocAllocator allocator;
 
      cVector<int> v(10, 0, allocator);

Swapping vectors

That’s pretty much it, but there’s one tricky case to look out for.

Specifically, what should happen in our vector swap() method? Let’s take a small diversion to see why there might be a problem.

Consider some code that takes a non-const reference to vector, and ‘swaps a vector out’ as a way of returning a set of values in the vector without the need to heap allocate the vector object itself:

class cVectorBuilder
 
  {
 
      cVector<int> _v;
 
  public:
 
      //.... construction and other building methods
 
      void takeResult(cVector<int>& result); // swaps _v into result
 
  };

So this code doesn’t care about allocators, and just wants to work with a vector of a given type. And maybe there is some other code that uses this, as follows:

void BuildData(/*some input params*/, cVector& result)
 
  {
 
    //.... construct a cVectorBuilder and call a bunch of build methods
 
      builder.takeResult(result);
 
  }

Now there’s no indication that there’s going to be a swap() involved, but the result vector will end up using the global allocator, and this can potentially cause some surprises in the calling code:

   cVector v(someSpecialAllocator);
 
     BuildData(/*input params*/, v);
 
     // lost our allocator assignment!
 
     // v now uses the global allocator

Nobody’s really doing anything wrong here (although this isn’t really the modern C++ way to do things). This is really a fundamental problem arising from the possibility to swap vectors with different allocators, and there are other situations where this can come up.

You can find some discussion about the possibilities for implementing vector swap with ‘unequal allocators’ here. We basically choose option 1, which is to simply declare it illegal to call swap with vectors with different allocators. So we just add an assert in our vector swap method that the two allocator pointers are equal.

In our case this works out fine, since this doesn’t happen so much in practice, because cases where this does happen are caught directly by the assertion, and because it’s generally straightforward to modify the relevant code paths to resolve the issue.

Comparison with std::vector, is this necessary/better??

Ok, so I’ve outlined the approach we take for custom allocation in our vector class.

This all works out quite nicely for us. It’s straightforward to implement and to use, and consistent with the custom allocators we use more generally in PathEngine. And we already had our custom vector in place when we came to implement this, so this wasn’t part of the decision about whether or not to switch to a custom vector implementation. But it’s interesting, nevertheless, to compare this approach with the standard allocator mechanism provided by std::vector.

My original ‘roll-your-own vector’ blog post was quite controversial. There were a lot of responses strongly against the idea of implementing a custom vector, but a lot of other responses (often from the game development industry side) saying something like ‘yes, we do that, but we do some detail differently’, and I know that this kind of customisation is not uncommon in the industry.

These two different viewpoints makes it worthwhile to explore this question in a bit more detail, then, I think.

I already discussed the potential pitfalls of switching to a custom vector implementation in the original ‘roll-your-own vector’ blog post, so lets look at the potential benefits of switching to a custom allocator mechanism.

Broadly speaking, this comes down to three key points:

  • Interface complexity
  • Stateful allocator support
  • Possibilities for further customisation and memory optimisation

Interface complexity

If we look at an example allocator implementation for each setup we can see that there’s a significant difference in the amount of code required. The following code is taken from my previous post, and was used to fill allocated memory with non zero values, to check for zero initialisation:

// STL allocator version
 
  template <class T>
 
  class cNonZeroedAllocator
 
  {
 
  public:
 
      typedef T value_type;
 
      typedef value_type* pointer;
 
      typedef const value_type* const_pointer;
 
      typedef value_type& reference;
 
      typedef const value_type& const_reference;
 
      typedef typename std::size_t size_type;
 
      typedef std::ptrdiff_t difference_type;
 
      template <class tTarget>
 
      struct rebind
 
      {
 
          typedef cNonZeroedAllocator<tTarget> other;
 
      };
 
      cNonZeroedAllocator() {}
 
      ~cNonZeroedAllocator() {}
 
      template <class T2>
 
      cNonZeroedAllocator(cNonZeroedAllocator<T2> const&)
 
      {
 
      }
 
      pointer
 
      address(reference ref)
 
      {
 
          return &ref;
 
      }
 
      const_pointer
 
      address(const_reference ref)
 
      {
 
          return &ref;
 
      }
 
      pointer
 
      allocate(size_type count, const void* = 0)
 
      {
 
          size_type byteSize = count * sizeof(T);
 
          void* result = malloc(byteSize);
 
          signed char* asCharPtr;
 
          asCharPtr = reinterpret_cast<signed char*>(result);
 
          for(size_type i = 0; i != byteSize; ++i)
 
          {
 
              asCharPtr[i] = -1;
 
          }
 
          return reinterpret_cast<pointer>(result);
 
      }
 
      void deallocate(pointer ptr, size_type)
 
      {
 
          free(ptr);
 
      }
 
  
 
      size_type
 
      max_size() const
 
      {
 
          return 0xffffffffUL / sizeof(T);
 
      }
 
      void
 
      construct(pointer ptr, const T& t)
 
      {
 
          new(ptr) T(t);
 
      }
 
      void
 
      destroy(pointer ptr)
 
      {
 
          ptr->~T();
 
      }
 
      template <class T2> bool
 
      operator==(cNonZeroedAllocator<T2> const&) const
 
      {
 
          return true;
 
      }
 
      template <class T2> bool
 
      operator!=(cNonZeroedAllocator<T2> const&) const
 
      {
 
          return false;
 
      }
 
  };

But with our custom allocator interface this can now be implemented as follows:

// custom allocator version
 
  class cNonZeroedAllocator : public iAllocator
 
  {
 
  public:
 
      void*
 
      allocate(tUnsigned32 size)
 
      {
 
          void* result = malloc(static_cast<size_t>(size));
 
          signed char* asCharPtr;
 
          asCharPtr = reinterpret_cast<signed char*>(result);
 
          for(tUnsigned32 i = 0; i != size; ++i)
 
          {
 
              asCharPtr[i] = -1;
 
          }
 
          return result;
 
      }
 
      void
 
      deallocate(void* ptr)
 
      {
 
          free(ptr);
 
      }
 
  };

As we saw previously a lot of stuff in the STL allocator relates to some obsolete design decisions, and is unlikely to actually be used in practice. The custom allocator interface also completely abstracts out the concept of constructed object type, and works only in terms of actual memory sizes and pointers, which seems more natural and whilst doing everything we need for the allocator use cases in PathEngine.

For me this is one advantage of the custom allocation setup, then, although probably not something that would by itself justify switching to a custom vector.

If you use allocators that depend on customisation of the other parts of the STL allocator interface (other than for data alignment) please let me know in the comments thread. I’m quite interested to hear about this! (There’s some discussion about data alignment customisation below.)

Stateful allocator requirement

Stateful allocator support is a specific customer requirement for PathEngine.

Clients need to be able to set custom allocation hooks and have all allocations made by the SDK (including vector buffer allocations) routed to custom client-side allocation code. Furthermore, multiple allocation hooks can be supplied, with the actual allocation strategy selected depending on the actual local execution context.

It’s not feasible to supply allocation context to all of our vector based code as a template parameter, and so we need our vector objects to support stateful allocators.

Stateful allocators with the virtual allocator interface

Stateful allocators are straightforward with our custom allocator setup. Vectors can be assigned different concrete allocator implementations and these concrete allocator implementations can include internal state, without code that works on the vectors needing to know anything about these details.

Stateful allocators with the STL

As discussed earlier, internal allocator state is something that was specifically forbidden by the original STL allocator specification. This is something that has been revisited in C++11, however, and stateful allocators are now explicitly supported, but it also looks like it’s possible to use stateful allocators in practice with many pre-C++11 compile environments.

The reasons for disallowing stateful allocators relate to two specific problem situations:

  • Splicing nodes between linked lists with different allocation strategies
  • Swapping vectors with different allocation strategies

C++11 addresses these issues with This stackoverflow answer discusses what happens, specifically, with C++11, in the vector swap case.

With PathEngine we want to be able to support clients with different compilation environments, and it’s an advantage not to require C++11 support. But according to this stackoverflow answer, you can also actually get away with using stateful allocators in most cases, without explicit C++11 support, as long as you avoid these problem cases.

Since we already prohibit the vector problem case (swap with unequal allocators), that means that we probably can actually implement our stateful allocator requirement with std::vector and STL allocators in practice, without requiring C++11 support.

There’s just one proviso, with or without C++11 support, due to allowances for legacy compiler behaviour in allocator traits. Specifically, it doesn’t look like we can get the same assertion behaviour in vector swap. If propagate_on_container_swap::value is set to false for either allocator then the result is ‘undefined behaviour’, so this could just swap the allocators silently, and we’d have to be quite careful about these kinds of problem cases!

Building on stateful allocators to address other issues

If you can use stateful allocators with the STL then this changes things a bit. A lot of things become possible just by adding suitable internal state to standard STL allocator implementations. But you can also now use this allocator internal state as a kind of bootstrap to work around other issues with STL allocators.

The trick is wrap up the same kind of virtual allocator interface setup we use in PathEngine in an STL allocator wrapper class. You could do this (for example) by putting a pointer to our iAllocator interface inside an STL allocator class (as internal state), and then forward the actual allocation and deallocation calls as virtual function calls through this pointer.

So, at the cost of another layer of complexity (which can be mostly hidden from the main application code), it should now be possible to:

  • remove unnecessary boiler plate from concrete allocator implementations (since these now just implement iAllocator), and
  • use different concrete allocator types without changing the actual vector type.

Although I’m still not keen on STL allocators, and prefer the direct simplicity of our custom allocator setup as opposed to covering up the mess of the STL allocator interface in this way, I have to admit that this does effectively remove two of the key benefits of our custom allocator setup. Let’s move on to the third point, then!

Refer to this presentation about bloomberg allocators in the context C++11 allocator changes).

Memory optimisation

The other potential benefit of custom allocation over STL allocators is basically the possibility to mess around with the allocation interface.

With STL allocators we’re restricted to using the allocate() and deallocate() methods exactly as defined in the original allocator specification. But with our custom allocator we’re basically free to mess with these method definitions (in consultation with our clients!), or to add additional methods, and generally change the interface to better suit our clients needs.

There is some discussion of this issue in this proposal for improving STL allocators, which talks about ways in which the memory allocation interface provided by STL allocators can be sub-optimal.

Some customisations implemented in the Bitsquid allocators are:

  • an ‘align’ parameter for the allocation method, and
  • a query for the size of allocated blocks

PathEngine allocators don’t include either of these customisations, although this is stuff that we can add quite easily if required by our clients. Our allocator does include the following extra methods:

    virtual void*
 
      expand(
 
              void* oldPtr,
 
              tUnsigned32 oldSize,
 
              tUnsigned32 oldSize_Used,
 
              tUnsigned32 newSize
 
              ) = 0;
 
  // helper
 
      template <class T> void
 
      expand_Array(
 
              T*& ptr,
 
              tUnsigned32 oldArraySize,
 
              tUnsigned32 oldArraySize_Used,
 
              tUnsigned32 newArraySize
 
              )
 
      {
 
          ptr = static_cast<T*>(expand(
 
              ptr,
 
              sizeof(T) * oldArraySize,
 
              sizeof(T) * oldArraySize_Used,
 
              sizeof(T) * newArraySize
 
              ));
 
      }

What this does, essentially, is to provide a way for concrete allocator classes to use the realloc() system call, or similar memory allocation functionality in a custom head, if this is desired.

As before, the expand_Array() method is there for convenience, and concrete classes only need to implement the expand() method. This takes a pointer to an existing memory block, and can either add space to the end of this existing block (if possible), or allocate a larger block somewhere else and move existing data to that new location (based on the oldSize_Used parameter).

Implementing expand()

A couple of example implementations for expand() are as follows:

// in cMallocAllocator, using realloc()
 
      void*
 
      expand(
 
          void* oldPtr,
 
          tUnsigned32 oldSize,
 
          tUnsigned32 oldSize_Used,
 
          tUnsigned32 newSize
 
          )
 
      {
 
          assert(oldPtr);
 
          assert(oldSize);
 
          assert(oldSize_Used <= oldSize);
 
          assert(newSize > oldSize);
 
          return realloc(oldPtr, static_cast<size_t>(newSize));
 
      }
// as allocate and move
 
      void*
 
      expand(
 
          void* oldPtr,
 
          tUnsigned32 oldSize,
 
          tUnsigned32 oldSize_Used,
 
          tUnsigned32 newSize
 
          )
 
      {
 
          assert(oldPtr);
 
          assert(oldSize);
 
          assert(oldSize_Used <= oldSize);
 
          assert(newSize > oldSize);
 
          void* newPtr = allocate(newSize);
 
          memcpy(newPtr, oldPtr, static_cast<size_t>(oldSize_Used));
 
          deallocate(oldPtr);
 
          return newPtr;
 
      }

So this can either call through directly to something like realloc(), or emulate realloc() with a sequence of allocation, memory copy and deallocation operations.

Benchmarking with realloc()

With this expand() method included in our allocator it’s pretty straightforward to update our custom vector to use realloc(), and it’s easy to see how this can potentially optimise memory use, but does this actually make a difference in practice?

I tried some benchmarking and it turns out that this depends very much on the actual memory heap implementation in use.

I tested this first of all with the following simple benchmark:

template <class tVector> static void
 
  PushBackBenchmark(tVector& target)
 
  {
 
      const int pattern[] = {0,1,2,3,4,5,6,7};
 
      const int patternLength = sizeof(pattern) / sizeof(*pattern);
 
      const int iterations = 10000000;
 
      tSigned32 patternI = 0;
 
      for(tSigned32 i = 0; i != iterations; ++i)
 
      {
 
          target.push_back(pattern[patternI]);
 
          ++patternI;
 
          if(patternI == patternLength)
 
          {
 
              patternI = 0;
 
          }
 
      }
 
  }

(Wrapped up in some code for timing over a bunch of iterations, with result checking to avoid the push_back being optimised out.)

This is obviously very far from a real useage situation, but the results were quite interesting:

OS container type time
Linux std::vector 0.0579 seconds
Linux cVector without realloc 0.0280 seconds
Linux cVector with realloc 0.0236 seconds
Windows std::vector 0.0583 seconds
Windows cVector without realloc 0.0367 seconds
Windows cVector with realloc 0.0367 seconds

So the first thing that stands out from these results is that using realloc() doesn’t make any significant difference on windows. I double checked this, and while expand() is definitely avoiding memory copies a significant proportion of the time, this is either not significant in the timings, or memory copy savings are being outweighed by some extra costs in the realloc() call. Maybe realloc() is implemented badly on Windows, or maybe the memory heap on Windows is optimised for more common allocation scenarios at the expense of realloc(), I don’t know. A quick google search shows that other people have seen similar issues.

Apart from that it looks like realloc() can make a significant performance difference, on some platforms (or depending on the memory heap being used). I did some extra testing, and it looks like we’re getting diminishing returns after some of the other performance tweaks we made in our custom vector, specifically the tweaks to increase capacity after the first push_back, and the capacity multiplier tweak. With these tweaks backed out:

OS container type time
Linux cVector without realloc, no tweaks 0.0532 seconds
Linux cVector with realloc, no tweaks 0.0235 seconds

So, for this specific benchmark, using realloc() is very significant, and even avoids the need for those other performance tweaks.

Slightly more involved benchmark

The benchmark above is really basic, however, and certainly isn’t a good general benchmark for vector memory use. In fact, with realloc(), there is only actually ever one single allocation made, which is then naturally free to expand through the available memory space!

A similar benchmark is discussed in this stackoverflow question, and in that case the benefits seemed to reduce significantly with more than one vector in use. I hacked the benchmark a bit to see what this does for us:

template <class tVector> static void
 
  PushBackBenchmark_TwoVectors(tVector& target1, tVector& target2)
 
  {
 
      const int pattern[] = {0,1,2,3,4,5,6,7};
 
      const int patternLength = sizeof(pattern) / sizeof(*pattern);
 
      const int iterations = 10000000;
 
      tSigned32 patternI = 0;
 
      for(tSigned32 i = 0; i != iterations; ++i)
 
      {
 
          target1.push_back(pattern[patternI]);
 
          target2.push_back(pattern[patternI]);
 
          ++patternI;
 
          if(patternI == patternLength)
 
          {
 
              patternI = 0;
 
          }
 
      }
 
  }
 
  template <class tVector> static void
 
  PushBackBenchmark_ThreeVectors(tVector& target1, tVector& target2, tVector& target3)
 
  {
 
      const int pattern[] = {0,1,2,3,4,5,6,7};
 
      const int patternLength = sizeof(pattern) / sizeof(*pattern);
 
      const int iterations = 10000000;
 
      tSigned32 patternI = 0;
 
      for(tSigned32 i = 0; i != iterations; ++i)
 
      {
 
          target1.push_back(pattern[patternI]);
 
          target2.push_back(pattern[patternI]);
 
          target3.push_back(pattern[patternI]);
 
          ++patternI;
 
          if(patternI == patternLength)
 
          {
 
              patternI = 0;
 
          }
 
      }
 
  }

With PushBackBenchmark_TwoVectors():

OS container type time
Linux std::vector 0.0860 seconds
Linux cVector without realloc 0.0721 seconds
Linux cVector with realloc 0.0495 seconds

With PushBackBenchmark_ThreeVectors():

OS container type time
Linux std::vector 0.1291 seconds
Linux cVector without realloc 0.0856 seconds
Linux cVector with realloc 0.0618 seconds

That’s kind of unexpected.

If we think about what’s going to happen with the vector buffer allocations in this benchmark, on the assumption of sequential allocations into a simple contiguous memory region, it seems like the separate vector allocations in the modified benchmark versions should actually prevent each other from expanding. And I expected that to reduce the benefits of using realloc. But the speedup is actually a lot more significant for these benchmark versions.

I stepped through the benchmark and the vector buffer allocations are being placed sequentially in a single contiguous memory region, and do initially prevent each other from expanding, but after a while the ‘hole’ at the start of the memory region gets large enough to be reused, and then reallocation becomes possible, and somehow turns out to be an even more significant benefit. Maybe these benchmark versions pushed the memory use into a new segment and incurred some kind of segment setup costs?

With virtual memory and different layers of memory allocation in modern operating systems, and different approaches to heap implementations, it all works out as quite a complicated issue, but it does seem fairly clear, at least, that using realloc() is something that can potentially make a significant difference to vector performance, in at least some cases!

Realloc() in PathEngine

Those are all still very arbitrary benchmarks and it’s interesting to see how much this actually makes a difference for some real uses cases. So I had a look at what difference the realloc() support makes for the vector use in PathEngine.

I tried our standard set of SDK benchmarks (with common queries in some ‘normal’ situations), both with and without realloc() support, and compared the timings for these two cases. It turns out that for this set of benchmarks, using realloc() doesn’t make a significant difference to the benchmark timings. There are some slight improvements in some timings, but nothing very noticeable.

The queries in these benchmarks have already had quite a lot of attention for performance optimisation, of course, and there are a bunch of other performance optimisations already in the SDK that are designed to avoid the need for vector capacity increases in these situations (reuse of vectors for runtime queries, for example). Nevertheless, if we’re asking whether custom allocation with realloc() is ‘necessary or better’ in the specific case of PathEngine vector use (and these specific benchmarks) the answer appears to be that no this doesn’t really seem to make any concrete difference!

Memory customisation and STL allocators

As I’ve said above, this kind of customisation of the allocator interface (to add stuff like realloc() support) is something that we can’t do with the standard allocator setup (even with C++11).

For completeness it’s worth noting the approach suggested by Alexandrescu in this article where he shows how you can effectively shoehorn stuff like realloc() calls into STL allocators.

But this does still depends on using some custom container code to detect special allocator types, and won’t work with std::vector.

Conclusion

This has ended up a lot longer than I originally intended so I’ll go ahead and wrap up here!

To conclude:

  • It’s not so hard to implement your own allocator setup, and integrate this with a custom vector (I hope this post gives you a good idea about what can be involved in this)
  • There are ways to do similar things with the STL, however, and overall this wouldn’t really work out as a strong argument for switching to a custom vector in our case
  • A custom allocator setup will let you do some funky things with memory allocation, if your memory heap will dance the dance, but it’s not always clear that this will translate into actual concrete performance benefits

A couple of things I haven’t talked about:

Memory fragmentation: custom memory interfaces can also be important for avoiding memory fragmentation, and this can be an important issue. We don’t have a system in place for actually measuring memory fragmentation, though, and I’d be interested to hear how other people in the industry actually quantify or benchmark this.

Memory relocation: the concept of ‘relocatable allocators’ is quite interesting, I think, although this has more significant implications for higher level vector based code, and requires moving further away from standard vector usage. This is something I’ll maybe talk about in more depth later on..

** Comments: Please check the comment thread on the original post before commenting. **

GDC 2014 Report

Original Author: Jaewon Jung

It’s that time again. Personally I think GDC this year was more interesting that a few preceding ones. Next-gen consoles have arrived, mobile gaming is really big still, and a quality Virtual Reality is becoming a reality.

PowerVR Ray Tracing Hardware and Mobile Graphics

ImgTec announced its upcoming mobile GPU, PowerVR GR6500, which features dedicated ray tracing units on top of its rasterization capabilities. It has a Ray Tracing Unit (RTU) which uses fixed-function math to perform ray tracing intersection queries and a scene hierarchy generator to speed up dynamic object updates. You can program it by writing ray shaders for OpenRL APIs. It makes a hybrid approach feasible, where ray tracing is used only for stuff it’s good at like shadowing, reflection, and refraction in addition to a regular rasterization. Unity showcased its editor using this technology for interactive real-time lightmap previews. It’s interesting that this kind of hardware innovation came from a mobile chip company rather than traditional a GPU powerhouse like NVIDIA or AMD. But, if you think of it, it makes sense. Arithmetic performance of the latest mobile GPUs has already reached that of previous generation consoles like Xbox 360. But, its bandwidth still remains quite limited compared to the previous gen consoles because of power limitation of mobile devices. Dedicating a small portion of its die space to ray tracing, which can spend less bandwidth to achieve the same effects previously mentioned in comparison to rasterization, can be a win to circumvent such a constraint. As they say, constraints breed creativity. The chip is not yet in production and it’s yet to be seen that other GPU manufactures will follow suit or not, but it remains true that this is really an interesting time for graphics coders.

There have been a few talks targeted for enabling console-level mobile graphics from PowerVR, Qualcomm, and Epic Games. Mobile GPUs usually do a tile-based rendering except Tegra one from NVIDIA, but there seems to be considerable differences even among tile-based GPUs from different vendors, which graphics codes should be aware of, to achieve full potential of each hardware.

VR is the next big thing

Kickstarter-born Oculus Rift recently has become an object of attention with great reviews from everywhere, its second hardware coming up, and a few great gamedev talents, including venerable John Carmack, joining it. (In related news, joined Oculus Rift, or more precisely Facebook.) But, in this GDC, the one who revealed an ace up its sleeve was Sony. Its aptly named Project Morpheus seems to have the same goal overall as Rift, but with its PlayStation 4 ecosystem including its camera and Move controller. According to SCE Worldwide Studios President Shuhei Yoshida, It has been in the works since 2010. The current prototype features a head-mounted display with 1080p resolution and a 90 degree field of view and seems to use Playstation Camera to track head orientation and movement. It’s still unclear what its final spec will look like and when the final product will be released. VR is a totally new medium whose scope extends beyond the domain of video games, as the Sony presentation emphasized, and it’ll be very interesting for us, game developers, both as a player and a developer, to see how these unfold in the coming years.

War of Game Engines

Epic Games introduced a $19 monthly subscription model (with 5 percent of revenue royalty payment) for Unreal Engine 4. Right after this announcement, Crytek tried to undercut it by announcing a %9.90 monthly subscription without any royalty for its CryEngine. Ostensibly, the CryEngine deal sounds much better if we naively assume both have similar capabilities, but there is a big caveat: Crytek deal doesn’t provide a full source code access like the Unreal one. Epic Games even allows keeping the source code after the subscription expired. In other words, as a programmer, you can dig into its latest source code and try to learn all its tricks & techniques with just twenty bucks, if you have time & passion for it. What a bold move. Unity also didn’t sit on its hands. It announced the latest incarnation of its engine, Unity 5, with impressive features like physically-based shading and WebGL deployment.

DirectX is not dead

It’s been a while since DirectX 11 came out. Microsoft also seemed to signal that the announcement of OpenGL ES 3.1.

Misc

A math tutorial by Code Clinic: How to Write Code the Compiler Can Actually Optimize. He also shows how zip archiving can be abused to measure information density of some data used in your code.

(I posted this article to my personal blog, too.)

Links

Principled Design

Original Author: Dan-Speed

What makes good software?

temporal coupling, informative error messages. Perhaps it’s well documented where it needs to be.

These are obvious candidates, but even here there is room for debate between the individuals who work on the software.

One property of software that I like to see is a consistency of vision, which tends to be easy enough to achieve with a one man team over a short period, but is more challenging to maintain over a longer period or with more people. By “vision”, I mean a high level consistency of understanding of what the software is meant to solve (and what it’s not meant to solve) and the general way in which it sets about that. Over a longer period of time, it’s very easy to forget what properties of the software were important when it was created, and therefore how its fitness of purpose should be evaluated.

These sorts of high level concepts are easily lost from documentation but are important for new people working on a piece of software to understand. I’m a big fan of giving a software library a mission statement,  but I’ve found that a list of principles is very useful in a number of ways, including fostering debate.

What principles are important to me?

Here’s the exercise – make a list of principles you feel that are important to how you build something that you’re working on.

Here are some guidelines for making this list:

  • Keep it short and sweet (~10 items)
  • Prefer ideas that force people to go out of their way over things that are expected
  • Avoid being specific to a particular solution (though this may sometimes be called for)
  • Don’t worry about order
  • Ensure that each principle is clearly communicated, but brief
  • You may add explanations about the reason for adopting a principle afterwards
  • Principles can be changed after they are agreed, but doing so ought to mean that you need to address the parts where your software violates them, or consider if you need to start again.
  • Pick only principles that can be evaluated against an implementation; if you can’t then you still lack concreteness or clarity
  • If your organization already has principles for software construction you need to make sure your principles are not contradicting them
  • Put it somewhere it’s easily found for reference

The goal of this list is to:

  • Foster debate within the team or with stakeholders about what is important
  • Help constrain the shape of the solution, without dictating the implementation
  • Provide an explanation and motivation for inconvenient choices
  • Provide a checklist for evaluating new features/changes
  • Help you to realize when a new feature or change would compromise the original principles and therefore lead to an inconsistent and confusing design

A quick bit of reasoning about the principles:

It’s important to be brief because the point is to communicate things to other people quickly and the principles need to be easy to remember. You want to keep the list short, so that you have a reasonable chance of sticking to it, and describe each item at a higher-level, so that the items provide value to a large area of your software, rather than small pockets. Principles are not rules, they can be broken, but there should always be a conversation about doing so.

If everyone in your company believes in XP, TDD etc, then having a principle that your software is unit tested doesn’t require anyone to go out of their way. However, if all of your technology is generally tied to a very particular tech stack and environment but you need this new software to be usable, for instance by a build system, outside of that environment, that’s a good principle to put down.

You should pick a solution because it achieves your goals and fits with your principles, not because you picked it up-front and put a principle around it. Principles should provoke a discussion about when an existing solution doesn’t fit any more rather than enforcing blind continuation. Note that the principle itself trumps the explanation, so an explanation could be a specific example, but that should not limit the principle.

Avoid things like “The software should be well written”, or “The game design should be meaningful to players”. These ideas are subjective and open to interpretation, as well as being almost impossible to evaluate.

A nice thing is that this can be applied at almost any level, to almost anything. You can make some company-wide architectural principles, personal principles or principles for a library.

Some example principles & explanations

  • Modular – the libraries should not require each other. Requirements will change or be different for new projects, new components will develop. This should be embraced without requiring us to discard the whole set of tools.
  • Standard Formats – the tools should use standard formats, allowing tools to be written in other languages or scripts.
  • Local testing – parts of the solution that are vendor specific must be testable locally through the use of interfaces & dependency injection
  • Minimal Setup – the software should be easy to deploy
  • Iteration time is important
  • Familiar – The solution should be familiar to game designers using it, and quick to teach programmers to develop with
  • HTTP First – REST/JSON based network APIs are preferred over custom formats and transports.
  • Push – deployments should be pushed, not pulled at random times by cron jobs / hup / agents.
  • Backwards compatibility –  we achieve backwards compatibility through careful versioning.
  • Free of Game logic – our back-end services will not be polluted by game-specific logic, but will use generic feature flags to achieve the desired behaviors

Software development hot air

This just provides a tool for communication, discussion and a bit of periodic reflection on what you find important. These things are achievable in other ways, but I’ve found that this is a lightweight way to help you mold your solution to set of concrete underlying ideas that might otherwise be lost. The constraints provided give other people a map to follow in understanding the path that you took, and a chance to follow in your footsteps (or express their differences).

constrained_Problem_Space

Principles are a purer motivation than practices. Practices can become rote. You can sprint, groom the backlog, do stand-ups, and still not be Agile. By keeping principles close to your heart, you, your team, your code-base and your practices can be in a continual cycle of self-evaluation and evolution without it being haphazard. Challenge yourself to live up to your principles.