When Even Crashing Doesn’t Work

Original Author: Bruce-Dawson

I’ve written previously about the improve code quality. However even the seemingly simple task of crashing can be more error prone than you might expect.

I’ve recently become aware of two different problems that can happen when crashing in 64-bit Windows. There is a Windows bug which can make debuggers forget where a crash happened, and there is a Windows design decision which sometimes causes a crash to be completely ignored!

Both problems are (mostly) avoidable once you know what to do, but the required techniques are far from obvious.

Forgetting where a crash happened

It is a reasonable minimum requirement that a debugger should halt on the exact instruction that triggered a fault and then attempt to show source code, local variables, a call stack etc. There are all sorts of reasons it may be difficult or impossible to show source code (none available), local variables (optimized away), or a call stack (stack trashed), but for user-mode debugging it should always be possible to stop on the faulting instruction.

And indeed, in all the decades that I have used Visual C++ it has managed this task quite well – until recently.

imageStarting a few months ago I noticed that, when the program that I was debugging crashed, the VC++ debugger would not halt on the faulting instruction. It wouldn’t even halt in the crashing function. Instead it would halt two levels into the OS, with a call stack that made no sense. At first I thought that the project I was working on was doing something weird with a structured exception handler but I was able to reproduce the bug on a fresh project created by the VC++ New Project Wizard. I briefly thought that maybe something was misconfigured on my machine, but then my coworkers started reporting this problem as well. Then I thought maybe it was a newly introduced VC++ bug – but the same problem can be triggered in windbg as well.

I wasn’t sure what was happening but it smelled like a recently introduced Windows bug.

My minimal test program for this bug was to call this Crash() function just before the message pump in a default Win32 program, debug build:

void Crash()
{
    char* p = 0;
    p[0] = 0;
}

If I break on the instruction that will crash then I get the call stack below, and I should get the same call stack after crashing:

image

That is indeed the call stack that I got in this scenario for years. However, starting a few months ago, on most 64-bit Windows 7 machines that I have tested this on, the actual call stack is this:

image

Notice that the function that crashed is not even listed! This makes routine bug investigation an expert-level problem.

Sometimes the crash call stack is even worse, with even the parent of the crashing function missing:

image

The actual stack displayed varies. Sometimes it is correct, and sometimes the two ZwRaiseException entries are listed. It seems to depend on subtle details of the code at the crash location, or the stack frames, or the phase of Venus.

Windbg defaults to halting on first-chance exceptions, so it normally avoids this bug. However if you continue execution after a crash then the exception handlers run and the bug appears.

I’ve created a simple test program with a “Crash normally” menu item so that you can easily test it. Source and the executable are available here. You’ll have to build the project file (with VS 2010 or VS 2012) to get symbols in order to see this properly in a debugger.

Another blogger investigated this issue earlier this year and AVX. Saving the state of the AVX registers requires additional space, and apparently the WoW64 debug support fails to reserve enough space, so the stack gets corrupted. Oops.

There is a fix (well, a couple of workarounds)

The problem with correctly displaying the location of a crash only occurs if the first-chance exception handlers are allowed to run. First-chance exception handlers give a program a chance to take some action when a program crashes (such as saving a minidump, or translating raw exception numbers into something more readable).

demonstrated it a few posts ago – but it’s not valuable enough to justify the complexity caused by not knowing where you crashed. Other uses of first-chance exceptions – such as ‘fixing’ bugs so that you can continue executing – are morally bankrupt and will not be acknowledged further here.

imageClearly what we want to do is to stop any exception handlers from running when our program crashes. We want the debugger to halt when an exception is thrown, instead of after it has complicated things by letting exception handlers run. This is actually the default behavior in windbg but in Visual Studio we have to change a setting. Go to the Debug menu, select Exceptions, and check the box beside Win32 Exceptions.

In an ideal world this would be a global setting and we would be done with the problem, but alas this is a per-solution setting, so you may have to click this check box many times. It’s a minor nuisance, and well worth it for the benefit of actually being able to debug your crashes.

os2museum blog. He points out that you can disable AVX support and therefore avoid the problem. The obvious disadvantage is that you lose AVX support, which will eventually become unacceptable. The command below and a reboot will turn off AVX support.

bcdedit /set xsavedisable 1

I think that there are two changes which Microsoft should make. One is that Visual Studio should default to halting immediately when Win32 exceptions are thrown – that is a safer policy in general, and would have avoided most of the impact of this bug.

The other change that Microsoft should make is to actually fix WOW64.

I have reported this bug to Microsoft through informal channels, but I’ve heard no reply so far.

Failure to stop at all

An equally disturbing problem was introduced some years ago with 64-bit Windows and it causes some crashes to be silently ignored.

Structured exception handling is the Windows system that underpins all exception handling (C++ exceptions are implemented using structured exception handling under the hood). Its full implementation relies on being able to unwind the stack (without or without calling destructors) in order to transfer execution from where an exception occurs to a catch/__except block.

The introduction of 64-bit Windows complicated this. On 64-bit Windows it is impossible to unwind the stack across the kernel boundary. That is, if your process calls into the kernel, and then the kernel calls back into your process, and an exception is thrown in the callback that is supposed to be handled on the other side of the kernel boundary, then Windows cannot handle this.

This may seem a bit esoteric and unlikely – writing kernel callbacks seems like a rare activity – but it’s actually quite common. In particular, a WindowProc is a callback, and it is often called by the kernel, as shown below:

image

If your code crashes in the user code on the right – called from the kernel – then Windows has a problem. Since Windows can’t invoke your exception handlers in the box on the left, and it doesn’t know what they would do, it has to make an executive decision about this exception. It can either crash the process, or it can silently ignore the exception, unwind the stack back to the kernel boundary, and then continue executing as if nothing happened.

Crashing the process may significantly inconvenience users, especially if there is a bug specific to 64-bit Windows in an unsupported product. But silently swallowing the exception means that many developers may be crashing in their WndProc without realizing it, leaving their process in an indeterminate state that may be causing future pain and suffering. Microsoft tries to err on the side of maximum compatibility and stability, but sometimes this just sweeps problems under the rug.

Triggering this behavior is easy. In a Project Wizard “Win32 Project” just drop a call to the Crash() function in the paint handler. To make this demo particularly dramatic be sure to put the Visual Studio exception settings back to normal. That is, make it so that Visual Studio does not stop when an exception is thrown – only when it is unhandled. Here’s a sample of what the modified code could look like, complete with a new/delete pair that straddles the Crash() call:

case WM_PAINT:
    {
        hdc = BeginPaint(hWnd, &ps);
        char* p = new char[1000000];
        Crash();
        delete [] p;
        EndPaint(hWnd, &ps);
        break;
    }

And here’s what the output window looks like:

image

The more you resize the window the more frantically the debugger tries to tell you that your program is in trouble. And yet your program continues. Try running it not under the debugger and you will see that it appears to be running normally. But if you look in task manager as you resize the window you will see the memory consumption growing out of control – the delete statement is never reached.

Aside: inevitably somebody will suggest that if I used std::vector then I wouldn’t have this memory leak. And indeed, normally I would use std::vector or some other container class to manage memory – manually calling delete is for chumps. However there are a couple of points to consider here:

  • The ability of std::vector to magically delete memory when an exception is thrown only works with C++ exceptions, not structured exceptions (crashes). There are ways to translate structured exceptions to C++ exceptions but this is misguided and, either slows your program or misses some exceptions. It’s just a bad idea. Don’t do it.
  • Additionally, the whole point of this article is that when you are in a callback from the kernel the exception handling mechanism – structured and C++ exceptions – is impaired. You can’t count on it to save you.

There is a fix

The default policy on 64-bit Windows is to silently swallow the crash, but as of Windows 7 SP1 there is a choice. There is a pair of undocumented (not directly listed on MSDN, although they are here or you can start the process of redemption by calling this function:

void EnableCrashingOnCrashes()
{
    typedef BOOL (WINAPI *tGetPolicy)(LPDWORD lpFlags);
    typedef BOOL (WINAPI *tSetPolicy)(DWORD dwFlags);
    const DWORD EXCEPTION_SWALLOWING = 0×1;

    HMODULE kernel32 = LoadLibraryA(“kernel32.dll”);
    tGetPolicy pGetPolicy = (tGetPolicy)GetProcAddress(kernel32,
                “GetProcessUserModeExceptionPolicy”);
    tSetPolicy pSetPolicy = (tSetPolicy)GetProcAddress(kernel32,
                “SetProcessUserModeExceptionPolicy”);
    if (pGetPolicy && pSetPolicy)
    {
        DWORD dwFlags;
        if (pGetPolicy(&dwFlags))
        {
            // Turn off the filter
            pSetPolicy(dwFlags & ~EXCEPTION_SWALLOWING);
        }
    }
}

The GetProcAddress dance is necessary because many versions of Windows don’t have these functions.

Calling this function – once at process startup will do – is a great start. It will ensure that crashes you hit during testing will get noticed and, one hopes, fixed. However there is probably another step that you will want to do. If you normally use an exception handler to record crash dumps then you will probably find that your exception handler is not recording these crashes. That’s because your exception handler is probably on the other side of the kernel boundary. You could try putting in more exception handlers, but then you’re playing callback whack-a-mole. The far simpler solution is to use for too much, but in this case they are better than nothing. Note also that the unhandled exception filter is not called when you are debugging.

imageThe same test program that has the “Crash normally” menu item also has a “Crash in callback” menu item to enable crashing in the WM_PAINT handler. As in the sample code above it leaks memory. It even keeps track of how much memory has leaked. Source and the executable are available here.

The test program also has an “Enable crashing on crashes” menu item. If you select this then the next callback crash – typically the next time that you resize the window – will be a real crash, instead of a silently ignored crash.

Note that CrashTest.exe will only exhibit the crash swallowing behavior on 64-bit Windows, and I’ve only tested it on Windows 7.

Things that annoy me

imageI don’t like the Program Compatibility Assistant. It tells you that something has gone wrong (in this case a crash in a kernel callback) but because it doesn’t tell you what went wrong there is no practical way for a Windows developer to do anything about it. Thus, the cycle of programs not running correctly continues.

It also doesn’t tell you what compatibility settings it has applied. I know that this information is not of interest to Windows users, but there should be some way to get specific information about what went wrong, so that developers are not left impotently scratching their heads.

Your task list

Download the test program from here to see bad exception handling (if you have an AVX capable machine running 64-bit Windows 7 SP1), and exception swallowing (all 64-bit versions of Windows).

You should enable breaking when an exception is thrown for all Win32 Exceptions, using Visual Studio’s Debug menu, Exceptions dialog. Don’t forget to do this for every solution. It is unfortunate that the default behavior is for Visual Studio has been incorrect for a couple of decades now. Maybe VS 2015 will correct this. If you enable breaking when Win32 exceptions are thrown then even kernel-callback crashes will break into the debugger, when you are running under a debugger.

Vote on the connect issue to request that the Visual Studio change the defaults so that the debugger halts on first-chance Win32 exceptions.

You should consider disabling AVX to avoid the WOW64 debugging bug: “bcdedit /set xsavedisable 1”

Call EnableCrashingOnCrashes() to ensure that crashes in callbacks are not ignored. Don’t use registry editing or other options to control this behavior.

If you have a crash dump saving system then use SetUnhandledExceptionFilter to ensure that it is called if your code crashes in a callback.

Test on 64-bit Windows.

Watch for more discussion on the compatibility assistant and exception handling in some future post.

Update

The silent swallowing of exceptions documented above is for 32-bit processes on 64-bit Windows. The behavior for 64-bit processes is a bit different. On Windows 7 if a 64-bit process crashes in a kernel callback then it will actually crash. However, if the executable doesn’t have a Windows 7 compatibility manifest (subject of a later post) then the Program Compatibility Assistant will apply a shim that will suppress future crashes. Confused yet?

The summary remains the same, with the possible addition that if you are testing on Windows 7 then you should add a compatibility manifest to say that you are doing so.

Also, I hear rumors that the stack corruption caused by storing AVX state is known by Microsoft and will be fixed. Whether the fix will be for Windows 8 only is not clear at this point.

Leaky Abstractions

Original Author: Claire Blackshaw

In a world of frameworks, simple to use engines and added layers of abstraction we are in danger of leaky abstraction, both in design and programming. While the concept is familiar to me a friend introduced me to the phrase at the pub recently as well as directing me to this brilliant article by Joel Spolsky. I wanted to publicise and explore this in the context of gaming using a graphics programming and motion design problem.

Do you understand Dot product? No, I mean really have you sat down with the math and do you remember it? I thought I had but recently while using Unity on a home project I naively called two functions in separate loops. One to find which side of a plane a point is, the other was how far from the plane the point was. Filtering the points by side and then calculating the distance in a follow-up step.

Moments later while debugging an unrelated but nearby piece of code I looked at the two functions and a brick of memory flew from a lecture of the past and knocked over my stupid forgetful self. Those familiar with the math have already facepalmed and laughed at my mistake, the math to figure out which side of a plane you are on is the same as that used to calculate the distance. The “side function” merely throws away the distance and returns the sign, leading to a leaky abstraction.

It should be noted that in the documentation of these functions, names or a surface level inspection you are not able to discover this fact. In a world where more and more layers of complexity are being shielded from us we are in great danger of not only throwing away useful information but repeating work already done. Increased battery drain, cloud server costs or wasted cycles being the symptoms of this ailment.

Designing for motion or futuristic inputs suffers a similar problem. Too often we see people using keyboard or keypad events with little understanding of the device delivers that from electrical signal to interrupt into a OS message pump or state then exposing that to our program. Often poor understanding introduces additional latency but this issue is magnified when we start using more complex input systems which we see as magic boxes.

The Sony Move controller uses gyroscope, accelerometer and camera feed to derive the position of the controller. The camera using the visible size of a known object, the ball, to do a distance calculation. Accelerometers are inherently noisy. What many people who use the system naively forget is that the data is pre-filtered and sampled over an interval. The default value reallying quite heavily on the visibility of the ball.

This filter step does introduce latency to the user control and in the cases where the ball is obscured the data can spike or drift in certain ways. Certain settings, or approaches can cause an undocumented increase in latency. What should a motion designer be concerning themselves with here you ask? Well when designing gestures where the ball tracking is lost or even partially obscured for a frame is harder than say the Wii or Six-Axis controller. One previous title I worked on around the launch window of the Move, the primary control worked better swinging about the six-axis controller than the Move.

Following this trend at Dare to be Digital last year we saw and impressive use of Kinect but every team was almost entirely relying on skeleton based systems. This is the “3rd stage” of Kinect processing and the system with the highest latency. Many of the control systems they were using could have worked off raw depth data feed, which could have been evaluated faster. Though in Microsoft’s defence they do a brilliant job of exposing the raw feed and stages of processing to developers for optimisation or use where you only care about simpler, faster motions.

So to come full circle from point/plane math to futuristic input systems we will be increasingly surrounded by layers of abstractions from both a coding and a design view. It is important we continue to “de-mystify” these systems, in order to better use them. Though in a call to developers of frameworks, middleware and similar products I have an old and familiar request.

Document your leaky abstractions, publish your process and enable your developers.

All they will do with that information is make you look good.


Matrices, Rotation, Scale and Drifting

Original Author: Niklas Frykholm

If you are using Matrix4x4s to store your node transforms and want to support scaling you are facing an annoying numerical problem: rotating a node causes its scale to drift from the original value.

The cause of drifting

Drifting happens because in a Matrix4x4 the rotation and the scale are stored together in the upper left 3×3 part of the matrix:

This means that if we want to change the rotation of a Matrix4x4 without affecting the scale we must extract the scale and reapply it:

void set_rotation(Matrix4x4 &pose, const Quaternion &rot)
 
  {
 
       Vector3 s = scale(pose);
 
       Matrix3x3 rotm = matrix3x3(rot);
 
       scale(rotm, s);
 
       set_3x3(pose, rotm);
 
  }

The problem here is that since floating point computation is imprecise, scale(pose) is not guaranteed to be exactly the same before this operation as after. Numerical errors will cause a very small difference. So even though we only intended to rotate the node we have inadvertently made it ever so slightly bigger (or smaller).

Does it matter? Sure, it is annoying that an object that we didn’t want to have any scaling at all suddenly has a scale of 1.0000001, but surely such a small change would be impercievable and couldn’t affect gameplay.

True, if we only rotated the object once. However, if we are dealing with an animated or spinning object we will be changing its rotation every frame. So if the error is 0.0000001 the first frame, it might be 0.0000002 the second frame and 0.0000003 the third frame.

Note that the error growth is linear rather than geometric because the error in each iteration is proportional to the current scale, not to the current error. I. e., to (1 + e) rather than e. We can assume that 1 >> e, because otherwise we already have a clearly visible error.

I ran a test using our existing math code. Rotating a transform using the method described above yields the following result:

Error Frames Time (at 60 Hz)
0.000001 202 3 s
0.000002 437 7 s
0.000005 897 15 s
0.000010 1654 28 s
0.000020 3511 58 s
0.000050 8823 2 min
0.000100 14393 4 min
0.000200 24605 7 min
0.000500 52203 15 min
0.001000 100575 28 min

As you can see, after 28 minutes we have an error of 0.1 %. At this point, it starts to get noticeable.

You could debate if this is something that needs fixing. Maybe you can live with the fact that objects grow by 0.1 % every half hour, because your game sessions are short and the small scale differences will never be noted. However, since Bitsquid is a general purpose engine, we need a better solution to the problem.

At this point, you might be asking yourself why this problem only happens when we introduce scaling. Don’t we have the same issue with just translation and rotation? No, because translation and rotation are stored in completely separate parts of the matrix:

Setting the rotation doesn’t touch any of the position elements and can’t introduce errors in them, and vice versa.

Solutions to scale drifting

I can think of four possible solutions to this problem:

  • Store rotation and scale separately

  • Always set rotation and scale together

  • Quantize the scale values

  • Prevent systematic errors

Solution 1: Store rotation and scale separately

The root cause of the problem is that rotation and scale are jumbled together in the Matrix4x4. We can fix that by separating them. So instead of using a Matrix4x4 we would store our pose as:

struct Pose {
 
        Vector3 translation;
 
        Matrix3x3 rotation;
 
        Vector3 scale;
 
  };

With the pose stored like this, changing the rotation does not touch the scale values, so we have eliminated the problem of drifting.

Note that this representation is actually using slightly less memory than a Matrix4x4 — 15 floats instead of 16. (We could reduce the storage space even further by storing the rotation as a quaternion, but then it would be more expensive to convert it to matrix form.)

However, the representation is not as convenient as a Matrix4x4. We can’t compose it or compute its inverse with regular matrix operations, as we can do for a Matrix4x4. We could write custom operations for that, or we could just convert this representation to a temporary Matrix4x4 whenever we needed those operations.

Converting to a Matrix4x4 requires initializing the 16 floats (some with values from the pose) and 9 floating point multiplications (to apply the scale). What kind of a performance impact would this have?

I would guess that the part of the codebase that would be most affected would be the scene graph local-to-world transformation. With this solution, you would want to store the local transform as a Pose and the world transform as a Matrix4x4. The local-to-world transform requires about 36 multiplications and 36 additions (says my quick estimate). So adding a temp Matrix4x4 conversion would take you from 72 to 81 FLOPS.

So a very rough estimate is that this change would make your scene graph transforms about 12 % more expensive. Likely, the real value is less than that since you probably have additional overhead costs that are the same for both methods. And of course, the scene graph transforms are just one small (and parallelizable) part of what your engine does. We rarely spend more than 2 % of our frame time there, meaning the total performance hit is something like 0.2 %.

I think that is a quite reasonable price to pay for a neat solution to the problem of drifting, but you may disagree of course. Also, perhaps the use of Matrix4x4s is so ingrained in your code base that it is simply not possible to change it. So let’s look at the other possible solutions.

Solution 2: Always set rotation and scale together

The fundamental problem with set_rotation() is that we try to change just the orientation of the node without affecting the scale. Extracting the scale and reapplying it is what causes the drifting.

If we don’t allow the user to just change the rotation, but force him to always set the scale and the rotation together, the problem disappears:

void set_rotation_and_scale(Matrix4x4 &pose, const Quaternion &rot, const Vector3 &s)
 
  {
 
      Matrix3x3 rotm = matrix3x3(rot);
 
      scale(rotm, s);
 
      set_3x3(pose, rotm);
 
  }

Since we have eliminated the step where we extract the scale and feed it back, we have rid ourselves of the feedback loop that caused runaway drifting. Of course, we haven’t completely eliminated the problem, because nothing prevents the user from emulating what we did in set_rotation() and recreating the feedback loop:

Vector3 s = scale(pose);
 
  set_rotation_and_scale(pose, new_rotation, s);

Now the drifting problem is back with a vengeance, reintroduced by the user.

To prevent drifting the user must take care not to create such feedback loops. I.e., she can never extract the scale from the matrix. Instead she must store the scale at some other place (separate from the matrix) so that she can always feed the matrix with the correct scale value.

What we have done is essentially to move the burden of keeping track of the scale of objects from the transform (the Matrix4x4) to the user of the transform. This prevents drifting and doesn’t have any performance costs, but it is pretty inconvenient for the user to have to track the scale of objects manually. Also, it is error prone, since the user who is not 100 % certain of what she is doing can accidentally recreate the feedback loop that causes drifting.

Solution 3: Quantize the scale values

If none of the two options presented so far seem palpable to you, there is actually a third possibility.

Consider what would happen if we changed the Vector3 scale(const Matrix4x4 &) function so that it always returned integer values.

Calling set_rotation() as before would introduce an error to the scale and set it to, say 1.0000001. But the next time we ran set_rotation() and asked for the scale it would be rounded to the nearest integer value, so it would be returned as 1 — the correct value. Applying the new rotation would again introduce an error and change the value to 1.0000001, but then again, the next time the function ran, the value returned would be snapped back to 1.

So by rounding the returned scale to fixed discrete values we prevent the feedback loop. We still get small errors in the scale, but without the runaway effect they are unnoticeable. (Small errors occur everywhere, for example in the scene graph transforms. That’s the nature of floating point computation. It is not the small errors that are the problem but the mechanisms that can cause them to result in visible effects.)

Of course, if we round to integer values we can only scale an object by 1, 2, 3, etc. Not by 0.5, for instance. But we can fix that by using some other set of discrete numbers for the scale. For example, we could round to the nearest 0.0001. This would let us have scales of 0.9998, 0.9999, 1.0000, 1.0001, 1.0002, … Hopefully that is enough precision to cover all the different scales that our artists might want to use.

Drifting won’t happen in this scheme, because the floating point errors will never be big enough to change the number to the next discrete value. (Unless you used really large scale values. If you want to support that — typically not interesting, because things like texture and vertex resolution start to look wonky — you could use a geometric quantization scheme instead of an arithmetic one.)

Snapping the scale values in this way might be OK for static scaling. But what if you want to smoothly change the scaling with an animation? Won’t the discrete steps cause visible jerks in the movement?

Actually not. Remember that it is only the value returned by scale() that is quantized, the user is still free to set_scale() to any non-quantized value. When the scale is driven by an animation, it is fed from an outside source. We don’t need to read it from the matrix and reapply it. So the quantization that happens in scale() never comes into play.

So amazingly enough, this hacky solution of snapping the scale to a fixed set of discrete values actually seems to work for most real world problems. There might be situations where it would cause trouble, but I can’t really come up with any.

Solution 4: Prevent systematic errors

A final approach is to try to address how the numbers are drifting instead of stopping them from drifting. If you look at the table above you see that the errors are growing linearly. That is not what you would expect if the errors were completely random.

If the errors in each iteration were completely random, you would get a random walk process where the total error would be e * sqrt(N) rather than e * N, where e is the error from one iteration and N the number of iterations. The fact that the error grows linearly tells us that our computation has a systematic bias — the error is always pushed in one particular direction.

If we could get rid of this systematic bias and get a truly random error, the accumulated error would grow much more slowly, the square root makes all the difference. For example, for the error to grow to 0.1 % it would take 5.2 years rather than 28 minutes. At that point, we might be ok with the drifting.

I haven’t thought that much about what would be needed to get rid of the systematic bias in the set_rotation() function. It’s a pretty tricky problem that requires a deep understanding of what happens to all the floating point numbers as they travel through the equations.

Conclusion

In the Bitsquid engine we have so far gone with #2, as a make-shift until we decided on the best permanent solution to this problem. After reviewing the options in this article I think we will most likely go with #1. #3 is an interesting hack and I think it would work almost everywhere, but I’m willing to pay the slight performance price for the cleaner and clearer solution of #1.

This has also been posted to The Bitsquid blog.

Game Developers: Remember Priority #1

Original Author: Aaron San Filippo

Business / Game Design / Production /

(Note: this was originally posted on Flippfly.com)

I questioned the wisdom of writing this, since as of yet, I’http://itunes.apple.com/us/app/monkey-drum-deluxe/id527723426?mt=8″>Monkey Drum Deluxe, there’s really not a lot of inherent credibility to my words that comes with having highly successful products to back them up.

But there’http://www.altdev.co/2012/03/04/indie-devs-the-odds-are-against-you/”> some indies that’s really kind of discouraging to me.

Namely: I think many have forgotten that the most important factor to success as a game developer is making an excellent game, and started to believe that financial success is either random, or mostly due to factors outside of the game itself.

Now – keep in mind that I said excellent (as opposed to decent, good, or even great) and that when I use that word, I mean: fun, appealing, polished, and accessible (and  don’t take accessible to mean casual or broadly appealing.)

Recently a tweet from Jon Blow (developer of Braid) made me think on this issue again:

gamasutra.com/view/feature/1…

— Jonathan Blow (@Jonathan_Blow) June 29, 2012

 

This was in response to a business-centric postmortem about a game that sold 7 copies titled “congratulations, your first indie game is a flop.”

Jon went on to explain:
“He made a game that there’s no reason for people to want, but acts like he is entitled to have people buy it / press cover it.”

Now to be fair, I think Jon’http://mightyvision.blogspot.co.uk/2012/06/ios-sale-numbers.html”>Michael Brough, talking about failures is important. The developer acknowledged mistakes, and ultimately showed no regret at having done something he loved and believed in.

My concern is that people seem to have an expectation that their game will do reasonably well as long as they get all their ducks in a row, and if this doesn’t happen, often the last thing they focus on is the game itself. Time and again I’ve seen people reference their 75% or so ratings and then go on to talk as if these are “great” reviews and that they just need to get their great game in front of people.

I think this kind of thinking is a big mistake.

Hear me out: it should be a self-evident fact that if you expect to succeed financially, you’re going to need lots of eyes on your game, especially if you’re charging a couple bucks or less for it. This can happen in a variety of ways, but it mostly boils down to two: either you spend money on marketing, or you make a game that is so good that its quality and value make it impossible to ignore. You want people to play it, share it, tweet about it, talk about it at work, review it, and feature it, not because of a great icon or an attractive promo video, but because it’s unquestionably just that good. You want it to be the game about which people say “you really have to play this!”

You should be able to think of your game like a dry pile of sticks doused in gasoline, that just ignite it.

It’s tempting to look at counter-examples: all the good games that somehow get passed over, and all the mediocre games that somehow manage to sell millions.

But in the absence of big marketing dollars, I would argue that:

Mediocre games usually fail.

Good games often fail.

Excellent games rarely fail.

Every other case is just noise.

So am I saying that marketing, PR, great icons, promo videos, a great website, social features, killer screenshots, and personal connections are unimportant?

Of course not!

But if your game is less than excellent, then all this stuff is like trying to push a rock up a hill in today’s market. That’s not a fulfilling way to spend your life. And the weaker your game is, the more time you’re going to spend trying to make all these supporting factors make up for it – a really bad cycle to be in when time is your most precious asset!

What’s cool about setting out to make excellent games, is that in addition to taking so much of the randomness out of your success potential, you’re going to enjoy a much more fulfilling career!

Now I feel I should make a point to say that sales isn’http://ramiismail.com/2012/07/on-success-failure-and-the-scene/”>Rami Ismail of Vlambeer points out. It’s a big space and not everybody in it is trying to make a living at it. We actually just submitted a little toy to the app store called “Creepy Eye” – an experiment using face-tracking and the gyroscope that we hadn’t seen explored before. Making an artful experiment or a cool diversion is a reward in itself. But my concern is when people start to feel a sense of entitlement or surprise when these experiments and “pretty good” games don’t garner any attention or sales, with sometimes lengthy and publicized business-focused analysis of where their monetization strategy failed, or scary-sounding warnings to other would-be indie developers.

If you want to sell games, and you don’t like throwing dice with your financial future, you need to be determined to produce excellence.

So do yourself a favor: take a break from your monetization strategizing, video-editing, press-emailing, buzz-creating, and icon-tuning for a minute, and ask yourself:

 

“Is this game Excellent yet?”

The less the code, the better

Original Author: Jaewon Jung

Subtitle: how to encourage simple code even if it may slow you down in the short term

Less is More!?

For some time, I’ve had a thought that no source code is the best source code. Obviously it has no bugs and requires no maintenance. An ideal software! Then, with no source code, our product will be super-awesome, right? Of course, not. Because our product (game or whatever kind of software it may be) must do something useful and no source code cannot do anything by its definition. Being useful (including giving an enjoyable experience to users) is raison d’etre of any software product (or of any product, for that matter, I guess).

Fight against the temptation to go easy

From time to time, I mulled over how one can incentivize a short code or a refactoring that makes overall code shorter & simpler. A need for such incentives stems from the fact that a shorter code (providing identical features, of course) often requires both deeper consideration, unremitting diligence and courage and usually takes more time to accomplish. Let’s suppose the task of implementing a feature A has been assigned to you because it was a highest priority among backlog items. While figuring out a best way to implement it, you found out some nasty feature duplication and unnecessarily complected part in relevant modules. Now, you have two choices. One is to finish the feature ASAP by working around or conforming to that bad code (while soothing your guilt by telling yourself “Let’s do the refactoring later,” which more often than not never happens) and get a recognition as a coding ninja. The other is to bite the bullet despite a short-term delay it may induce and do the right thing right away even with the risk of breaking some irrelevant part of the code. You might not be lauded right away as a super achiever (rather can be blamed for missing the deadline or for a temporarily broken build/feature), but in the long run this can save not only your time, but also your colleagues’, whether they become aware of an insidious effect of your effort or not. As for me, I have fallen victim to my impatience, short-sightedness and temptation to go easy more often than I’d like to admit.

It should be measured to be encouraged

One naive metric I can think of is to compute (lines_added - lines_removed) for each commit and give more points if the result is less. This too simplistic approach won’t work, of course. Especially, at the early phase of a project, no (or very few) refactoring can occur since there isn’t that much code to be cleaned up after all and many code lines should be written and added naturally at this stage. Writing much code at this stage is definitely not a thing to be discouraged. The crucial thing is how much more value the code adds to the product by introducing that many additional lines. The key point is how one can measure so called ‘added value’? Quite tricky, if not impossible, even to define one, not to mention computing it. By the way, a plausible approximation will be enough for our purpose. After all, we’re wanting some ways to incentivize good directions overall rather than calculating some absolute numbers.

Tests as a proxy of added values

In test-driven development, each meaningful feature (whether that is a small function in the case of unit-testing or an actual feature for users under a high-level functional test) should have a corresponding test set. So an idea I recently came to my mind is to use the size of test code as an approximation for the added value. Then, an equation for the bonus point can be like below:

score = k1 * number_of_new_test_cases - k2 * net_amount_of_code_added

k1 and k2 are some positive coefficients. So, for example, for a pure refactoring change, the number_of_new_test_cases will be zero, but the net_amount_of_code_added will usually be negative through applying DRY rigorously. On the other hand, when actively adding several base features in an early phase, both values will be positive in most cases.

At this point, I should admit that I’ve never succeeded in entreching the TDD in some projects I introduced it to nor had a luck to work in a team with it already established. I think TDD can be very beneficial, but again I’m not in the position of asserting that. Still, if you’re already doing it and reaping some benefit from it, then another merit of this approach, that it also encourages adding tests for any new feature or function, will appeal to you.

Conclusion (or rather lack of it)

The Boy Scout Rule by Uncle Bob is a wonderful motto, but hard to keep following in reality. I think depending purely on an individual’s discipline and conscience for improving the codebase of a team-based project is a losing battle. Since there are usually quice a few existing incentives for an easy/quick/short-term achievement, which can be detrimental to a long-term quality of the codebase, there should be systematic incentives which can counter-balance those and reinforce one’s resolve to keep the code clean.

There are other ways of fighting against the tendency of code complecting like dedicating some portion of time to paying technical debt or having a dedicated sprint/iteration for it periodically. But, it’s usually better to deal with one right off the bat once you find it, as long as some critical deadline is not impending. How do you guys think of this approach? Do you have any other tips or ideas to keep code simple in spite of our very human nature of going easy and complecting?

(This article has also been posted to my personal blog.)