Latency Mitigation Strategies

Original Author: John-Carmack



Virtual reality (VR) is one of the most demanding human-in-the-loop applications from a latency standpoint.  The latency between the physical movement of a user’s head and updated photons from a head mounted display reaching their eyes is one of the most critical factors in providing a high quality experience.

Human sensory systems can detect very small relative delays in parts of the visual or, especially, audio fields, but when absolute delays are below approximately 20 milliseconds they are generally imperceptible.  Interactive 3D systems today typically have latencies that are several times that figure, but alternate configurations of the same hardware components can allow that target to be reached.

A discussion of the sources of latency throughout a system follows, along with techniques for reducing the latency in the processing done on the host system.


Updating the imagery in a head mounted display (HMD) based on a head tracking sensor is a subtly different challenge than most human / computer interactions.  With a conventional mouse or game controller, the user is consciously manipulating an interface to complete a task, while the goal of virtual reality is to have the experience accepted at an unconscious level.

Users can adapt to control systems with a significant amount of latency and still perform challenging tasks or enjoy a game; many thousands of people enjoyed playing early network games, even with 400+ milliseconds of latency between pressing a key and seeing a response on screen.

If large amounts of latency are present in the VR system, users may still be able to perform tasks, but it will be by the much less rewarding means of using their head as a controller, rather than accepting that their head is naturally moving around in a stable virtual world.  Perceiving latency in the response to head motion is also one of the primary causes of simulator sickness.  Other technical factors that affect the quality of a VR experience, like head tracking accuracy and precision, may interact with the perception of latency, or, like display resolution and color depth, be largely orthogonal to it.

A total system latency of 50 milliseconds will feel responsive, but still subtly lagging.  One of the easiest ways to see the effects of latency in a head mounted display is to roll your head side to side along the view vector while looking at a clear vertical edge.  Latency will show up as an apparent tilting of the vertical line with the head motion; the view feels “dragged along” with the head motion.  When the latency is low enough, the virtual world convincingly feels like you are simply rotating your view of a stable world.

Extrapolation of sensor data can be used to mitigate some system latency, but even with a sophisticated model of the motion of the human head, there will be artifacts as movements are initiated and changed.  It is always better to not have a problem than to mitigate it, so true latency reduction should be aggressively pursued, leaving extrapolation to smooth out sensor jitter issues and perform only a small amount of prediction.

Data collection

It is not usually possible to introspectively measure the complete system latency of a VR system, because the sensors and display devices external to the host processor make significant contributions to the total latency.  An effective technique is to record high speed video that simultaneously captures the initiating physical motion and the eventual display update.  The system latency can then be determined by single stepping the video and counting the number of video frames between the two events.

In most cases there will be a significant jitter in the resulting timings due to aliasing between sensor rates, display rates, and camera rates, but conventional applications tend to display total latencies in the dozens of 240 fps video frames.

On an unloaded Windows 7 system with the compositing Aero desktop interface disabled, a gaming mouse dragging a window displayed on a 180 hz CRT monitor can show a response on screen in the same 240 fps video frame that the mouse was seen to first move, demonstrating an end to end latency below four milliseconds.  Many systems need to cooperate for this to happen: The mouse updates 500 times a second, with no filtering or buffering.  The operating system immediately processes the update, and immediately performs GPU accelerated rendering directly to the framebuffer without any page flipping or buffering.  The display accepts the video signal with no buffering or processing, and the screen phosphors begin emitting new photons within microseconds.

In a typical VR system, many things go far less optimally, sometimes resulting in end to end latencies of over 100 milliseconds.


Detecting a physical action can be as simple as a watching a circuit close for a button press, or as complex as analyzing a live video feed to infer position and orientation.

In the old days, executing an IO port input instruction could directly trigger an analog to digital conversion on an ISA bus adapter card, giving a latency on the order of a microsecond and no sampling jitter issues.  Today, sensors are systems unto themselves, and may have internal pipelines and queues that need to be traversed before the information is even put on the USB serial bus to be transmitted to the host.

Analog sensors have an inherent tension between random noise and sensor bandwidth, and some combination of analog and digital filtering is usually done on a signal before returning it.  Sometimes this filtering is excessive, which can contribute significant latency and remove subtle motions completely.

Communication bandwidth delay on older serial ports or wireless links can be significant in some cases.  If the sensor messages occupy the full bandwidth of a communication channel, latency equal to the repeat time of the sensor is added simply for transferring the message.  Video data streams can stress even modern wired links, which may encourage the use of data compression, which usually adds another full frame of latency if not explicitly implemented in a pipelined manner.

Filtering and communication are constant delays, but the discretely packetized nature of most sensor updates introduces a variable latency, or “jitter” as the sensor data is used for a video frame rate that differs from the sensor frame rate.  This latency ranges from close to zero if the sensor packet arrived just before it was queried, up to the repeat time for sensor messages.  Most USB HID devices update at 125 samples per second, giving a jitter of up to 8 milliseconds, but it is possible to receive 1000 updates a second from some USB hardware.  The operating system may impose an additional random delay of up to a couple milliseconds between the arrival of a message and a user mode application getting the chance to process it, even on an unloaded system.


On old CRT displays, the voltage coming out of the video card directly modulated the voltage of the electron gun, which caused the screen phosphors to begin emitting photons a few microseconds after a pixel was read from the frame buffer memory.

Early LCDs were notorious for “ghosting” during scrolling or animation, still showing traces of old images many tens of milliseconds after the image was changed, but significant progress has been made in the last two decades.  The transition times for LCD pixels vary based on the start and end values being transitioned between, but a good panel today will have a switching time around ten milliseconds, and optimized displays for active 3D and gaming can have switching times less than half that.

Modern displays are also expected to perform a wide variety of processing on the incoming signal before they change the actual display elements.  A typical Full HD display today will accept 720p or interlaced composite signals and convert them to the 1920×1080 physical pixels.  24 fps movie footage will be converted to 60 fps refresh rates.  Stereoscopic input may be converted from side-by-side, top-down, or other formats to frame sequential for active displays, or interlaced for passive displays.  Content protection may be applied.  Many consumer oriented displays have started applying motion interpolation and other sophisticated algorithms that require multiple frames of buffering.

Some of these processing tasks could be handled by only buffering a single scan line, but some of them fundamentally need one or more full frames of buffering, and display vendors have tended to implement the general case without optimizing for the cases that could be done with low or no delay.  Some consumer displays wind up buffering three or more frames internally, resulting in 50 milliseconds of latency even when the input data could have been fed directly into the display matrix.

Some less common display technologies have speed advantages over LCD panels; OLED pixels can have switching times well under a millisecond, and laser displays are as instantaneous as CRTs.

A subtle latency point is that most displays present an image incrementally as it is scanned out from the computer, which has the effect that the bottom of the screen changes 16 milliseconds later than the top of the screen on a 60 fps display.  This is rarely a problem on a static display, but on a head mounted display it can cause the world to appear to shear left and right, or “waggle” as the head is rotated, because the source image was generated for an instant in time, but different parts are presented at different times.  This effect is usually masked by switching times on LCD HMDs, but it is obvious with fast OLED HMDs.

Host processing

The classic processing model for a game or VR application is:

Read user input -> run simulation -> issue rendering commands -> graphics drawing -> wait for vsync -> scanout

I = Input sampling and dependent calculation

S = simulation / game execution

R = rendering engine

G = GPU drawing time

V = video scanout time

All latencies are based on a frame time of roughly 16 milliseconds, a progressively scanned display, and zero sensor and pixel latency.

If the performance demands of the application are well below what the system can provide, a straightforward implementation with no parallel overlap will usually provide fairly good latency values.  However, if running synchronized to the video refresh, the minimum latency will still be 16 ms even if the system is infinitely fast.   This rate feels good for most eye-hand tasks, but it is still a perceptible lag that can be felt in a head mounted display, or in the responsiveness of a mouse cursor.

Ample performance, vsync:
  .................. latency 16 – 32 milliseconds

Running without vsync on a very fast system will deliver better latency, but only over a fraction of the screen, and with visible tear lines.  The impact of the tear lines are related to the disparity between the two frames that are being torn between, and the amount of time that the tear lines are visible.  Tear lines look worse on a continuously illuminated LCD than on a CRT or laser projector, and worse on a 60 fps display than a 120 fps display.  Somewhat counteracting that, slow switching LCD panels blur the impact of the tear line relative to the faster displays.

If enough frames were rendered such that each scan line had a unique image, the effect would be of a “rolling shutter”, rather than visible tear lines, and the image would feel continuous.  Unfortunately, even rendering 1000 frames a second, giving approximately 15 bands on screen separated by tear lines, is still quite objectionable on fast switching displays, and few scenes are capable of being rendered at that rate, let alone 60x higher for a true rolling shutter on a 1080P display.

Ample performance, unsynchronized:
  ..... latency 5 – 8 milliseconds at ~200 frames per second

In most cases, performance is a constant point of concern, and a parallel pipelined architecture is adopted to allow multiple processors to work in parallel instead of sequentially.  Large command buffers on GPUs can buffer an entire frame of drawing commands, which allows them to overlap the work on the CPU, which generally gives a significant frame rate boost at the expense of added latency.

  GPU:                |GGGGGGGGGGG----|
  VID:                |               |VVVVVVVVVVVVVVVV|
      .................................. latency 32 – 48 milliseconds

When the CPU load for the simulation and rendering no longer fit in a single frame, multiple CPU cores can be used in parallel to produce more frames.  It is possible to reduce frame execution time without increasing latency in some cases, but the natural split of simulation and rendering has often been used to allow effective pipeline parallel operation.  Work queue approaches buffered for maximum overlap can cause an additional frame of latency if they are on the critical user responsiveness path.

  CPU2:                |RRRRRRRRR-------|
  GPU :                |                |GGGGGGGGGG------|
  VID :                |                |                |VVVVVVVVVVVVVVVV|
       .................................................... latency 48 – 64 milliseconds

Even if an application is running at a perfectly smooth 60 fps, it can still have host latencies of over 50 milliseconds, and an application targeting 30 fps could have twice that.   Sensor and display latencies can add significant additional amounts on top of that, so the goal of 20 milliseconds motion-to-photons latency is challenging to achieve.

Latency Reduction Strategies

Prevent GPU buffering

The drive to win frame rate benchmark wars has led driver writers to aggressively buffer drawing commands, and there have even been cases where drivers ignored explicit calls to glFinish() in the name of improved “performance”.  Today’s fence primitives do appear to be reliably observed for drawing primitives, but the semantics of buffer swaps are still worryingly imprecise.  A recommended sequence of commands to synchronize with the vertical retrace and idle the GPU is:





While this should always prevent excessive command buffering on any conformant driver, it could conceivably fail to provide an accurate vertical sync timing point if the driver was transparently implementing triple buffering.

To minimize the performance impact of synchronizing with the GPU, it is important to have sufficient work ready to send to the GPU immediately after the synchronization is performed.  The details of exactly when the GPU can begin executing commands are platform specific, but execution can be explicitly kicked off with glFlush() or equivalent calls.  If the code issuing drawing commands does not proceed fast enough, the GPU may complete all the work and go idle with a “pipeline bubble”.  Because the CPU time to issue a drawing command may have little relation to the GPU time required to draw it, these pipeline bubbles may cause the GPU to take noticeably longer to draw the frame than if it were completely buffered.  Ordering the drawing so that larger and slower operations happen first will provide a cushion, as will pushing as much preparatory work as possible before the synchronization point.

Run GPU with minimal buffering:
  CPU2:                |RRRRRRRRR-------|
  GPU :                |-GGGGGGGGGG-----|
  VID :                |                |VVVVVVVVVVVVVVVV|
       ................................... latency 32 – 48 milliseconds

Tile based renderers, as are found in most mobile devices, inherently require a full scene of command buffering before they can generate their first tile of pixels, so synchronizing before issuing any commands will destroy far more overlap.  In a modern rendering engine there may be multiple scene renders for each frame to handle shadows, reflections, and other effects, but increased latency is still a fundamental drawback of the technology.

High end, multiple GPU systems today are usually configured for AFR, or Alternate Frame Rendering, where each GPU is allowed to take twice as long to render a single frame, but the overall frame rate is maintained because there are two GPUs producing frames

Alternate Frame Rendering dual GPU:
  CPU2:                |RRRRRRRRR-------|RRRRRRRRR-------|
  GPU1:                | GGGGGGGGGGGGGGGGGGGGGGGG--------|
  GPU2:                |                | GGGGGGGGGGGGGGGGGGGGGGG---------|
  VID :                |                |                |VVVVVVVVVVVVVVVV|
       .................................................... latency 48 – 64 milliseconds

Similarly to the case with CPU workloads, it is possible to have two or more GPUs cooperate on a single frame in a way that delivers more work in a constant amount of time, but it increases complexity and generally delivers a lower total speedup.

An attractive direction for stereoscopic rendering is to have each GPU on a dual GPU system render one eye, which would deliver maximum performance and minimum latency, at the expense of requiring the application to maintain buffers across two independent rendering contexts.

The downside to preventing GPU buffering is that throughput performance may drop, resulting in more dropped frames under heavily loaded conditions.

Late frame scheduling

Much of the work in the simulation task does not depend directly on the user input, or would be insensitive to a frame of latency in it.  If the user processing is done last, and the input is sampled just before it is needed, rather than stored off at the beginning of the frame, the total latency can be reduced.

It is very difficult to predict the time required for the general simulation work on the entire world, but the work just for the player’s view response to the sensor input can be made essentially deterministic.  If this is split off from the main simulation task and delayed until shortly before the end of the frame, it can remove nearly a full frame of latency.

Late frame scheduling:
  CPU2:                |RRRRRRRRR-------|
  GPU :                |-GGGGGGGGGG-----|
  VID :                |                |VVVVVVVVVVVVVVVV|
                      .................... latency 18 – 34 milliseconds

Adjusting the view is the most latency sensitive task; actions resulting from other user commands, like animating a weapon or interacting with other objects in the world, are generally insensitive to an additional frame of latency, and can be handled in the general simulation task the following frame.

The drawback to late frame scheduling is that it introduces a tight scheduling requirement that usually requires busy waiting to meet, wasting power.  If your frame rate is determined by the video retrace rather than an arbitrary time slice, assistance from the graphics driver in accurately determining the current scanout position is helpful.

View bypass

An alternate way of accomplishing a similar, or slightly greater latency reduction Is to allow the rendering code to modify the parameters delivered to it by the game code, based on a newer sampling of user input.

At the simplest level, the user input can be used to calculate a delta from the previous sampling to the current one, which can be used to modify the view matrix that the game submitted to the rendering code.

Delta processing in this way is minimally intrusive, but there will often be situations where the user input should not affect the rendering, such as cinematic cut scenes or when the player has died.  It can be argued that a game designed from scratch for virtual reality should avoid those situations, because a non-responsive view in a HMD is disorienting and unpleasant, but conventional game design has many such cases.

A binary flag could be provided to disable the bypass calculation, but it is useful to generalize such that the game provides an object or function with embedded state that produces rendering parameters from sensor input data instead of having the game provide the view parameters themselves.  In addition to handling the trivial case of ignoring sensor input, the generator function can incorporate additional information such as a head/neck positioning model that modified position based on orientation, or lists of other models to be positioned relative to the updated view.

If the game and rendering code are running in parallel, it is important that the parameter generation function does not reference any game state to avoid race conditions.

View bypass:
  CPU2:                |IRRRRRRRRR------|
  GPU :                |--GGGGGGGGGG----|
  VID :                |                |VVVVVVVVVVVVVVVV|
                        .................. latency 16 – 32 milliseconds

The input is only sampled once per frame, but it is simultaneously used by both the simulation task and the rendering task.  Some input processing work is now duplicated by the simulation task and the render task, but it is generally minimal.

The latency for parameters produced by the generator function is now reduced, but other interactions with the world, like muzzle flashes and physics responses, remain at the same latency as the standard model.

A modified form of view bypass could allow tile based GPUs to achieve similar view latencies to non-tiled GPUs, or allow non-tiled GPUs to achieve 100% utilization without pipeline bubbles by the following steps:

Inhibit the execution of GPU commands, forcing them to be buffered.  OpenGL has only the deprecated display list functionality to approximate this, but a control extension could be formulated.

All calculations that depend on the view matrix must reference it independently from a buffer object, rather than from inline parameters or as a composite model-view-projection (MVP) matrix.

After all commands have been issued and the next frame has started, sample the user input, run it through the parameter generator, and put the resulting view matrix into the buffer object for referencing by the draw commands.

Kick off the draw command execution.

Tiler optimized view bypass:
  CPU2:                |IRRRRRRRRRR-----|I
  GPU :                |                |-GGGGGGGGGG-----|
  VID :                |                |                |VVVVVVVVVVVVVVVV|
                                         .................. latency 16 – 32 milliseconds

Any view frustum culling that was performed to avoid drawing some models may be invalid if the new view matrix has changed substantially enough from what was used during the rendering task.  This can be mitigated at some performance cost by using a larger frustum field of view for culling, and hardware clip planes based on the culling frustum limits can be used to guarantee a clean edge if necessary.  Occlusion errors from culling, where a bright object is seen that should have been occluded by an object that was incorrectly culled, are very distracting, but a temporary clean encroaching of black at a screen edge during rapid rotation is almost unnoticeable.

Time warping

If you had perfect knowledge of how long the rendering of a frame would take, some additional amount of latency could be saved by late frame scheduling the entire rendering task, but this is not practical due to the wide variability in frame rendering times.

Late frame input sampled view bypass:
  CPU2:                |----IRRRRRRRRR--|
  GPU :                |------GGGGGGGGGG|
  VID :                |                |VVVVVVVVVVVVVVVV|
                            .............. latency 12 – 28 milliseconds

However, a post processing task on the rendered image can be counted on to complete in a fairly predictable amount of time, and can be late scheduled more easily.  Any pixel on the screen, along with the associated depth buffer value, can be converted back to a world space position, which can be re-transformed to a different screen space pixel location for a modified set of view parameters.

After drawing a frame with the best information at your disposal, possibly with bypassed view parameters, instead of displaying it directly, fetch the latest user input, generate updated view parameters, and calculate a transformation that warps the rendered image into a position that approximates where it would be with the updated parameters.  Using that transform, warp the rendered image into an updated form on screen that reflects the new input.  If there are two dimensional overlays present on the screen that need to remain fixed, they must be drawn or composited in after the warp operation, to prevent them from incorrectly moving as the view parameters change.

Late frame scheduled time warp:
  CPU2:                |RRRRRRRRRR----IR|
  GPU :                |-GGGGGGGGGG----G|
  VID :                |                |VVVVVVVVVVVVVVVV|
                                      .... latency 2 – 18 milliseconds

If the difference between the view parameters at the time of the scene rendering and the time of the final warp is only a change in direction, the warped image can be almost exactly correct within the limits of the image filtering.  Effects that are calculated relative to the screen, like depth based fog (versus distance based fog) and billboard sprites will be slightly different, but not in a manner that is objectionable.

If the warp involves translation as well as direction changes, geometric silhouette edges begin to introduce artifacts where internal parallax would have revealed surfaces not visible in the original rendering.  A scene with no silhouette edges, like the inside of a box, can be warped significant amounts and display only changes in texture density, but translation warping realistic scenes will result in smears or gaps along edges.  In many cases these are difficult to notice, and they always disappear when motion stops, but first person view hands and weapons are a prominent case.  This can be mitigated by limiting the amount of translation warp, compressing or making constant the depth range of the scene being warped to limit the dynamic separation, or rendering the disconnected near field objects as a separate plane, to be composited in after the warp.

If an image is being warped to a destination with the same field of view, most warps will leave some corners or edges of the new image undefined, because none of the source pixels are warped to their locations.  This can be mitigated by rendering a larger field of view than the destination requires; but simply leaving unrendered pixels black is surprisingly unobtrusive, especially in a wide field of view HMD.

A forward warp, where source pixels are deposited in their new positions, offers the best accuracy for arbitrary transformations.  At the limit, the frame buffer and depth buffer could be treated as a height field, but millions of half pixel sized triangles would have a severe performance cost.  Using a grid of triangles at some fraction of the depth buffer resolution can bring the cost down to a very low level, and the trivial case of treating the rendered image as a single quad avoids all silhouette artifacts at the expense of incorrect pixel positions under translation.

Reverse warping, where the pixel in the source rendering is estimated based on the position in the warped image, can be more convenient because it is implemented completely in a fragment shader.  It can produce identical results for simple direction changes, but additional artifacts near geometric boundaries are introduced if per-pixel depth information is considered, unless considerable effort is expended to search a neighborhood for the best source pixel.

If desired, it is straightforward to incorporate motion blur in a reverse mapping by taking several samples along the line from the pixel being warped to the transformed position in the source image.

Reverse mapping also allows the possibility of modifying the warp through the video scanout.  The view parameters can be predicted ahead in time to when the scanout will read the bottom row of pixels, which can be used to generate a second warp matrix.  The warp to be applied can be interpolated between the two of them based on the pixel row being processed.  This can correct for the “waggle” effect on a progressively scanned head mounted display, where the 16 millisecond difference in time between the display showing the top line and bottom line results in a perceived shearing of the world under rapid rotation on fast switching displays.

Continuously updated time warping

If the necessary feedback and scheduling mechanisms are available, instead of predicting what the warp transformation should be at the bottom of the frame and warping the entire screen at once, the warp to screen can be done incrementally while continuously updating the warp matrix as new input arrives.

Continuous time warp:
  CPU2:                |RRRRRRRRRRR-----|
  GPU :                |-GGGGGGGGGGGG---|
  WARP:                |               W| W W W W W W W W|
  VID :                |                |VVVVVVVVVVVVVVVV|
                                       ... latency 2 – 3 milliseconds for 500hz sensor updates

The ideal interface for doing this would be some form of “scanout shader” that would be called “just in time” for the video display.  Several video game systems like the Atari 2600, Jaguar, and Nintendo DS have had buffers ranging from half a scan line to several scan lines that were filled up in this manner.

Without new hardware support, it is still possible to incrementally perform the warping directly to the front buffer being scanned for video, and not perform a swap buffers operation at all.

A CPU core could be dedicated to the task of warping scan lines at roughly the speed they are consumed by the video output, updating the time warp matrix each scan line to blend in the most recently arrived sensor information.

GPUs can perform the time warping operation much more efficiently than a conventional CPU can, but the GPU will be busy drawing the next frame during video scanout, and GPU drawing operations cannot currently be scheduled with high precision due to the difficulty of task switching the deep pipelines and extensive context state.  However, modern GPUs are beginning to allow compute tasks to run in parallel with graphics operations, which may allow a fraction of a GPU to be dedicated to performing the warp operations as a shared parameter buffer is updated by the CPU.


View bypass and time warping are complementary techniques that can be applied independently or together.  Time warping can warp from a source image at an arbitrary view time / location to any other one, but artifacts from internal parallax and screen edge clamping are reduced by using the most recent source image possible, which view bypass rendering helps provide.

Actions that require simulation state changes, like flipping a switch or firing a weapon, still need to go through the full pipeline for 32 – 48 milliseconds of latency based on what scan line the result winds up displaying on the screen, and translational information may not be completely faithfully represented below the 16 – 32 milliseconds of the view bypass rendering, but the critical head orientation feedback can be provided in 2 – 18 milliseconds on a 60 hz display.  In conjunction with low latency sensors and displays, this will generally be perceived as immediate.  Continuous time warping opens up the possibility of latencies below 3 milliseconds, which may cross largely unexplored thresholds in human / computer interactivity.

Conventional computer interfaces are generally not as latency demanding as virtual reality, but sensitive users can tell the difference in mouse response down to the same 20 milliseconds or so, making it worthwhile to apply these techniques even in applications without a VR focus.

A particularly interesting application is in “cloud gaming”, where a simple client appliance or application forwards control information to a remote server, which streams back real time video of the game.  This offers significant convenience benefits for users, but the inherent network and compression latencies makes it a lower quality experience for action oriented titles.  View bypass and time warping can both be performed on the server, regaining a substantial fraction of the latency imposed by the network.  If the cloud gaming client was made more sophisticated, time warping could be performed locally, which could theoretically reduce the latency to the same levels as local applications, but it would probably be prudent to restrict the total amount of time warping to perhaps 30 or 40 milliseconds to limit the distance from the source images.


Zenimax for allowing me to publish this openly.

Hillcrest Labs for inertial sensors and experimental firmware.

Emagin for access to OLED displays.

Oculus for a prototype Rift HMD.

Nvidia for an experimental driver with access to the current scan line number.

Why Lua?

Original Author: Niklas Frykholm

Technology/ Code /

A question that I get asked regularly is why we have chosen Lua as our engine scripting language. I guess as opposed to more well-known languages, such as JavaScript or C#. The short answer is that Lua is lighter and more elegant than both those languages. It is also faster than JavaScript and more dynamic than C#.

When we started Bitsquid, we set out four key design principles for the engine:

  • Simplicity. (A small, manageable codebase with a minimalistic, modular design.)

  • Flexibility. (A completely data-driven engine that is not tied to any particular game type.)

  • Dynamism. (Fast iteration times, with hot reload of everything on real target platforms.)

  • Speed. (Excellent multicore performance and cache-friendly data-oriented layouts.)

Whenever we design new systems for the engine, we always keep these four goals in mind. As we shall see below, Lua does very well on all four counts, which makes it a good fit for our engine.

Simplicity in Lua

As I grow older (and hopefully more experienced) I find myself appreciating simplicity more and more. My favorite scripting language has gone from “”>Lua.

Lua is really small for a programming language. The entire Lua syntax fits on a single page. In fact, here it is:

chunk ::= {stat [`;´]} [laststat [`;´]]
  block ::= chunk
  stat ::=  varlist `=´ explist | 
       functioncall | 
       do block end | 
       while exp do block end | 
       repeat block until exp | 
       if exp then block {elseif exp then block} [else block] end | 
       for Name `=´ exp `,´ exp [`,´ exp] do block end | 
       for namelist in explist do block end | 
       function funcname funcbody | 
       local function Name funcbody | 
       local namelist [`=´ explist] 
  laststat ::= return [explist] | break
  funcname ::= Name {`.´ Name} [`:´ Name]
  varlist ::= var {`,´ var}
  var ::=  Name | prefixexp `[´ exp `]´ | prefixexp `.´ Name 
  namelist ::= Name {`,´ Name}
  explist ::= {exp `,´} exp
  exp ::=  nil | false | true | Number | String | `...´ | function | 
       prefixexp | tableconstructor | exp binop exp | unop exp 
  prefixexp ::= var | functioncall | `(´ exp `)´
  functioncall ::=  prefixexp args | prefixexp `:´ Name args 
  args ::=  `(´ [explist] `)´ | tableconstructor | String 
  function ::= function funcbody
  funcbody ::= `(´ [parlist] `)´ block end
  parlist ::= namelist [`,´ `...´] | `...´
  tableconstructor ::= `{´ [fieldlist] `}´
  fieldlist ::= field {fieldsep field} [fieldsep]
  field ::= `[´ exp `]´ `=´ exp | Name `=´ exp | exp
  fieldsep ::= `,´ | `;´
  binop ::= `+´ | `-´ | `*´ | `/´ | `^´ | `%´ | `..´ | 
       `<´ | `<=´ | `>´ | `>=´ | `==´ | `~=´ | 
       and | or
  unop ::= `-´ | not | `#´

The same minimalistic philosophy is applied across the entire language. From the standard libraries to the C interface to the actual language implementation. You can understand all of Lua by just understanding a few key concepts.

Lua’s simplicity and size does not mean that it lacks features. Rather it is just really well designed. It comes with a small set of orthogonal features that can be combined in lots of interesting ways. This gives the language a feeling of elegance, which is quite rare in the programming world. It is not a perfect language (perfect languages don’t exist), but it is a little gem that fits very well into its particular niche. In that way, Lua is similar to C (the original, not the C++ monstrosity) — it has a nice small set of features that fit very well together. (I suspect that Smalltalk and LISP also have this feeling of minimalistic elegance, but I haven’t done enough real-world programming in those languages to really be able to tell.)

As an example of how powerful Lua’s minimalism can be, consider this: Lua does not have a class or object system, but that doesn’t matter, because you can implement a class system in about 20 lines or so of Lua code. In fact, here is one:

function class(klass, super)
      if not klass then
          klass = {}
          local meta = {}
          meta.__call = function(self, ...)
              local object = {}
              setmetatable(object, klass)
              if object.init then object:init(...) end
              return object
          setmetatable(klass, meta)
      if super then
          for k,v in pairs(super) do
              klass[k] = v
      klass.__index = klass
      return klass

If you prefer prototype based languages — no problem — you can make a prototype object system in Lua too.

Smallness and simplicity makes everything easier. It makes Lua easier to learn, read, understand, port, master and optimize. A project such as LuaJIT — created by a single developer — would not have been possible in a more complicated language.

Flexibility in Lua

Lua is a fully featured language, and in the Bitsquid engine, Lua is not just used as an extension language, rather it has direct control over the gameplay loop. This means that you have complete control over the engine from Lua. You can create completely different games by just changing the Lua code. (Examples: First person medieval combat Hamilton.)

Dynamism in Lua

Unlike C#, which only has limited support for Edit and Continue, Lua makes it possible to reload everything — the entire program — on all target platforms, including consoles, mobiles and tablets.

This means that gameplay programmers can work on the code, tweak constants, fix bugs and add features without having to restart the game. And they can do this while running on the real target hardware, so that they know exactly what performance they get, how the controls feel and how much memory they are using. This enables fast iterations which is the key to increasing productivity and improving quality in game development.

Speed of Lua

Measuring the performance of a language is always tricky, but by most accounts, LuaJIT 2 is one of the fastest dynamic language implementations in the world. It outperforms other dynamic languages on many benchmarks, often by a substantial margin.

On the platforms where JITting isn’t allowed, LuaJIT can be run in interpreter mode. The interpreter mode of LuaJIT is very competitive with other non-JITed language implementations.

Furthermore, Lua has a very simple C interoperability interface (simplified further by LuaJIT FFI). This means that in performance critical parts of the code it is really easy to drop into C and get maximum performance.

Lua’s weak points

As I said above, no language is perfect. The things I miss most when programming in Lua don’”>ReSharper. Lua has no “official” debugger, and not much in the way of autocompletion or refactoring tools.

Some people would argue that this shouldn’t be counted as an argument against Lua, since it doesn’t really concern the language Lua. I disagree. A language is not a singular, isolated thing. It is part of a bigger programming experience. When we judge a language we must take that entire experience into account: Can you find help in online forums? Are there any good free-to-use development tools? Is the user base fragmented? Can you easily create GUIs with native look-and-feel? Etc.

The lack of an official debugger is not a huge issue. Lua has an excellent debugging API that can be used to communicate with external debuggers. Using that API you can quite easily write your own debugger (we have) or integrate a debugger into your favorite text editor. Also, quite recently, the Decoda IDE was open sourced, which means there is now a good open source debugger available.

Getting autocompletion and refactoring to work well with Lua is trickier. Since Lua is dynamically typed the IDE doesn’t know the type of variables, parameters or return values. So it doesn’t know what methods to suggest. And when doing refactoring operations, it can’t distinguish between methods that have the same name, but operate on different types.

But I don’t think it necessarily has to be this way. An IDE could do type inference and try to guess the type of variables. For example, if a programmer started to write something like this:

local car = Car()

the IDE could infer that the variable car was of type Car. It could then display suitable autocompletion information for the Car class.

Lua’s dynamic nature makes it tricky to write type inference code that is guaranteed to be 100 % correct. For example, a piece of Lua code could dynamically access the global _G table and change the math.sin() function so that returned a string instead of a number. But such examples are probably not that common in regular Lua code. Also, autocompletion backed by type inference could still be very useful to the end user even if it wasn’t always 100 % correct.

Type inference could be combined with explicit type hinting to cover the cases where the IDE was not able to make a correct guess (such as for functions exposed through the C API). Hinting could be implemented with a specially formatted comment that specified the type of a variable or a function:

-- @type Car -> number
  function top_speed(car)

In the example above, the comment would indicate that top_speed is a function that takes a Car argument and returns a number.

Type hinting and type inference could also be used to detect “type errors” in Lua code. For example, if the IDE saw something like this:

local bike = Bicycle()
  local s = top_speed(bike)

it could conclude that since bike is probably a Bicycle object and since top_speed expects a Car object, this call will probably result in a runtime error. It could indicate this with a squiggly red line in the source code.

I don’t know of any Lua IDE that really explores this possibility. I might try it for my next hack day.

This has also been posted to The Bitsquid blog.

Vexing puzzle design

Original Author: Kyle-Kulyk

Vex Blocks.

When we started development of Vex Blocks, we set out to create a falling block style arcade game in the vein of Tetris that utilised a device’s rotation.  The job of the player was to chain together blocks on the screen by matching colors, symbols or both and tracing out patterns with their fingers to connect the blocks.  Random blocks would fall into the play area and the job was to clear as many as possible, rotating the device as necessary so blocks would fall into different arrangements.  Once we had created the basic gameplay mechanics, we set about trying to think of how we could change the rules of the game to create different gameplay modes and a “nice to have feature if we have the time” was puzzle mode.

So, as development moved along I ultimately found myself faced with the job of creating various puzzles for our puzzle mode.  I had never set out to create a puzzle before, but how hard could it be?  Start simple, right?

I started by recreating my playing area in Photoshop and went about duplicating the various game pieces so I could simply drag and drop to create the puzzles before coding them into our game.  My next step was to create something aesthetically pleasing before I even thought of how the puzzle would play.  I’d drop in blocks to create geometric shapes and patterns, often drawing inspiration from simple icons as I only had a 5×8 grid to work with.  Once I had a pattern on the screen that I was relatively happy with, I’d start thinking about how it would play.

Here’s where it really started to get fun.  The point of the puzzle mode was to solve the puzzle, clearing all playing blocks from the screen in as few chains as possible, with an upper limit on the amount of chains you could use before the puzzle would reset.  I’d have a look at the blocks in front of me and start tracing out the various options for chains.  If it was too straightforward, then I’d start to throw in obstacles by swapping out blocks that couldn’t be readily chained together, or could only be part of a chain coming from one particular direction.  Or, I’d start with a puzzle and then mimic a few phone rotations to see what I’d end up with.  It was a bit like messing up a Rubik’s cube.  As challenging as a Rubik’s cube is to solve, there’s a certain amount of satisfaction in taking a solved cube and mixing it up for another to solve.  For a few puzzles, that’s exactly what it was like.  Starting with a solved puzzle that was easy to chain together, then scrambling it.  Mmmm…satisfying.

From a design perspective, starting simple was really the only way for this project to evolve.  As I started to become comfortable designing simple puzzles, I’d gradually add in new game mechanics.  What if I add a block that can’t be chained and has to be surrounded?  What if I introduce blocks that stay fixed in one spot despite device orientation?  What if we throw in blocks that explode if you don’t clear them quickly enough?  What about using specific power-ups?  Adding one new gameplay mechanic at a time and exploring that mechanic fully before moving onto the next, then adding them together provided a nice progression in terms of variety and difficulty.  As I became more familiar with process, design started to shift away from the look of the puzzle and instead started with a particular challenge, and then I moulded the look around the puzzle.

Next up I assembled the puzzles in-game and turned them loose on our testers.  I quickly discovered that what seems easy to me after working on the game full-time for nine months isn’t necessarily as easy for gamers who haven’t spent that type of time with the product.  Test, test test.  Who knew?  There’s a fine line between challenging and “Nuts to this” with gamers.  Thankfully, I’ve received some excellent feedback and what was originally a “nice to have feature if we have the time” has become a challenging addition to the title that extends the gameplay options while offering us the opportunity to release additional content if gamers like what they see.

Brewing your own game analytics service

Original Author: Colt McAnlis

In this post, I describe how to implement a game analytics system to collect and store game data.  At the end of the post you’ll find links to the source code for my sample implementation.




As a game developer, you must gather data about your game’s users, mine that data, and respond to it. The cloud being what it is today, there are multiple costs associated with collecting and using game data, including costs for transactions and storage.

There are of course tons and tons of services out there that do pretty much the same thing in terms of data collection. This is especially true in the mobile space, where it seems that a new games analytics VC-funded company pops up every day. However, these services often come with lots of questions regarding data ownership, costs, reporting structures, and so forth. At bottom, these services may or may not fit your needs.

As such, if you’re simply looking to understand more about a game analytics system, need more functionality, or just want to roll your own, let’s take a look at how to build your very own low-cost game analytics service from scratch.

The setup

To begin, we make the following assumptions:

  • The user will play your client, and you’ll submit events to a server for cataloging.
  • You have some client code that can make HTTP requests (we’ll use HTML5 in this article).
  • You have some semi-resident server-side compute resource with direct access to a data store. We’ll use Google App Engine (GAE) in this article.

In a naive implementation, we assume that the client does the bulk of the work, and pushes data up to the cloud in a regular fashion. For example, we can push an event every time a rat is killed. This setup results in a simple dataflow between our components:



Let’s hypothetically say that our game has around 15,000 players a day, and each player kills about 2,000 rats – that’s a shocking 30 million events that we’ll be tracking. At this point, we need to take a hard look at what our cost structure is for storing and computing all that data. For instance, the latest pricing on Google App Engine charges about $0.10 per 100k writes to the Datastore, meaning you’d pay about $30 a day for 30 million writes.  That’s a lot of money to throw around just to store in-game events for data-mining. I mean, if you have a dedicated data miner to track that information then the cost might be justified; otherwise, you’re going to need to find a more cost-effective solution.

In addition, your client may be sending huge amounts of data to the server (for instance if you’re tracking each mouse click in an RTS game).  That’s lots of traffic and compute time that you’re churning just to collect some floating point numbers.

Client-side batching

In the naive implementation above, the dominating cost factor is the sheer number of writes into the Datastore.  To lower the cost, we have to reduce the number of writes.

The first thing we should do is determine the importance of the data that we’re collecting and how often we need to use that data.  For instance, mouse clicks in an RTS game may be semi-important, but not so important that we need that data instantly.  As such, we could batch those mouse clicks and submit them at the end of the game.  This deferred batch-submit strategy is great for reducing the amount of transfers from client to server (since we only get the data at the end of the game), but doesn’t really help us reduce the number of writes into the Datastore (assuming that we’d write each entry in the batch to the Datastore after receipt).

To reduce the number of writes, we turn to another offering in App Engine.  The Blobstore API allows our application to create data objects (called blobs) that are much larger than objects allowed in the Datastore service. The original Blobstore API only allowed clients to submit blobs via HTTP request, but the new experimental Blobstore API allows writing directly to the storage system from server-side code.

Using Blobstore allows us to batch our events on the client side, and submit them in a single  blob.


With this setup, we only submit a batch of events at the end of the game and then store them into the Blobstore, which drops our overall cost per day significantly.

There are, however, two issues with this setup.  The first issue is client connectivity.  Say a user plays 20 minutes and is then disconnected, dropping needed data that we may really want. This issue is even more apparent on mobile platforms, where users are on unreliable network connections and where you should expect random data loss during transmission.

Luckily clients can take advantage of persistent storage, which allows them to store batched data and attempt to resubmit the data at a later time. Having reliable network connections means submissions are more likely to succeed the first time, but any client-side batching system needs to have, at its core, the concept of cache-resubmit for any gathered data.

The second issue is that storing data in Blobstore limits our ability to do analysis on the data directly. Before we can mess with the data, we must read it from the Blobstore into a computational structure for usage.  In other words, the query “give me all the users who’ve killed a rat today” requires us to read out the data from Blobstore into Datastore (or a similar container) before doing processing.

Server-side batching

In an ideal world we’d allow clients to message-spam our server as much as they want, so that we wouldn’t have to worry about clients dropping out randomly and taking their precious data with them. We can accommodate such spammy submissions by batching events at the server.

For those of you that are new to cloud computing systems like Google App Engine, it’s worth emphasizing that GAE modules aren’t always running – rather, they are instances that are spun up depending on request volume.  More importantly, they can get spun down as well, depending on query volume and infrastructure service scheduling.  That means there’s really no way to keep a resident in-memory copy of data.

This is where App Engine Backends come in.  Backends are pseudo persistent, heavier-weight process that can hang around for longer durations. With backends, we can allow clients/GAE instances to communicate as normal, and cache/batch the requests into a backend before submitting to the Blobstore

The cost for this setup would be about $0.08/hour for the GAE backend, in addition to the size of the data that’s being stored in the Blobstore, as well as any additional front-end / back-end compute times.

With this setup, clients can be very spammy and intermittent, and the GAE instance effectively acts as a pass-through, simply handing data off to the backend.

You can also make the backend public-facing, allowing clients to submit data to it directly rather than going through the GAE instance.  But be warned that this may create a vulnerability point, as spammy/rogue clients may have the ability to engage in Denial Of Service attacks against the backend. Combining the scalability of the GAE instance frontend with the longer-running backend is a good way to eliminate this vulnerability and still allow per-client scaling/throttling.

As the number of writes in our system increases, we’ll eventually hit an upper limit on the number of requests a backend can process.  More specifically, depending on the size of our event structure, our 30 million events may exceed the capacity of the back end’s available RAM.

One way to address this issue is to add an upper storage limit and start flushing data.  For example, once the backend caches say 10MB of data, it can flush that data to the Blobstore for storage. This technique is particularly helpful given that GAE backends are not truly persistent – they are processes that can be rescheduled on different physical machines, which can die or become unavailable for various reasons. As such, we run into similar connectivity issues as we do with clients, although at much lower frequency. Adding regular flushes can ensure regular storage pulses to safeguard our system from losing data. A downside to these regular flushes is that they can take a chunk of time, and generally, we’ll need to flush during an event submission (if the cache gets full, we’ll need to flush before adding the new event). As such, the GAE instance can easily timeout waiting for the backend to flush its data.

A more scalable method of dealing with the limited number of requests a backend can process is to simply increase the number of backends to a desired capacity, and use a hashing function in the GAE instance to evenly distribute events to the backends. Or rather, once our traffic increases to the point where it exceeds what one backend can handle, we can add additional backends (which obviously increases the cost).

Creating a balanced approach

One important point to note is that there are actually different types of data that you will want to collect for your game.  Every game is different – there is no one best set of data for all games.  Your game will have a range of statistics that you should collect and store with some variability. For instance:

  • There may be some statistics that are fine to gather locally and push up at periodic intervals; others, you’ll want to store immediately because they are so critical.
  • You may not need guaranteed delivery of every single event from every single game for every single player. You may just need “most” data or a representative amount of data. For example, you can log only data from a statistically relevant percentage of clients and then extrapolate results.
  • Not all events can/should come from clients. For MMO games, most of the event calculation and game state reckoning takes place on a game server instance, and as such, that instance should have access to submit events as well.

You should also adjust your data collection based on input loads.  Tune your storage/flushing options based on where the bulk of your data is coming from. For any flushing point (client to server, or server to Blobstore) you should adjust when to flush based on duration and how much data is stored. Always remember that you’re trading RAM for IO operations – keep data in RAM longer and you’ll need more RAM; flush data more often and you’ll do more IO.  Tweak the numbers constantly to find a good balance.

One final issue to consider is the cost associated with the size of your data in Blobstore. Since you’re charged per byte, it might be worth reducing the size of the stored data. Thankfully App Engine also has a solution for that through its ZIP API.

Our final implementation, shown in the figure below, includes data segmentation, multiple levels of batching, and data compression.


Additional considerations

True cost gut-check

There are of course many additional costs involved with cloud processing. For example, there’s the computational cost (= instance hours) of processing/handling every individual HTTP request from clients, as well as the bandwidth cost associated with the per-HTTP-request overhead.  To estimate your true cost, the best thing to do is to build a mock system and run some valid traffic through it to ballpark your numbers.

Reducing long-term data costs

When you implement your game analytics system, you should plan for success and consider what you want to do in the long haul. For example, imagine that on a good day, you’ll be storing some 50 million in-game events. That adds up to a significant amount of data to keep around long-term. After a year of production, the likelihood that a single day’s worth of data will be useful drops significantly, so keeping it in the cloud is going to cost you money for data that’s not used. In that case, you should consider moving the data into a form that reduces your cost over time.

One solution that may make sense is to move data regularly from the cloud to a local box, where you can access the data for a longer period of time at a lower cost. Before archiving, you should cache important data elements so that future analysis can reference the results of the data without having to pull it all back out from deep freeze.

The source

You can find source code for each of the 3 tracking methods we’ve discussed here on my github page. With the App Engine SDK for Python, you can quickly upload and run the instances.  Use the given HTML pages to test the system and see how things work.