Smoothsort vs. Timsort

Original Author: Jaewon Jung

I recently found out an interesting sorting algorithm called Smoothsort, originally created by the legendary comparion sorts. Naturally a thought came to me to study these and benchmark with STL sorting algorithms.


As you can see Leonardo number, a close cousin of the well-known Fibonacci number. Anyway please refer to Keith’s explanation for the ins and outs of this algorithm.


In contrast to the Smoothsort, which is mathematically sound and deep in theories (as Keith mentions, it’s hard for one(at least, for Keith and me) to imagine how Dijkstra originally came up with it), Timsort is rather a product of observations of real-word data and clever tricks to exploit common patterns found in them. Compared to Smoothsort, Timsort has a worse space usage of O(n), but it has a benefit of being a stable-sorter. Unfortunately, this algorithm is also far from simple. In my opinion, the best way to come to a good understanding of this is to read here[7].


Now, fun time. To see how these new kids stand their grounds against battle-tested standard sorting algorithms in STL, I adopted some benchmark code from the Timsort implementation above and modified a little. You can find the main code here.

Without furthur ado, here is the result:

Data std::sort std::stable_sort Smoothsort Smoorthsort(raw bit) Timsort
Randomized(int) 978 970 4871 2529 1576
Randomized(double) 1068 1117 5069 2785 1677
Reversed(int) 213 753 3729 1821 19
Reversed(double) 224 858 3776 1741 23
Sorted(int) 167 349 1005 449 13
Sorted(double) 179 358 986 453 17
Partially sorted #0(int) 961 989 4699 2516 1478
Partially sorted #0(double) 1049 1098 5272 2774 1594
Partially sorted #1(int) 713 828 4798 2435 634
Partially sorted #1(double) 761 903 5077 2630 732
  • Data size: 100,000
  • Data type: int or double
  • Unit: miliseconds
  • Partially sorted #0: each subarray of size 10 in 100,000 sequence is sorted
  • Partially sorted #1: each subarray of size 1000 in 100,000 sequence is sorted
  • The original implementation of Keith’s uses std:bitset. ‘Smoothsort(raw bit)’ is an modified one that uses raw bit operations instead
  • Test hardware: Intel Core i7 920 / 6GB RAM

As you can see, both Timsort and Smoothsort didn’t cut the mustard. Smoothsort is worse than STL sorts in all cases(even with std:bitset replaced with raw bit operations). Timsort shows a trade-off. In random or almost random sequences, not as good as STL ones, but in partially sorted sequences like ‘Reversed’, ‘Sorted’ and ‘Partially sorted #1′, it shows impressive speed-up. Admittedly, apart from replacing an obvious culprit like std::bitset, I didn’t try any thorough optimization for each. So if you can see/suggest any optimization opportunities for both I missed, please leave a comment.


I was somewhat impressed with STL sorters at this point and found out I hadn’t known much about their internal implementions except the foggiest idea that it would be a variant of quick sort. So I digged into the source(STL in VS2012 RC, specifically). As you know, there are two, one which doesn’t maintain the relative order of records with equal keys(unstable) and the other which keeps it(stable).

std::sort is basically a quick sort with a median of medians algorithm to decide a partition pivot. Once the sequence becomes small enough(< 32), it switches to an insertion sort. Or if the recursion becomes too deep(> 1.5log2(N)), then it switches to a heap sort.

std::stable_sort is not a quick sort. It’s a sort(pun intended) of a merge sort. First, it insertion-sorts each chunk of size 32 and merge them hierarchically. Initially, it tries to get a temp storage of the half size of the original sequence and use it when merging. If the allocation fails, then it tries a smaller size, but this means more recursions and merges, so slower, of course. In the sense that it is a combo of merge sort and insertion sort, one can say it’s similiar to Timsort in essence, although the latter has much more complex tricks up its sleeve.

Codes for std::sort/std::stable_sort are relatively easy to follow(especially in comparison with understanding Smoothsort or Timsort), so I strongly recommend to take stock of them, if you haven’t done before.


  • Asymptotic performance is not a whole story at all. The constant factor matters (an obvious thing, but still worth repeating)
  • Timsort can be a viable alternative when data are not so random
  • Smoothsort, though mathematically clever, doesn’t cut the mustard
  • std::sort/std::stable_sort is pretty good in most of the cases
  • For small data, insertion sort is very good. So it’s a good strategy to mix it with other algorithms to devise a good hybrid sorting algorithm


  1. my personal blog.)

Photon Mapping Part 1

Original Author: Simon Yeung


In this generation of computer graphics, Photon mapping is one of the GI technique using particle tracing to compute images in offline rendering. Photon mapping is an easy to implement technique, so I choose to learn it and my target is to bake light map storing indirect diffuse lighting information using the photon map. Photon mapping consists of 2 passes: photon map pass and render pass, which will be described below.

Photon Map Pass

In this pass, photons will be casted into the scene from the position of light source. Each photon store packet of energy. When photon hits a surface of the scene, the photon will either be reflected (either diffusely or specularly), transmitted  or absorbed, which is determined by Russian roulette.

Photons are traced in the scene to simulate the light transportation
This hit event represents the incoming energy of that surface and will be stored in a k-d tree (known as photon map) for looking up in the render pass. Each hit event would store the photon energy, the incoming direction and the hit position.

However, it is more convenient to store radiance than storing energy in photon because when using punctual light source(e.g. point light), it is hard to compute the energy emits given the light source radiance. So I use the method described in Physically Based Rendering, a weight of radiance is stored in each photon:

The Costco-ification of Leaderboards

Original Author: David Czarnecki


It’s beneficial, sometimes, to buy in bulk. Or in this case, to rank in bulk. I’ll take a look at a recent improvement to our open source leaderboard library to show significant improvement when doing a bulk insert operation for ranking members in a leaderboard.


Let’s face it, we’ve all been walking through Costco or Sam’s Club at one point and we’ve thought to ourselves, “I do need a palette of chocolate covered pretzels.” or “We could use a drum of grape jam at home.” Regardless of whether or not these are valid units of measurement, the time it takes to purchase an item in bulk is less than the time it takes to purchase the items individually when you know you’re going to need the bulk amount. So let’s explore this idea with leaderboards.

Let’s look at the performance of ranking 1 million members in a leaderboard.

insert_time = Benchmark.measure do
    1.upto(1000000) do |index|
      highscore_lb.rank_member("member_#{index}", index)
  => 29.340000 15.050000 44.390000 ( 81.673507)

81 seconds isn’t bad, but can we do better? What if a catastrophic failure forces us to rebuild our leaderboards from scratch? Every second counts right? Let’s rank the same 1 million members in the leaderboard all at once.

member_data = []
   => []
  1.upto(1000000) do |index|
    member_data << "member_#{index}"
    member_data << index
   => 1
  insert_time = Benchmark.measure do
   =>  22.390000   6.380000  28.770000 ( 31.144027)

31 seconds! As it turns out, “buying in bulk” really paid off. It helps to understand your underlying data store, in this case Redis, to know when it can handle bulk operations to your advantage. If we were to compare the time it’d take to rank, say 10 million members in a leaderboard, we’d be looking at almost 14 minutes (ranking individually) vs. 5 minutes (ranking in bulk).


As shown in this post, bulk operations can significantly impact the performance of your systems if the underlying data store can optimize those bulk operations. So don’t worry about that desk of Cheez-Its you just purchased. You’re worth it!

On Simplexity

Original Author: Christian Schladetsch

I just wanted to write a short note about Anders Hejlsberg:

“When you take something incredibly complex and try to wrap it in something simpler, you often just shroud the complexity. You don’t actually design a truly simple system.”

This is a profound and insightful observation.

It is often true of system coders that we like to say that we “hide” or “wrap” the complexity of a system, so that it can be used by the application coders to make products.

What Anders says is that such an approach can often lead to a system that overall has more complexity than otherwise. I don’t think he means that systems shouldn’t be complex, or that our APIs should be unsophisticated.

I think he means that as you go down a level of abstraction, the lower levels should be logically and practically simpler than the level above.

Actually acheiving this in real-world systems is, I think, the mark of a truly talented system programmer. I know I often fail at this, and my most recent efforts, while providing a “simple” top-level API, are far more complex underneath than perhaps they ought to be, or could be.

When I start a new system, I often start with the unit-tests. The idea is that once the tests pass, the system is a success. However, we can’t really write tests for all usages, so complexity creeps in as we make systems that attempt to deal with all possible future use-cases.

The result after a few iterations and mid-flow design changes is more complexity than there has to be. I pride myself on observing this “creeping complexity”, and also take pride in my willingness to do a deep, tedious and difficult cross-concern refactor in order to reduce it. But I think sometimes I don’t go far enough. I think I often make simplex systems.

Sometimes, it is only when a system is complete, and passes all tests, and works in the applciation space well enough, that you can see a better system underneath. A system that is simpler as you go down, rather than more complex as you go down. A system that better matches the application space.

However, it is at precisely that same time that you have the least amount of mental energy to do anything about it. The system works; the tests pass, and people are using it to make things. Why rock the boat?

Perhaps it is this indifference that is “the wall” for system coders. Overall, it’s a marathon, and you have to push through that last barrier, and be willing and able to do that last cross-cutting refactor that breaks everything for days or even weeks on end.

Only then can one reduce the complexity, remove the simplexity, and create a truly simple and useful system.

Three Steps from Paid to Freemium

Original Author: Betable Blog

This is a guest post by Michail Katkoff, who works on Monetization at Rovio Entertainment. He writes about his experiences with social and mobile game monetization regularly on his Deconstructor of Fun Twitter.

Cut The Rope: Experiments saw its ranking jump from 31 to 1 in two days when it switched to freemium.

A quick glimpse on the top grossing games list in App Store and you’ll spot several freemium titles on the top. You see year old titles such as freemium apps now account for over 65% of iOS app revenue. You start to wonder… what if you turned your paid game into a free one? Would this be a good business decision? I put together the following three steps to help you analyse this situation logically and come to the right conclusion for your game.

1. Make the Business Case

Before cutting the price of your app to zero, you need to have a business case ready. Projections of potential growth rates and conversion expectations will be key factors in your decision to go freemium. To get started, you need to construct several scenario analyses based on three key variables: daily active users (DAU), conversion rate (from free to paying), and average revenue per paying user (ARPPU).

Picture 1: An example of a projection formula, which takes into account different combinations of growth, conversion and spending scenarios

Daily Active Users

The whole idea behind cutting away the price of the app is to significantly increase the amount of installs. The increase in installs your app will be getting depends on several variables, such as the current rank in app store, current price, marketing, promotion, and potential cross promotion. Based on these variables, cutting your price to freemium could have a huge impact on your growth. To try this out, you can list your app on a “free app a day” site to see what the combination of promotion and price does to your download rate. (See Picture 2 below to see what kind of installs difference you might see from making your app freemium.)

Also, you should be tracking and aware of the ratio of daily active users to your total users. This will give you an idea of how engaging your game is, and how many of your users will be likely to pay in a freemium game. This leads to our next variable, Conversion.

Picture 2: Estimation of Apple App Store downloads and ranks


The amount of paying users is very game specific issue and varies typically between 1% and 5%. The reason why I want you to look at your DAUs specifically is because you’ll likely only convert your players once they are engaged. When converting players from free to paying, avoid aggressive offers and steep progression curves to maximize retention, and thus the long-term conversion (read more: Primed Spend).  That way, you won’t be sacrificing engagement later for revenue now.

Picture 3: Increase ARPPU by offering large bundles. Indicate the value of larger purchases with a combination of graphics and texts

Average Revenue Per Paying User (ARPPU)

The final variable of the formula is the ARPPU, which is tied to the in-app purchase offering of your app. To maximize ARPPU, you should always try to offer high-value high-price bundles as according to my experience the demand tends to be quite price-inflexible when it comes to high purchase points.

Also, favor consumable items instead of permanent items both in your game and in your promotions. These are more valuable to engaged players because they will use the items during gameplay, and their consumable nature increases re-purchases (read more:Monetizing Free-to-Play Games).


2. Create Demand & Offering

Before you transition your game to freemium, you need to create offerings (in-app purchases and purchase points), and then ensure that there is demand for those offerings. There are two ways you can create demand for in-app purchases: Restrictions and Achievements.


Restricting gameplay creates a strong incentive for in-app purchases. However, there are few rules you should remember when restricting players’ gameplay.

  • First, don’t confuse restrictions with walls. By this I mean that players should be able to progress without paying, and restrictions are there only to slow this progress. You are selling a shortcut, not the gameplay itself, which will prevent players from being frustrated.
  • Secondly, start with light restrictions, and then increase them slowly and steadily. This way, you won’t scare away players in the beginning, and once player are engaged they will either convert to paying or wait out even the toughest restrictions.
  • Lastly, always offer at least two purchase options to bypass the restriction. An example of this would be a smaller amount of paid currency and a significant amount of free currency, while the other option would be a larger amount of paid currency. The intention is to make the small price point offer look extremely unattractive and thus incentivize players to make the reasonable purchase with the higher price point.

Example of Restriction Mechanism:

Title X is a puzzle game, which has players passing various levels. Levels are made into level packs with each pack containing 10 levels.  To restrict the progression, you have added a timer mechanic so that every time players complete a level pack they have to wait for a specific time until they can continue the adventure. Players can skip the waiting period by either purchasing a single unlock, which unlocks the next level pack (10 levels), or by unlocking all the level packs (100 levels) with a slightly larger purchase. The waiting period grows after each completed level pack, making the all levels unlock purchase more and more attractive.

Example of Restriction Mechanism


In place of or in addition to restricting the player’s progress, you can also add achievements, which reward players for perfecting each level. Now, to perfect a level a player needs to usually play more and refine their strategy, which drives engagement. The first achievements should be pretty easy, but as players progress the achievements should become harder and harder to reach (and therefore demand more play sessions). This leads to demand for boosters and power ups, which assist players in their quest to perfect each level and earn the achievement.

Example of Achievement Mechanism:

Title X is a racing game. For each level, a first place finish earns player 3 stars from that level. Every sixth level is visibly locked in the level selection screen, and the sixth level is typically a more interesting or unique level design. To enter the locked level, a player must get a first place finish in all five previous levels. Players can improve the chances of winning a race by purchasing consumable boosters with in-app purchases, or they can purchase the ability to open the chest outright.

Example of Achievement Mechanism


3. Track and Adjust

Now you have a business case, which gives you goals to shoot for in order to make your game a freemium success. You’ve created offers with build-in demand to encourage in-app purchases. Now, in order to successfully make the switch from paid to freemium, you need to implement tracking around your key metrics. Once the app transitions from paid to free in the app stores, the development and marketing teams should be looking to optimize the game based on the data they get.

Be sure to keep the above Restrictions and Achievements in mind when setting up your tracking. Tracking and analyzing the data is essential, both for improving your monetization and engagement and to know how are you doing with regard to the business case you’ve made. Once you know where you stand, it’s much easier to prioritize following updates and achieve the business goals you’ve set.

So now that your game is freemium and you’ve got disappointing numbers, how do you adjust? Let me give you some suggestions below:


Going from paid to free should be accompanied with a marketing and PR campaign. Have you game featured on FreeAppADay and various blogs and tweets to ensure exposure. If you have other titles, go hard on cross promotion. If this is your only title than consider buying cross promotion (Android only). And featuring from Google or Apple is naturally worth crawling for…


Low ARPPU means usually that either your prices are too low (counterintuitive, but remember only 1-5% of your players are buying) or that your larger price points fail to offer the perceived higher value. Take a closer look at the data and see at which point in the game users are making purchases and what are they purchasing. Once you know this information, you can rearrange you offering so that it better suits players in those converting phases of the game. At the same time, make sure you unbalance the price points so that the higher prices offer unbeatable value compared to the lower ones. Again, because only 1-5% of your players are buying in most games, you want to give them a lot of incentive to make a larger purchase.


Conversion is the toughest variable of them all to improve. When it’s low, it implicates that the demand you’ve created isn’t strong enough, or that your in-app offering doesn’t answer to the demand. If you suspect that either of those two cases is true, you need to go back to step two and really analyze how players are playing the game.

However, be sure that you’re not missing the low-hanging fruit in your design and user experience. I’ve seen games where purchasable content is simply too well hidden to be noticed because designers are afraid it might scare players. Don’t be afraid! While you don’t want to be obnoxious about it, most players are now used to freemium games and are not fazed by prompts to buy in-app purchases. From your data, see what percentage of DAUs is seeing the purchase wall. If it’s below 20%, you need make better use of various screens in the game, such as Level Failed, Level Selector and Welcome Screen. In particular, Level Failed is a great place to sell your in-app purchases because players are looking for ways to improve their play.

Picture 4: Hambo makes excellent use of the pause screen with several buttons leading players to convert.



In the end it’s all about understanding the difference between paid and freemium business models and analysing the potential of your paid game succeeding as a freemium version. The key is in creating demand and giving players the possibility to fulfil the demand with in-app purchases. And one final thing I want to mention: don’t expect to nail the transformation to freemium instantly. You need track and adjust your game to improve your numbers and revenue. Going from paid to freemium is a tough challenge, but if you execute well then it will certainly pay off. Good luck!

In Praise of Idleness

Original Author: Bruce Dawson

“I want to say, in all seriousness, that a great deal of harm is being done in the modern world by belief in the virtuousness of WORK, and that the road to happiness and prosperity lies in an organised diminution of work.”

In the past few years I have encountered many examples of code that spins in a busy loop, doing nothing useful for dozens of milliseconds. I’ve looked at dozens of xperf traces from customer machines that show both busy waiting and CPU starvation. There are many misconceptions about the costs and benefits of busy waiting, and I want to explain emphatically why busy waiting is rarely a good idea, is a worse idea now than in the past, and will become an even worse idea in the future.

The temptation to busy wait

File:Padlock.svgA simple way of providing mutual exclusion is to have flag which a thread can atomically test-and-set, using something like InterlockedCompareExchange. If the flag is zero and you are able to atomically set it to one then you ‘own’ that critical region and you can safely execute your code. If you write your own lightweight mutex you have to be careful to avoid reordering problems but the basics of guaranteeing mutual exclusion are actually quite straightforward.

The problem is, what happens when somebody else owns the lock? You need to wait for them to release the lock – but how?

One option is to wait on a semaphore. That puts you to sleep until the lock is released. That is good. That is appropriate. That is the whole freakin’ point of this article. So do that. But first I want to explain why developers sometimes don’t like waiting on a semaphore.

imageWaiting on a semaphore means calling out to the kernel (expensive), context switching away from your thread (expensive) and then context switching back to your thread. If the lock is released quickly then it can be more efficient to busy wait for “a while” in order to avoid the costs of the kernel call and a pair of context switches.

So, busy waiting on a mutual exclusion flag can be more efficient in some cases. But can we gain that efficiency without paying an unreasonable price?

The simple case

Imagine that you are running on a N-core system and there are N threads running. Each thread runs on a different core. In this scenario it seems reasonable for a thread that wants a lock to busy wait until it acquires the lock, because (by definition) there are enough CPU resources for all, so we might as well keep spinning, waiting for the lock to be freed.

That all sounds very reasonable, but even in this extremely simple case there may be reasons why busy waiting is a bad idea.

No CPU is an island

  1. imageIn these days of TurboBoost, the total power draw of a chip is capped. If one core draws more power – because it is spinning in a busy loop – then there is less power available for the other cores, and the whole chip may have to run at a lower frequency.
  2. In these days of integrated graphics (and the future is increasingly integrated), the total power draw of a chip is capped. If the CPU draws more power – because it is spinning in a busy loop – then there is less power available for the GPU, and it may have to run at a lower frequency.
  3. Busy waiting wastes energy which can reduce battery life or just waste electricity, increase fan noise, and heat my office or my lap.
  4. Hyperthreading means that two logical cores residing on one physical core may be sharing resources, such as execution units or L1 caches. The exact details of the sharing vary dramatically (Bulldozer modules versus Intel hyperthreading) but the fundamental problem remains the same – code executed on one logical core can slow down the other. Calling Sleep(0) repeatedly is particularly bad because it uses a lot of cache and execution resources. However any sort of spinning is likely to have some negative effect on the other logical core.
  5. InterlockedCompareExchange is, by its very nature, a global operation. In order to do an atomic read-modify-write it needs to use bus locking or cache line reservation which requires communicating with – and potentially blocking – every other core in the system.
  6. Even just a normal read (it turns out that a good optimization when doing a spin lock is to do a normal read before attempting the expensive InterlockedCompareExchange) has to bring in the associated cache line. If it is owned exclusively by another thread (if another thread is writing to the same cache line) then there is some cost. This should be minor, and if this is a problem then the false sharing which it implies should really be fixed, but this is another potential risk.

Because of all of these issues it is likely that busy waiting will reduce overall performance on at least some machines, and waste energy on many others. Even if busy waiting improves performance by a few percent on your machine, there are undoubtedly some customer machines (now, or in the future) where it will cause a much more significant slowdown.

CPU Starvation

Our idealized example of N-cores and N-threads trying to run is very tidy. On Xbox 360 it may actually be realistic (some of the “No CPU is an island” warnings still apply), but on a PC it is completely unrealistic. Even on a PC where your game is the “only application running” there is, I’m afraid to say, a huge amount of background CPU activity. This could be anti-virus, services, the system process, various application update mechanisms, or any one of dozens of other processes. This load varies, but can quite easily add up to several simultaneous threads trying to run, at least part of the time.

In addition, on a PC you don’t really know what core your code is running on. It changes constantly and, while you can set thread affinity, on a PC you usually shouldn’t because it just constrains the operating system and means that your code will be affected even more by other processes. As usual, Xbox 360 is an exception here because it requires thread affinities to be set – this works because there are no other processes running.

It turns out that on Windows you can’t even tell how many logical cores are currently available. Sure, you can ask the OS for the core count, and you can ask about logical versus physical cores (although Bulldozer blurs the distinction), but the actual number of logical cores that the OS considers eligible for thread scheduling varies over time. When demand is low Windows tries to keep the secondary logical cores on each physical core idle. When demand is high the OS makes those secondary logical cores available, and this can change multiple times per second.

You also don’t know what priority your threads are running at. You can set your thread’s priority, but Windows dynamically adjusts threads’ priorities in order to avoid live-lock and to improve responsiveness. These algorithms are complicated and change over time, so don’t even think about trying to predict them. Other operating systems will have different algorithms, so any attempt at ensuring that you are not interfering with useful work is doomed to failure.

The net result of this is that sometimes when you busy wait you are causing CPU starvation, and on a PC this is fundamentally unavoidable and unpredictable.


A lot of people decide that they are going to sleep in a polite way. They are going to repeatedly call Sleep(0) in some vague attempt to let other threads run. When you call Sleep(0) on Windows it will either return immediately or it will switch to another thread and I am of the opinion that both of these results are bad.

If Sleep(0) returns immediately then what we have is a very complicated and expensive busy wait that is touching lots of cache lines and is probably hitting the OS scheduler lock, for no good reason.

If Sleep(0) switches to another thread then we have lost the CPU, paid all the cost of context switches, but we have failed to tell the operating system what we are waiting for. Thus, there is no way for the operating system to wake us up promptly when the lock is released. This means that we will likely sleep for an entire quantum, which could be 10-20 ms, and then be given a chance to try again. Meanwhile the lock we wanted will probably have been released and acquired hundreds of times and we might have to repeat the process again!

The one theoretical case when Sleep(0) might help is if the thread that owns the lock is waiting to run, in which case Sleep(0) might give it CPU time so that it can finish its task and release the lock. However Sleep(0) will only relinquish the CPU to a thread of equal priority, and due to dynamic thread priority adjustments there is no way to ensure that your worker thread has a matching thread priority. And, even if thread priorities match, Windows tries to avoid moving threads between cores, so if the IdealCPU for this other thread is not the core you are running on then your attempt at generosity may be rejected.

It’s worth pointing out that the problems with Sleep(0) are not implementation defects – it is just the wrong function for the job. Sleep(0) is a poor match for waiting on a lock because what you really want to do is to give CPU time to the owner of the lock, if it needs it, but Sleep(0) has no way of knowing which thread you care about! On the other hand, if you wait on a critical section then Windows knows exactly what thread you are waiting on, and it can (potentially, I don’t know if it does) give that thread a boost in order to avoid priority inversions or otherwise shorten the delay.

Sleep(0) is, in my opinion, more about faith than science. It is simply not a reliable or efficient way to do anything on Windows. It will occasionally do something helpful due to sheer luck, but you can never count on that and the odds are against you.


Sleep(1) puts your thread to sleep for at least a millisecond. This avoids the problems of busy waiting, but in video games a millisecond is often too long to be napping. And, by default, Sleep(1) actually puts you to sleep for much longer than a millisecond. But that’s a subject for another post. Therefore events and semaphores should usually be preferred over Sleep(1) in a loop – they will wake you immediately upon being signaled.


The YieldProcessor macro (which generally maps to _mm_pause) can help prevent a busy wait from completely overwhelming the system, by inserting pauses in the instruction stream that prevent the busy loop from overwhelming the processor. This is particularly important on hyperthreaded systems since it gives the other logical core time to run. If you must busy wait then be sure to use YieldProcessor(). But I’d rather you didn’t busy wait. YieldProcessor will reduce your load on the system, but it doesn’t change the fact that your idle loop is occupying a logical core that might be needed for something more important.

Complicated enough for you?

That’s a lot of complexity. And there is probably more that I have forgotten or I’m not aware of. Writing your own lock requires you to be an expert in all of the issues mentioned above.

Real world example – waiting for thread termination

File:PIA08409 North Polar Region of Enceladus.jpgThere are three popular ways of waiting for a Windows thread to terminate. Two of them should never be used.

One method to detect thread termination is to spin in a loop waiting on a volatile variable that the thread sets as it exits. This is a terrible idea for two reasons. One reason is that it is a busy wait, and therefore bad. The other reason is that it doesn’t actually tell you when the thread has exited. It tells you when the thread is about to exit. This makes it a recipe for race conditions.

Another method to detect thread termination is to spin in a loop and repeatedly call GetExitCodeThread until it no longer gives an exit code of STILL_ACTIVE. This is a terrible idea because this is a busy wait, and therefore bad. In some cases the child thread will be starved for CPU time – and unable to exit – because the parent thread is consuming all available CPU time in its wait-for-exit loop. And heaven help you if your thread’s exit code happens to be STILL_ACTIVE…

The one-true-way to detect thread termination is to call WaitForSingleObject on the thread’s handle. When it returns you can be sure that the thread is terminated. Then, and only then, you should close the thread handle. Simple. And no busy waiting required.

Other platforms should offer equivalent ways of waiting for a thread to terminate, such as pthread_join for pthreads. Or in C++ 11 you can wait for a std::thread to terminate by calling std::thread::join.

Real word example – thread pool

I worked with a thread pool a few years ago that contained several busy waits. These busy waits may have made sense when the code was written, but as core-counts and hyperthreading increased, some architectural problems were revealed.

This system was based around using an event to signal when a new job was available. The thread that added the job (we’ll call it the master) would call SetEvent(). The nature of events is that this would wake up all of the worker threads, and the nature of using events for a thread pool is that the thread pool lock had to be held by the master when calling SetEvent().

It turns out that when a thread is woken up after waiting Windows typically gives it a small priority boost. This means that the worker threads would probably be running at a higher priority than the master thread. Therefore, if there were enough threads in the thread pool (and on hyperthreaded systems ‘enough’ is equal to the core count or less) then the master thread would be switched out so that all of the worker threads could run. However the master thread still holds the lock, so the worker threads would all try to acquire the lock and fail. In the best case scenario this means there are a huge number of context switches before any work can get done.

In one version of this system the lock was a pure spinlock. So, the worker threads would keep trying to acquire the lock, and keep failing. In the purest form of this flaw we had a six-core twelve-thread CPU with eleven worker threads. Six of the worker threads would get scheduled (Windows tries to leave the hyperthreads idle) and would try to acquire the lock, thus starving out the master thread. After a few quanta had expired Windows would either allow more threads to run, or would decay the priority of the worker threads so that the master thread could finally run, and release its lock. It was common to see 20-40 ms sequences where six cores were spinning in a lock-acquire busy wait, and no work was done!

The xperfview capture below shows a call to SetEvent that took 12.434 ms due to this issue, during which time all cores were busy waiting:


Using a semaphore instead of an event ensures that only one worker thread is woken up at a time, and also allows the worker thread to be triggered when the lock isn’t held. This avoids all unnecessary context switches. Switching to a lock that doesn’t spin endlessly also avoids the worst-case scenario.

Real world example – folly::MicroSpinLock

This lock uses a one-byte flag and no semaphore. It spins for 4,000 iterations (using ‘pause’ – excellent) before trying the lock again, and there is no apparent limit to how long it will busy wait. Therefore this lock can lead to live-lock, wasted power, CPU starvation, reduced CPU frequency, and reduced GPU frequency. The only advantage I can see of this lock over a futex is that it is very small (there is also a one-bit version) and it is properly constructed through zero initialization.

The comment block says “Note: these locks are for use when you aren’t likely to contend on the critical section”. Famous last words. Caveat lockor. Sorry Facebook, too risky, not recommended.

Profiling challenges

Busy waiting means that wait analysis doesn’t work, which means that a valuable profiling tool is lost.

Spin a little bit

imageSpinning is not entirely morally bankrupt. If there is a lock that is typically only held for a few hundred cycles and if the threads in question are usually simultaneously running on different cores, then spinning for a little while is just fine. The key is understanding the odds of the two threads being on different cores, and on a reasonable definition of “for a little while”. You should spin long enough to give yourself a reasonable chance of acquiring the lock, but no longer. And, if you never succeed in acquiring the lock by spinning then you should give up and never spin again.

Windows critical sections provide an option to spin, and they even appear to have an (undocumented) flag that will dynamically adjust the spin count:


If you do your own spinning then you should monitor the results to make sure that it is genuinely helping, and you should absolutely cap the spinning at some small time, probably a few thousand cycles. Remember, the only reason to spin is to avoid the cost of a context switch. Unless you measure how long you are spinning and how often the spinning works then you can’t be sure that you are getting any benefit. An unmeasured optimization is no optimization at all.

If you write your own spin loop then you should definitely use YieldProcessor() and you should do normal reads in your loop to see if there is any chance of acquiring the lock before trying the heavyweight InterlockedCompareExchange.

However I am not impressed with the value of spinning. Spinning while trying to acquire a lock only makes sense if the lock is lightly contended for. If that is the case then the benefit is likely to be modest, due to the light contention. However the risks remain high, should you get it wrong.

If you do spin, keep it short. Spinning for dozens of milliseconds just makes me angry. You wouldn’t like me when I’m angry.

In conclusion…

Please don’t busy wait. It will often harm performance, while offering modest gain. It will reduce battery life and waste power. It has never been a worse idea than right now, but it will become an even worse idea. There are better ways of improving multi-threaded performance.

If you do busy wait, be sure to cap how long you do it, and do normal reads and YieldProcessor() in the wait loop.

Don’t use Sleep(0).

For best and simplest results use a CRITICAL_SECTION, futex, Benaphore, or the equivalent on your platform. Don’t use this lock. User-mode mutexes like these are extremely fast in the uncontended case, and if you really want to you can specify a spin count. If you use a CRITICAL_SECTION then it will normally be acquired instantly, at very low cost, and in those rare cases when you do have to wait you will be rewarded on reawakening with a priority boost and a smug sense of satisfaction at not wasting CPU time. If a lock is heavily contended then you should address that problem (separate memory pools anyone?) instead of wasting CPU cycles for efficiency.

To paraphrase mothers everywhere, if you don’t have anything useful to do, don’t do anything.

  • Spinning in a loop
  • Can be simple and tempting
  • But mother says no

Why I Play, Why I Create

Original Author: Claire Blackshaw

Dev Bot has a formula for The Game

Dev Bot has a formula for The Game

Several weeks ago, triggered partly by work events and partly Diablo III release I started really thinking about why I play[1]. What is Play to me and how can I make games that speaks to my soul and are worth my time in crafting? Sounds grandiose, so let’s simplify it.

Which games do I enjoy playing and feel good for having played?

I force myself to purchase, play and dip into the latest game for research and to stay on top of the curve like a responsible designer. At first I thought my lackluster enjoyment was simply gaming fatigue from the industry, but then I started questioning gaming as a past time.

Examining films, television and interactive storytelling such as Pottermore and HomeStuck. Why do I feel enriched by certain novels but feel dirty when logging off a few hours of Diablo? Why does sketching for hours feel satisfying, and building a blocky Minecraft world with my partner fill my heart with joy?

After dissection of games which I personally enjoyed and connected with there were two groups which stood out to me: Personal Pieces and Emergent Experiences.

Personal Pieces which I enjoyed were almost all short, light on gameplay and the creative work of a single person or small team. These pieces must be completable in a single seating and not frustrate or block me at any point. Using touch mechanics, gentle mouse interactions or single button the gameplay should enhance the storytelling and not break it. Less is more.

Studio productions fail at these personal pieces because they are too often meddled with, pulled in different directions and stretched thin to pad out a required length of play time. A good film parallel often mentioned is Citizen Kane, where Orson Welles famously controlled every frame[2]. Authors should also be conscious in this medium that gameplay is secondary and only serves to reinforce the story and that this is totally okay. The key to this group is a personal connection to the audience, a fearless display. As a designer you need a story to tell and the ability to express yourself as a creative force.

The second type scares producers and experienced game developers like the willies. To quote a recent colleague, “I don’t know how to design this”. Well you can’t! Sorry but you need to have an idea, let it loose and ride it like a raging shark whale bear sliding down the mountain. The trick is gently guiding the design and pruning wayward branches. These games have systems, and rules which interact in ways the developer cannot predict and provide a unique unbalanced experience for each player.

Also in these games while a goal or direction may be provided to the play, it should allow the player to explore their own space. These games are also truly social, not flaying Facebook but instead moments which we share with our friends. The Minecraft world we prize, the X-Com team death moments after leaving the ship and physics based explosions of doom.

You are not going to be able to boilerplate produce these games or stamp out several carbon copies. Middleware and established engines will not help you but instead hinder you in the creation of these creative emergent games. You need to inherently understand the rules and game systems and be willing to experiment with them. As a designer to create these games you need to be able to prototype and understand these systems. This is where the programmer designer will flourish even if your code is only prototype, see Spelunky.

Well I hope you enjoyed this little look into my mind and exploration of why games matter to me. It’s a personal exploration and not intended to be a grand exploration on the theme, but instead a single viewpoint. If you’re not making games which matter to you personally then is there any meaning to your creation?

If we are to continue to make games meaningful we need to explore new game structures, systems and not be afraid to be personal and fight for every frame.

[1] The irony of this posting slot falling on E3 does not escape me.

[2] I’m not going to get into the game which makes you cry, I think we already have brilliant games but Citizen Kane is a great example of a big production under the tight control of one person.

SQL Server: Text and Document Retrieval

Original Author: Ted Spence

Once you begin using SQL, it’s very natural to just put every data object you have into your database. However, one of the important lessons of SQL is that not all data needs to be “relational”; just because SQL Server can handle virtually any data type doesn’t mean it’s the ideal solution for those data types. This article is about what happens when you need to start storing large volumes of images, text, XML documents, JSON files, or other documents.

The Natural Tendency to “Grow”

Generally, a server-based data project begins by identifying a need for a data storage system and adding an data repository. When you start, you’ll be free to choose between SQL Server, MySQL, CouchDB, and whatever else that’s trendy. You start by looking at all the data you intend to use, and you identify the correct solution for storing that data. For the first few months of the project, all is well, performance is good, and you’re using your data store exactly as you intended.

While your project continues to be a success, you’ll keep adding features. Over the next few months you’ll add new data types, new secondary tables, affiliated data feeds, and so on. This phase of your project likely differs from the beginning: each feature by itself isn’t large enough to justify a radical rewrite. Even if you have enough time to consider complex solutions that involve mixing SQL and NoSQL databases, it’s not a good idea to give the IT management team too many disparate systems to supervise. You shouldn’t just be optimizing just your code, you should be optimizing the systems management requirements and complexity.

I often visit companies who have huge SQL databases and ask them why they were built in that particular fashion. One company might have a terabyte-sized database filled with image files but only a few hundred MB of actual relational data; another firm might have a hundred gigabytes of HTML text stored in a custom content management system. In most cases the companies will tell you that they simply started with the database and kept growing.

Of course, on the other side of the fence are the companies that continuously change their database architecture. These firms are able to rapidly throw out an entire data store when they think the data storage system is sub-optimal, then switch back when circumstances change again. If you have the time available, doing these kinds of refactoring projects can be excellent learning opportunities; but for the rest of us, I have some tips for managing large data growth.

Storing Large Blobs in SQL

Your SQL database has tons of great features for large data support. Given that most software needs to store text, comments, or XML / JSON documents for rapid retrieval, let’s begin with they’re deprecated now, so let’s focus on only this one.

Some of you may recall my previous article where I mentioned that removing varchar() columns was a useful way to improve performance. This is still correct: extremely high performance tables are ones that omit all variable length columns and pack data in extremely densely. Density is the reason you want to take all your variable length text and split it off into a separate system.

As an example, I once received a report that the latest minor point release of our codebase had a terrible performance regression bug. The point release didn’t contain a lot of significant changes, but the database nearly ground to a halt whenever the system retrieved data about a game. Naturally, the first thing you should do is to compare performance before and after the change. I identified one core stored procedure, “get_GameData”, which was the culprit; looking at my application’s debug logs I would see the following:

(old version)
  SQL: get_GameData (Time: 00:00:00.0020000)
  (new version)
  SQL: get_GameData (Time: 00:00:00.0060000)

The next step was to trace the query’s execution plan. It was a compound query that touched about 10 different tables; all of those tables were fast except for one: the table that listed the game’s release date. Why would that be a performance drain? I decided to take a look:

CREATE TABLE game_release_dates (
      release_date_id INT NOT NULL PRIMARY KEY IDENTITY(1,1),
      game_id INT NOT NULL,
      release_date datetime NULL,
      found_by_researcher_id INT NOT NULL,
      researcher_comments text NULL -- New column added in this point release

Performance problem spotted! Someone had slipped a minor feature into the software to track user comments. However, this “researcher_comments” field was sparse: it was generally null but in a few dozen instances it contained a “copy-paste” of the contents of a publisher press release verifying the release date of the game. Because this table was touched in hundreds of areas, it contributed to a general slowdown.

Use Case: User-Generated Comments

The reason this performance problem existed is because the data was sparse and inefficiently stored. There’s nothing wrong with using varchar(max) for comments, but out of all the dozens of systems that needed to use the release dates table, only one of them – the researcher user interface – needed to see the comments. One option at this point is to migrate all the comments to a completely different system not part of SQL Server; that’s certainly possible, but here’s another alternative pattern, the comments table:

CREATE TABLE user_comments (
      comment_id INT NOT NULL PRIMARY KEY IDENTITY(1,1),
      -- This is an ID number of the "table" being commented on - e.g. 153 might refer to "release dates"
      table_id INT NOT NULL, 
      -- This is the ID number of the "object" being commented on - e.g. game #124721387
      record_id INT NOT NULL, 
      -- This is the User ID writing the comment.  Maybe you want to use an Active Directory account name?
      user_id INT NOT NULL,
      comment_date datetime NOT NULL,
      notes nvarchar(MAX)

This is a pattern I’ve been using for about twenty years now. It’s simple, useful, and it allows the comments table to be extremely dense, even though comments may be distributed sparsely. If you only have a few dozen comments, and those comments are only used in a few rare cases, you’ve basically eliminated the overhead for all other systems that don’t require them.

Use Case: Document Retrieval

Another area where SQL is often tasked to help out is document retrieval. You may have a large collection of XML documents – perhaps they represent game turns, or content management system text, or zone data files. If you already have SQL Server in place it’s likely more efficient to use SQL Server to index these documents rather than to add a secondary system dedicated to key-value-pair document retrieval.

That advantage is likely to be squandered though if you actually store the data directly in SQL Server. Your database server is powerful only as long as it has free RAM available to cache high-frequency data. Here’s what the bottleneck problem looks like:

Sequence diagram of SQL Server and images

Why would you store your images here?

The critical performance improvement you can make is to store big objects elsewhere, and index them in your database. We can use secondary data stores for anything that’s going to be 1) Large, 2) Unstructured, and 3) Non-relational. Ideally, for large objects like these – especially for images, XML, JSON, and so on, you want to cut out all your internal overhead. You can store this data directly on a high performance filesystem and identify each document using a URL schema like this:


This way, you only need to retrieve the integer item ID from the database and you can tell the client application how to retrieve that object directly. Some people prefer they’re 50% less storage efficient than 64-bit ints. Unless I have a specific need for GUIDs, I tend to go for 64-bit int objects.

Type of Data
Secondary Storage
Item thumbnails Rarely changed, but accessed continuously Use a CDN, like Akamai, AT&T, or EdgeCast. These are the most effective rapid delivery mechanisms for small files.
User created images such as character images, icons, guild logos, etc. Changed frequently, accessed regularly this guy’s solution.
XML or JSON data for multiplayer games Changed dozens of times over the duration of a game, accessed once for each revision Grant your front-end web servers access to a shared redundant data store like Amazon S3, and enable reads and writes to that location. Store each file using a GUID filename. Create a back-end pruning task to eliminate data when it exceeds your age threshold.

A schema for a relational SQL table that links to a back-end secondary data store should look something like this:

      item_name nvarchar(100) NULL,
      item_class_id INT,
      item_thumbnail_id BIGINT, -- or GUID
      item_fullimage_id BIGINT -- or GUID

With this in mind, you will keep your database server working on the small subset of your data that must be relational, and leave the large volume bit pushing to the content distribution network. When it comes time to link to an item thumbnail or full image, you can just generate the URL and retrieve it from the CDN. Alternatively, if you’re writing a client side application, write a startup script that checks a local cache against all images on your CDN at once.

Where do you go next?

Pretty much all the tricks I’ve shared today are related to moving data aside so it doesn’t impact critical high performance systems. When you focus on building extremely dense data tables, SQL Server can become almost as fast as constructing your own B-Tree in memory (because that’s actually what SQL Server does behind the scenes). As long as you keep in mind the side effects of your design choices, you should be able to continue to tune your database for extremely high performance.

Read my lips: No more loading screens

Original Author: Niklas Frykholm

Technology/ Code /

I’ve already talked a bit about how we manage resources in the Bitsquid engine, for example in:

Today I want to focus on a particular aspect of resource management that I haven’t discussed in any great detail before: streaming.

Streaming is good for two things:

  • Getting rid of loading screens.

  • Showing more detail/variety than we can fit in memory.

This is 2012, and according to the Mayans it is the last year anyone should have to be forced to look at a loading screen. Since I’m not directly involved in games production, but only make an engine, I can afford to make such broad, sweeping statements.

A quick recap

Since the two links above are quite long reads, let’s start with a quick summary of how our resource system works.

A resource in the Bitsquid engine is a piece of data uniquely identified by its name and type. For example:

Type Name
unit units/beings/player
texture vegetation/grass/tall_grass_01
wav music/soft_jazz

The resource files are created by our tools. They are human readable files written in a JSON-like format. Before they can be used in the runtime, they need to be compiled. The data compiler compiles each resource into a platform specific optimized binary blob:

For loading and unloading, the resources are grouped into packages. A package (itself a resource) is a list of resources that can be loaded and unloaded together.

In a very small game, the entire game could be a single package. In a larger game a package could contain all the resources for a particular level, or a tile set that is used in multiple levels. The gameplay programmers decide when to load and unload packages.

For the final release of the game, the packages are converted to bundles. A bundle contains all the compiled resources in a package concatenated together to a single file which is compressed by a stream compression filter. The engine loads the entire bundle in one go, without seeking. This is crucial for optic media, but it also really speeds up hard drive performance.

Getting rid of loading screens

To get rid of loading screens we can use package streaming. By this I simply mean the ability to load new packages in the background while the engine is doing other things. This means that when the player is approaching a new area in the game, we can start downloading that data in the background, while she continues playing. When she arrives at the area, the data is already in memory and she can proceed without having to wait for a loading screen.

In the Bitsquid engine, packages are always loaded in the background, so package streaming is enabled by default. The gameplay programmer can tell the engine to start loading a package. The loading is handled by a separate background thread and the gameplay programmer can poll the engine every frame to determine if the loading has completed or not. When the loading is done she can start to use the new data.

It is up to the gameplay programmer to decide what the engine should do during the loading time. She can choose to stall the engine (ugh), show a loading screen (semi-ugh) or do something more interesting (yes, 2012!).

There are many different ways of organizing and structuring packages for successful streaming and the engine lets you use whichever method you prefer. Different solutions might work for different games. A game with linear progression could have a package for each “stage” and trigger download of the next one when the current is almost complete. An open world game could have a package for each section of the map, together with additional packages for “special” locations. A game with random encounters could have separate packages for the different encounters you can run into.

The fact that the engine doesn’t have a hard-wired streaming model gives the designer a lot of power and flexibility. But with that power also comes a greater responsibility. The designer must bear the entire burden of setting up the packages correctly and deciding when to load them and unload them.

Perhaps in the future, we will pick one or two streaming models as “standard solutions” and provide some convenience functionality for them. You will of course still have the option to drop into “manual” mode for full flexibility if needed.

Showing more detail

For showing more detail than can fit in memory we use something that I for lack of a better word call resource streaming. What resource streaming means is simply that for certain resources, we don’t load the entire resource into memory when we load its package. We only keep the resource partially in memory and stream bits and pieces of it in and out as needed.

The simplest example of a streaming resource is perhaps a video file. Video files can contain hundreds of megabytes of data and we don’t want to load all that into memory as one big blob. Instead we want to stream it in, frame by frame, as the video is being played.

Remember what I said in the recap, that each resource gets compiled into a platform specific binary blob. Actually, I lied. The truth is that each resource gets compiled into two blobs:

The first one is the memory-resident blob. That is the one we already know about. It goes into the bundle, gets loaded into memory by the background thread, etc.

The second one is the streaming blob. It contains data that shouldn’t go straight into memory but instead be accessible to a streaming manager that will move the data in and out of memory as necessary.

Not all data types produce a streaming blob when compiled. In fact, most don’t. Only types that use resource streaming (such as video) output any streaming data.

It is up to the data compiler for the specific content type to decide what goes into the memory-resident blob and what goes into the streaming blob. For example, the video compiler puts all header information (number of frames, content size, frame rate, etc) in the memory-resident blob, and just the raw frame data in the streaming blob. That way we can know the size of the video and other useful information without pulling in any stream data.

If you wanted to build a mega-texture solution on top of this (we haven’t) you would put the lowest MIP-level in the memory resident data together perhaps with some kind of index that told you how to quickly locate data in the stream.

For voice data, you could put the first 100 ms or so of the sound into the memory-resident area and the rest into the stream. That way you can start playing the sound immediately when it is triggered, without waiting for any data to be streamed in.

All the streaming blobs for a particular package are put into a streaming bundle that ends up next to the ordinary bundle for the package. The offset and size of the streaming data for each resource gets stored in the ordinary bundle together with the memory-resident data for that resource. That way, each resource always knows where to find its streaming data in the streaming bundle:

Unlike the ordinary bundle, we expect the stream bundle to be accessed randomly. For that reason it should preferably reside on a hard drive rather than on optical media. For the same reason, we do not compress the stream bundle. Most of the resource formats that you would want to stream already have built-in compression (video, sound, texture). For other resources, you can always add whatever kind of compression you want when you compile the data.

When a bundle is loaded, we open the corresponding stream bundle and keep the file handle around for future reference. Any system that wants to access the stream data can use this handle to (asynchronously) read from the stream bundle.

We don’t provide any general system for deciding when to stream in this data, how much to cache in-memory and when to throw it out. Instead we leave that up to each individual system that supports resource streaming. A video streaming solution will have very different ideas about how to cache data than a texture streaming solution. Forcing them both to adhere to the same model would just complicate and suboptimize things.

With the combination of packet streaming and resource streaming you can cover almost any imaginable streaming scenario with a model that is both simple and flexible.

This has also been posted to The Bitsquid blog.