Channel API

Original Author: Noel Austin

I’m sure you’ve heard of Google App Engine – it’s a web hosting service that allows you to write server software in Java, Python or Go. Overall I think it’s a great piece of technology that helped me understand the concept of writing scalable services, it was also a useful reason to finally learn Python. Anyway I’m currently interested in the suitability of it for realtime applications like games.

The traditional web model is request/response based. The client requests a resource from the server, the server responds with some data. That’s the end of the transaction. In this model there is no way for the server to update the client without first receiving a request from that client. If you want to build a multiplayer game then this is likely to be very important.

It can be simulated either by regular polling from the client (does not scale well), by some fairly hacky techniques falling under the umbrella of Channel API.

Test App Here!
here.

Overall I was disappointed with the results, and indicates to me that if you want any kind of fast response game then you may have to look elsewhere.

  • Very unpredictable latency (200-1000ms)
  • Big packet overhead of HTTP headers etc
  • No peer to peer support at all (limitation of current web technologies)
  • No support for broadcast messages (server must send to each client in separate call)
  • Sending several messages at once often results in a massive increase in latency

That said I do think the technology is quite promising and certainly opens up the possibility of doing whole classes of games with two way communication. Strategy games, or other games where one player does not directly affect another. Adding chat for instance is a perfect use case for this. If you are thinking about using Channel then I found out a few things that might be useful to you.

  • Reliable – in all my tests all messages eventually arrived
  • Out of order – messages are not guaranteed to be in sequence so if this is important you need to handle your own sequence checks
  • Multiple instances – don’t forget the service may spin up multiple instances to handle the requests
  • Development server is much worse than the live server (on Windows at least)! It seems to serialize all messages sent with approximately 600ms between each one.

I checked the latency from my office in the UK and found it to be very unpredictable and often slow. If anyone from other parts of the world could test and report back your average latency that would be awesome! It’s entirely possible that I’ve screwed something up which is causing the poor performance, if anyone has any comments on this it would be great to hear them.

Addendum

After writing this I became interested in comparing the Channel API with Websockets. I took this opportunity to play with nodejitsu who provided one free app and a fantastic tool to deploy your app. Unsurprisingly results are much better than with Channel. My average latency tends to hang around 200ms, but there are no spikes or degradation when sending a lot of messages.

You can test my quick and dirty app at here.

It is not really a fair comparison, because the back ends are designed for different purposes. The GAE server uses persistent storage to support multiple instances, whereas the node.js version will not scale. The Channel API is also fully supported on all browsers/platforms whereas support for WebSockets is still fairly thin on the ground.

This is a cross post from my blog.

Embrace Freemium

Original Author: Tom Gaulton

Like it or not, the freemium gaming model (i.e. games are free and the money is made via in-app purchases) appears to be here to stay. I’ve seen a lot of concern voiced recently that freemium content is driving down the quality of games, but I don’t believe that’s necessarily true. There are problems that need addressing – but let’s not throw the baby out with the bathwater.

A Brief History of Freemium

Take away the freemium model and you’re left with users having to pay for games up front. With traditional disk-based media that worked, the barrier to entry meant that there were relatively few titles on sale so it was possible to get a game noticed given a decent advertising push and by word of mouth – and that meant publishers could afford decent budgets and take a few risks. The business model wasn’t perfect, but it sufficed. Innovation wasn’t always top priority, but the quality bar was fairly consistent.

Fast forward to the age of downloads, in particular the AppStore, and the barrier to entry for developers was suddenly much lower – almost anyone with a computer and a compiler could release a game. Initially this seemed like a good thing, there was much excitement about “indie gaming” and a few notable success stories, but soon the market was flooded and developers began undercutting each other on price in order to get sales.

This process, commonly dubbed “the race to the bottom”, now seems to have run its course, with the majority of games now in the $1-2 bracket, and that has warped the general perception of value. Paying $5 for a game may have seemed cheap once but now looks expensive against the backdrop of bargain bucket titles – despite the fact that people will happily fork out more than that for a latte and a sandwich. As a result it has become incredibly risky to invest in making a high quality game for the mobile and tablet market. Even if your $5+ game is awesome, the chances of enough people finding it and paying for it are slim. They’re more likely to download half a dozen separate dollar titles and conclude that all mobile games are rubbish.

That’s where freemium enters the frame. If you can give your game away for free there’s nothing (bar the initial download) to stop people trying it out – sure, the market is still flooded with choice, but you’ve improved your chances dramatically. Once you’ve got people hooked on your game, then you can start billing them for items, and make your money that way instead – and because you’ve already hooked people in you can charge them a lot more than the minimum app price that people have come to expect for paid apps. Recently we’ve seen a few games exploit this system to make large sums of money.

So what’s the problem?

Notice I used the word “exploit” in the last sentence? Well, that’s the problem with freemium at the moment. Games don’t simply draw you in and then ask for a single payment to continue playing; they attempt to sell you a whole plethora of in-game items, often using devious means that encourage addictive behaviour and disguise the true cost. A while back I played a game called Paradise Island on Android that was a perfect example; you could buy special buildings for your island, but the way you did so involved so many different forms of currency along the way that it wasn’t immediately obvious what the cost would be. I once sat down and calculated that the true cost of a single building in their Halloween pack was a staggering $50.

It’s this addiction-feeding style of game which I believe has given freemium such a bad name. After all, if you look back to the era when magazine cover-disks were the norm you realise that freemium has been with us in spirit for decades, even if the buzzword hasn’t.

Can the problem be fixed?

I think we’re beginning to see light at the end of the tunnel. Some recent games have started to implement less aggressive forms of in-app purchasing – for example Triple Town works by limiting the free version to a fixed number of ‘turns’ per day, and gives you the option to purchase unlimited turns for a fixed fee. It still has the option to buy items to essentially cheat your way to victory, but at least you can opt out of that and make a one off payment. It’s a step in the right direction.

Perhaps more importantly though, the Japanese courts have just stepped in to ban the “kompu gacha” mechanic, whereby players pay a small fee for the chance of unlocking a rare item (in essence, a lottery), but can only do so after unlocking a whole set of other items more common items through the same mechanic. It’s only one specific mechanic that they’ve banned, and it’s only the tip of the iceberg, but the case has drawn a lot of attention from the press over the last few days, and looks like it might be the tipping point that wakes regulatory bodies, and players themselves, up to the devious way in which they’ve been manipulated.

Nice Freemium

Much as the makers of yoghurt products keep telling us there “bad bacteria” and “good bacteria”, I believe there can also be “evil freemium” and “nice freemium”. The evil freemium games are built purely to extort money from their players, while nice freemium are good honest fun, that just happens to come in a freemium package.

I doubt we’ve seen the back of evil freemium just yet, but rather than despair at the state of the freemium market, why not try making your own game freemium – just make sure it’s nice freemium.

Why I went back into the studio……

Original Author: Kevin Dent

I LOVE working in the studio, I really do. I love the freedom it affords me. I love trying to create games that I want to play!

I also really love having a cerabal cortex, so I left the studio life and learned biz magic.

I make a really good living in biz, I LOVE doing what I do now; I get to work with amazing people, I get to talk to amazing people. For the love of god, I spoke to the creator of Fruit Ninja tonight to have “chat”. How cool is that? Dude totally rocks btw.

My gig allows me to talk to my heroes. Seriously, I love playing video games that much.

The most common thing people say to me is “….wow, you are a business guy and you love playing video games?”

When I stepped out of the studio, I made myself the pledge that I would only work on titles that touched me deeply. I would only work on games that I personally wanted to play. I would be way richer if I just kissed ass and decided to suck it up for the lord, god, almighty dollar!

If I do say so myself, I was pretty decent on the creative side of things too, but to be brutally honest; I was running a studio that sucked at business.

So one day I stopped. One day I decided that I would step out of the studio. I put a 22 year old in charge of it and for 7 mobile games -feature phones- he was shit. Then he just flipped the page and was brilliant.

Recently, I met a guy called Jason Brice. He sent me @ messages constantly, he was really cool and then one day I seen the maps he made in an FPS.

OMFG

They were great!

I looked at them -I hated the crane on the harbor map- BUT I loved the game itself.

Jason was creating the game that I wanted to play.

His view was that there were way too many layers between the player and AK47′ing another guy in the face.

He had me at “hello”.

So as I play this game, I talk about it, I reveal things about it and then I talk about it again.

Simply put, THIS IS GAMING!

I listen to an average of 14 game pitches a week, this was one of the first game pitches that I have seen that melted my resolve. Here I thought that I knew everything and these noobz are teaching me how to love video gaming again.

I want to be fair, I want to be honest and I do not want to be a prick.

BUT I am sick of the modern day FPS titles or as I like to call them “We just came up with another feature that allows us to get you to buy another sequel”………….. ok that is a long title.

I am sick that on NPD day that we all shiver with anticipation at how many people bought our game. Here is a novel fucking idea, I am sick with anticipation at the thought of how many people enjoyed our game.

The game is ReKoil BTW. BUY IT PLUX

I am my most vulnerable when I am sitting in front of a metrics screen looking at the numbers, asking, wanting, no begging video gamers to like my title.

That is weak sauce, I want them to like me. I want them, no I crave that they like me!

It is the vulnerable essence of every video game maker. It is that vulnerability that allows me to exist.

This very insecurity, allows me to make games that I want to play, this insecurity allows Jason and the team to jump off a cliff and trust the fact that someone on the team will catch him.

Guess what? When someone that you consider a blood friend jumps off a cliff; you always catch them.

I am proud of what we are making today, I love that we are participating in the conversation! Will we beat Black-Op’s 2? God no! Their budget is 60X what people say that we are worth, but we will be participating in the conversation.

We are basically fighting the good fight, we are throwing punches and getting the shit kicked out of us. BUT we are Marty McFly and we will knock Biff the fuck out.

Am I desperate? FUCK YEAH!

Just tonight I sent this email:

 

 

From: Kevin Dent [mailto:kevin@XXX.XXXX]
Sent: Thursday, May 10, 2012 8:50 PM
To: ‘Andy McNamara’
Subject: Front cover

Hi Andy,

Who do I have to screw to get the front cover for ReKoil?

Cheers,

Kevin

 

As of today, we are not on Kickstarter, we are totally self-funded and we are totally throwing ourselves under the bus in an attempted to make a better game.

Let me be clear, Andy is an amazing person and I love him dearly. He is smart, contiencious and endearing. BUT Game Informer is owned by Game Stop and those dudes are hardcore publisher fuckers, there is zero chance of us getting the cover.

There is no way that, me, you or your freak parents will get us on the front cover of GI and nor should it. Brilliant games get on there 12 a year at least.

That said, it was worth a shot, we are living in the era of the indie.

There has been so many amazing titles in the last twelve months made by people with way more talent than me.

I rejoice at the next gen of game creators.

I am humbed by them.

But the truth is, that Jason Brice, asked me to jump off a cliff and I made the leap.

Who will catch me?

Kevin

 

Four reasons we’re not as good as we could be

Original Author: Chad Moore


The lizard brain is not merely a concept. It’s real, and it’s living on the top of your spine, fighting for your survival. But, of course, survival and success are not the same thing. The lizard brain is the reason you’re afraid, the reason you don’t do all the art you can, the reason you don’t ship when you can. The lizard brain is the source of the resistance.”

Linchpin: Are You Indispensable?

I’ve adapted a guest post I wrote over at Joshua Becker for the opportunity to write for his blog and his help with the original post.

I’ve been giving a lot of thought to single tasking, focus and distractions in my professional and personal life. Here is what I believe to be the four biggest factors contributing to distracting us from doing great creative work. By the way “creative work” to me means art, code, design, music you name it. Many facets and roles in gamedev are creative, perhaps not at first glance, but any creative problem solving applies, in my humble opinion.

#1 We’re distracted by notifications

Everyone is “always on”. We have our email open all day and internal instant messaging clients humming – not to mention the external social networks and IM chats. Instant notifications are being pushed from everywhere to wherever we are at the moment, sitting at our desktops or walking about with our phones. Most of us are on digital leashes of some kind even when we don’t need to be. We’re in a state of constant multiple deliverables or actionable tasks as well as completely open to distractions. How do we get anything done?

Some studies even propose that people can be addicted to the micro-endorphin rush we get when we get an email or a tweet. That’s why we can’t stop checking our phones when we’re away from work or even when we’re changing our brains.

Can you read through this entire 1500 hundred word post without a notification going off?

#2 We’re distracted by geography

Office geography has the entire team sitting in cube farms. These close-knit cubes are designed to “enhance communication.” For example, artists can quickly look over the half-wall and ask a programmer a question. This is great for immediate problem solving, however, the creative tasks we perform require dedicated and focused work in order to be fully realized. The constant interruptions hinder our productivity. What’s the tradeoff? Does the immediate communication outweigh the lack of focus?

Interruptions are hurting productivity, not helping it.

#3 We can’t focus

In this land of distractions, how do we concentrate long enough on a creative task to produce quality? Mental context switching is hard to quantify as there are so many variables. However, I have heard that switching between 3 tasks costs a person 40% of their available work time. You can just throw (at least) one of those tasks out the window, I hope it wasn’t important.

Does your company have a lot of meetings? Meetings in the middle of the day can be a drastic productivity and creativity killer.

Multi-tasking just doesn’t work. Is there a “sweet spot” in your workday that you can work uninterrupted for more than an hour on a single task?

#4 We’re comfortable and afraid

Our lizard brains want us to be stuck in a rut. Safety is the Lizard Brain’s primary objective. Safe is free of conflicts and challenges. Safe is the same thing over and over again. Safe is not standing out in the crowd. Safe, as the lizard brain defines it, is not good for your creativity and thereby the projects you work on.

Image courtesy of Alex Moore

Here are some of the things that I’m using or trying out to combat the lizard brain, distractions and multi-tasking.

Control your time.

Batch processing your email can be effective. Once at the beginning of the day and again at the end. This is more efficient than shifting to outlook with each new notification and processing each single email. If you think that is too long to be out of contact, do the email dance for 10 minutes every two hours. Or whatever works for you, just try to keep the notifications off. If there truly is an emergency you will know, someone will come find you.

Turn off your work IM. Same rules as above. Maybe you’re available and on IM only in the afternoon. Or set your status to do not disturb and teach people politely what that means. Try things and see how they work for you.

Do one thing at a time. Make a list of the 3 or 5 most important things for that day. Do them as early in the day as you can. You’ll accomplish the “big things” early and everything else after that is icing on the cake.

All of this can be tricky at first if your culture wants you to respond immediately, but I think it pays off in the end. You have more uninterrupted time to do something good. Talk to your manager if you need to before implementing a radical change.

Have the good kind of meetings.

The morning standup with your team is probably the best type of meeting. It’s early, not in the middle of the day. Everyone says what they are working on, if anything is blocking them and what they’ll do next. Preferably everyone stands. It is called a stand up after all. This is a bold choice, but decline any meeting that you think is a waste of time. Reply that you can’t make it and ask for a recap. Meeting Agendas and Recaps are such a great tool for meetings, we seem to forget about them.

The Agenda: When booking the meeting provide a list of what topics to discuss, who is responsible for talking for each topic and a guesstimate on how long they’ll have to talk. Send the agenda out a couple of days beforehand and make it known that the attendees are expected to be prepared. Simon Cooke suggests formulating the agenda as questions to get people thinking and vested.

During the meeting, stick to the schedule. Get in, do good stuff, get out.

The Recap: briefly state what was discussed.

Here’s a sample meeting agenda:

Todays Date, Galactic Domination Meeting Agenda

  • Overview, Palpatine, 15 minutes
  • Death Star #2 construction status, Tarkin, 5 minutes
  • Various disturbances in the force, Vader, 10 minutes

And the recap:

Overview, Palpatine, 15 minutes
Squash rebellion by destroying secret base.
Eliminate Galactic Senate via political maneuvering.
Death Star #2 construction status, Tarkin, 5 minutes
We fixed the exhaust port / proton torpedo bug
Various disturbances in the force, Vader, 10 minutes
Obi-Wan is dead, but no one is sure why he just disappeared.
Follow up on plan to turn Son of Skywalker to the dark side.

Wait.

Can you wait a while to talk to the artist? Look at him, he has his headphones on (game developer code for “Do Not Disturb”) and is obviously “in the zone” sculpting in the 3D modeling application with a ton of reference images on the other monitor. Don’t interrupt him asking for his hours spent this week on his tasks. Let him do his job. That User Interface Engineer who’s sliding her chair back and forth from her development kit and her two-screen computer rig? She’s fixing a bug in the UI on her computer, testing it on the kit, and she’s obviously busy. Do not walk over and ask her how it’s going right now. Let her do her job. Your opportunity to ask your question will come… but you may need to wait your turn to ask it.

Get people on board.

Use team signals to tell others “do not disturb.” I’ve seen flags put up by artists to denote when they are in the zone and to please come back later. I’ve also seen tech teams have one team member wear a hat signaling that he/she is the person on call for help that day. Find a way of signaling people and teach them to recognize and respect it. Work it out with your team so you can be efficient together.

Protect your industry’s most valuable asset.

The most valuable asset any company has are its people. People make the art, tech and design of a game. They support the game or the other people making the games. There are certain parts of the day when creativity runs wild and other parts of the day that are more suitable for less creative tasks. Do what you can to help reduce roadblocks and ask your Manager for support.

I certainly don’t have all the answers. But I’m trying to communicate with one person at a time, handle one problem at a time, and work on one idea at a time. All as a part of my effort to bring the principles of minimalism, simplicity and single tasking into my personal and professional life. I’d love to hear your thoughts on these topics.


Maximum Creativity: Open & Closed Mode

Original Author: Claire Blackshaw

A video recently cycled through my friends’ social circles which I wanted to share. John Cleese talks about Creativity and Open and Closed thinking modes.

The TL;DR of John Cleese’s talk

Closed Mode
Purposeful Highly Productive, but not creative. Good for getting things done. Default Mode at Work.
Open Mode
Playful, Curious, Fun, Humorous, Relaxed, Contemplative without goals.

I had a lot of takeaways from that but my biggest takeaway was what could I do to be more open and enable openness around me. In the past I’ve drawn mostly from my dramatic background and improv lessons, always saying yes. See Lisa Brown’s brilliant article on that topic.

Humour and Time are key to being in the Open Mode. Also Goals, Hierarchy and Authority are not conducive to the Open Mode. An interesting thought for those of you who read the Valve Employee Handbook floating around the net at the moment. The key thought this lead me to is simple…

You Can’t Play by Proxy

Too often I espouse the brilliance of the programming designer. The real magic is simple, you cannot play by proxy. A designer who can’t play (and change) with gameplay is like a chef who gets someone else to taste the food and describe it. An artist who cannot change the appearance of something in-game without someone else is like painting on canvas by instructing a monkey holding a paintbrush. A programmer who cannot go outside the spec is like a dancer in a straight jacket.

You need to hold the ball in your hand and feel it bounce. Always ensure your source control, build server and other tools do not stop someone’s experimentation.

The Open Mode is Unachievable in Crunch Time

A common example of the open mode in game development: someone has just implemented a system which due to an amusing bug is glitching out or not behaving in the designed fashion. In a crunch time, i.e. closed environment, a bug is filed, profanities are hurled and the bug is squashed. In an open mode and a lighter environment, possible earlier in the project, the group plays with the bug and says something like, “Isn’t that interesting…”

Now to be clear, most times nothing immediately comes from that comment though at times a brilliant idea or mechanic emerges. Often these moments lend themselves to interesting gameplay moments. For a prime example of game development project by open mode style thinking, look at Minecraft. Many gameplay features started as bugs or unintended behaviour which in a traditional environment would have been eradicated but the players enjoyed it and Notch left it in.

In order to achieve things we need to be in a closed mode. Though it’s easy to get stuck in that mode and with tunnel vision run down our path missing all the brilliant opportunities that are being thrown up during game development.

Playful Tools

When designing or creating a tool or pipeline think about how your tools enable playfulness, experiment and toy around with your tools. A willingness to create without fear. Observe how playful we are with pens, for many of us our primary tool.

Two Developers, one creating a scene in notepad with fragile .xml, the other drawing a blueprint in photoshop. The drawing based solution is more fluid and open to play. The developer can draw a “humorous shape” and see it in game without fuss. Leading to greater comfort and competence, discovering creative uses of the tools.

Nothing can frustrate and break playfulness more than a deep pipeline or frustrating tools. A buggy or frustrating tool ensures you’re in the Closed Mode. As in the Closed Mode we are less likely to make mistakes and can follow rigid guidelines.

While being Creative Nothing is Wrong

We live in a digital world of hacks and cheats, we care for results not method. The thing which appears on screen is a beautiful lie. So if your engine is refresh sensitive and those silly European televisions are giving you trouble, well, just turn up the earth’s gravity*. Remember the product is the experience, not the game.

Too often a professional artist, designer, programmer or otherwise will dismiss or stamp on a comment because it’s patently silly. They insist on saying it’s too complex or saying it’s daft. Now please, respect the professional skill for which the individual was hired. Though at no point should someone be crushed or demotivated by the HAMMER of authority.

Jam, Jive and Jiggle

Encourage people to take time out to play in fixed time periods to jam, jive and jiggle. I write this after having just finished a weekend of Ludum Dare fun. The result of the Jam is less important than the time to play. The time to ponder and play with an idea. If you have a sprint structure try to take some time out after a sprint or a whole sprint, WITHOUT GOALS! Just play. See what comes from it.

People need to feel free to play with ideas, in order to have great ideas.

* A real humorous example of an engine hack seen in the wild.

C/C++ Low Level Curriculum Part 8: looking at optimised assembly

Original Author: Alex Darby

It’s that time again where I have managed to find a few spare hours to squoze out an article for the Low Level Curriculum. This is the 8th post in this series, which is not in any way significant except that I like the number 8. As well as being a power of two, it is also the maximum number of unarmed people who can simultaneously get close enough to attack you (according to a martial arts book I once read).

This post covers how to set up Visual Studio to allow you to easily look at the optimised assembly code generated for simple code snippets like the ones we deal with in this series. If you wonder why I feel this is worth a post of its own here’s the reason – optimising compilers are good, and given code with constants as input and no external output (like the snippets I give as examples in this series) the compiler will generally optimise the code away to nothing – which I find makes it pretty hard to look at. This should prove immensely useful, both to refer back to, and for your own experimentation.

Here are the backlinks for preceding articles in the series in case you want to refer back to any of them (warning: the first few are quite long):

  1. /2011/11/09/a-low-level-curriculum-for-c-and-c/
  2. /2011/11/24/c-c-low-level-curriculum-part-2-data-types/
  3. /2011/12/14/c-c-low-level-curriculum-part-3-the-stack/
  4. /2011/12/24/c-c-low-level-curriculum-part-4-more-stack/
  5. /2012/02/07/c-c-low-level-curriculum-part-5-even-more-stack/
  6. /2012/03/07/c-c-low-level-curriculum-part-6-conditionals/
  7. Once you have clicked OK just click “Finish” on the next stage of the wizard – in case you’re wondering, the options available when you click next don’t matter for our purposes (and un-checking the “Precompiled header” check box makes no difference, it still generates a console app that uses a precompiled header…).

    Changing the Project Properties

    The next step is to use the menu to select “Project -> <YourProjectName> Properties”, which will bring up the properties dialog for the project.

    When the properties dialog appears (see image below):

    • select “All Configurations” from the Configuration drop list
    • select “Configuration Properties ->General” in the tree view at the left of the window
    • in the main pane change “Whole Program Optimisation” to “No Whole Program Optimisation”.

    Next, in the tree view (see image below):

    • in the tree view, navigate to “C/C++ -> Code Generation”
    • in the main pane, change “Basic Runtime Checks” to “Default” (i.e. off)

    Finally (see image below):

    • in the tree view, go to “C/C++ -> Output Files”
    • in the main pane change “Assembler Output” to “Assembly With Source Code /(FAs)”
    • once you’ve done that click “OK”

    Now, when you compile the Visual Studio compiler will generate an .asm file as well as an .exe file. This file will contain the intermediate assembly code generated by the compiler, with the source code inserted into it inline as comments.

    You could alternatively choose the “Assembly, Machine Code and Source (/FAcs)” option if you like – this will generate a .cod file that contains the machine code as well as the asm and source.

    I prefer the regular .asm because it’s less visually noisy and the assembler mnemonics are all aligned on the same column, so that’s what I’ll assume you’re using if you’re following the article, but the .cod file is fine.

    So, what did we do there?

    Well, first we turned off link time code generation. Amongst other things, this will prevent the linker stripping the .asm generated for functions that are compiled but not called anywhere.

    Secondly, we turned off the basic runtime checks (which are already off in Release). These checks make the function prologues and epilogues generated do significant amounts of (basically unneccessary) extra work causing a worst case 5x slowdown (see this post by Bruce Dawson on his personal blog for an in depth explanation).

    Finally, we asked the compiler not to throw away the assembly code it generates for our program; this data is produced by the compilation process whenever you compile but is usually thrown away, we’re just asking Visual Studio to write it into an .asm file so we can take a look at it.

    Since we made these changes for “All Configurations” this means we will have access to .asm files containing the assembly code generated by both the Debug and Release build configurations.

    Let’s try it out

    So in the spirit of discovery, let’s try it out (for the sake of familiarity) with a language feature we looked at last time – the conditional operator:

    1
     
      2
     
      3
     
      4
     
      5
     
      6
     
      7
     
      8
     
      9
     
      10
     
      11
     
      12
     
      13
     
      14
     
      
    #include "stdafx.h"
     
       
     
      int ConditionalTest( bool bFlag, int iOnTrue, int iOnFalse )
     
      {
     
          return ( bFlag ? iOnTrue : iOnFalse );
     
      }
     
       
     
      int main(int argc, char* argv[])
     
      {
     
          int a = 1, b = 2;
     
          bool bFlag = false;
     
          int c = ConditionalTest( bFlag, a, b );
     
          return 0;
     
      }

    The question you have in your head at this moment should be “why have we put the code into a function?”. Rest assured that this will become apparent soon enough.

    Now we have to build the code and look in the .asm files generated to see what the compiler has been up to…

    First build the Debug build configuration – this should already be selected in the solution configuration drop-down (at the top of your Visual Studio window unless you’ve moved it).

    Next build the Release configuration.

    Now we need to open the .asm files. Unless you have messed with project settings that I didn’t tell you to these will be in the following paths:

    <path where you put the project>/Debug/<projectName>.asm

    <path where you put the project>/Release/<projectName>.asm

    .asm files

    I’m not going to go into any significant detail about how .asm files are laid out here, if you want to find out more here’s a link to the Microsoft documentation for their assembler.

    The main thing you should note is that we can find the C/C++ functions in the .asm file by looking for their names; and that – once we find them – the mixture of source code and assembly code looks basically the same as it does in the disassembly view of Visual Studio in the debugger.

    main()

    Let’s look at main() first. This is where I explain why the code snippet we wanted to look at was put in a function. I can tell you’re excited.

    Here’s main() from the Debug .asm (I’ve reformatted it slightly to make it take up less vertical space):

    1
     
      2
     
      3
     
      4
     
      5
     
      6
     
      7
     
      8
     
      9
     
      10
     
      11
     
      12
     
      13
     
      14
     
      15
     
      16
     
      17
     
      18
     
      19
     
      20
     
      21
     
      22
     
      23
     
      24
     
      25
     
      26
     
      27
     
      28
     
      29
     
      30
     
      31
     
      32
     
      33
     
      34
     
      35
     
      36
     
      37
     
      38
     
      39
     
      40
     
      41
     
      
    _TEXT    SEGMENT
     
      _c$ = -16                        ; size = 4
     
      _bFlag$ = -9                        ; size = 1
     
      _b$ = -8                        ; size = 4
     
      _a$ = -4                        ; size = 4
     
      _argc$ = 8                        ; size = 4
     
      _argv$ = 12                        ; size = 4
     
      _main    PROC                        ; COMDAT
     
      ; 9    : {
     
          push    ebp
     
          mov    ebp, esp
     
          sub    esp, 80                    ; 00000050H
     
          push    ebx
     
          push    esi
     
          push    edi
     
      ; 10   :     int a = 1, b = 2;
     
          mov    DWORD PTR _a$[ebp], 1
     
          mov    DWORD PTR _b$[ebp], 2
     
      ; 11   :     bool bFlag = false;
     
          mov    BYTE PTR _bFlag$[ebp], 0
     
      ; 12   :     int c = ConditionalTest( bFlag, a, b );
     
          mov    eax, DWORD PTR _b$[ebp]
     
          push    eax
     
          mov    ecx, DWORD PTR _a$[ebp]
     
          push    ecx
     
          movzx    edx, BYTE PTR _bFlag$[ebp]
     
          push    edx
     
          call    ?ConditionalTest@@YAH_NHH@Z        ; ConditionalTest
     
          add    esp, 12                    ; 0000000cH
     
          mov    DWORD PTR _c$[ebp], eax
     
      ; 13   :     return 0;
     
          xor    eax, eax
     
      ; 14   : }
     
          pop    edi
     
          pop    esi
     
          pop    ebx
     
          mov    esp, ebp
     
          pop    ebp
     
          ret    0
     
      _main    ENDP
     
      _TEXT    ENDS

    As long as you’ve read the previous posts, this should mostly look pretty familiar.

    It breaks down as follows:

    • lines 1-8: these lines define the offsets of the various Stack variables from [ebp] within main()’s Stack Frame
    • lines 10-15: function prologue of main()
    • lines 17-20: initialise the Stack variables
    • lines 22-30: push the parameters to ConditionalTest() into the Stack, call it, and assign its return value
    • line 32: sets up main()’s return value
    • lines 34-38: function epilogue of main()
    • line 39: return from main()

    Nothing unexpected there really, the only new thing to take in is the declarations of the Stack variable offsets from [ebp].

    I feel these tend to make the assembly code easier to follow than the code in the disassembly window in the Visual Studio debugger.

    And, for comparison, here’s main() for the Release .asm:

    1
     
      2
     
      3
     
      4
     
      5
     
      6
     
      7
     
      8
     
      9
     
      10
     
      11
     
      12
     
      13
     
      
    _TEXT    SEGMENT
     
      _argc$ = 8                        ; size = 4
     
      _argv$ = 12                        ; size = 4
     
      _main    PROC                        ; COMDAT
     
      ; 10   :     int a = 1, b = 2;
     
      ; 11   :     bool bFlag = false;
     
      ; 12   :     int c = ConditionalTest( bFlag, a, b );
     
      ; 13   :     return 0;
     
          xor    eax, eax
     
      ; 14   : }
     
          ret    0
     
      _main    ENDP
     
      _TEXT    ENDS

    The astute amongst you will have noticed that the Release assembly code is significantly smaller than the Debug.

    In fact, it’s clearly doing nothing at all other than returning 0. Good optimising! High five!

    As I alluded to earlier, the optimising compiler is great at spotting code that evaluates to a compile time constant and will happily replace any code it can with the equivalent constant.

    So that’s why we put the code snippet in a function

    It should hopefully be relatively clear by this point why we might have put the code snippet into a function, and then asked the linker not to remove code for functions that aren’t called.

    Even if it can optimise away calls to a function, the compiler can’t optimise away the function before link time because some code outside of the object file it exists in might call it. Incidentally, the same effect usually keeps variables defined at global scope from being optimised away before linkage.

    I’m going to call this Schrödinger linkage (catchy, right?). If we want our simple code snippet to stay around after optimising we only need to make sure that it takes advantage of Schrödinger linkage to cheat the optimiser.

    If the compiler can’t tell whether the function will be called, then it certainly can’t tell what the values of its parameters will be during one of these potential calls, or what its return value might be used for and so it can’t optimise away any code that relies on those inputs or contributes to the output either.

    The upshot of this is that if we put our code snippet in a function, make sure that it uses the function parameters as inputs, and that its output is returned from the function then it should survive optimisation.

    It’s really a testament to all the compiler programmers over the years that it takes so much effort to get at the optimised assembly code generated by a simple code snippet – compiler programmers we salute you!

    ConditionalTest()

    So, here’s the Debug .asm for ConditionalTest() (ignoring the prologue / epilogue):

    1
     
      2
     
      3
     
      4
     
      5
     
      6
     
      7
     
      8
     
      9
     
      10
     
      11
     
      12
     
      13
     
      
    ; 5    :     return( bFlag ? iOnTrue : iOnFalse );
     
          movzx    eax, BYTE PTR _bFlag$[ebp]
     
          test    eax, eax
     
          je    SHORT $LN3@Conditiona
     
          mov    ecx, DWORD PTR _iOnTrue$[ebp]
     
          mov    DWORD PTR tv66[ebp], ecx
     
          jmp    SHORT $LN4@Conditiona
     
      $LN3@Conditiona:
     
          mov    edx, DWORD PTR _iOnFalse$[ebp]
     
          mov    DWORD PTR tv66[ebp], edx
     
      $LN4@Conditiona:
     
          mov    eax, DWORD PTR tv66[ebp]
     
      ; 6    : }

    As you should be able to see, this is doing the basically same thing as the code we looked at in the Debug disassembly in the previous article:

    • branching based on the result of testing the value of bFlag (the mnemonic test does a bitwise logical AND)
    • both branches set a Stack variable at an offset of tv66 from [ebp]
    • and both branches then execute the last line which copies the content of that address into eax

    Again, the assembly code is arguably easier to follow than the corresponding disassembly because the jmp mnemonic jumps to labels visibly defined in the code, whereas in the disassembly view in Visual Studio you generally have to cross reference the operand to jmp with the memory addresses in the disassembly view to see where it’s jumping to…

    Let’s compare this with the Release assembler (again not showing the function prologue or epilogue):

    1
     
      2
     
      3
     
      4
     
      5
     
      6
     
      7
     
      
    ; 5    :     return( bFlag ? iOnTrue : iOnFalse );
     
          cmp    BYTE PTR _bFlag$[ebp], 0
     
          mov    eax, DWORD PTR _iOnTrue$[ebp]
     
          jne    SHORT $LN4@Conditiona
     
          mov    eax, DWORD PTR _iOnFalse$[ebp]
     
      $LN4@Conditiona:
     
      ; 6    : }

    You will note that the work of this function is now done in 4 instructions as opposed to 9 in the Debug:

    • it compares the value of bFlag against 0
    • unconditionally moves the value of iOnTrue into eax
    • if the value of bFlag was not equal to 0 (i.e. it was true) it jumps past the next instruction…
    • …otherwise this moves the value of iOnFalse into eax

    As I’ve stated before I’m not an assembly code programmer and I’m not an optimisation expert. Consequently, I’m not going to offer my opinion on the significance of the ordering of the instructions in this Release assembly code.

    I am, however, prepared to go out on a limb and say it’s a pretty safe bet that the Release version with 4 instructions is going to execute significantly faster than the Debug version with 9.

    So, why such a big difference between Debug and Release for something that when debugging at source level is a single-step?

    Essentially this is because the unoptimised assembly code generated by the compiler must be amenable to single-step debugging at the source level:

    • it almost always does the exact logical equivalent of what the high level code asked it to do and, specifically, in the same order
    • it also has to frequently write values from CPU registers back into memory so that the debugger can show them updating

    Summary

    What’s the main point I’d like you to take away from this article? Optimising compilers are feisty!

    You have to know how to stop them optimising away your isolated C/C++ code snippets if you want to easily be able to see the optimised assembly code they generate.

    This article shows a simple boilerplate way to short-circuit the Visual Studio optimising compiler – mileage will vary on other platforms.

    There are other strategies to stop the optimiser optimising away your code, but they basically all come down to utilising the Schrödinger linkage effect; in general:

    • use global variables, function parameters, or function call results as inputs to the code
    • use global variables, function return values, or function call parameters as outputs from the code
    • if you’re not using Visual Studio’s compiler you may also need to turn off inlining

    A final extreme method I have been told about is to insert nop instructions via inline assembly around / within the code you want to isolate. Note that you should use this approach with caution, as it interferes directly with the optimiser and can easily affect the output to the point where it is no longer representative.

    Epilogue

    So, I hope you found this interesting – I certainly expect you will find it useful 🙂

    The next article (as promised last time!) is about looping, which is another reason why it seemed like a good time to cover getting at optimised assembly code for simple C/C++ snippets.

    I will be referring back to this in future articles in situations where looking at the optimised assembly code is particularly relevant.

    If you’re wondering what you should look at first to see how Debug and Release code differ, and want to get practise at beating the optimiser, I’d suggest starting with something straight forward like adding a few numbers together.

    Lastly, but by no means leastly, thanks to Rich, Ted, and Bruce for their input and proof reading; and Bruce for supplying me with the tip that made this post possible.

Xperf Wait Analysis–Finding Idle Time

Original Author: Bruce-Dawson

Technology/ Code /

The Windows Performance Toolkit, also known as slowdowns in PowerPoint by using xperf’s built-in sampling profiler, but that actually understates the true value of Xperf. While I think xperf is a better sampling profiler than most of the alternatives (higher frequency, lower overhead, kernel and user mode), xperf is really at its best when it reveals information that other profilers cannot measure at all.

In short, lots of profilers can tell you what your program is doing, but few profilers are excellent at telling you why your program is doing nothing.

Our story so far

Xperf has a high learning curve. Therefore I highly recommend that you start by reading some previous articles from this series. The entire series can be found here, but the most important ones are:

  • ETW events to your game
  • Xperf Analysis Basics – the essential knowledge of how to navigate the xperf UI, including how to set up symbol paths

The rest of this post assumes that you have installed xperf (preferably the Windows 8 version) and have familiarity with at least Xperf Analysis Basics.

Wait Analysis Victories

I’ve had good luck using Wait Analysis to find many performance problems. Some of these delays were short enough to be difficult to notice, yet long enough to matter. Others were debilitating. All were difficult or impossible to analyze through CPU sampling or other ‘normal’ CPU profilers. Some examples include:

  • Finding the cause of brief startup hangs in Internet Explorer and various games
  • Profiling Luke Stackwalker to find out why it caused frame rate glitches in the game it was profiling
  • Finding the cause of a 10x perf-reduction when upgrading to a newer version of Windows
  • Finding the cause of frame rate hitches during fraps recording
  • Finding the chain of lock contention that caused frame rate hitches on a heavily loaded system
  • Finding the cause of (and a workaround for) repeated 2-6 second hangs in Visual Studio’s output window

The last investigation is the one I want to cover today. It is sufficiently simple and self-contained that I can cover it end-to-end in a single (long) post.

Finding the hang

When profiling a transient problem such as a frame-rate glitch or a temporary hang the first challenge is to locate the hang in the trace. A trace might cover 60 seconds, and a hang might last for 2 seconds or less, so knowing where to look is crucial. There are a number of ways to do this:

  • Find the key stroke that triggered the hang, through logging of input events
  • Use instrumentation in the functions of interest
  • Look for patterns in the CPU usage or other data
  • Use OS hang-detection events

I’ve used all four of these techniques. Our internal trace recording tool has an optional input event logger which puts all keyboard and mouse input into the trace (watch for it). If a hang is triggered by a particular key press or mouse click then finding its start point in the trace is trivial.

Custom instrumentation (emitting ETW events at key points in your game, see the Recording a Trace post) is also a common technique. Emitting an event every frame makes a frame rate hitch plainly visible. However this doesn’t work when investigating performance problems in other people’s code, such as in Visual Studio.

In some cases a hang will be plainly visible in the CPU consumption. One recent hang showed a significant hole in the otherwise consistent CPU usage, plain as day.

A specific event that indicates the time and duration of a hang would be ideal, and Windows 7 actually has such an event. The Microsoft-WindowsWin32k ETW user provider will emit an event whenever a thread resumes pumping messages after a significant delay. Windows Vista and earlier users are out of luck, but on Windows 7 this is often exactly what is needed, and this provider is enabled by my recommended trace recording batch files.

It’s hands on time

I’ve uploaded a .zip file of a sample trace to can be found then you can follow along. This is by far the best way to learn wait analysis.

This trace covers over over ten minutes for some types of data, but the detailed sampling and context switch data only covers 28 seconds, from about 782 to 810 seconds.

Start by selecting the region where we have full data, from 782 to 810 s and cloning this selection.

Our path now depends on whether you are using the new (Windows 8) version of xperfview.exe.

Hands on with old versions of xperfview

While this exact technique is only applicable to (and only works with) old versions of xperfview, the general concept is still applicable and the exploration of generic events is crucial whether looking for your custom events or exploring the built-in events.

Scroll down to the Generic Events table. Right-click the selected region and bring up a summary table. Enable the Process Name column and put it first. Enable the ThreadID column and put it after Field 3. Move the Time (s) column and put it after the ThreadID column. I also hid a couple of columns in order to get my screenshot to fit, but that’s less critical. Now we have all of the information we need in a convenient and easy to read place.

If we drill into the data for devenv.exe and select the MessageCheckDelay Task Name we should see something like this:

The Zen of summary tables is all about looking for data columns that seem useful, enabling them, and fearlessly rearranging columns to group/sort/pivot/spindle the data to answer your question. In this case our question was when does devenv.exe (group by Process Name) hang (group by the Microsoft-Windows-Win32k provider, Task Name equals MessageCheckDelay or InputProcessDelay), and for those events, look at the TimeSinceInputRemoveMs, Thread ID, and Time (s) data.

So now, with relatively little effort, I know that devenv.exe hung (didn’t check for messages) for 5,304 ms, its message pump is running on thread ID 9,536, and the hang ended at 805.666 seconds into the trace.

Cool.

Hands on with the new version of xperfview (6.2.8229)

Microsoft is continuing to develop xperf and if you install the latest version (released Feb 29, 2012, and linked to from here) then there are a couple of options. Wpa.exe has a new UI which shows pretty graphs for UI delays:

I don’t know how to dig in deeper so I can’t tell if it is any use, so that’s all I have to say about it.

The new xperfview.exe has removed the MessageCheckDelay and InputProcessDelay events from Generic Events but has added a new UI Delay Information Graph. If you scroll down to this graph and zoom in around 800 s (in the area where we have full detail) then you should see five reports of hung apps. VTrace.exe (my trace recording application) hung for a while, there are three spurious reports of Internet Explorer hanging, and there is a MsgCheckDelay report for devenv.exe. It’s really too easy.

You can right-click to change the threshold for what delays are reported, or to bring up a summary table of delays. You’ll need to bring up the Delay Summary Table to find out the UI thread ID for devenv.exe.

Select the region around the devenv hang and we’re ready for the next step.

Finding the cause

The MessageCheckDelay is emitted at the end of the hang (805.666 seconds) and it tells us the length of the hang (5.304 s) so we now know the range of the hang quite accurately.

The hang runs from 800.362 to 805.666 seconds so we should zoom in on that area of the graphs in the xperf main window and look at CPU Usage by Process. My system has eight hardware threads (four hyperthreaded cores) so 12.5% utilization represents one busy thread. Even without that context we can see from the graph below that my CPUs are idle for most of the time. There’s a bit of devenv activity (the two blue spikes), but mostly this is an idle hang.

When analyzing an idle hang you should select the entire region of the hang, and it is particularly important to select the end of the hang. It is better to select a few extra tens or hundreds of milliseconds at the end rather than risk missing the crucial events that end the hang. This selection can be done with the mouse or by right-clicking and using the Select Interval command. For easy reproducibility I right-clicked and used the Select Interval command to select the region from 800.0 s to 806.0 s. I then used Clone Selection to copy it to all of the graphs.

Who woke whom?

If a thread is not running, and it then starts running, then there was a context switch that started it (the new thread) running. That context switch is recorded in our ETW trace and contains all sorts of useful information. Include in this information is (for the traces recorded with my recommended batch files) the new process name and thread ID, the call stack which the thread woke up on (which is the same one it went to sleep on), the length of time it was not running and, for threads that were waiting on some synchronization primitive, the thread that woke it up.

Ponder that, because it’s crucial. An ETW trace tells you, for each context switch, how long the thread was not running, and who woke it up. That’s why it is important to have the end of the hang selected, because that is (presumably) the time of the context switch that gets the thread running again.

In the main xperf window go to the CPU Scheduling graph (make sure the correct time range is selected), right click on the selection, and select “Summary Table with Ready Thread” to view all context switches for the selected region together with the readying thread information. Make sure the columns are in this order:

  1. NewProcess – this is the process whose thread is being scheduled to run
  2. NewThreadId – this is the thread ID of the thread being scheduled to run
  3. NewThreadStack – this is the stack that the thread will resume running at
  4. ReadyingProcess – this is the process, if any, that readied the new thread
  5. ReadyingThreadId – guess. Go ahead, you can figure it out.
  6. ReadyingThreadStack – this is the stack of the readying thread when it readied the new thread
  7. Orange bar – columns to the left of this are used for grouping, columns to the right are for sorting and data display
  8. Count – how many context switches are summarized by each row
  9. Sum:TimeSinceLast (us) – the time the new thread was not running (time since it last ran) summed across all context switches summarized by each row

There are more columns, and for deeper analysis they can be useful, but we don’t need them today.

With our columns thus arranged we can quickly find our problem. Find devenv.exe (be sure to find the correct PID if multiple versions are running) and expand it, find the thread of interest (9,536, from the MessageCheckDelay event), then expand the stack. If you click the “Sum:TimeSinceLast (us)” column so the little arrow is pointing down then as you drill down into the stacks (hint: select the top node and then repeatedly press right-arrow) it will go down the hot call stack. In the sample trace, over the selected region, thread 9,536 starts with a total of about 5.523 s of non-running time over 316 context switches. As we drill down we get to a single context switch that ended an idle gap of 5.202 s. That’s our hang, clear as day.

The NewThreadStack for this 5.202 s call stack starts at _RtlUserThreadStart and winds through a lot of Visual Studio code. Microsoft is kind enough to publish symbols for much of VS, as well as for Windows and about fifty rows down we get to the interesting details:

It’s a single context switch (‘count’ goes down to one when we got a bit lower in the call stack) that put the Visual Studio UI thread to sleep for 5.202 s. It doesn’t get much clearer than that.

If we go down to the bottom of the stack and expand the next three columns (compressed in the screen shot above for size reasons) then we can see who woke us, which can also be described as “who we were waiting for”:

In this case it was the System process (thread 5880) in an IopfCompleteRequest call that goes through MUP.SYS. If we know what MUP.SYS is then that gives us another clue as to the root cause, but even without that we know that Visual Studio called CreateFileW and it took a long time to return.

What about the other threads?

In our selected region their are context switch events for 11 threads in devenv.exe. For all of those threads the Sum:TimeSinceLast value is greater than for 9,536, the thread we are investigating. So why aren’t we looking at them?

It’s important to understand that Sum:TimeSinceLast just measures how long a thread was idle, and there is nothing wrong with a thread being idle. A thread being idle is only a problem if it is supposed to be doing something and isn’t. In fact, if devenv.exe has 11 threads then they had better be idle most of the time or else my six-core machine is going to be constantly busy.

Many of the threads have a Sum:TimeSinceLast time of about 15 s, which is significantly longer than the 6 s time period selected. That’s because this summary table shows all of the context switches that occurred during this time period, and the first context switch for these threads was after they had been idle for a very long time, most of that time outside of the selected region.

The reason we are looking at thread 9,536 is because (according to the MessageCheckDelay event) it is the UI thread and it went for 5.304 s without pumping messages. It kept me waiting, and that makes me angry. You wouldn’t like me when I’m angry.

File I/O summary table

Since we know that the hang is related to file I/O we should look at what file I/O is happening during this time period. Luckily this information is recorded by the xperf batch files that I recommend.

On the main xperf window go to the File I/O graph and bring up a summary table, for the same time region we’ve been using so far. Arrange the columns as shown below and drill in as usual. I’m sure this screen shot won’t show up very well, but I can’t shorten it any more. It contains too much glorious information. Click on the image for deeper details:

We can see here that a Create file event, from devenv.exe, thread 9,536, took 5.203 s, trying to open DeviceMup…, and that ultimately the network path was not found.

Wow.

It turns out that DeviceMup, or MUP.sys, means the network. The hang is because Visual Studio tried to open a non-existent network file, and sometimes that causes a 5.2 s network timeout. Hence the hang.

The remainder of the hang is from a few other context switches and CPU time that account for the rest of the 5.304 s, but the one long bit of idle time is all that matters in this case. It’s particularly clean.

What’s the cause?

The file name associated with this hang is quite peculiar. The full name is:

DeviceMupperforcemainsrclibpublicwin64vdebug_tool.lib#227 – opened for edit

That doesn’t look like a file name. That looks more like the output from Perforce. And that’s exactly what it is. At Valve we store build results in Perforce so we have pre-build steps to check these files out. The checkout commands print their results to the Visual Studio output window like this:

//perforce/main/src/lib/public/win64/vdebug_tool.lib#227 – opened for edit

Visual Studio ‘helpfully’ reverses the slashes and decides that this represents a file name on \perforcemain. Since this whole thing started with me pressing F8 (actually double-clicking the output window in this reenactment) this means that Visual Studio was trying desperately to treat this potential file name as a source-file name associated with an error or warning message.

Oops.

That explains the CResultList::AttemptToNavigate entry on the hang call stack – everything makes more sense once you understand the problem.

Conclusion

Once the cause of the hang was understood I modified our pre-build step to pipe the output through sed.exe and had it rewrite the output so that Visual Studio would no longer find it interesting. This avoids the hang, but also made it so that F8 would take the selection to interesting errors and warnings, instead of to these mundane progress messages. A little sed magic replaces “//” with the empty string, and “…” with “—“ :

sed -e s!//!! -e s!…!—!

This changes the hang-prone results before:

to the hang-proof benign text after:

I also reported the bug to the Visual Studio team. Having a trace is very powerful for this because it meant that I could tell them definitively what the problem was, and I could share the trace in order to let them confirm my findings. Just like minidump files are a powerful way to report crash bugs, xperf traces are a powerful way to report performance bugs. The Visual Studio team has told me that this bug will be fixed in Visual Studio 11 – UNC paths will be ignored by the output window’s parser.

Mup.sys is the driver used for network file I/O. Therefore its presence on the Readying Thread stack was a clue that a network delay was the problem. Doing file I/O on the UI thread is always a bit dodgy if you want to avoid hangs, and doing network file I/O is particularly problematic, so watching for mup.sys is a good idea.

Wait chains

Some wait analysis investigations are more complicated than this one. In several investigations I have found that the main thread of our game was idle for a few hundred milliseconds waiting on a semaphore, critical section, or other synchronization object. In that case the readying thread is critical because that is the thread that released the synchronization object. Once I find out who was holding up the main thread I can move the analysis to that thread and apply either busy-thread analysis (CPU sampling) or idle thread analysis (finding what that thread was waiting on). Usually just one or two levels of hunting is needed to find the culprit, but I did recently trace back across six context switches in four different processes in order to track down an OS scheduling problem.

When following wait chains it is important to understand the order of events. If thread 1234 is readied by thread 5678 at time 10.5 s, then any context switches or CPU activity that happen to thread 5678 after that point are not relevant to the wait chain. Since they happened after thread 1234 was woken they can’t be part of its wait chain.

For CPU activity this is dealt with by selecting the region of interest. For context switches this is dealt with by drilling down all the way and then looking at the SwitchInTime (s) column (which you may want to move to a more convenient location). This column records the time of the context switch.

It’s worth pointing out that if you busy wait (spinning on some global variable flag) or use your own custom synchronization primitives (CSuperFastCriticalSection) then these techniques will not work. The OS synchronization primitives are instrumented with ETW events that allow, in almost all cases, perfect following of wait chains. Even if your custom synchronization code is faster (and it probably isn’t) it isn’t enough faster to make up for the loss of wait analysis. The ability to profile your code trumps any small performance improvement.

Can’t any profiler do this?

Sampling profilers and instrumented profilers might be able to tell you that your program is idle, and they might even be able to tell you where your program is idle, but they generally can’t tell you why your program is idle. Only by following the chain of readying threads and looking at other information can you be sure to find the cause of your idle stalls.

It’s also convenient that you can leave xperf running in continuous-capture mode, where it is constantly recording all system activity to a circular buffer. When you notice a problem you can just record the buffer to disk, and do some post-mortem profiling.

It’s not baking

Baking is all about precisely following a recipe – improvisation tends to lead to failure. Wait analysis, on the other hand, is all about creativity, thinking outside the box, and understanding the entire system. You have to understand context switches, you have to think about what idle time is good and what is bad, you have to think about when to look at CPU usage and when to look at idle time, and you often have to invent some new type of analysis or summary table ordering in order to identify the root cause. It’s not easy, but if you master this skill then you can solve problems that most developers cannot.

Embracing Dynamism

Original Author: Niklas Frykholm

Technology/ Code /

Are you stuck in static thinking? Do you see your program as a fixed collection of classes and functions with unchanging behavior.

While that view is mostly true for old school languages such as C++ and Java, the game is different for dynamic languages: Lua, JavaScript, Python, etc. That can be easy to forget if you spend most of your time in the static world, so in this article I’m going to show some of the tricks you can apply when everything is fluid and malleable.

At Bitsquid our dynamic language of choice is Lua. Lua has the advantage of being fast, fully dynamic, small, simple and having a traditional (i.e. non-LISP-y) syntax. We use Lua for most gameplay code and it interfaces with the engine through an API with exposed C functions, such as World.render() or Unit.set_position().

I will use Lua in all the examples below, but the techniques can be used in most dynamic languages.

1. Read-eval-print-loop

Dynamic languages can compile and execute code at runtime. In Lua, it is as simple as:

loadstring("print(10*10)")()

This can be used to implement a command console where you can type Lua code and directly execute it in the running game. This can be an invaluable debugging and tuning tool. For example if you need to debug some problem with the bazooka:

World.spawn_unit("bazooka", Unit.position(player))

Or tune the player’s run speed:

Unit.set_data(player, "run_speed", 4.3)

2. Reload code

The console can be used for more than giving commands, you can also use it to redefine functions. If the gameplay code defines a scoring rule for kills:

function Player.register_kill(self, enemy)
 
  	self.score = self.score + 10
 
  end

you can use the console to redefine the function and change the rules:

function Player.register_kill(self, enemy)
 
  	if enemy.type == "boss" then
 
  		self.score = self.score + 100
 
  	else
 
  		self.score = self.score + 10
 
  	end
 
  end

Executing this code will replace the existing Player.register_kill function with the new one. All code that previously called the old function will now call the new one and the new scoring rules will apply immediately.

If you take some care with how you use the global namespace you can write your Lua code so that all of it is reloadable using this technique. Then the gameplay programmer can just edit the Lua files on disk and press a key to reload them in-game. The game will continue to run with the new gameplay code, without any need for a reboot. Pretty cool.

You can even get this to work for script errors. If there is an error in the Lua code, don’t crash the game, just freeze it and allow the gameplay programmer to fix the error, reload the code and continue running.

3. Override system functions

The functions in the engine API don’t have any special privileges, they can be redefined just as other Lua functions. This can be used to add custom functionality or for debugging purposes.

Say, for example, that you have some units that are mysteriously popping up all over the level. You know they are being spawned somewhere in the gameplay code, but you can’t find where. One solution would be to override the World.spawn_unit function and print a stack trace whenever the offending unit is spawned:

old_spawn_unit = World.spawn_unit
 
  function World.spawn_unit(type, position)
 
  	if type == "tribble" then
 
  		print "Tribble spawned by:"
 
  		print_stack_trace()
 
  	end
 
  	old_spawn_unit(type, position)
 
  end

Now, whenever a tribble is spawned by the script, a call stack will be printed and we can easily find who is doing the spawning.

Note that before we replace World.spawn_unit, we save the original function in the variable old_spawn_unit. This enables us to call old_spawn_unit() to do the actual spawning.

This technique could also be used to find all (potentially expensive) raycasts being done by the script.

4. Handle deprecated functions

Sometimes we need to deprecate functions in the engine API. It can be annoying to the people using the engine of course, but backwards compatibilty is the mother of stagnation. If you never throw away old code, you will eventually have a huge ugly code mess on your hands.

Luckily, since the script can create functions in the engine namespace, the script can provide the backwards compatibility when needed.

For example, we used to have a function PhysicsWorld.clear_kinematic(world, actor). That naming was inconsistent with some of our other functions so we changed it to Actor.set_kinematic(actor, false).

One way of dealing with this change would be to go through all the code in the project, find all uses of PhysicsWorld.clear_kinematic and change them to use Actor.set_kinematic instead. But another way would be to just implement PhysicsWorld.clear_kinematic in the script:

function PhysicsWorld.clear_kinematic(world, actor)
 
  	Actor.set_kinematic(actor, false)
 
  end

Now the rest of the code can go on using PhysicsWorld.clear_kinematic without even caring that the function has been removed from the engine API. You could even use a combination of the two strategies — implementing the deprecated function in Lua for a quick fix, and then looking into removing the uses of it.

5. Dynamically inserting profiling

Top-down profiling with explicit profiler scopes is a good way of finding out where a game is spending most of its time. However, to be useful, explicit profiler scopes need to be inserted in all the “right” places (all potentially expensive functions).

In C we need to guess where these right places are before compiling the program. In Lua, we can just insert the profiler scopes dynamically. We can even create a function that adds profiling to any function we want:

function profile(class_name, method_name)
 
  	local f = _G[class_name][method_name]
 
  	_G[class_name][method_name] = function (...)
 
  		Profiler.start(class_name .. "." .. method_name)
 
  		f(...)
 
  		Profiler.stop()
 
  	end
 
  end

When we call this function as profile(‘Player’, ‘update’) it will first save the existing Player.update function and then replace it with a function that calls Profiler.start(“Player.update”) before calling the original function and Profiler.stop() before returning.

Using this techinque, we can dynamically add profiling to any function we want during our optimization session.

6. Tab completion

If you implement an interactive Lua console, it is nice to support tab completion, so the user doesn’t have to remember all function names. But how do you build the list of callable functions to use with tab completion?

Using Lua of course! Just find all tables (i.e., classes) in the global namespace and all functions stored in those tables:

t = {}
 
   
 
  for class_name,class in pairs(_G) do
 
  	if type(class) == 'table' then
 
  		for function_name,function in pairs(class) do
 
  			if type(function) == 'function' then
 
  				t[#t+1] = class_name .. '.' .. function_name
 
  			end
 
  		end
 
  	end
 
  end

After running this, t will contain the full list of function names.

7. Looping through all objects

By recursing through _G you can enumerate all reachable objects in the Lua runtime.

function enumerate(f)
 
  	local seen = {}
 
  	local recurse = function(t)
 
  		if type(t) ~= 'table' then return end
 
  		if seen[t] == true then return end
 
  		f(t)
 
  		seen[t] = true
 
  		recurse(getmetatable(t))
 
  		for k,v in pairs(t) do
 
  			recurse(k)
 
  			recurse(v)
 
  		end
 
  	end
 
  	recurse(_G)
 
  end

Calling enumerate(f) will call f(o) on all objects o in the runtime. (Assuming they are reachable from _G. Potentially, there could also be objects only reachable through Lua references held in C.)

Such an enumeration could be used for many things. For example, you could use it to print the health of every object in the game.

function print_health(o)
 
  	if o.health then print(o.health) end
 
  end
 
  enumerate(print_health)

The technique could also be used for memory optimizations. You could loop through all Lua objects and find the memory used by each object type. Then you could focus your optimization efforts on the resource hogs.

This has also been posted to The Bitsquid Blog.

Generating Uniformly Distributed Points on Sphere

Original Author: Jaewon Jung

Recently, while I was working on a screen-space shader effect, I had to do some random sampling over the surface of a sphere. An effective sampling requires a uniform distribution of samples. After a quick googling, I found out a way to generate uniformly distributed samples([1]), and it showed a decent result for my application. But, still unsure if that was an ideal way, I performed a due research about it later. Following is the result of that short research.

Usually, in graphics application, one can limit it to the three-dimensional space. In that case, there are four possible approaches, all of which guarantee a uniform distribution(BTW, as for what the ‘uniform distribution’ exactly means, [6] has some explanations). If a n-dimension support is required, one is out, so three remain. Let’s take stock of each.

Rejection sampling ([2][4][5])

One simple way is something called ‘rejection sampling’. For each x, y, z coordinates, choose a random value of a uniform distribution between [-1, 1]. If the length of the resulting vector is greater than one, reject it and try again. Obviously, this method can be generalized to n-dimension. But the bigger the dimension gets, the higher the rejection rate gets, so the less efficient the technique becomes.

Normal deviate ([2][5])

This technique chooses x, y and z from a normal distribution of mean 0 and variance 1. Then normalize the resulting vector and that’s it. [2] shows why this method can generate a uniform distribution over a sphere. In short,

It works because the vector chosen (before normalization) has a density that depends only on the distance from the origin.

as [5] explains. This also generalizes to n-dimension without a hassle.

Trigonometry method ([1][3][4][5])

This one works only for a three-dimensional sphere(called 2-sphere in literatures, which means it has two degrees of freedom), but is an easiest one to intuitively grab how it works. [1] nicely explains why it works from Archimedes’ theorem:

The area of a sphere equals the area of every right circular cylinder circumscribed about the sphere excluding the bases.

The exact steps are as below:

  • Choose z uniformly distributed in [-1,1].
  • Choose t uniformly distributed on [0, 2*pi).
  • Let r = sqrt(1-z^2).
  • Let x = r * cos(t).
  • Let y = r * sin(t).

This is the one I used for my shader effect. Since I had to use a very small number of samples for the sake of performance, I did a stratified sampling with this method. A straightforward extension to a stratified sampling is another advantage of this technique.

Coordinate approach ([2][3][5])

The last one is applicable to general n-dimensions and [2] explains its quite math-heavy derivation in detail. This technique first gets the distribution of a single coordinate of a uniformly distributed point on the N-sphere. Then, it recursively gets the distribution of the next coordinate over (N-1)-sphere, and so on. Fortunately, for the usual 3D space(i.e. 2-sphere), the distribution of a coordinate is uniform and one can do a rejection sampling on 2D for the remaining 1-sphere(i.e. a circle). The exact way is explained in [5] as a variation of the trigonometry method.

Codes and Pictures

Even if you haven’t got it all fully up to this point, don’t worry. The source code will fill up the gaps in your understanding. You can find my naive C++ implementations of techniques above here:

500 points by rejection sampling

SQL Server Performance: Part 1

Original Author: Ted Spence

I first began working with SQL (the SQL-92 dialect) in 1995.  At the time I’d only ever used raw disk IO for storage; and SQL was a complete shock.  Every bit of data I needed could be stored using a single API, and the database server took care of all the hard work. In 1997 I switched to Microsoft SQL Server 6.5, and it quickly became my preferred database.  Not everyone in the gaming world uses SQL, but if you do, here are some performance tuning lessons I’ve learned along the way.

Part 1: Effective use of SQL Server

SQL Server is almost completely devoid of “magic performance tricks.”  This article isn’t going to be about tuning configuration settings; instead, it’s going to cover how to give yourself enough flexibility and freedom to spot performance problems early, and make changes to improve performance when you identify a problem.

Design Flexibility

Before we start, let’s cover a few helpful design ideas that can make performance tuning easier in the future.  Each one of these could be discussed as its own topic, but let’s do a quick top level view.

  • Use SQL only for data that must be varnish-cached), and so on.
  • Design your SQL environment to be split apart.  If you have eight or nine types of data in your system, make them each their own databases.  That way, when your game suddenly explodes in size, each one can be scaled upward at its own pace.  Within a single database, I find that it’s useful to give each subsystem its own three-character prefix so that I can rapidly group related objects together. The only drawback is that you won’t be able to use foreign keys across distributed databases; but in most cases that’s necessary for high performance anyway.

    Imagine you had to cut your database in half - make sure it's clear how to do it.

    Isolating Tables

  • Use SQL licensing effectively.  Since SQL Standard with a size limitation, I like to use it in proof of concept environments.  When the database grows larger, I can migrate it to SQL Standard with just a new connection string.
  • If you have a physical SQL server, put your TempDB on its own, physically isolated disk drive.  Ideally, use a sub-$100 SSD for this.  There are lots of nifty tricks you can do with a high performance TempDB drive.
  • Get an MSDN subscription and use your development SQL license to have extra non-production servers sitting around.  This means you can always experiment with a new database configuration.
  • Don’t write SQL directly into your application.  Using stored procedures mostly removes the SQL language parser overhead.  Additionally, when you discover a performance problem, a stored procedure can be surgically altered while still in production.
  • Make use of your database maintenance plans.  Note that these only work on SQL Standard; but it can be awfully useful to have your database run a checkpoint and incremental backup every hour.  Indexes should be rebuilt nightly.
  • Learn “Third Normal Form”.  The full details of 3NF are quite complex, but in practice, what it means is that each datum should exist in the database once and only once.  A classic example is storing the department name for an employee database; here’s the right and wrong way to do it:
    Strings are big - so make sure each string is in the database only once.

    Move your data into specialized tables with useful primary keys.

I should also warn you that databases are easy to take for granted. SQL Server is a very friendly, deceptively simple program.  Often the dev database you create will seem so fast that you’ll be tempted to ignore performance tuning, until the production server grinds to a halt on deploy day. So let’s review a few tricks to enable you to see these problems before they get pushed live.

Spotting Performance Problems

Everyone’s best friend should be “Show Execution Plan.”  For each query you write, click the little button that shows either the estimated or actual execution plan. The results will look like this:

A Sample Execution Plan

An execution plan that could be optimized.

In this screen, you see a “Table Scan” that consumes a large amount of time that could be optimized out by creating an index.  It’s the equivalent to having an unstructured array and scanning through every element blindly.  Adding an index allows your database to use the equivalent of adding a hash table to look up data faster.  The great value of SQL Server is that you can dynamically tune the indexes your application uses without modifying your code!

You should look through the execution plan for the following:

  • Missing indexes. SQL Server will helpfully tell you, highlighted in green, the exact index it wishes it had.  Right click on it and select “Missing Index Details;” you can then give it a name and create it for an instant performance boost …… but remember that this isn’t always the best idea.  Each index you create slows down inserts, updates, and deletes.  Indexes also consume memory, which can result in page faults when you’re under memory pressure.  If you can, make all your indexes use integer IDs, since they are very memory-dense and fast to search.  Organize indexes with multiple values so that the most restrictive values are first – that will allow the query optimizer to exclude large volumes of rows rapidly.
  • Table Scans. As described above, if SQL server is spending a significant amount of time scanning your tables, try to figure out ways to modify your indexes to convert them to “Index Seek” instead.
  • Too many joins.  If your execution plan is really tall and wide, this usually means that SQL server is joining lots of tables together in a single step.  Lots of joins are an inescapable side effect of a normalized database; I’ll write a future article on the subject in detail. But for the moment, look for any tables that can be optimized away. Try declaring a variable to contain some of the values you want and selecting them early.  For example, look at these two equivalent SQL snippets – one of them only joins two tables, and the other joins three:
1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  11
 
  12
 
  13
 
  14
 
  15
 
  16
 
  17
 
  
-- SLOW WAY: Get weapons sales by querying three tables
 
  SELECT i.item_name, i.item_id, h.price, h.unit_count
 
    FROM act_auctionhistory h
 
         INNER JOIN itm_items i ON i.item_id = a.item_id
 
         INNER JOIN itm_itemclasses ic ON ic.itemclass_id = i.itemclass_id
 
   WHERE ic.itemclass_name = 'weapon'
 
         GROUP BY i.item_id, i.item_name
 
   
 
  -- FAST WAY: Get the same data but use only two tables at a time
 
  DECLARE @cid INT
 
  SELECT @cid = itemclass_id FROM itm_itemclasses WHERE itemclass_name = 'weapon'
 
   
 
  SELECT i.item_name, i.item_id, h.price, h.unit_count
 
    FROM act_auctionhistory h
 
         INNER JOIN itm_items i ON i.item_id = a.item_id
 
   WHERE i.itemclass_id = @cid
 
         GROUP BY i.item_id, i.item_name

Here’s another tip. Log on to your live server (make sure you use read-only privileges of course!). Right click on your SQL Server in the left hand pane and select “Activity Monitor”, then “Recent Expensive Queries”.  I find that I get a lot of mileage out of just checking this page every few hours:

SQL Server's activity monitor - worth checking regularly, like a facebook page for your database.

Check this page regularly!

So when you get see this screen, what do you look for and how do you make use of the information? Start by right clicking on each query and show the execution plan for each one.

Log Your Database Activity

Another worthwhile tip is to modify your code to write a log of database queries to your debug output.  When you’re using advanced ORM tools like NHibernate, it’s very easy to write code that has a side effect of generating unexpected database hits.  For example, consider this simple logic:

1
 
  2
 
  3
 
  4
 
  5
 
  
for (int i = 0; i < itemarray.Length; i++) {
 
      if (itemarray[i].IsQuestObject()) {
 
          return true;
 
      }
 
  }

In most cases this logic is fine and fast.  But imagine that the programmer down the hall fixes a bug by making IsQuestObject() ping the database – all of a sudden you’ve got a nearly-invisible massive performance penalty. The database is often so fast that you don’t notice these kinds of performance sinks. So I modify my ORM to emit debug output every time it executes a SQL statement.

As a result, when running in debug, I see the following:

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  
SQL: list_OrderedUserAccounts (Time: 00:00:00.0030000)
 
  SQL: get_PermissionsByUser (Time: 00:00:00.0020000)
 
  SQL: RetrievePermissionGroupObject (Time: 00:00:00.0020000)
 
  SQL: get_JoinsByLabel (Time: 00:00:00.0010000)
 
  SQL: get_JoinsByLabel (Time: 00:00:00.0010000)
 
  SQL: get_JoinsByLabel (Time: 00:00:00.0010000)
 
  SQL: ListProductObject (Time: 00:00:01.3680000)
 
  SQL: list_OrderedPlatforms (Time: 00:00:00.0020000)

When I look at this output, I immediately check to see if I’ve accidentally written a loop that calls get_JoinsByLabel too many times.  Then next I investigate the query “ListProductObject” to see why it took 1.5 seconds.

Combine multiple queries

Nothing improves performance quite like reducing the number of queries you execute.  Rather than submitting ten queries rapidly, why not make use of a single stored procedure that returns ten result sets?  Let’s say you’re working on the front page of your in-game auction house, but when the user mouses over an item name or a player name, you want to show some secondary details.

You could do this by writing one query for the auction house page and a second query for each time the user hovers their mouse.  But that would produce dozens of queries every time the mouse moved.

Instead, let’s create a single compound query that returns both the auction page and the popup text for everything.  Even though at first glance you may think you’re doing unnecessary work, SQL Server is already looking up all the objects using hash tables, and you’re reducing query overhead significantly by reducing volume.  Here’s roughly how to do it:

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  11
 
  12
 
  13
 
  14
 
  15
 
  16
 
  17
 
  18
 
  19
 
  
CREATE PROCEDURE get_AuctionHouseFirstPage AS
 
   
 
  -- Preparation: Select basic into a temporary table
 
  SELECT auction_id, character_id, auction_date,FROM ACT_Auctions,INTO #temp_auction_page
 
   
 
  -- First result set: Return the auction page
 
  SELECT ... FROM #temp_auction_page
 
   
 
  -- Second result set: Return pop-up information about characters
 
  SELECT user_name, guild_name,FROM CHR_Characters c
 
         INNER JOIN #temp_auction_page t ON c.char_id = t.char_id
 
   
 
  -- Third result set: Return pop-up information about items
 
  SELECT item_name, item_enchantment, ...
 
    FROM ITM_Items i
 
         INNER JOIN #temp_auction_page t ON i.item_id = t.item_id

For Next Time

Many of you may be asking – what about SQL vs NoSQL databases? I encourage you to tread lightly and consider them carefully on their own merits. Unlike SQL, which provides specific guarantees to data behavior at the cost of performance, NoSQL databases have their own unique advantages and drawbacks. There are many situations in gaming that are extremely well suited for NoSQL implementations; but this is a complex issue that really deserves its own article.

For the next SQL Server performance article, I’ll look at the speed of database inserts and updates, and the side effects of some basic choices in software style.