Performance and Graphics: Premature Optimisation

In developing new systems we frequently focus on a new systems functional requirements, interface, extensibility and how generically we can write the code with a “we’ll optimise this later if it turns out to be a problem” approach which is very much in the spirit of the often cited quote “premature optimization is the root of all evil". This blog post outlines the reason I believe failing to consider memory access patterns and performance from the outset is flawed and potentially very expensive to the developer of any performance critical application.

What Donald Knuth actually wrote in his 1974 paper Structured Programming with go to Statements is this "Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

That was 1974.

Taking a step back just for a moment let us consider what is optimisation?

Optimisation falls broadly into two categories:

Doing less of what you’re doing
Doing what you’re doing only faster

I’m not going to waste any of your time going over option 1, eliminating work is the best win there is.

Ok, so you have ensured the task is processing only the minimum number of times and it’s still a bottleneck, what next? The first thing is to identify where the time is going. Here are a few commonly encountered culprits:

Arithmetic instructions
Branch prediction misses
LHS stalls (Load hit store stalls)
Cache misses
Function call overhead

When we come to optimising work diving in and changing code is easy, if your code ends up being limited by arithmetic instructions or branch prediction misses or even LHS stalls there are often easy wins to be made just by changing the code, reworking the maths, inlining, vectorisation, loop unrolling, clever bit twiddling etc.. Cache misses are harder to tackle, sure we can sprinkle prefetch instructions around but they also take computation time and do not really help if you’re main bus is already suffering bandwidth problems. (The main bus is after all a shared resource these days) Beware false prophets, always use a tool which actually times your code, especially when it comes to using prefetch!

A bit of background in case you’re not familiar with what cache lines are. (which is the important bit) CPUs and GPUs don’t really ever load/store single words from main memory any more, typically they are isolated from main memory via a tier of increasingly faster but smaller memory caches. All memory fetches go through this cache hierarchy. The minimum load from main memory is what’s called a cache line, typically a figure of around 64-128 bytes. (For the purposes of this blog I will use 128) that is each cache line maps a 128 byte region of memory and which must be fetched and later stored as a complete block. The CPU only ever access the highest cache tier (L1) from which it can actually load individual words as it needs them. Typically access to this L1 cache tier is so fast latency can be hidden by instruction pipelining (which is a whole topic in itself), the important take away is that reading from L1 is super cheap, having to stall while another cache line is fetched up is not. If we stall to get data from L2 or L3, that’s bad, if it has to be fetched from main memory, that’s a lot worse. Worse still is loading 128 bytes all the way through the hierarchy to only use a small percentage of those bytes.

Let us take a step back again and think about memory. The trend since 1974 has however been that processor speeds increase year on year much faster than memory speed, sure we now have clever caches to help mitigate this but overwhelmingly we have moved to an era where instructions are cheap, memory fetches are not.

So what happens if you’re memory access bound? The strategy is simple, do more work with less cache line misses:

Use all the data in each cache line loaded, don’t load data into the cache that your task doesn’t need
Fit more data into less bytes, reconstruct data or use smaller types (being mindful of int->float, float->int conversion cost)
Align data to minimise the number of cache lines spanned (tends to waste a little memory)
Access data sequentially, some platforms predict array access and hardware prefetch the next cache line for you.
Minimise the number of different memory locations your task is accessing

Ok, data packing is pretty easy to add to an existing system, so is reshuffling data members and even recalculating data instead of storing it (how long is it since sin/cos tables made sense?) but that’s really the end of what you can short of more serious rewrites to improve access patterns and eliminate loading of data you don’t use. (cache pollution) Unfortunately large scale data reorganisation is a substantial and difficult undertaking meaning both cost and risk, particularly if it spills over into tools and/or API changes.

In fact this removal of cache pollution is where I (and the data orientated design guys) have an issue with OOP’s encapsulation of all thing ‘object’ in sequential memory. Ok, for the record, yes I know encapsulation doesn’t require laying out all data in sequential memory but in practice that’s what tends to occur, not doing this requires a degree of discipline which feels like working against the grain of the C++ language. All things object being encapsulated in sequential memory is absolutely fine of course as long as every performance critical method uses every member and objects processed sequentially.

Let's say you object isn't an exact multiple of a cache line in size and/or isn't aligned to the start of a cache line boundary. (or your hardware has sequential acccess cache prefetch units) Loading this unaligned object will also bring in neighbouring data, it has to, the cache can only load in 128 byte units, unless you use this data you are pouring bandwidth down the plug hole. Arrays are you friend, naive class factories which 'new' every class you allocate are typically plain evil.

And then there’s concurrency

One of the often cited ‘magic bullets’ is to make the task thread safe so it can be executed concurrently with other ‘stuff’, that is we can execute our task at the same time as other tasks without unsafe concurrent data read/writes being made. Making tasks concurrent and fast (i.e. without 1000’s of locks) means really getting the memory design right. Get the memory access protection wrong and you could be in for some really nasty infrequent timing related master only bugs (everyone’s favourite!) or seeing all your potential performance increase swallowed by times spent locking. Remembering also that you have only one memory bus, if your task is memory bandwidth limited going concurrent isn’t really going to help all that much.

Converting to a cache efficient and thread safe data organisation can be such an upheaval that it’s often easier and less risky to write new than attempt to move an old system to where it needs to be. Attempting to rework old can result in unwanted compromises due to the change being too wide ranging and too risky, on the other hand writing a new system can be extremely expensive if the component is large. Neither situation of expense or risk and compromise is particularly attractive.

Then there’s the API

If you need to change your API to facilitate the optimisations you need you can be in for a whole world of hurt as it’s no longer just ‘your’ code which needs modification, it’s all your users as well.

Lastly, death by 1000 cuts

Have you ever seen a profile without any hot spots yet your frame time is still blown? These are the very worst profiles where everything is simply taking too long and everything is memory fetch limited. There is an ceiling on how many data cache fetches you can perform per frame, you must maximise work done per fetch but where do you start?

The root of all evil 2013

I agree with Donald Knuth when it comes to optimising arithmetic and logic operations, leave it until you see it climbing up the profile but it is certainly no longer the root of all evil. In 2013 failing to design a cache efficient and thread safe memory organisation is the root of all evil. As a sin premature optimisation of code pales by comparison to producing a new system with poor memory access patterns. Iterating and peer reviewing memory design is risk free and a relatively quick and easy process. In 1974 we were writing a lot in assembly, obfuscation was a big problem and optimising made that problem significantly worse. Ok we still very occasionally drop down to writing in assembly but only as a last resort, we’ve optimised for memory access already right?

3 comments:

Anonymous28 February 2013 at 08:49
Good read, thanks!

I'd suggest a 6th point to the strategy, do more work with less data when possible. Spend some CPU time to calculate data instead of fetching them from memory.
Unknown4 March 2013 at 07:32
Plenty of valuable information and a good read, thanks :)

Tuesday, 26 February 2013

Premature Optimisation

3 comments: