In developing new systems we frequently focus on a new
systems functional requirements, interface, extensibility and how generically
we can write the code with a “we’ll optimise this later if it turns out to be a
problem” approach which is very much in the spirit of the often cited quote
“premature optimization is the root of all evil". This blog post outlines
the reason I believe failing to consider memory access patterns and performance
from the outset is flawed and potentially very expensive to the developer of
any performance critical application.
What Donald Knuth actually wrote in his 1974 paper Structured
Programming with go to Statements is this "Programmers waste
enormous amounts of time thinking about, or worrying about, the speed of
noncritical parts of their programs, and these attempts at efficiency actually
have a strong negative impact when debugging and maintenance are considered. We
should forget about small
efficiencies, say about 97% of the time: premature optimization is the
root of all evil. Yet we should not pass up our opportunities
in that critical 3%."
That was 1974.
Taking a step back just for a moment let us consider what is
optimisation?
Optimisation falls broadly into two categories:
- Doing less of what you’re doing
- Doing what you’re doing only faster
I’m not going to waste any of your time going over option 1,
eliminating work is the best win there is.
Ok, so you have ensured the task is processing only the
minimum number of times and it’s still a bottleneck, what next? The first thing
is to identify where the time is going. Here are a few commonly encountered
culprits:
- Arithmetic instructions
- Branch prediction misses
- LHS stalls (Load hit store stalls)
- Cache misses
- Function call overhead
When we come to optimising work diving in and changing code
is easy, if your code ends up being limited by arithmetic instructions or
branch prediction misses or even LHS stalls there are often easy wins to be
made just by changing the code, reworking the maths, inlining, vectorisation,
loop unrolling, clever bit twiddling etc..
Cache misses are harder to tackle, sure we can sprinkle prefetch
instructions around but they also take computation time and do not really help
if you’re main bus is already suffering bandwidth problems. (The main bus is
after all a shared resource these days) Beware false prophets, always use a
tool which actually times your code, especially when it comes to using
prefetch!
A bit of background in case you’re not familiar with what
cache lines are. (which is the important bit) CPUs and GPUs don’t really ever
load/store single words from main memory any more, typically they are isolated
from main memory via a tier of increasingly faster but smaller memory caches.
All memory fetches go through this cache hierarchy. The minimum load from main
memory is what’s called a cache line, typically a figure of around 64-128 bytes. (For the purposes of this blog I will use 128)
that is each cache line maps a 128 byte region of memory and which must be fetched and
later stored as a complete block. The CPU only ever access the highest cache
tier (L1) from which it can actually load individual words as it needs them.
Typically access to this L1 cache tier is so fast latency can be hidden by
instruction pipelining (which is a whole topic in itself), the important take
away is that reading from L1 is super cheap, having to stall while another
cache line is fetched up is not. If we stall to get data from L2 or L3, that’s
bad, if it has to be fetched from main memory, that’s a lot worse. Worse still
is loading 128 bytes all the way through the hierarchy to only use a small
percentage of those bytes.
Let us take a step back again and think about memory. The
trend since 1974 has however been that processor speeds increase year
on year much faster than memory speed, sure we now have clever caches to help
mitigate this but overwhelmingly we have moved to an era where instructions are
cheap, memory fetches are not.
So what happens if you’re memory access bound? The strategy
is simple, do more work with less cache line misses:
- Use all the data in each cache line loaded, don’t load data into the cache that your task doesn’t need
- Fit more data into less bytes, reconstruct data or use smaller types (being mindful of int->float, float->int conversion cost)
- Align data to minimise the number of cache lines spanned (tends to waste a little memory)
- Access data sequentially, some platforms predict array access and hardware prefetch the next cache line for you.
- Minimise the number of different memory locations your task is accessing
Ok, data packing is pretty easy to add to an existing
system, so is reshuffling data members and even recalculating data instead of storing it (how long is it since sin/cos tables made sense?) but that’s really the end of what you
can short of more serious rewrites to improve access patterns and eliminate
loading of data you don’t use. (cache pollution) Unfortunately large scale data
reorganisation is a substantial and difficult undertaking meaning both cost and
risk, particularly if it spills over into tools and/or API changes.
In fact this removal of cache pollution is where I (and the
data orientated design guys) have an issue with OOP’s encapsulation of all
thing ‘object’ in sequential memory. Ok, for the record, yes I know
encapsulation doesn’t require laying out all data in sequential memory but in
practice that’s what tends to occur, not doing this requires a degree of
discipline which feels like working against the grain of the C++ language. All things
object being encapsulated in sequential memory is absolutely fine of course as
long as every performance critical method uses every member and objects processed sequentially.
Let's say you object isn't an exact multiple of a cache line in size and/or isn't aligned to the start of a cache line boundary. (or your hardware has sequential acccess cache prefetch units) Loading this unaligned object will also bring in neighbouring data, it has to, the cache can only load in 128 byte units, unless you use this data you are pouring bandwidth down the plug hole. Arrays are you friend, naive class factories which 'new' every class you allocate are typically plain evil.
Let's say you object isn't an exact multiple of a cache line in size and/or isn't aligned to the start of a cache line boundary. (or your hardware has sequential acccess cache prefetch units) Loading this unaligned object will also bring in neighbouring data, it has to, the cache can only load in 128 byte units, unless you use this data you are pouring bandwidth down the plug hole. Arrays are you friend, naive class factories which 'new' every class you allocate are typically plain evil.
And then there’s concurrency
One of the often cited ‘magic bullets’ is to make the task
thread safe so it can be executed concurrently with other ‘stuff’, that is we
can execute our task at the same time as other tasks without unsafe concurrent
data read/writes being made. Making tasks concurrent and fast (i.e. without
1000’s of locks) means really getting the memory design right. Get the memory
access protection wrong and you could be in for some really nasty infrequent
timing related master only bugs (everyone’s favourite!) or seeing all your
potential performance increase swallowed by times spent locking. Remembering
also that you have only one memory bus, if your task is memory bandwidth
limited going concurrent isn’t really going to help all that much.
Converting to a cache efficient and thread safe data
organisation can be such an upheaval that it’s often easier and less risky to
write new than attempt to move an old system to where it needs to be.
Attempting to rework old can result in unwanted compromises due to the change
being too wide ranging and too risky, on the other hand writing a new system
can be extremely expensive if the component is large. Neither situation of expense or
risk and compromise is particularly attractive.
Then there’s the API
If you need to change your API to facilitate the
optimisations you need you can be in for a whole world of hurt as it’s no
longer just ‘your’ code which needs modification, it’s all your users as well.
Lastly, death by 1000 cuts
Have you ever seen a profile without any hot spots yet your
frame time is still blown? These are the very worst profiles where everything
is simply taking too long and everything is memory fetch limited. There is an
ceiling on how many data cache fetches you can perform per frame, you must
maximise work done per fetch but where do you start?
The root of all evil 2013
I agree with Donald Knuth when it comes to optimising
arithmetic and logic operations, leave it until you see it climbing up the
profile but it is certainly no longer the root of all evil. In 2013 failing to design a cache efficient and
thread safe memory organisation is the root of all evil. As a sin premature
optimisation of code pales by comparison to producing a new system with poor
memory access patterns. Iterating and peer reviewing memory design is risk free
and a relatively quick and easy process. In 1974 we were writing a lot in
assembly, obfuscation was a big problem and optimising made that problem
significantly worse. Ok we still very occasionally drop down to writing in
assembly but only as a last resort, we’ve optimised for memory access already
right?
Good read, thanks!
ReplyDeleteI'd suggest a 6th point to the strategy, do more work with less data when possible. Spend some CPU time to calculate data instead of fetching them from memory.
Hi, thanks for the feedback :)
DeleteYou're definitely correct with this, I made mention of reconstructing data in point 2 but I think perhaps I could have made more of this point.
Plenty of valuable information and a good read, thanks :)
ReplyDelete