<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>Mike Acton</title>
    <link rel="alternate" type="text/html" href="http://www.cellperformance.com/mike_acton/" />
    <link rel="self" type="application/atom+xml" href="http://www.cellperformance.com/mike_acton/atom.xml" />
   <id>tag:www.cellperformance.com,2008:/mike_acton//2</id>
    <link rel="service.post" type="application/atom+xml" href="http://www.cellperformance.com/cgi-bin/mt/mt-atom.cgi/weblog/blog_id=2" title="Mike Acton" />
    <updated>2007-04-08T07:14:12Z</updated>
    <subtitle>Thoughts on performance, the video game industry, and development.</subtitle>
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type 3.2</generator>
 
<entry>
    <title>Utility: match</title>
    <link rel="alternate" type="text/html" href="http://www.cellperformance.com/mike_acton/2007/04/utility_match.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://www.cellperformance.com/cgi-bin/mt/mt-atom.cgi/weblog/blog_id=2/entry_id=88" title="Utility: match" />
    <id>tag:www.cellperformance.com,2007:/mike_acton//2.88</id>
    
    <published>2007-04-07T06:13:23Z</published>
    <updated>2007-04-08T07:14:12Z</updated>
    
    <summary>Sharing a little utility called match which I use in conjunction with uniq.</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://www.cellperformance.com/mike_acton</uri>
    </author>
            <category term="Public" />
    
    <content type="html" xml:lang="en" xml:base="http://www.cellperformance.com/mike_acton/">
        <![CDATA[<div class="sticky-note">
<b>Update!</b> If fixed up all the greater-than and less-than symbols in this entry. I didn't make much sense before. I always forget to change those up in the HTML.
</div>

I'm just sharing a little utility I use all the time called <b>match</b>. <br />
<br />

<pre class="code">
Usage: ./match [-h] &lt;source_file&gt; &lt;uniq_file&gt;

For each line in &lt;source_file&gt; print the index to the 
first matching line in &lt;uniq_file&gt;.

[-h] Print results in 32 bit hexidecimal (default is decimal)

Note: The max line width supported is 4095 characters.
Note: Maximum number of lines supported is (2^32)
</pre>

If I have a source file of data represented as text (as I often do because it's often easier for me to read binary dumps in a text editor than a special "hex editor"), I use match to create a table of indices to unique lines (often these correspond to 128 bits since that's the size of an SPU register).<br >
<br />
I commonly use it like so (given I have a file called "source_file")
<pre class="code">
sort source_file | uniq &gt; uniq_file
match source_file uniq_file
</pre>

Now I have a handy table of indices! <br />
<br />
Download: <a href="http://www.cellperformance.com/public/attachments/match.c">match.c</a><br />
]]>
        
    </content>
</entry>
<entry>
    <title>Open Source and Console Games</title>
    <link rel="alternate" type="text/html" href="http://www.cellperformance.com/mike_acton/2006/08/open_source_and_console_games.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://www.cellperformance.com/cgi-bin/mt/mt-atom.cgi/weblog/blog_id=2/entry_id=69" title="Open Source and Console Games" />
    <id>tag:www.cellperformance.com,2006:/mike_acton//2.69</id>
    
    <published>2006-08-09T08:45:28Z</published>
    <updated>2006-12-26T06:25:26Z</updated>
    
    <summary>The free and open source software which we gladly take advantage of can be thought
of as the proverbial &quot;shoulder of giants&quot;. When we forget what brought us the 
advantages to get where we are, we do a disservice to ourselves and the health of 
our industry, and thus ultimately a disservice to our shareholders and customers.</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://www.cellperformance.com/mike_acton</uri>
    </author>
            <category term="Public" />
    
    <content type="html" xml:lang="en" xml:base="http://www.cellperformance.com/mike_acton/">
        <![CDATA[<div class="sticky-note">
	On August 16, 2006 I participated in a <a href="http://www.digitalhollywood.com/%231BBlkSessions/BBWedFourWork.html">
	panel discussion on Open Source and media</a> as part of Digital Hollywood's <a href="http://www.digitalhollywood.com/BuildingBlocks.html">Building Blocks 2006</a>
	conference.<br />
	<br />
	Here is the description of the panel [from digitalhollywood.com]
	<div class="quote">
		The Open Source movement began during the dot.com rise with young companies
		developing great tools to deliver applications and services across
		multiple platforms. The consumer’s appetite for new content driven
		experiences has expanded to include ways to view, manage, and share
		content across devices. With the changing landscape around the home,
		Open Source promises to power a new generation of applications running
		over today’s high-speed networks and the systems used to create,
		manage, and distribute that content.<br /> 
		<br /> 
		Come join key leaders in the global electronics, online, and media 
		communities to discuss Open Source's definition, and learn how companies
		will create systems, infrastructure, and applications for the next 
		generation of the Consumer Entertainment Experience.
	</div><br />
	<br />
	For those of you who did not attend, I would like to take an
	opportunity to discuss here my personal opinions on these issues.
</div>]]>
        <![CDATA[<div class="subtitle">Background</div>

From the description of the panel, some people might be lead to believe that free and
open source software are new phenomenons, somehow linked to the internet bubble. This
is definately not true.<br />
<br />

Certainly the history of "free software" can be traced back much further than the
"dot com rise", and because much of the software we use (for example, GNU/Linux) is a mix 
of both "open" and "free" software, we should consider the larger context.<br />
<br />

By most accounts, this begins with Richard Stallman making <a href="http://www.gnu.org/gnu/initial-announcement.html">announcing his plan</a>
to "free unix" on usenet in 1983. But of course, the distribution of source code
freely among programmers can be traced back much further than that.<br />
<br />

<div class="subtitle">Terminology</div>

No discussion of "open source" can be complete without distinguishing between 
the subtle differences of "open source software" and "free software":<br />
<br />
See: <a href="http://www.gnu.org/philosophy/free-sw.html">Definition of "free software"</a><br />
See: <a href="http://www.opensource.org/docs/definition_plain.php">Definition of "open source software"</a>
<br />
Here's an article which tries to <a href="http://www.itworld.com/AppDev/350/LWD010523vcontrol4/">clarify the differences.</a><br />
<br />
There has been some muddying of the waters by Microsoft's relatively recent 
<a href=http://en.wikipedia.org/wiki/Shared_source=>"shared source"</a> initiative
(but there is general agreement that this is not either really "open" or "free" and
so not part of this discussion.)<br />
<br />

<div class="subtitle">Licenses</div>

There are very many <a href="http://www.opensource.org/licenses/">open source licenses</a> 
and one canonical <a href="http://www.gnu.org/copyleft/gpl.html">free software license</a>.
Many companies (including IBM, SGI, Apple, ...) have also produced their own variants.<br />
<br />

I think applicability of license to product merits some discussion here. 
For example, in my field (console video games), we are restricted by the platform 
owners (Sony, Nintendo, etc.) by NDA and cannot release specific details to the public. 
This necessarily limits our choices when using "open source software" and nearly 
eliminates "free software" as an option, as we often cannot fully reciprocate our 
modifications to the public.<br />
<br />
I do not see this a problem, nor as a challange to be overcome. Authors of free
software are often willing to distribute their software without cost so that 
others make take advantage of the work that they've done and the only ask for
one thing in return - that the software remain free. Just as the case when a
middleware vendor may charge half a million dollars more than you're willing
to pay, if the price for the software is to steep, then something else must
be used instead. I wholeheartedly respect the work of the FSF 
(<a href="http://www.fsf.org">Free Software Foundation</a>), but I understand
that the practical nature of our business makes it very difficult to directly
use the products of their hard work, and the work of so many other free software
developers, in our own products.<br />
<br />
Free software definitely has its place in game development, however. I do most
of my work on GNU/Linux desktops and GCC is my compiler of choice. Additionally, many
offline tools used directly or indirectly to develop the games themselves are
based on free or open software, and I'm grateful that those tools exist.<br />
<br />

<div class="subtitle">Reciprocity</div>

I think reciprocity is the most important thing we can be discussing in the 
context of open source software and console games. It cannot be a simply a 
matter of "how open source benefits us", but we must also discuss "how we can
participate in the open source community" and what responsibilities we have
for doing so.<br />
<br />

The free and open source software which we gladly take advantage of (if not in the games 
themselves, then certainly in the tools that develop them) can be thought
of as the proverbial "shoulder of giants". When we forget what brought us the 
advantages to get where we are, we do a disservice to ourselves and the health of 
our industry, and thus ultimately a disservice to our shareholders and customers.<br />
<br />

I think Yahoo Search's vision statement applies equally well to the role of open source
software:<br />
<br />
"Enable people to <b>find</b>, <b>use</b>, <b>share</b> and <b>expand</b> all human knowledge"<br />
<br />

To share and contribute not only benefits us now, but will continue to benefit us when our 
current products are forgotten and dusty.<br />
<br />

<div class="subtitle">Cost of Openness</div>

There is an ongoing debate on the cost of sharing your work with the world. Perhaps there
will be a higher cost in support when calls and emails arrive from users that have configured
the software in some strange environment. Maybe it will give competitors an edge when they
see can clearly read the "secrets" of your product in the source code. Most arguments, including
these, are never really so much about the costs involved (consider how many millions of dollars
are spent developing the typical console title) but rather question the value of sharing, i.e. the return
on the investment. <br />
<br />  

<div class="sticky-note">
Consider this: The console game industry is a fast-moving industry. Consoles
change, methods change and even the developers themselves change rapidly and constantly. Success of
a title is usually determined by the quality of the content, not the engine that drives it, although occasionally
the field of successful titles is punctuated by technical acheivement. But if competitors need access to the
source of a successful product in order to become successful themselves, <i>they are already behind</i>, and
no amount of access will allow them to gain on the continued developments of the leaders. And if it does help
to make their product a little better, that's a good thing - good games are good for the platform, and what's good
for the platform is good for developers wanting to sell their games on that platform.
</div>

<b>The value of openness is in the people, not the source code.</b><br />
<br />   
<ul>
<li><b>Invest in the future.</b> The programmers reading, modifying and commenting on the source may 
belong to the next-generation of coders in the industry. Help them learn by providing examples of real-world
challanges and their solutions.</li>
<li><b>Invest in your team.</b> The best way to learn is to teach. Simply by explaining what they've done, 
programmers will come up with new ideas and find areas that they've missed. This is no minor point - a 
studio's value is in it's people and since there are very few traditional training courses for the professional developer,
a good studio must find different ways of helping make those developers better each day at what they do.
</ul>

<div class="subtitle">Call to Arms</div>
 
Electronic Arts made a considerable difference to not only games but to many different industries when they released
the <a href="http://www.szonye.com/bradd/iff.html">EA IFF 85</a> Standard for Interchange Format Files. And it is
in that tradition, almost twenty-two years later that I hope game developers, studios and publishers will re-double their
efforts to share what they have created and learned with the community. Id software, the modern poster-child for
sharing their technology, certainly hasn't lost anything by releasing some of their older sources.<br />
<br />  

Start small - a function, a snippet even. But make if we make it a habit, we will all be rewarded. <br />
<br />  

<div class="sticky-note">
Has your studio released something into the wild? Tell me about it and I will happily list it here.
</div>
]]>
    </content>
</entry>
<entry>
    <title>Understanding Strict Aliasing</title>
    <link rel="alternate" type="text/html" href="http://www.cellperformance.com/mike_acton/2006/06/understanding_strict_aliasing.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://www.cellperformance.com/cgi-bin/mt/mt-atom.cgi/weblog/blog_id=2/entry_id=51" title="Understanding Strict Aliasing" />
    <id>tag:www.cellperformance.com,2006:/mike_acton//2.51</id>
    
    <published>2006-06-01T09:35:19Z</published>
    <updated>2008-02-26T05:21:46Z</updated>
    
    <summary>Strict aliasing has been part of C programming for the better part of the last decade but a thorough understanding of the details of this feature is still clouded in mystery for many programmers. Examine detailed examples and some perculiarities of GCC&apos;s implementation.</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://www.cellperformance.com/mike_acton</uri>
    </author>
            <category term="Public" />
    
    <content type="html" xml:lang="en" xml:base="http://www.cellperformance.com/mike_acton/">
        <![CDATA[<div class="sticky-note"><strong>UPDATED! (25 Feb 08) Added note on bitfields. Corrected a typo (Thanks <span class="monospace-strong">Turly O'Connor</span>!)</strong></div>
<div class="sticky-note"><strong>UPDATED! (08 Aug 06) More Clarifications! 
Special thanks to <span class="monospace-strong">Nicolas Riesch</span>, <span class="monospace-strong">André de Leiradella</span> and <span class="monospace-strong">pinskia</span> for their comments and suggestions.
</strong></div>
<div class="sticky-note"><strong>UPDATED! (28 Dec 06) Minor fixes. 
Special thanks to <span class="monospace-strong">Kobi Cohen-Arazi</span> and <span class="monospace-strong">Chris Pickett</span>.
</strong></div>


<div class="subtitle">Aliasing</div>

One pointer is said to <i>alias</i> another pointer when both refer to the same location or object. In this example,
<pre class="code">
<span class="line-number">  0</span>uint32_t 
<span class="line-number">  1</span>swap_words( uint32_t arg )
<span class="line-number">  2</span>{
<span class="line-number">  3</span>  uint16_t* const sp = (uint16_t*)&amp;arg;
<span class="line-number">  4</span>  uint16_t        hi = sp[0];
<span class="line-number">  5</span>  uint16_t        lo = sp[1];
<span class="line-number">  6</span>  
<span class="line-number">  7</span>  sp[1] = hi;
<span class="line-number">  8</span>  sp[0] = lo;
<span class="line-number">  9</span>
<span class="line-number"> 10</span>  return (arg);
<span class="line-number"> 11</span>} 
</pre>
<div class="rule-of-thumb">
Using GCC 3.4.1 and above, the above code will generate <b>warning: dereferencing type-punned pointer will break strict-aliasing rules</b> on line 3.
</div>
The memory referred to by <b>sp</b> is an alias of <b>arg</b> because they refer to the same address in memory. In C99, it is <i>illegal</i> to create an alias of a different type than the original. This is often refered to as the <b>strict aliasing</b> rule. The rule is enabled by default in GCC at optimization levels at or above O2. Although the above example would compile, the results are undefined. 
More than likely, <strong>arg</strong> would be returned unchanged because a pointer to uint16_t cannot be an alias
to a pointer to uint32_t when applying the strict aliasing rule.<br />
<br />

<div class="rule-of-thumb">
Dereferencing a cast of a variable from one type of pointer to a different type is <em>usually</em> in violation of the strict aliasing rule.
</div>

However, having multiple representations of the same location in memory is often beneficial. Properly balancing the compiler's memory optimizations and the programmer's optimizations based on real-world context and data is a bit of a black art. It requires an understanding of the tradeoffs among what's permitted by the standard, what's the reality of compilers and the value of a particular transformation based on the architecture and the data. It's worth it in the end though when the results speak for themselves.<br />

<div class="sticky-note"> All of the examples in this article have been tested with various versions of GCC. Although you can expect most of the examples to generate similar results across the major compilers, programmers' expectations should always be validated for the compilers and compiler revisions required. </div>
<br /> 

Read on for details on the strict aliasing rule and some common pitfalls.<br />
]]>
        <![CDATA[<ul>
<li><a href="#introduction">What is strict aliasing?</a></li>
<li><a href="#benefits">Benefits to the strict aliasing rule</a></li>
<li><a href="#compatible_type">Casting compatible types</a></li>
<li><a href="#union_1">Casting through a union (1)</a></li>
<li><a href="#union_2">Casting through a union (2)</a></li>
<li><a href="#union_3">Casting through a union (3)</a></li>
<li><a href="#cast_to_char_pointer">Casting to <i>char*</i></a></li>
<li><a href="#gcc_rule_breaking">GCC rule breaking</a></li>
<li><a href="#qa_bitfields">Question about bitfields</a></li>
<li><a href="#c99_standard">C99 standard</a></li>
<li><a href="#summary">Summary</a></li>
</ul>

<div id="introduction" class="subtitle">What is strict aliasing?</div>

<span class="monospace-strong">Strict aliasing is an assumption, made by the C (or C++) compiler, 
that dereferencing pointers to objects of different types will never refer to the same memory location 
(i.e. alias eachother.)</span>
<br /><br />

Here are some basic examples of assumptions that may be made by the compiler when strict aliasing is
enabled:<br />
<br />

<b>Pointers to different built in types do not alias:</b>
<pre class="code">
<span class="line-number">  0</span>int16_t* foo;
<span class="line-number">  1</span>int32_t* bar;
</pre>
The compiler will assume that <span class="monospace-strong">*foo</span> and <span class="monospace-strong">*bar</span>
never refer to the same location.
<br /><br />

<b>Pointers to aggregate or union types with differing tags do not alias:</b>
<pre class="code">
<span class="line-number">  0</span>typedef struct
<span class="line-number">  1</span>{
<span class="line-number">  2</span>  uint16_t a;
<span class="line-number">  3</span>  uint16_t b;
<span class="line-number">  4</span>  uint16_t c;
<span class="line-number">  5</span>} Foo;
<span class="line-number">  6</span>
<span class="line-number">  7</span>typedef struct
<span class="line-number">  8</span>{
<span class="line-number">  9</span>  uint16_t a;
<span class="line-number"> 10</span>  uint16_t b;
<span class="line-number"> 11</span>  uint16_t c;
<span class="line-number"> 12</span>} Bar;
<span class="line-number"> 13</span>
<span class="line-number"> 14</span>Foo* foo;
<span class="line-number"> 15</span>Bar* bar;
</pre>
The compiler will assume that <span class="monospace-strong">*foo</span> and <span class="monospace-strong">*bar</span>
never refer to the same location, even though the contents of the structures are the same.
<br /><br />

<b>Pointers to aggregate or union types which differ only by name may alias:</b>
<pre class="code">
<span class="line-number">  0</span>typedef struct
<span class="line-number">  1</span>{
<span class="line-number">  2</span>  uint16_t a;
<span class="line-number">  3</span>  uint16_t b;
<span class="line-number">  4</span>  uint16_t c;
<span class="line-number">  5</span>} Foo;
<span class="line-number">  6</span>
<span class="line-number">  7</span>typedef Foo Bar;
<span class="line-number">  8</span>
<span class="line-number">  9</span>Foo* foo;
<span class="line-number"> 10</span>Bar* bar;
</pre>
The compiler will assume that <span class="monospace-strong">*foo</span> and <span class="monospace-strong">*bar</span>
may refer to the same location, and will not perform the optimizations decribed below.
<br /><br />


<div id="benefits" class="subtitle">Benefits to The Strict Aliasing Rule</div>

When the compiler cannot assume that two object are not aliased, it must act very conservatively when accessing memory. For example:

<pre class="code">
<span class="line-number">  0</span>typedef struct
<span class="line-number">  1</span>{
<span class="line-number">  2</span>  uint16_t a;
<span class="line-number">  3</span>  uint16_t b;
<span class="line-number">  4</span>  uint16_t c;
<span class="line-number">  5</span>} Sample;
<span class="line-number">  6</span>
<span class="line-number">  7</span>void
<span class="line-number">  8</span>test( uint32_t* values,
<span class="line-number">  9</span>      Sample*   uniform,
<span class="line-number"> 10</span>      uint64_t  count )
<span class="line-number"> 11</span>{
<span class="line-number"> 12</span>  uint64_t i;
<span class="line-number"> 13</span>
<span class="line-number"> 14</span>  for (i=0;i&lt;count;i++)
<span class="line-number"> 15</span>  {
<span class="line-number"> 16</span>    values[i] += (uint32_t)uniform-&gt;b;
<span class="line-number"> 17</span>  }
<span class="line-number"> 18</span>}
</pre>

Compiled with <b><span style="color:#ff0000">-fno-strict-aliasing</span> -O3 -std=c99</b> on the <span style="color:#FF00FF">64 bit build</span> of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.
<pre class="code">
<span class="line-number">  0</span>test:
<span class="line-number">  1</span>  li     10, 0      # i      = 0
<span class="line-number">  2</span>  cmpld  7,  10, 5  # done   = (i==count)
<span class="line-number">  3</span>  bgelr- 7          # if (done) return
<span class="line-number">  4</span>  mtctr  5          # ctr    = count
<span class="line-number">  5</span>.L8:
<span class="line-number">  6</span>  sldi   11, 10, 2  # offset = i * 4
<span class="line-number">  7</span><span style="color:#ff0000">  lhz    9,  2(4)   # b      = *(uniform+4)</span>
<span class="line-number">  8</span>  addi   10, 10, 1  # i++
<span class="line-number">  9</span>  lwzx   5,  11, 3  # value  = *(values+offset)
<span class="line-number"> 10</span>  add    0,  5,  9  # value  = value + b
<span class="line-number"> 11</span>  stwx   0,  11, 3  # *(values+offset) = value
<span class="line-number"> 12</span>  bdnz  .L8         # if (ctr--) goto .L8
<span class="line-number"> 13</span>  blr               # return
</pre>

In this case <b>uniform->b</b> <i>must</i> be loaded during each iteration of the loop. This is because the compiler cannot be certain that <b>values</b> does not overlap <b>b</b> in memory. If, in fact, they do overlap, the programmer would expect that <b>uniform->b</b> would be properly updated and the values stored into the <b>values</b> array adjusted accordingly. The only method for the compiler to guarantee these results is reloading <b>uniform->b</b> at every iteration.<br />
<br />

It was noted that this case is extremely uncommon in <i>most</i> code and the decision was made to <i>presume</i> objects of different types are not aliased and to be more aggresive with optimizations. It is certain  the fact  this presumption would break some existing code was discussed in detail. It must have been decided that those most likely to use memory aliasing techniques for optimization are are few and those that do use it are the most willing and capable of making the necessary changes.  <br />
<br />

The result, even for this small case, can make a significant performance impact. Compiled with <b><span style="color:#ff0000">-fstrict-aliasing</span> -Wstrict-aliasing=2 -O3 -std=c99</b> on the <span style="color:#FF00FF">64 bit build</span> of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.

<pre class="code">
<span class="line-number">  0</span>test:
<span class="line-number">  1</span>  li     11,0     # i      = 0
<span class="line-number">  2</span>  cmpld  7,11,5   # done   = (i == count)
<span class="line-number">  3</span>  bgelr- 7        # if (done) return
<span class="line-number">  4</span><span style="color:#ff0000">  lhz    4,2(4)   # b      = uniform.b</span>
<span class="line-number">  5</span>  mtctr  5        # ctr    = count
<span class="line-number">  6</span>.L8:
<span class="line-number">  7</span>  sldi   9,11,2   # offset = i * 4
<span class="line-number">  8</span>  addi   11,11,1  # i++
<span class="line-number">  9</span>  lwzx   5,9,3    # value  = *(values+offset)
<span class="line-number"> 10</span>  add    0,5,4    # value  = value + b
<span class="line-number"> 11</span>  stwx   0,9,3    # *(values+offset) = value
<span class="line-number"> 12</span>  bdnz   .L8      # if (ctr--) goto .L8
<span class="line-number"> 13</span>  blr             # return
</pre>

The load of <b>b</b> is now only done once, outside the loop. For more examples of optimizations for non-aliasing memory see: <a href="http://www.cellperformance.com/mike_acton/2006/05/demystifying_the_restrict_keyw.html">Demystifying The Restrict Keyword</a>

<div id="compatible_type" class="subtitle">Casting Compatible Types</div>
Aliases are permitted for types that only differ by qualifier or sign.
<pre class="code">
<span class="line-number">  0</span>uint32_t
<span class="line-number">  1</span>test( uint32_t a )
<span class="line-number">  2</span>{
<span class="line-number">  3</span>  uint32_t* const       a0 = &a;
<span class="line-number">  4</span>  uint32_t* volatile    a1 = &a;
<span class="line-number">  5</span>  int32_t*              a2 = (int32_t*)&a;
<span class="line-number">  6</span>  int32_t* const        a3 = (int32_t*)&a;
<span class="line-number">  7</span>  int32_t* volatile     a4 = (int32_t*)&a;
<span class="line-number">  8</span>  const int32_t* const  a5 = (int32_t*)&a;
<span class="line-number">  9</span>
<span class="line-number"> 10</span>  (*a0)++;
<span class="line-number"> 11</span>  (*a1)++;
<span class="line-number"> 12</span>  (*a2)++;
<span class="line-number"> 13</span>  (*a3)++;
<span class="line-number"> 14</span>  (*a4)++;
<span class="line-number"> 15</span>
<span class="line-number"> 16</span>  return (*a5);
<span class="line-number"> 17</span>}
</pre>
In this case <b>a0</b>-<b>a5</b> are all valid aliases of <b>a</b> and this function will return <b>(a + 5)</b>.

<div class="sticky-note">
GCC has two flags to enable warnings related to strict aliasing. <b>-Wstrict-aliasing</b> enables warnings for most common errors related to type-punning. <b>-Wstrict-aliasing=2</b> attempts to warn about a larger class of cases, however false positives may be returned.</div>

<div id="union_1" class="subtitle">Casting through a union (1)</div>

The most commonly accepted method of converting one type of object to another is by using a union type as in this example:
<pre class="code">
<span class="line-number">  0</span>typedef union
<span class="line-number">  1</span>{
<span class="line-number">  2</span>  uint32_t u32;
<span class="line-number">  3</span>  uint16_t u16[2];
<span class="line-number">  4</span>}
<span class="line-number">  5</span>U32;
<span class="line-number">  6</span>
<span class="line-number">  7</span>uint32_t
<span class="line-number">  8</span>swap_words( uint32_t arg )
<span class="line-number">  9</span>{
<span class="line-number"> 10</span>  U32      in;
<span class="line-number"> 11</span>  uint16_t lo;
<span class="line-number"> 12</span>  uint16_t hi;
<span class="line-number"> 13</span>
<span class="line-number"> 14</span>  in.u32    = arg;
<span class="line-number"> 15</span>  hi        = in.u16[0];
<span class="line-number"> 16</span>  lo        = in.u16[1];
<span class="line-number"> 17</span>  in.u16[0] = lo;
<span class="line-number"> 18</span>  in.u16[1] = hi;
<span class="line-number"> 19</span>
<span class="line-number"> 20</span>  return (in.u32);
<span class="line-number"> 21</span>}
</pre>
This method is not properly called <i>casting</i> at all (although it may be called <em>type-punning</em>) as the value is simplied copied into a union which permits aliasing among its members. From a performance point of view, this method relies on the ability of the optimizer to remove the redundant stores and loads.  When using recent versions of GCC, if the transformation is reasonably simple, it is very likely that the compiler will be able to remove the redundancies and produce an optimal code sequence.<br />

<div class="sticky-note">
Strictly speaking, reading a member of a union different from the one written to is undefined in ANSI/ISO C99 except in the special case of type-punning to a <b>char*</b>, similar to the example below: <a href="#cast_to_char_pointer">Casting to <b>char*</b></a>. However, it is an extremely common idiom and is well-supported by all major compilers. As a practical matter, reading and writing to any member of a union, in any order, is acceptable practice.</div>

For example, when compiled with <span class="monospace-strong">GNU C version 4.0.0 (Apple Computer, Inc. build 5026) (powerpc-apple-darwin8)</span>, the argument is simply rotated 16 bits.
<pre class="code">
<span class="line-number">  0</span>swap_words:
<span class="line-number">  1</span>  rlwinm r3,r3,16,0xffffffff
<span class="line-number">  2</span>  blr
</pre>

When compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU, the loads and stores are removed but the instruction sequence is less than optimal.
<pre class="code">
<span class="line-number">  0</span>swap_words:
<span class="line-number">  1</span>  slwi    4,3,16     ; hi    = arg &lt;&lt; 16
<span class="line-number">  2</span>  rldicl  3,3,48,48  ; lo    = arg &gt;&gt; 16
<span class="line-number">  3</span>  or      0,4,3      ; out   = hi | lo;
<span class="line-number">  4</span>  rldicl  3,0,0,32   ; final = out &amp; 0xffffffff
<span class="line-number">  5</span>  blr
</pre>
<br />

In order to generate reasonably good code across both the GCC3 and GCC4 families, use C99 style intializers:
<pre class="code">
<span class="line-number">  0</span>uint32_t
<span class="line-number">  1</span>swap_words( uint32_t arg )
<span class="line-number">  2</span>{
<span class="line-number">  3</span>  U32    in  = { .u32=arg };
<span class="line-number">  4</span>  U32    out = { .u16[0]=in.u16[1], 
<span class="line-number">  5</span>                 .u16[1]=in.u16[0] };
<span class="line-number">  6</span>
<span class="line-number">  7</span>  return (out.u32);
<span class="line-number">  8</span>}
</pre>

Compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on the 32 bit build of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.
<pre class="code">
<span class="line-number">  0</span>swap_words:
<span class="line-number">  1</span>  stwu 1,-16(1)              ; Push stack
<span class="line-number">  2</span>  rlwinm 3,3,16,0xffffffff   ; Rotate 16 bits
<span class="line-number">  3</span>  addi 1,1,16                ; Pop stack
<span class="line-number">  4</span>  blr
</pre>

<div class="sticky-note">
It is a parculiarity of the 32 bit build of GCC 3.4.1 for the Cell PPU that the stack is <i>always</i> pushed and popped regardless of whether or not it is used. 

</div>

<div class="rule-of-thumb">
This method is most valuable for use with primitive types which can be returned <i>by value</i>.
This is because it relies on doing a complete copy of the object (by value) and removing the redundancies. 
With more complex aggregate or union types copying may be done on the stack or through the memcpy function
and redundancies are harder to eliminate.
</div>

<div id="union_2" class="subtitle">Casting through a union (2)</div>

Casting proper may be done between a pointer to a type and a pointer to an aggregate or union type which contains a member of a <a href="#compatible_type">compatible type</a>, as in the following example:
<pre class="code">
<span class="line-number">  0</span>uint32_t
<span class="line-number">  1</span>swap_words( uint32_t arg )
<span class="line-number">  2</span>{
<span class="line-number">  3</span>  U32*     in = (U32*)&arg;
<span class="line-number">  4</span>  uint16_t lo = in-&gt;u16[0];
<span class="line-number">  5</span>  uint16_t hi = in-&gt;u16[1];
<span class="line-number">  6</span>
<span class="line-number">  7</span>  in-&gt;u16[0] = hi;
<span class="line-number">  8</span>  in-&gt;u16[1] = lo;
<span class="line-number">  9</span>
<span class="line-number"> 10</span>  return (in-&gt;u32);
<span class="line-number"> 11</span>}
</pre>

<b>in</b> is a pointer to a <b>U32</b> type, which contains the member <b>u32</b> which is of type <b>uint32_t</b> which is compatible with <b>arg</b>, which is also of type <b>uint32_t</b>.


<div class="sticky-note">
The above source when compiled with GCC 4.0 with the <b>-Wstrict-aliasing=2</b> flag enabled will generate a warning. This warning is an example of a <b>false positive</b>. This type of cast is  allowed and will generate the appropriate code (see below). It is documented clearly that <b>-Wstrict-aliasing=2</b> may return false positives.</div>

Compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on <span class="monospace-strong">GNU C version 4.0.0 (Apple Computer, Inc. build 5026) (powerpc-apple-darwin8)</span>,

<pre class="code">
<span class="line-number">  0</span>swap_words:
<span class="line-number">  1</span>  stw r3,24(r1)  ; Store arg
<span class="line-number">  2</span>  lhz r0,24(r1)  ; Load hi
<span class="line-number">  3</span>  lhz r2,26(r1)  ; Load lo
<span class="line-number">  4</span>  sth r0,26(r1)  ; Store result[1] = hi
<span class="line-number">  5</span>  sth r2,24(r1)  ; Store result[0] = lo
<span class="line-number">  6</span>  lwz r3,24(r1)  ; Load result
<span class="line-number">  7</span>  blr            ; Return
</pre>

GCC is extremely poor at combining loads and stores done through a pointer to a union type as can be seen from the generated code above. The output is a very naive interpretation of the source and would perform badly compared to the previous examples on most architectures.<br /> 
<br />

However, once this fact is accounted for, this method can be very useful. Rather than copying the argument <i>by value</i>, which is problematic on large or complex structures, a pointer can be passed in and the value modified directly. If the loads and stores can be combined in the source the results will usually be excellent.

<div class="sticky-note">
<i>"But when the address of a variable is taken, 
doesn't the compiler force it to be stored in memory rather than in a register?"</i>
<br /><br />
Yes, both a store and a load may then generated as part of the trace. However, when alias analysis is done it
can be determined that the object cannot be changed another mechanism so the load and store may be marked as
redundant and removed.
</div>

<div class="rule-of-thumb">
Do not rely on the compiler to combine loads and stores. The programmer is <i>always</i> better equipted to make those decisions based on alignment concerns and complex instruction penalty rules.
</div>

<pre class="code">
<span class="line-number">  0</span>uint16_t*
<span class="line-number">  1</span>swap_words( uint16_t* arg )
<span class="line-number">  2</span>{
<span class="line-number">  3</span>  U32*     combined = (U32*)arg;
<span class="line-number">  4</span>  uint32_t start    = combined-&gt;u32;
<span class="line-number">  5</span>  uint32_t lo       = start &gt;&gt; 16;
<span class="line-number">  6</span>  uint32_t hi       = start &lt;&lt; 16;
<span class="line-number">  7</span>  uint32_t final    = lo | hi;
<span class="line-number">  8</span>
<span class="line-number">  9</span>  combined-&gt;u32 = final;
<span class="line-number"> 10</span>}
</pre>

Compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on <span class="monospace-strong">GNU C version 4.0.0 (Apple Computer, Inc. build 5026) (powerpc-apple-darwin8)</span>,
<pre class="code">
<span class="line-number">  0</span>swap_words:
<span class="line-number">  1</span>  lwz r0,0(r3)                ; Load arg
<span class="line-number">  2</span>  rlwinm r0,r0,16,0xffffffff  ; Rotate 16 bits
<span class="line-number">  3</span>  stw r0,0(r3)                ; Store arg
<span class="line-number">  4</span>  blr                         ; Return
</pre>

<div class="rule-of-thumb">
If the above source is called as a <i>non-inline</i> function, there will be a signficant penalty on most architectures waiting for the load before the rotate and the store on return.<br />

If the above source is called as a <i>inline</i> function, it can be safely assumed the load and store will be removed by the compiler as redundant.
</div>

<div class="sticky-note">
In C99, a <b>static inline</b> function, which may be included in a header file, differs from automatic inlining in that the function may be defined multiple times (e.g. included by multiple source files). Each definition of a <b>static inline</b> function must be identical.
</div>

<pre class="code">

<span class="line-number">  0</span>static inline void
<span class="line-number">  1</span>swap_words( uint16_t* arg )
<span class="line-number">  2</span>{
<span class="line-number">  3</span>  U32*     combined = (U32*)arg;
<span class="line-number">  4</span>  uint32_t start    = combined-&gt;u32;
<span class="line-number">  5</span>  uint32_t lo       = start &gt;&gt; 16;
<span class="line-number">  6</span>  uint32_t hi       = start &lt;&lt; 16;
<span class="line-number">  7</span>  uint32_t final    = lo | hi;
<span class="line-number">  8</span>
<span class="line-number">  9</span>  combined-&gt;u32 = final;
<span class="line-number"> 10</span>}
</pre>

<div class="rule-of-thumb">
With some care, this method is the most appropriate for modifying large or complex structures by multiple types.
</div>

<div id="union_3" class="subtitle">Casting through a union (3)</div>

Occasionally a programmer may encounter the following <span style="color:#FF0000">INVALID</span> method for creating an alias with a pointer of a different type:
<pre class="code">
<span class="line-number">  0</span>typedef union 
<span class="line-number">  1</span>{
<span class="line-number">  2</span>  uint16_t* sp; 
<span class="line-number">  3</span>  uint32_t* wp;
<span class="line-number">  4</span>} U32P;
<span class="line-number">  5</span>
<span class="line-number">  6</span>uint32_t 
<span class="line-number">  7</span>swap_words( uint32_t arg )
<span class="line-number">  8</span>{
<span class="line-number">  9</span>  U32P             in = { .wp = &amp;arg };
<span class="line-number"> 10</span>  const uint16_t   hi = in.sp[0];
<span class="line-number"> 11</span>  const uint16_t   lo = in.sp[1];
<span class="line-number"> 12</span>  
<span class="line-number"> 13</span>  in.sp[0] = lo;
<span class="line-number"> 14</span>  in.sp[1] = hi;
<span class="line-number"> 15</span>
<span class="line-number"> 16</span>  return ( arg ); <span style="color:#FF0000">&lt;-- RESULT IS UNDEFINED</span>
<span class="line-number"> 17</span>} 
</pre>
The problem with this method is although <b>U32P</b> does in fact say that <b>sp</b> is an alias for <b>wp</b>, it does not say anything about the relationship between the values pointed to by <b>sp</b> and <b>wp</b>. This differs in a critical way from <a href="#union_1">"Casting Through a Union (1)"</a> and <a href="#union_2">"Casting Through a Union (2)"</a> which both define aliases for the <i>values being pointed to</i>, not the pointers themselves.<br />
<br />
The presumption of strict aliasing remains true: Two pointers of different types are assumed, except in a few very limited conditions <a href="#c99_standard">specified in the C99 standard</a>, not to alias. This is <b>not</b> one of those exceptions.

<div class="sticky-note">
The above source when compiled with GCC 3.4.1 or GCC 4.0 with the <b>-Wstrict-aliasing=2</b> flag enabled will <b>NOT</b> generate a warning. This should serve as an example to <i>always</i> check the generated code. Warnings are often helpful hints, but they are by no means exaustive and do not always detect when a programmer makes an error. Like any peice of software, a compiler has limits. Knowing them can <i>only</i> be helpful.</div>

For example, when compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on <span class="monospace-strong">GNU C version 4.0.0 (Apple Computer, Inc. build 5026) (powerpc-apple-darwin8)</span>,
<pre class="code">
<span class="line-number">  0</span>swap_words:      ; <span style="color:#FF0000">RETURNS ARG UNCHANGED</span>
<span class="line-number">  1</span>  lhz r0,24(r1)  ; Load lo from stack (<i>What value?!</i>)
<span class="line-number">  2</span>  lhz r2,26(r1)  ; Load hi from stack (<i>What value?!</i>)
<span class="line-number">  3</span>  stw r3,24(r1)  ; Store arg to stack
<span class="line-number">  4</span>  sth r0,26(r1)  ; Store hi to stack
<span class="line-number">  5</span>  sth r2,24(r1)  ; Store lo to stack
<span class="line-number">  6</span>  blr            ; Return
</pre>

In this case notice that because <b>hi</b>, <b>lo</b> and <b>arg</b> are assumed not to alias, 
the resulting order of instruction has no value:
<ul>
<li><span class="monspace-strong">[Line 1]: </span><b>lo</b> is loaded from the stack before anything is stored to the stack</li>
<li><span class="monspace-strong">[Line 2]: </span><b>hi</b> is loaded from the stack before anything is stored to the stack</li>
<li><span class="monspace-strong">[Line 3]: </span><b>arg</b> is stored to the stack, but this value will not be read.</li>
<li><span class="monspace-strong">[Line 4]: </span><b>hi</b> is stored to the stack, but this value will not be read.</li>
<li><span class="monspace-strong">[Line 5]: </span><b>lo</b> is stored to the stack, but this value will not be read.</li>
</ul>

Or when compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on the <span style="color:#FF00FF">64 bit build</span> of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.
<pre class="code">
<span class="line-number">  0</span>swap_words:     # <span style="color:#FF0000">RETURNS ARG UNCHANGED</span>
<span class="line-number">  1</span>  stw 3,48(1)   # Store arg to stack
<span class="line-number">  2</span>  lhz 9,48(1)   # Load hi
<span class="line-number">  3</span>  lhz 0,50(1)   # Load lo
<span class="line-number">  4</span>  lwz 3,48(1)   # Load arg
<span class="line-number">  5</span>  sth 0,48(1)   # Store hi to stack
<span class="line-number">  6</span>  sth 9,50(1)   # Store lo to stack
<span class="line-number">  7</span>  blr           # Return
</pre>

Or when compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on the <span style="color:#FF00FF">32 bit build</span> of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.
<pre class="code">
<span class="line-number">  0</span>swap_words:     # <span style="color:#FF0000">RETURNS ARG UNCHANGED</span>
<span class="line-number">  1</span>  stwu 1,-16(1) # Push stack
<span class="line-number">  2</span>  addi 1,1,16   # Pop stack
<span class="line-number">  3</span>  blr           # Return 
</pre>


<div id="cast_to_char_pointer" class="subtitle">Casting to <i>char*</i></div>
  
It is always presumed that a <b>char*</b> may refer to an alias of any object. It is therefore quite safe, if perhaps a bit <i>unoptimal</i> (for architecture with wide loads and stores) to cast any pointer of any type to a <b>char*</b> type.
<pre class="code">
<span class="line-number">  0</span>uint32_t 
<span class="line-number">  1</span>swap_words( uint32_t arg )
<span class="line-number">  2</span>{
<span class="line-number">  3</span>  char* const cp = (char*)&amp;arg;
<span class="line-number">  4</span>  const char  c0 = cp[0];
<span class="line-number">  5</span>  const char  c1 = cp[1];
<span class="line-number">  6</span>  const char  c2 = cp[2];
<span class="line-number">  7</span>  const char  c3 = cp[3];
<span class="line-number">  8</span>
<span class="line-number">  9</span>  cp[0] = c2;
<span class="line-number"> 10</span>  cp[1] = c3;
<span class="line-number"> 11</span>  cp[2] = c0;
<span class="line-number"> 12</span>  cp[3] = c1;
<span class="line-number"> 13</span>
<span class="line-number"> 14</span>  return (arg);
<span class="line-number"> 15</span>} 
</pre>

The converse is not true. Casting a <b>char*</b> to a pointer of any type other than a <b>char*</b> and dereferencing it is <em>usually</em> in volation of the strict aliasing rule.
<div class="rule-of-thumb">In other words, casting from a pointer of one type to pointer of an unrelated type through a <b>char*</b> is <b>undefined</b>. </div>

<pre class="code">
<span class="line-number">  0</span>uint32_t
<span class="line-number">  1</span>test( uint32_t arg )
<span class="line-number">  2</span>{
<span class="line-number">  3</span>  char*     const cp = (char*)&arg;
<span class="line-number">  4</span>  uint16_t* const sp = (uint16_t*)cp;
<span class="line-number">  5</span>
<span class="line-number">  6</span>  sp[0] = 0x0001;
<span class="line-number">  7</span>  sp[1] = 0x0002;
<span class="line-number">  8</span>
<span class="line-number">  9</span>  return (arg);
<span class="line-number"> 10</span>}
</pre>

When compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on the <span style="color:#FF00FF">64 bit build</span> of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.

<pre class="code">
<span class="line-number">  0</span>test:
<span class="line-number">  1</span>  stw 3, 48(1)   # arg stored to stack
<span class="line-number">  2</span>  li  0, 1       # hi = 0x0001
<span class="line-number">  3</span>  li  9, 2       # lo = 0x0002
<span class="line-number">  4</span>  lwz 3, 48(1)   # result = loaded from stack
<span class="line-number">  5</span>  sth 0, 48(1)   # store hi to stack
<span class="line-number">  6</span>  sth 9, 50(1)   # store lo to stack
<span class="line-number">  7</span>  blr            # return (result) <span style="color:#FF0000">&lt;-- RETURNS ARG UNCHANGED</span>
</pre>

As <a href="http://cellperformance.com/phpBB2/viewtopic.php?t=48&start=0&postdays=0&postorder=asc&highlight=#115">clarified by <b>Pinskia</b></a>, it is not deferencing a <b>char*</b> per se that is specifically recognized
as a potential alias of any object, but any address referring to a <b>char</b> object. This includes an array of <b>char</b>
objects, as in the following example which will also break the strict aliasing assumption.

<pre class="code">
<span class="line-number">  0</span>  char      const cp[4] = { arg0, arg1, arg2, arg3 };
<span class="line-number">  1</span>  uint16_t* const sp    = (uint16_t*)cp;
<span class="line-number">  2</span>
<span class="line-number">  3</span>  sp[0] = 0x0001;
<span class="line-number">  4</span>  sp[1] = 0x0002;
</pre>


<div id="gcc_rule_breaking" class="subtitle">GCC RULE BREAKING</div>

GCC allows type-punned values to be deferenced at independent locations in memory (i.e. different objects) when the source of the lvalue is not directly known.<br />

<pre class="code">
<span class="line-number">  0</span>void
<span class="line-number">  1</span>set_value( uint64_t* c, 
<span class="line-number">  2</span>           uint32_t  a_val, 
<span class="line-number">  3</span>           uint16_t  b_val ) 
<span class="line-number">  4</span>{
<span class="line-number">  5</span>  uint32_t* a = (uint32_t*)c;
<span class="line-number">  6</span>  uint16_t* b = (uint16_t*)c;
<span class="line-number">  7</span>  
<span class="line-number">  8</span>  a[0] = a_val; // &lt;--- Address of c + 0
<span class="line-number">  9</span>  b[2] = b_val; // &lt;--- Address of c + 4
<span class="line-number"> 10</span>  b[3] = b_val; // &lt;--- Address of c + 6
<span class="line-number"> 11</span>}
</pre>

When compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on the <span style="color:#FF00FF">64 bit build</span> of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.

<pre class="code">
<span class="line-number">  0</span>set_value:
<span class="line-number">  1</span>  stw 4,0(3)   # (c+0) = a_val
<span class="line-number">  2</span>  sth 5,6(3)   # (c+6) = b_val
<span class="line-number">  3</span>  sth 5,4(3)   # (c+4) = b_val
<span class="line-number">  4</span>  blr          # return (c)
</pre>

Note any use of <b>c[0]</b> here would be (more?) undefined because it would alias the uses of <b>a</b> and <b>b</b>.

<pre class="code">
<span class="line-number">  0</span>void
<span class="line-number">  1</span>set_value( uint64_t* c, 
<span class="line-number">  2</span>           uint32_t  a_val, 
<span class="line-number">  3</span>           uint16_t  b_val ) 
<span class="line-number">  4</span>{
<span class="line-number">  5</span>  uint32_t* a = (uint32_t*)c;
<span class="line-number">  6</span>  uint16_t* b = (uint16_t*)c;
<span class="line-number">  7</span>  
<span class="line-number">  8</span>  a[0] = a_val; // &lt; Address of c + 0
<span class="line-number">  9</span>  b[2] = b_val; // &lt; Address of c + 4
<span class="line-number"> 10</span>  b[3] = b_val; // &lt; Address of c + 6
<span class="line-number"> 11</span>  
<span class="line-number"> 12</span>  <span style="color:#FF0000">// WHAT VALUE THIS WOULD PRINT IS UNDEFINED</span>
<span class="line-number"> 13</span>  printf("c = 0x%08x\n", c[0] ); 
<span class="line-number"> 14</span>}
</pre>

However, when <b>set_value</b> is compiled inline (perhaps automatically), the source of <b>c</b> may be known and GCC will assume the values do <b>not</b> alias and may reduce the expression differently and generate completely different code.

<pre class="code">
<span class="line-number">  0</span>static inline void
<span class="line-number">  1</span>set_value( uint64_t* c, 
<span class="line-number">  2</span>           uint32_t  a_val, 
<span class="line-number">  3</span>           uint16_t  b_val ) 
<span class="line-number">  4</span>{
<span class="line-number">  5</span>  uint32_t* a = (uint32_t*)c;
<span class="line-number">  6</span>  uint16_t* b = (uint16_t*)c;
<span class="line-number">  7</span>  
<span class="line-number">  8</span>  a[0] = a_val; // &lt;--- Address of c + 0
<span class="line-number">  9</span>  b[2] = b_val; // &lt;--- Address of c + 4
<span class="line-number"> 10</span>  b[3] = b_val; // &lt;--- Address of c + 6
<span class="line-number"> 11</span>}
</pre>

<pre class="code">
<span class="line-number">  0</span>int64_t
<span class="line-number">  1</span>test( int64_t  a
<span class="line-number">  2</span>     ,int64_t  b
<span class="line-number">  3</span>     ,uint32_t hi32
<span class="line-number">  4</span>     ,uint16_t lo16 )
<span class="line-number">  5</span>{
<span class="line-number">  6</span>  int64_t c = a + b;
<span class="line-number">  7</span>
<span class="line-number">  8</span>  set_value( &amp;c, hi32, lo16 );
<span class="line-number">  9</span>
<span class="line-number"> 10</span>  return (c);
<span class="line-number"> 11</span>}
</pre>

When compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on the <span style="color:#FF00FF">64 bit build</span> of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.

<pre id="test_set_value_original" class="code">
<span class="line-number">  0</span>test:
<span class="line-number">  1</span>  add 3,3,4    # c = (a+b)
<span class="line-number">  2</span>  blr          # return (c)
</pre>

In this case because the object <b>c</b> is never accessed through any <i>valid</i> aliases in <b>set_value</b>, the expression is reduced out.

<div class="sticky-note"> The above example will <strong>NOT</strong> currently generate any warnings with <b>-Wstrict-aliasing=2</b> and will simply generate <i>different</i> results depending on whether or not the expression is inlined. This is another good reason to always double check the generated code. Also, when writing unit tests, it is a good idea to test a function both as an inline function and an extern function.</div>

<div class="sticky-note"> With GCC, strict aliasing warnings are <em>more likely</em> to be generated at the point where an address is taken (e.g. <span class="monospace-strong">uint16_t* a = (uint16_t*)&amp;b;</span>) than with pre-existing pointers (e.g. <span class="monospace-strong">uint16_t* a = (uint16_t*)b_ptr;</span>). Take special care when type-punning pre-existing pointers. </div>

Perhaps surprisingly, illegal aliasing within a loop generates completely different results. It is probably not completely accidental though, as most of the historical arguments <i>against</i> strict aliasing have revolved around optimized versions of functions like <b>memset</b> and <b>memcpy</b> which would cast the data to the widest available register size to minimize the trips to and from memory.

<pre class="code">
<span class="line-number">  0</span>void
<span class="line-number">  1</span>set_value( uint64_t* c,
<span class="line-number">  2</span>           uint32_t  a_val,
<span class="line-number">  3</span>           uint16_t  b_val,
<span class="line-number">  4</span>           uint32_t  count )
<span class="line-number">  5</span>{
<span class="line-number">  6</span>  uint32_t* a  = (uint32_t*)c;
<span class="line-number">  7</span>  uint16_t* b  = (uint16_t*)c;
<span class="line-number">  8</span>  uint32_t  i  = 0;
<span class="line-number">  9</span>
<span class="line-number"> 10</span>  for (i=0;i&lt;count;i++,a++,b+=2)
<span class="line-number"> 11</span>  {
<span class="line-number"> 12</span>    a[0]  = a_val;
<span class="line-number"> 13</span>    b[2]  = b_val;
<span class="line-number"> 14</span>    b[3]  = b_val;
<span class="line-number"> 15</span>  }
<span class="line-number"> 16</span>}
</pre>

As expected from the previous example above, this should still generate the "expected" result:<br />
<br />

When compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on the <span style="color:#FF00FF">32 bit build</span> of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.
<pre class="code">
<span class="line-number">  0</span>set_value:
<span class="line-number">  1</span>  cmpwi 0, 6, 0   # done = (count == 0)
<span class="line-number">  2</span>  stwu  1, -16(1) # Push stack
<span class="line-number">  3</span>  mr    9, 3      # Copy c
<span class="line-number">  4</span>  beq-  0, .L7    # if (done) goto .L7
<span class="line-number">  5</span>  mtctr 6         # i = count
<span class="line-number">  6</span>.L8:
<span class="line-number">  7</span>  stw   4, 0(9)   # a[0] = a_val
<span class="line-number">  8</span>  addi  9, 9, 4   # a++
<span class="line-number">  9</span>  sth   5, 4(3)   # b[2] = b_val
<span class="line-number"> 10</span>  sth   5, 6(3)   # b[3] = b_val
<span class="line-number"> 11</span>  addi  3, 3, 4   # b+=2
<span class="line-number"> 12</span>  bdnz  .L8       # if (i) goto .L8
<span class="line-number"> 13</span>.L7:
<span class="line-number"> 14</span>  addi  1, 1, 16  # Pop stack
<span class="line-number"> 15</span>  blr             # return
</pre>

When called inline, the previous example would suggest that the compiler, assuming <b>c</b> is not aliased would also return <span class="monospace-strong">(a + b)</span>:<br />
<br />

<pre class="code">
<span class="line-number">  0</span>int64_t
<span class="line-number">  1</span>test_loop( int64_t  a,
<span class="line-number">  2</span>           int64_t  b,
<span class="line-number">  3</span>           uint32_t hi32,
<span class="line-number">  4</span>           uint16_t lo16,
<span class="line-number">  5</span>           uint32_t count )
<span class="line-number">  6</span>{
<span class="line-number">  7</span>  static int64_t c[ C_COUNT ];
<span class="line-number">  8</span>
<span class="line-number">  9</span>  c[0] = a + b;
<span class="line-number"> 10</span>
<span class="line-number"> 11</span>  set_value( c, hi32, lo16, count );
<span class="line-number"> 12</span>
<span class="line-number"> 13</span>  return (c[0]);
<span class="line-number"> 14</span>}
</pre>

When compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on the <span style="color:#FF00FF">32 bit build</span> of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.
<pre class="code">
<span class="line-number">  0</span>test_loop:
<span class="line-number">  1</span>  lis   12, c.0@ha      # cloc     = location of c
<span class="line-number">  2</span>  mr.   0,  9           # i        = count
<span class="line-number">  3</span>  la    11, c.0@l(12)   # c        = *cloc
<span class="line-number">  4</span>  addc  10, 4, 6        # c1       = addlo (a,b)
<span class="line-number">  5</span>  adde  9,  3, 5        # c2       = addhi (a,b)
<span class="line-number">  6</span>  stwu  1, -16(1)       # Push stack
<span class="line-number">  7</span>  stw   9,  0(11)       # c[0].hi  = c2
<span class="line-number">  8</span>  mr    6,  11          # a        = c
<span class="line-number">  9</span>  stw   10, 4(11)       # c[0].lo  = c1
<span class="line-number"> 10</span>  mr    9,  11          # b        = c
<span class="line-number"> 11</span>  beq-  0,  .L19        # if (i==0) goto .L19
<span class="line-number"> 12</span>  mtctr 0               # i        = count
<span class="line-number"> 13</span>.L20:
<span class="line-number"> 14</span>  stw   7,  0(9)        # a[0]     = hi32
<span class="line-number"> 15</span>  addi  9,  9, 4        # a++
<span class="line-number"> 16</span>  sth   8,  4(6)        # b[2]     = lo16
<span class="line-number"> 17</span>  sth   8,  6(6)        # b[3]     = lo16
<span class="line-number"> 18</span>  addi  6,  6, 4        # b+=2
<span class="line-number"> 19</span>  bdnz  .L20            # if (i) goto .L20
<span class="line-number"> 20</span>.L19:
<span class="line-number"> 21</span>  la    9,  c.0@l(12)   # c        = *cloc
<span class="line-number"> 22</span>  addi  1,  1, 16       # Pop stack
<span class="line-number"> 23</span>  lwz   3,  0(9)        # result.hi = c[0].hi
<span class="line-number"> 24</span>  lwz   4,  4(9)        # result.lo = c[0].lo
<span class="line-number"> 25</span>  blr                   # return (result)
</pre>

The result is clearly different from the <a href="#test_set_value_original">original version</a> without the loop.<br />
<br />

It is not the existance of the loop in the source that changes the transformation, but rather the existance of a loop <i>after</i> the initial optimization passes. For example, GCC is fairly good at optimizing (unrolling) loops with a fixed iteration count. Examine the following example:

<pre class="code">
<span class="line-number">  0</span>int64_t
<span class="line-number">  1</span>test_noloop( int64_t  a,
<span class="line-number">  2</span>             int64_t  b,
<span class="line-number">  3</span>             uint32_t hi32,
<span class="line-number">  4</span>             uint16_t lo16 )
<span class="line-number">  5</span>{
<span class="line-number">  6</span>  int64_t c = a + b;
<span class="line-number">  7</span>
<span class="line-number">  8</span>  set_value( &amp;c, hi32, lo16, 1 );
<span class="line-number">  9</span>
<span class="line-number"> 10</span>  return (c);
<span class="line-number"> 11</span>}
</pre>

It wouldn't be completely outrageous to expect the above example to generate similar, albeit unrolled, code. That is unless you know to expect simple loop transformations to be done fairly early in the compilation process and alias analysis to be done later.  When compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on the <span style="color:#FF00FF">32 bit build</span> of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.

<pre class="code">
<span class="line-number">  0</span>test_noloop:      # &lt;--- RETURNS (A+B)
<span class="line-number">  1</span>  stwu 1,-16(1)   # Push stack
<span class="line-number">  2</span>  addc 4,4,6      # c.lo = addlo(a,b)
<span class="line-number">  3</span>  adde 3,3,5      # c.hi = addhi(a,b)
<span class="line-number">  4</span>  addi 1,1,16     # Pop stack
<span class="line-number">  5</span>  blr             # return (c)
</pre>

<div class="sticky-note">
The existance of a loop around accessed aliases and whether or not the iteration count is known at compile time may impact the generated code. Tests should include both constant and <b>extern</b>'d iteration counts.
</div>

What is surprising is that the 64 bit build of the same version of the same compiler generates  different results. When compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on the <span style="color:#FF00FF">64 bit build</span> of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.

<pre class="code">
<span class="line-number">  0</span>test_loop:
<span class="line-number">  1</span>  li     10, 0           # i = 0
<span class="line-number">  2</span>  cmplw  7,  10, 7       # done = (i==count)
<span class="line-number">  3</span>  add    4,  3, 4        # sum  = a + b
<span class="line-number">  4</span>  ld     3,  .LC0@toc(2) # cloc = location of c
<span class="line-number">  5</span>  std    4,  0(3)        # c[0] = sum
<span class="line-number">  6</span>  mr     9,  3           # a    = c
<span class="line-number">  7</span>  mr     11, 3           # b    = c
<span class="line-number">  8</span>  bge-   7,  .L18        # if (done) goto .L18
<span class="line-number">  9</span>.L22:
<span class="line-number"> 10</span>  addi   0,  10, 1       # i++
<span class="line-number"> 11</span>  stw    5,  0(11)       # a[0] = hi32
<span class="line-number"> 12</span>  rldicl 10, 0, 0, 32    # i    = i & 0xffffffff
<span class="line-number"> 13</span>  sth    6,  4(9)        # b[2] = lo16
<span class="line-number"> 14</span>  sth    6,  6(9)        # b[3] = lo16
<span class="line-number"> 15</span>  cmplw  7,  10, 7       # done = (i==count)
<span class="line-number"> 16</span>  addi   11, 11, 4       # a++
<span class="line-number"> 17</span>  addi   9,  9, 4        # b+= 2
<span class="line-number"> 18</span>  blt+   7,  .L22        # if (!done) goto .L22
<span class="line-number"> 19</span>.L18:
<span class="line-number"> 20</span>  ld     3,0(3)          # result = c[0]
<span class="line-number"> 21</span>  blr                    # return (result)
</pre>

This indicates that there are significant <b>non-obvious</b> side-effects to building GCC as 32 bits versus 64 bits that <em>someone might want to look into</em>.
<div class="sticky-note">
The platform, version number and build data (i.e. the output of <span class="monospace-strong">gcc --version</span>) is not sufficient information for compatibility testing. To be thorough, units tests should be run across all versions of the same compiler, if more than one is known to exist.</div>

<div id="qa_bitfields" class="subtitle">Question about bitfields</div>
<div class="sticky-note">
On 08 Jan 2008, Royous Zacharias asked me to clarify a question about bitfields and strict aliasing. With his permission, I'm posting his question and my response here in the hope that it will either be helpful or spur someone else to provide a more definitive answer.

<br />
<br />
<pre class="monospace-strong">
Mike,

I was checking out your web page and I seem to be running into a problem related to the issue you 
pointed out.  Does, strict aliasing apply to Bitfield structures having 32-bits.  I am running on a 32-bit 
power pc motorola board the following code:

BitFieldId id;

id.field0 = 0
id.field1 = 1
id.field2 = 0
id.field3 = 1

uint *ptr = (uint*)&id;

function(*ptr)

where function(...) is defined as void function(uint n) { ... }

When I have strict-aliasing turned on, the code above does not execute correctly (e.g. ptr is zeroed-out)?  
However, everything works fine when I remove this option from -O2 optimization on gcc 4.0.0?  Can the 
above give unreliable results when executed with strict-aliasing turned on?  I really appreciate your views 
about this.

Thanks,
Royous Zacharias
</pre>
</div>

<br />
<br />

The short answer is yes, it does apply. (BitFieldId*) is not related to (uint*) here (and thus cannot be aliased). To make things more complicated: how exactly a bit field should be related to an int here is somewhat open to interpretation the standard (or at least as I read it.) -- The main gotcha in the standard is that you can't take the address of a bit field member (it's not defined).
<br />
<br />

But you still have a couple of options --
<br />
<br />

Bottom line: You need to tell the compiler that BitFieldId and uint are related.
<br />
<br />

CASE 1: If you don't mind using compiler extensions (most compilers support this though), you can have BitFieldId be a union of an anonymous struct of the named bits and a uint. Because then BitFieldId would contain a uint as a member, a (BitFieldId*) and a (uint*) would then be related.
<br />
<br />

CASE 2: You can create a composite type which includes both a BitFieldId and a uint, let's call that a BitFieldUint. Now:<br />
(BitFieldId*) is related to (BitFieldUint*)<br />
(uint*) is related to (BitFieldUint*)<br />
So this:<br />
<span class="monospace-strong">uint* ptr = (uint*)(BitFieldUint*)&id; </span><br />
would be valid.<br />
<br />
<br />
Interestingly, in CASE 2, it doesn't matter if BitFieldUint is a struct or a union (or how big it is or anything else, really), all that's important is that it contains both types so that they become related through this new type.
<br />
<br />
I've attached a small bit of code that will hopefully clear that up.

<pre class="code">
typedef struct BitFieldId      BitFieldId;
typedef union BitFieldId_2     BitFieldId_2;
typedef union BitFieldIdStruct BitFieldIdStruct;
typedef union BitFieldIdUnion  BitFieldIdUnion;

struct BitFieldId
{
  uint32_t field0 : 1;
  uint32_t field1 : 1;
  uint32_t field2 : 1;
  uint32_t field3 : 1;
  uint32_t field4 : 1;
  uint32_t field5 : 1;
  uint32_t field6 : 1;
  uint32_t field7 : 1;
  uint32_t field8 : 1;
  uint32_t field9 : 1;
  uint32_t field10 : 1;
  uint32_t field11 : 1;
  uint32_t field12 : 1;
  uint32_t field13 : 1;
  uint32_t field14 : 1;
  uint32_t field15 : 1;
  uint32_t field16 : 1;
  uint32_t field17 : 1;
  uint32_t field18 : 1;
  uint32_t field19 : 1;
  uint32_t field20 : 1;
  uint32_t field21 : 1;
  uint32_t field22 : 1;
  uint32_t field23 : 1;
  uint32_t field24 : 1;
  uint32_t field25 : 1;
  uint32_t field26 : 1;
  uint32_t field27 : 1;
  uint32_t field28 : 1;
  uint32_t field29 : 1;
  uint32_t field30 : 1;
  uint32_t field31 : 1;
};

union BitFieldId_2
{
  uint32_t u32;
  struct 
  {
    uint32_t field0 : 1;
    uint32_t field1 : 1;
    uint32_t field2 : 1;
    uint32_t field3 : 1;
    uint32_t field4 : 1;
    uint32_t field5 : 1;
    uint32_t field6 : 1;
    uint32_t field7 : 1;
    uint32_t field8 : 1;
    uint32_t field9 : 1;
    uint32_t field10 : 1;
    uint32_t field11 : 1;
    uint32_t field12 : 1;
    uint32_t field13 : 1;
    uint32_t field14 : 1;
    uint32_t field15 : 1;
    uint32_t field16 : 1;
    uint32_t field17 : 1;
    uint32_t field18 : 1;
    uint32_t field19 : 1;
    uint32_t field20 : 1;
    uint32_t field21 : 1;
    uint32_t field22 : 1;
    uint32_t field23 : 1;
    uint32_t field24 : 1;
    uint32_t field25 : 1;
    uint32_t field26 : 1;
    uint32_t field27 : 1;
    uint32_t field28 : 1;
    uint32_t field29 : 1;
    uint32_t field30 : 1;
    uint32_t field31 : 1;
  };
};

union BitFieldIdUnion
{
  uint32_t   u32;
  BitFieldId bit_field;
};

union BitFieldIdStruct
{
  uint32_t   u32;
  BitFieldId bit_field;
};

void
CopyBitFieldId_BAD( BitFieldId* id0, BitFieldId* id1 )
{
  uint32_t* id0_u32 = (uint32_t*)id0;  
  uint32_t* id1_u32 = (uint32_t*)id1;  

  *id0_u32 = *id1_u32;
}

void
CopyBitFieldId_BITFIELD_RELATED_TO_INT( BitFieldId_2* id0, BitFieldId_2* id1 )
{
  uint32_t* id0_u32 = (uint32_t*)&id0->u32;
  uint32_t* id1_u32 = (uint32_t*)&id1->u32;

  *id0_u32 = *id1_u32;
}

void
CopyBitFieldId_VIA_CAST_THROUGH_RELATED_UNION( BitFieldId* id0, BitFieldId* id1 )
{
  uint32_t* id0_u32 = (uint32_t*)(BitFieldIdUnion*)id0;
  uint32_t* id1_u32 = (uint32_t*)(BitFieldIdUnion*)id1;

  *id0_u32 = *id1_u32;
}

void
CopyBitFieldId_VIA_CAST_THROUGH_RELATED_STRUCT( BitFieldId* id0, BitFieldId* id1 )
{
  uint32_t* id0_u32 = (uint32_t*)(BitFieldIdStruct*)id0;
  uint32_t* id1_u32 = (uint32_t*)(BitFieldIdStruct*)id1;

  *id0_u32 = *id1_u32;
}

</pre>


<div id="c99_standard" class="subtitle">C99 Standard</div>
This article has been pretty relaxed with the use of terminology and there is always room for some interpretation when reading a standard. There are many additional cases not covered above and compiler specific issues to consider. But for those interested in up-to-date definitive information on the C standard refer to <a href="http://www.open-std.org/JTC1/SC22/WG14/www/docs/n1124.pdf">ISO/IEC 9899:TC2 [open-std.org]</a>. Here is the most relevant text from section "6.5 Expressions":<br />
<br />
<br />
<div class="monospace-strong">
An object shall have its stored value accessed only by an lvalue expression that has one of
the following types:
<ul>
<li>a type compatible with the effective type of the object,</li>
<li>a qualified version of a type compatible with the effective type of the object,</li>
<li>a type that is the signed or unsigned type corresponding to the effective type of the
object,</li>
<li>a type that is the signed or unsigned type corresponding to a qualified version of the
effective type of the object,</li>
<li>an aggregate or union type that includes one of the aforementioned types among its
members (including, recursively, a member of a subaggregate or contained union), or</li>
<li>a character type.</li>
</ul>
</div>

<div class="sticky-note">
Note the use of types like <b>uint64_t</b> and <b>uint32_t</b> in the above examples. For decades programmers have been creating their own integer types and reworking their header files for each platform simply to get consistant integer sizes across multiple architectures. This is because the standard does not guarantee types like <b>int</b> or <b>short</b> to be of any <i>particular</i> width, it only guarantees their sizes relative to eachother. But finally, with C99, the debate is over. Standard width integers are now defined in <b>stdint.h</b>. <i>Always</i> use this header, and if your implementation does not have it (e.g. Microsoft), there are portable public domain versions available (e.g. This <a href="http://www.cs.colorado.edu/~main/cs1300/include/stdint.h">stdint.h</a> can be used for Win32).
</div>

<div id="summary" class="subtitle">Summary</div>

<ul>
<li>Strict aliasing means that two objects of different types cannot refer to the same location in memory. Enable this option in GCC with the <strong>-fstrict-aliasing</strong> flag. Be sure that <i>all</i> code can safely run with this rule enabled. Enable strict aliasing related warnings with <strong>-Wstrict-aliasing</strong>, but do not expect to be warned in all cases. </li>
<li>In order to discover aliasing problems as quickly as possible, <b>-fstrict-aliasing</b> should always be included in the compilation flags for GCC. Otherwise problems may only be visible at the highest optimization levels where it is the most difficult to debug.</li>
</ul>

<div class="sticky-note">
Be wary of code that <i>requires</i> the use of <b>-fno-strict-aliasing</b> (turns off strict aliasing at any level) in order to work. This is a very good indication that the code relies on aliased memory access and is likely to be dominated by poor memory access patterns. At the very least only the minimum amount of files should have it disabled, and only because time has not permitted their repair <i>yet</i>. Although it may seem complex to properly alias memory, the tests where it is really necessary for performance are actually quite few and should already be tested rigorously. It is unlikely that code that does not enable strict aliasing would be able to take advantage of the <b>restrict</b> keyword. Using the restrict keyword allows a significant class of memory access optimizations critical to high performance code. For more information on the restrict keyword see: <a href="http://www.cellperformance.com/mike_acton/2006/05/demystifying_the_restrict_keyw.html">Demystifying The Restrict Keyword</a>
</div>

<div class="subtitle">May Also Interest You</div>

	<a href="http://www.cellperformance.com/mike_acton/2006/05/demystifying_the_restrict_keyw.html">Demystifying The Restrict Keyword</a> (Mike Acton)<br />
	Optimizing data access is a critical part of good performance. Read on to find out how to use the restrict keyword to open up a whole class of optimizations that were previously impossible for a C compiler.<br />
	<br />

	<a href="http://www.cellperformance.com/articles/2006/07/tutorial_branch_elimination_pa.html">Better Performance Through Branch Elimination</a> (Mike Acton and André de Leiradella)<br />

	An introduction to branch penalties, Why it's a good idea to avoid branchy code and some techniques for eliminating them.<br />
	<br />

	<a href="http://www.cellperformance.com/articles/2006/04/avoiding_microcoded_instructio.html">Avoiding Microcoded Instructions On The PPU</a> (Mike Acton)<br />

	Executing instructions from microcode can wreck havok on inner loop
	performance. Find out which instructions are microcoded and how to
	avoid them.<br />
	<br />

]]>
    </content>
</entry>
<entry>
    <title>Demystifying The Restrict Keyword</title>
    <link rel="alternate" type="text/html" href="http://www.cellperformance.com/mike_acton/2006/05/demystifying_the_restrict_keyw.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://www.cellperformance.com/cgi-bin/mt/mt-atom.cgi/weblog/blog_id=2/entry_id=33" title="Demystifying The Restrict Keyword" />
    <id>tag:www.cellperformance.com,2006:/mike_acton//2.33</id>
    
    <published>2006-05-29T08:13:54Z</published>
    <updated>2006-06-01T10:24:36Z</updated>
    
    <summary>Optimizing data access is a critical part of good performance. Read on to find out how to use the restrict keyword to open up a whole class of optimizations that were previously impossible for a C compiler.</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://www.cellperformance.com/mike_acton</uri>
    </author>
            <category term="Public" />
    
    <content type="html" xml:lang="en" xml:base="http://www.cellperformance.com/mike_acton/">
        <![CDATA[<div class="sticky-note"><strong>
UPDATED! More examples! More detailed explainations! </strong></div>
<div class="subtitle">Contract</div>

The restrict keyword can be considered an extension to the strict aliasing rule. It allows the programmer to declare that pointers which share the same type (or were otherwise validly created) <b>do not</b> alias eachother. By using restrict the programmer can declare that any loads and stores through the qualified pointer (or through another pointer copied either directly or indirectly from the restricted pointer) are the <b>only</b> loads and stores to the same address during the lifetime of the pointer. In other words, the pointer is not aliased by any pointers other than its own copies.<br />
<br />
<div class="rule-of-thumb">
Restrict is a “no data hazards will be generated” contract between the programmer and the compiler. The compiler relies on this information to make optimizations. If the data is, in fact, aliased, the results are undefined and a programmer should not expect the compiler to output a warning. The compiler assumes the programmer is not <i>lying</i>.</div>
<br />
<br />
<div class="contract-header">THE RESTRICT CONTRACT</div>
<div class="contract">
I, [insert your name], a PROFESSIONAL or AMATEUR [circle one] programmer recognize that there are
limits to what a compiler can do. I certify that, to the best of my knowledge, there are no magic
elves or monkeys in the compiler which through the forces of fairy dust can always make code faster.
I understand that there are some problems for which there is not enough information to solve. I 
hereby declare that given the opportunity to provide the compiler with sufficient information,
perhaps through some key word, I will gladly use said keyword and not bitch and moan about how 
"the compiler should be doing this for me."<br />
<br />
In this case, I promise that the pointer declared along with the restrict qualifier is not aliased.
I certify that writes through this pointer will not effect the values read through any other pointer
available in the same context which is also declared as restricted.<br />
<br />
* Your agreement to this contract is implied by use of the restrict keyword ;)
</div>
<br />
<br />
Read on for more information on the practical use and benefits to using the restrict keyword...]]>
        <![CDATA[<div class="subtitle">Restrict is a type qualifier</div>

<div class="quote"> A new feature of C99: The restrict type qualifier allows programs to be written so that translators can produce significantly faster executables. [...] Anyone for whom this is not a concern can safely ignore this feature of the language.</div>
<div class="quote-cite"> -- <a href="http://std.dkuug.dk/JTC1/SC22/WG14/www/C99RationaleV5.10.pdf">From Rationale for International Standard – Programming Languages – C [std.dkuug.dk]</a> (6.7.3.1 Formal definition of restrict)</div>
<br />

The restrict keyword is a type qualifier for pointers and is a formal part of the C99 standard.<br />
<br />
Example usage:
<div class="code">
int* restrict foo;
</div>
Notice that the restrict keyword qualifies the pointer and not the object being pointed to.

<div class="sticky-note">
Not all compilers are compliant with the C99 standard. For example Microsoft's compiler, does not support the C99 standard <i>at all</i>. If you are using MSVC on a x86 platform you will not have access to this critical optimization option.<br />
</div>

<div class="sticky-note">
When using GCC, remember to enable the C99 standard by adding <b>-std=c99</b> to your compilation flags. In code that cannot be compiled with C99, use either <b>__restrict</b> or <b>__restrict__</b> to enable the keyword as a GCC extension.<br />
</div>

<div class="sticky-note">
The restrict keyword was not included as part of the C++98 standard. However some C++ compilers <i>may</i> support it as an extension. It's important that when restrict is used in C++ to remember that the implicit <i>this</i> pointer should also be restricted. Consult your compiler's manual for how to do this, if possible.
</div>

<div class="rule-of-thumb">An understanding the <a href="http://www.cellperformance.com/mike_acton/2006/06/understanding_strict_aliasing.html">strict aliasing rule</a> will provide good context for  problems related to the restrict keyword. </div>

<div class="subtitle">Why was restrict introduced into C99?</div>
<div class="quote">
The problem that the restrict qualifier addresses is that potential aliasing can inhibit optimizations. Specifically, if a translator cannot determine that two different pointers are being used to reference different objects, then it cannot apply optimizations such as maintaining the values of the objects in registers rather than in memory, or reordering loads and stores of these values. This problem can have a significant effect on a program that, for example, performs arithmetic calculations on large arrays of numbers. The effect can be measured by comparing a program that uses pointers with a similar program that uses file scope arrays (or with a similar Fortran program). The array version can run faster by a factor of ten or more on a system with vector processors. Where such large performance gains are possible, implementations have of course offered their own solutions, usually in the form of compiler directives that specify particular optimizations. Differences in the spelling, scope, and precise meaning of these directives have made them troublesome to use in a program that must run on many different systems. This was the motivation for a standard solution.</div>
<div class="quote-cite"> -- <a href="http://std.dkuug.dk/JTC1/SC22/WG14/www/C99RationaleV5.10.pdf">From Rationale for International Standard – Programming Languages – C [std.dkuug.dk]</a> (6.7.3.1 Formal definition of restrict)</div>
<br />
In other words, proper use of the restrict keyword gives the compiler enough information to select a more optimal order of loads and stores to/from memory and to potentially make better use of registers to store non-aliased objects.<br />

<div class="subtitle">Non-aliased Memory Windows</div>

Given the following structure, there is a significant difference in performance in even the smallest update loops.

<pre class="code">
typedef struct vector3  vector3;

struct vector3
{
  float x;
  float y;
  float z;
};
</pre>

What follows is a simple example function that updates some "particles" with unrestricted pointers. Note that the pointers share the same type, so the compiler will assume they can be aliased, per the strict aliasing rule.

<div class="sticky-note">
The example code sections in the article are not meant to serve as examples of real production code, but rather as examples of real <em>patterns</em> often found in production code.</div>

<pre class="code">
void
move( vector3* velocity, 
      vector3* position, 
      vector3* acceleration, 
      float    time_step, 
      size_t   count )
{
  for (size_t i=0;i&lt;count;i++)
  {
    velocity[i].x += acceleration[i].x * time_step;
    velocity[i].y += acceleration[i].y * time_step;
    velocity[i].z += acceleration[i].z * time_step;
    position[i].x += velocity[i].x     * time_step;
    position[i].y += velocity[i].y     * time_step;
    position[i].z += velocity[i].z     * time_step;
  }
}
</pre>
<br />

<div class="sticky-note">This article will examine the assembly output generated for the PowerPC. However, the principles and suggestions presented are applicable to many common architectures.</div>
<pre class="code">
# This code was compiled with GCC 3.4.1 for PowerPC,
# with the following options: <b>-O3 -fstrict-aliasing -std=c99</b>
#
move:
  cmpwi  0,6,0
  stwu   1,-16(1)
  beq-   0,.L7
  li     8,0
  mtctr  6
.L8:
  add    9,8,3
  lfsx   13,8,5
  add    10,8,5
  lfsx   0,8,3
  lfs    8,4(9)
  add    11,8,4
  lfs    5,8(10)
  lfs    7,4(10)
  lfs    6,8(9)
  fmadds 4,13,1,0
  fmadds 3,7,1,8
  fmadds 2,5,1,6
  <span style="color: #FF0000;">stfsx  4,8,3      # Store velocity_x
  stfs   3,4(9)     # Store velocity_y
  stfs   2,8(9)     # Store velocity_z</span>
  <span style="color: #0000FF;">lfsx   11,8,4     # Load position_x
  lfs    10,4(11)   # Load position_y
  lfs    9,8(11)    # Load position_z</span>
  fmadds 12,4,1,11
  fmadds 0,3,1,10
  fmadds 13,2,1,9
  stfsx  12,8,4
  addi   8,8,12
  stfs   0,4(11)
  stfs   13,8(11)
  bdnz   .L8
.L7:
  addi   1,1,16
  blr
</pre>

Notice above that <b>position</b> must wait for <b>velocity</b> to be stored. This is because the compiler cannot gaurantee that the two are not aliased and must assume that the write to <b>velocity</b> can overwrite the location where <b>position</b> will be read. Because the compiler must <i>effectively</i> perform the operations in the order declared in the source, it must assume this is the behavior the programmer intended.<br />

<div class="rule-of-thumb">
The use of unrestricted pointers inhibits the compiler's ability to schedule loads and may cause redundant loads in many cases. With few exceptions, accessing any value through a pointer will force the compiler to load, or reload, the value after any store. This is because the compiler cannot gaurantee that the value being loaded was not aliased by the value that was stored.</div>

For instance, there is no reason (other than sanity) why the programmer could not call the function in this way:
<pre class="code">
void 
call_move( vector3* some_data, float time_step, count )
{
  move( some_data, some_data, some_data, time_step, count );
}
</pre>
The use of restricted pointers would specifically disallow this.<br />
<br />
Compare this to the same function working with arrays of file scope. Working with file scope arrays represents the best case for the compiler with regard to alias analysis and should be used as the baseline for implementing  functions with restricted pointers.
<pre class="code">
vector3 velocity     [ PARTICLE_COUNT ];
vector3 position     [ PARTICLE_COUNT ];
vector3 acceleration [ PARTICLE_COUNT ];
&nbsp;
void
move( float time_step )
{
  for (size_t i=0;i&lt;PARTICLE_COUNT;i++)
  {
    velocity[i].x += acceleration[i].x * time_step;
    velocity[i].y += acceleration[i].y * time_step;
    velocity[i].z += acceleration[i].z * time_step;
    position[i].x += velocity[i].x     * time_step;
    position[i].y += velocity[i].y     * time_step;
    position[i].z += velocity[i].z     * time_step;
  }
}
</pre>

With the above code the compiler knows the arrays will be stored seperately and can determine that they are three independent data <i>windows</i>, or <i>stripes</i> and there can be no aliasing among them. A data stripe can be thought of as a <i>data channel</i> made up of indexable elements. <br />
<br />

<table width="400" border="1">
  <tr>
    <th scope="col">Data Channel </th>
    <th scope="col">Channel Elements (by Index) </th>
    </tr>
  <tr>
    <td>velocity</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N] </td>
    </tr>
  <tr>
    <td>position</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N]</td>
    </tr>
  <tr>
    <td>acceleration</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N]</td>
    </tr>
</table>

<div class="rule-of-thumb">
An element in a restricted data stripe can be a function of one or more elements of any other restricted data stripes, but <b>cannot</b> be a function of a <i>change</i> in an element of a data stripe.</div>

<pre class="code">
# This code was compiled with GCC 3.4.1 for PowerPC,
# with the following options: <b>-O3 -fstrict-aliasing -std=c99</b>
#
move:
  lis    3,velocity@ha
  lis    11,acceleration@ha
  lis    9,position@ha
  la     6,velocity@l(3)
  la     5,acceleration@l(11)
  la     7,position@l(9)
  li     8,0
  stwu   1,-16(1)
  li     0,8192
  mtctr  0
.L18:
  add    12,8,6
  <span style="color:#0000FF">lfsx   12,8,6     # Load  velocity     + 0</span>
  add    10,8,5
  <span style="color:#0000FF">lfsx   13,8,5     # Load  acceleration + 0
  lfs    8,4(12)    # Load  velocity     + 4</span>
  add    4,8,7
  <span style="color:#0000FF">lfs    5,8(10)    # Load  acceleration + 8
  lfs    6,8(12)    # Load  velocity     + 8
  lfs    7,4(10)    # Load  acceleration + 4</span>
  fmadds 9,13,1,12
  fmadds 10,7,1,8
  fmadds 11,5,1,6
  <span style="color:#0000FF">lfsx   4,8,7      # Load  position     + 0
  lfs    3,4(4)     # Load  position     + 4
  lfs    2,8(4)     # Load  position     + 8</span>
  fmadds 0,9,1,4
  fmadds 13,10,1,3
  fmadds 12,11,1,2
  <span style="color:#FF0000">stfsx  9,8,6      # Store velocity     + 0
  stfs   11,8(12)   # Store velocity     + 8
  stfs   10,4(12)   # Store velocity     + 4
  stfsx  0,8,7      # Store position     + 0</span>
  addi   8,8,12
  <span style="color:#FF0000">stfs   13,4(4)    # Store position     + 4
  stfs   12,8(4)    # Store position     + 8</span>
  bdnz   .L18
  addi   1,1,16
  blr
</pre>

All the stores are completed at the end of the loop. More specifically, the load for <strong>position</strong> is scheduled <em>before</em> the store of <strong>velocity</strong>. This validates that the compiler has enough information to determine that the values stored do not alias the values loaded. <br />
<br />

In order to get this same behavior with non-file scope pointers, use the restrict keyword to declare that every location which is either loaded or stored has no aliases.
<pre class="code">
void
move( vector3* velocity, 
      vector3* position, 
      vector3* acceleration, 
      float    time_step, 
      size_t   count, 
      size_t   stride )
{
  float* <span style="color: #0000FF">restrict</span> acceleration_x = &amp;acceleration->x;
  float* <span style="color: #0000FF">restrict</span> velocity_x     = &amp;velocity->x;
  float* <span style="color: #0000FF">restrict</span> position_x     = &amp;position->x;
  float* <span style="color: #0000FF">restrict</span> acceleration_y = &amp;acceleration->y;
  float* <span style="color: #0000FF">restrict</span> velocity_y     = &amp;velocity->y;
  float* <span style="color: #0000FF">restrict</span> position_y     = &amp;position->y;
  float* <span style="color: #0000FF">restrict</span> acceleration_z = &amp;acceleration->z;
  float* <span style="color: #0000FF">restrict</span> velocity_z     = &amp;velocity->z;
  float* <span style="color: #0000FF">restrict</span> position_z     = &amp;position->z;

  for (size_t i=0;i&lt;count*stride;i+=stride)
  {
    velocity_x[i] += acceleration_x[i] * time_step;
    velocity_y[i] += acceleration_y[i] * time_step;
    velocity_z[i] += acceleration_z[i] * time_step;
    position_x[i] += velocity_x[i]     * time_step;
    position_y[i] += velocity_y[i]     * time_step;
    position_z[i] += velocity_z[i]     * time_step;
  }
}
</pre>

Nine (9) non-aliased memory stipes were declared in the above code. This  completely defines the aliasing relationships between all the loads and stores.<br />
<br />

<table width="400" border="1">
  <tr>
    <th scope="col">Data Channel </th>
    <th scope="col">Channel Elements (by Index) </th>
  </tr>
  <tr>
    <td>velocity_x</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N] </td>
  </tr>
  <tr>
    <td>velocity_y</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N]</td>
  </tr>
  <tr>
    <td>velocity_z</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N]</td>
  </tr>
  <tr>
    <td>position_x</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N]</td>
  </tr>
  <tr>
    <td>position_y</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N]</td>
  </tr>
  <tr>
    <td>position_z</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N]</td>
  </tr>
  <tr>
    <td>acceleration_x</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N]</td>
  </tr>
  <tr>
    <td>acceleration_y</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N]</td>
  </tr>
  <tr>
    <td>acceleration_z</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N]</td>
  </tr>
</table>
<br />
By copying addresses from from pointer to another, an implicit hierarchy (or tree) of pointers is created. The child pointers are usually completely aliased by the parent pointer and it's important not to use them both at the same time (i.e. in the same scope). When restricted child pointers are created, consider the parent pointer to be <i>out of scope</i> and do not make an accesses through it. Note that in this case, any use of <b>velocity</b>, <b>position</b> or <b>acceleration</b> would invalidate the restrict contract and the results would be undefined.

<pre class="ascii-art">
                |---> velocity_x
velocity -------|---> velocity_y
                |---> velocity_z

                |---> position_x
position -------|---> position_y
                |---> position_z

                |---> acceleration_x
acceleration ---|---> acceleration_y
                |---> acceleration_z
</pre>

<div class="rule-of-thumb">
Typically, only the leaf nodes in a hierarchy of restricted pointers should be used.</div> 

This code was compiled with GCC 3.4.1 for PowerPC with the following options: <b>-O3 -fstrict-aliasing -std=c99</b>
<pre class="code">
# This code was compiled with GCC 3.4.1 for PowerPC,
# with the following options: <b>-O3 -fstrict-aliasing -std=c99</b>
#
move:
  stwu   1,-32(1)
  stw    31,28(1)
  mullw  31,6,7
  stw    30,24(1)
  cmplwi 7,31,0
  mr     30,7
  addi   12,3,4
  addi   6,5,4
  addi   8,4,4
  addi   7,5,8
  addi   10,3,8
  addi   11,4,8
  li     9,0
  ble-   7,.L27
.L31:
  slwi   0,9,2
  <span style="color: #0000FF">lfsx   13,3,0</span><span style="color:#0000FF">     # Load  velocity_x</span>
  add    9,9,30
  <span style="color: #0000FF">lfsx   8,12,0</span><span style="color:#0000FF">     # Load  velocity_y</span>
  cmplw  7,31,9
 <span style="color: #0000FF"> lfsx   6,10,0<span style="color:#0000FF">     # Load  velocity_z</span>
  lfsx   12,5,0<span style="color:#0000FF">     # Load  acceleration_x</span>
  lfsx   7,6,0<span style="color:#0000FF">      # Load  acceleration_y</span>
  lfsx   5,7,0<span style="color:#0000FF">      # Load  acceleration_z</span></span>
  fmadds 11,12,1,13
  fmadds 10,7,1,8
  fmadds 9,5,1,6
  <span style="color: #0000FF">lfsx   4,4,0      # Load  position_x
  lfsx   3,8,0      # Load  position_y
  lfsx   2,11,0</span><span style="color: #0000FF">     # Load  position_z</span>
  fmadds 0,11,1,4
  fmadds 13,10,1,3
  fmadds 12,9,1,2
  <span style="color: #FF0000">stfsx  11,3,0     # Store velocity_x
  stfsx  10,12,0    # Store velocity_y
  stfsx  9,10,0     # Store velocity_z
  stfsx  0,4,0      # Store position_x
  stfsx  13,8,0     # Store position_y
  stfsx  12,11,0</span><span style="color: #FF0000">    # Store position_z</span>
  bgt+   7,.L31
.L27:
  lwz    30,24(1)
  lwz    31,28(1)
  addi   1,1,32
  blr
</pre>

This version has all the flexibility of the first (unrestricted) version and the performance of the second (file scope arrays) version. You should expect code where all aliasing information is declared with the restrict keyword to <i>almost always</i> perform significantly better, and <em>never</em> worse, than with unrestricted pointers. This is especially true on superscalar RISC, or RISC-like architectures with large register files, like the PowerPC or MIPS R4000. <br />
<br />
The asute reader may also have noticed that because nine (9) restricted stripes were used instead of three (3) file scope arrays, the compiler has been able to select a much simplier addressing scheme. Much of the pointer arithmetic has been hoisted out of the loop. The version with the restricted pointers is actually <i>more</i> efficient than the one with file scope arrays.

<div class="subtitle">Non-aliased Memory Access Patterns</div>

An important distinction to make is that the restrict keyword is not restricting anything. It  is in fact <i>allowing</i> the compiler to do more than it could previously. It should also be noted that the type of the pointer that is qualified with restrict is not important, it is only important what location and size was used when loading  or storing from the pointer. The restrict keyword does not declare that the object being pointed to is completely without aliases, only that the addresses that are loaded and stored from are unaliased.<br />
<br />
For example, the following setup would be a completely valid use of restricted pointers:
<pre class="code">
struct particle
{
  vector3 position;
  vector3 velocity;
  vector3 acceleration;
};
&nbsp;
[ ... ]
&nbsp;
void 
call_move( particle* particles, float time_step, count )
{
  move( &particles->position, 
        &particles->velocity, 
        &particles->acceleration, 
        time_step, 
        count, 
        sizeof(particle) );
}
</pre>

Although each stripe of data is part of the same "object", none of the accesses would be aliased. Some runtime systems try to determine whether or not pointers are aliased by simply checking to see if the memory windows overlap. That is not sufficient. 

<div class="rule-of-thumb">
Memory windows <i>can</i> overlap and still be non-aliased.
</div>

<div class="subtitle">Usage and Suggestions</div>
Use of the restrict keyword should be very common. It should be used as a standard part of all new code. Older code should be revisited as possible to take advantage of the new optimization opportunities. It is somewhat difficult to refactor restricted requirements into pre-existing code as a certain amount of alias analysis must be done by the programmer. However, for the majority of live code in typical applications, memory access is not aliased (nor are memory windows overlapping) and aliasing hazards will be limited to a small fraction of the code base.<br />

<div class="rule-of-thumb">
Before modifying code to use the restrict keyword, ensure that all code can compile safely with strict aliasing enabled.
</div>

Programmers using functions that make assumptions about aliasing must know what those assumptions are. Certainly, if at all possible, memory usage patterns should be documented. However, at the very least, aliasing assumptions in the parameters passed to the functions should be declared. In the above examples, the parameters <b>velocity</b>, <b>position</b> and <b>acceleration</b> must not be aliased and the restrict contract should be made public by <i>also</i> declaring those parameters restricted.

<pre class="code">
void 
move( vector3* restrict velocity, 
      vector3* restrict position, 
      vector3* restrict acceleration, 
      float             time_step, 
      size_t            count, 
      size_t            stride );
</pre>

Not publishing aliasing assumptions will lead to very difficult to find bugs. Programmers will not know that the data must be independent and someone, someday will find a reason to use the same array in two or more pointers.<br />
<br />
Take for example <b>memcpy</b>, which has been officially changed to have the following declaration:
<pre class="code">
void* 
memcpy(void*       restrict s1, 
       const void* restrict s2, 
       size_t               n );
</pre>
<i>Can you guess why?</i><br />

<div class="rule-of-thumb">
Use restrict in function prototypes and in structure definitions to publish the assumptions made about aliasing.
</div>

Restricted pointers can be copied from one to another to create a hierarchy of pointers. However there is one limitation defined in the C99 standard. The child pointer <b>must not</b> be in the same block-level scope as the parent pointer. The result of copying restricted pointers in the same block-level scope is undefined.
<pre class="code">
{
  vector3* restrict position   = &amp;obj_a->position;
  float*   restrict position_x = &amp;position->x; <-- UNDEFINED
  {
    float* restrict position_y = &amp;position->y; <-- VALID
  }
}
</pre>  

<div class="rule-of-thumb">
Restricted child pointers must be in a different block-level scope than the parent pointer.
</div>

<br />
There is one additional problem in the assembly output above which is somewhat particular to the GCC scheduler. Notice that the load for <b>position </b> happens immediately before its update and store. The first multiply-add will stall waiting the first load to be completed before executing. The first float (<b>position_x</b>) <i>will not</i> be ready in three (3) cycles. It would be considerably better (and faster) if the load could be pushed closer to the top of the loop so that it is more likely to be completed by the time it is needed.

<pre class="code">
  <span style="color: #0000FF">lfsx   4,4,0      # Load   position_x
  lfsx   3,8,0      # Load   position_y
  lfsx   2,11,0     # Load   position_z</span>
  <span style="color: #FF00FF">fmadds 0,11,1,4   # Update position_y
  fmadds 13,10,1,3  # Update position_x
  fmadds 12,9,1,2   # Update position_z</span>
</pre>

Due to the order in which scheduling is done in GCC, it is always better to simplify expressions. Do not mix memory access with calculations. The code can be re-written as follows:
<pre class="code">
void
move( vector3* <span style="color: #0000FF">restrict</span> velocity, 
      vector3* <span style="color: #0000FF">restrict</span> position, 
      vector3* <span style="color: #0000FF">restrict</span> acceleration, 
      float             time_step,  
      size_t            count, 
      size_t            stride )
{
  float* <span style="color: #0000FF">restrict</span> acceleration_x = &amp;acceleration->x;
  float* <span style="color: #0000FF">restrict</span> velocity_x     = &amp;velocity->x;
  float* <span style="color: #0000FF">restrict</span> position_x     = &amp;position->x;
  float* <span style="color: #0000FF">restrict</span> acceleration_y = &amp;acceleration->y;
  float* <span style="color: #0000FF">restrict</span> velocity_y     = &amp;velocity->y;
  float* <span style="color: #0000FF">restrict</span> position_y     = &amp;position->y;
  float* <span style="color: #0000FF">restrict</span> acceleration_z = &amp;acceleration->z;
  float* <span style="color: #0000FF">restrict</span> velocity_z     = &amp;velocity->z;
  float* <span style="color: #0000FF">restrict</span> position_z     = &amp;position->z;

  for (size_t i=0;i&lt;count*stride;i+=stride)
  {
    const float ax  = acceleration_x[i];
    const float ay  = acceleration_y[i];
    const float az  = acceleration_z[i];
    const float vx  = velocity_x[i];
    const float vy  = velocity_y[i];
    const float vz  = velocity_z[i];
    const float px  = position_x[i];
    const float py  = position_y[i];
    const float pz  = position_z[i];

    const float nvx = vx + ( ax * time_step );
    const float nvy = vy + ( ay * time_step );
    const float nvz = vz + ( az * time_step );
    const float npx = px + ( vx * time_step );
    const float npy = py + ( vy * time_step );
    const float npz = pz + ( vz * time_step );

    velocity_x[i]   = nvx;
    velocity_y[i]   = nvy;
    velocity_z[i]   = nvz;
    position_x[i]   = npx;
    position_y[i]   = npy;
    position_z[i]   = npz;
  }
}
</pre>

<pre class="code">
# This code was compiled with GCC 3.4.1 for PowerPC,
# with the following options: <b>-O3 -fstrict-aliasing -std=c99</b>
#
move:
  stwu   1,-32(1)
  stw    31,28(1)
  mullw  31,6,7
  stw    30,24(1)
  cmplwi 7,31,0
  mr     30,7
  addi   12,3,4
  addi   6,5,4
  addi   8,4,4
  addi   7,5,8
  addi   10,3,8
  addi   11,4,8
  li     9,0
  ble-   7,.L47
.L51:
  slwi   0,9,2
  <span style="color: #0000FF">lfsx   8,3,0       # Load   vx</span>
  add    9,9,30
  <span style="color: #0000FF">lfsx   7,12,0      # Load   vy</span>
  cmplw  7,31,9
  <span style="color: #0000FF">lfsx   6,10,0      # Load   vz
  lfsx   10,4,0      # Load   px
  lfsx   9,8,0       # Load   py
  lfsx   5,11,0      # Load   pz
  lfsx   4,5,0       # Load   ax
  lfsx   3,6,0       # Load   ay
  lfsx   2,7,0       # Load   az</span>
  <span style="color: #FF00FF">fmadds 0,8,1,10    # Update npx
  fmadds 13,7,1,9    # Update npy
  fmadds 12,6,1,5    # Update npz
  fmadds 11,4,1,8    # Update nvx
  fmadds 10,3,1,7    # Update nvy
  fmadds 9,2,1,6     # Update nvz</span>
  <span style="color: #FF0000">stfsx  0,4,0       # Store  npx
  stfsx  13,8,0      # Store  npy
  stfsx  12,11,0     # Store  npz
  stfsx  11,3,0      # Store  nvx
  stfsx  10,12,0     # Store  nvy
  stfsx  9,10,0      # Store  nvz</span>
  bgt+   7,.L51
.L47:
  lwz    30,24(1)
  lwz    31,28(1)
  addi   1,1,32
  blr
</pre>

The loads are now properly scheduled and moved as far in advance as possible. The pattern [Load --&gt; Update --&gt; Store] is usually the optimal pattern for simple memory transformations on a superscalar RISC-like architecture, and is exactly what is being emitted. This is reasonably close to good hand-written assembly for the same code (without re-defining the problem), and the code now very suitable for unrolling.<br />

<div class="rule-of-thumb">
Simplify expressions. Do not mix memory access with calculations. Use the [ Load --&gt; Update --&gt; Store ] pattern.
</div>
 
<div class="subtitle">Summary</div>
<ul>
<li>Strict aliasing means that two objects of different types cannot refer to the same location in memory. Enable this option in GCC with the <strong>-fstrict-aliasing</strong> flag. Be sure that <i>all</i> code can safely run with this rule enabled. Enable strict aliasing related warnings with <strong>-Wstrict-aliasing</strong>, but do not expect to be warned in all cases. </li>
<li>Compare the assembly output of the function with restricted pointers and file scope arrays to ensure that all of the possible aliasing information has been used.</li>
<li>Only use restricted leaf pointers. Use of parent pointers may break the restrict contract.</li>
<li>Publish as many assumptions as possible about aliasing information in the function declaration.</li>
<li>Memory windows may be overlapping and still be without aliases. Do not limit the data design to non-overlapping windows.</li>
<li>Begin using the restrict keyword immediately. Retrofit old code as soon as possible.</li>
<li>Keep loads and stores separated from calculations. This results in better scheduling in GCC, and makes the relationship between the output assembly and the original source clearer.</li>
</ul>

<div class="subtitle">Additional Reading</div>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Aliasing_%28computing%29">Aliasing (computing) [wikipedia.org]</a></li>
<li><a href="http://mail-index.netbsd.org/tech-kern/2003/08/11/0001.html">Aliasing, Krister Walfridsson [netbsd.org]</a></li>
<li><a href="http://www.intel.com/software/products/compilers/clin/docs/main_cls/mergedprojects/optaps_cls/common/optaps_perf_run.htm">Memory Aliasing on Itanium®-based Systems [intel.com]</a></li>
<li><a href="http://www.cs.princeton.edu/~jqwu/Memory/survey.html">Survey of Alias Analysis [princeton.edu]</a></li>
<li><a href="http://realtimecollisiondetection.net/pubs/GDC03_Ericson_Memory_Optimization.ppt">Memory Optimization, Christer Ericson [realtimecollisiondetection.net]</a></li>
<li><a href="http://www.cs.pitt.edu/~mock/papers/clei2004.pdf">Why Programmer-specified Aliasing is a Bad Idea, Markus Mock [pitt.edu]</a></li>
<li><a href="http://www.hlrs.de/organization/tsc/services/tools/docu/kcc/UserGuide/chapter_4.html">KAI C++ User's Guide, 4.1 Writing Optimizable Code [hlrs.de]</a></li>
</ul>]]>
    </content>
</entry>
<entry>
    <title>A Practical GCC Trick To Use During Optimization</title>
    <link rel="alternate" type="text/html" href="http://www.cellperformance.com/mike_acton/2006/04/a_practical_gcc_trick_to_use_d.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://www.cellperformance.com/cgi-bin/mt/mt-atom.cgi/weblog/blog_id=2/entry_id=4" title="A Practical GCC Trick To Use During Optimization" />
    <id>tag:www.cellperformance.com,2006:/mike_acton//2.4</id>
    
    <published>2006-04-15T09:10:13Z</published>
    <updated>2006-05-01T01:04:58Z</updated>
    
    <summary>Splitting a basic block (by force) Warning: This is a trick to use during optimization. It is not documented nor gauranteed to work across multiple platforms or different revisions of the compiler. Many programmers will say that this non-portable code...</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://www.cellperformance.com/mike_acton</uri>
    </author>
            <category term="Public" />
    
    <content type="html" xml:lang="en" xml:base="http://www.cellperformance.com/mike_acton/">
        <![CDATA[<div class="subtitle">Splitting a basic block (by force)</div>
<br />
<div class="quote"><b>Warning:</b> This is a trick to use during optimization. It is not documented nor gauranteed to work across multiple platforms or different revisions of the compiler. Many programmers will say that this non-portable code should not be used in production, but there is such a huge practical benefit to using it has to be mentioned.</div>
<br/>

One of the first things that the compiler's scheduler does is organize the code into blocks of code without branches, out-of-line function calls or other optimization barriers. These basic blocks will then be scheduled among eachother based on their dependencies. Toward the end of scheduling individual instructions will be scheduled within basic blocks. Finally, instructions may be moved across block boundaries.<br/>
<br/>
Inline assembly is most definately an optimization barrier and every block of inline assembly is treated as an independent basic block. This is why assembly should be inlined one instruction at a time - to give the compiler the maximum number of options for scheduling.<br/>
<br/>
<br/>
<b>Why do I mention this?</b> <br/>
<br/>
Well, it turns out that if you have an empty inline assembly statement you can rely on the side-effect of splitting the basic block witout actually adding any instructions.
<div class="code">
#define GCC_SPLIT_BLOCK __asm__ ("");
</div>

]]>
        <![CDATA[So in this case we want to split the block after the loads but before the calculations. The compiler will then schedule these two blocks separately, then try to merge them - but there will be nothing to merge since it will meet all the contraints.

I'll also split the comparisons and branches section of the code at the end so it's easier to see what's happening (so easier to optimize). There's no problem with letting the compiler mix the instructions between the FPU calculations and the comparisons and branches, though.

So the new version:
<div class="code">
    bool overlaps(const Box& b) const
    {
      //
      // LOADS
      //

      const float a_c0 = m_center[0];
      const float a_c1 = m_center[1];
      const float a_c2 = m_center[2];
      const float a_e0 = m_extent[0];
      const float a_e1 = m_extent[1];
      const float a_e2 = m_extent[2];
      const float b_c0 = b.m_center[0];
      const float b_c1 = b.m_center[1];
      const float b_c2 = b.m_center[2];
      const float b_e0 = b.m_extent[0];
      const float b_e1 = b.m_extent[1];
      const float b_e2 = b.m_extent[2];

      GCC_SPLIT_BLOCK

      //
      // CALCULATIONS
      //

      const float delta_c0     = a_c0 - b_c0;
      const float delta_c1     = a_c1 - b_c1;
      const float delta_c2     = a_c2 - b_c2;
      const float abs_delta_c0 = ::fabs( delta_c0 );
      const float abs_delta_c1 = ::fabs( delta_c1 );
      const float abs_delta_c2 = ::fabs( delta_c2 );
      const float sum_e0       = a_e0 + b_e0;
      const float sum_e1       = a_e1 + b_e1;
      const float sum_e2       = a_e2 + b_e2;

      GCC_SPLIT_BLOCK

      //
      // COMPARES AND BRANCHES
      //

      const bool  in_0     = abs_delta_c0 <= sum_e0;
      const bool  in_1     = abs_delta_c1 <= sum_e1;
      const bool  in_2     = abs_delta_c2 <= sum_e2;
      const bool  result   = in_0 && in_1 && in_2;

      return (result);
    }
</div>

The final output is exactly what we were expecting. It's very obvious from output that the code was scheduled on either side of the splits. The new output:
<div class="code">
_Z12test_overlapRK3BoxS1_:
    //
    // PUSH STACK
    //

    stwu 1,-16(1)

    //
    // LOADS
    //

    lfs 4,20(3)
    lfs 3,20(4)
    lfs 1,0(3)
    lfs 13,4(3)
    lfs 12,8(3)
    lfs 11,12(3)
    lfs 10,16(3)
    lfs 9,0(4)
    lfs 8,4(4)
    lfs 7,8(4)
    lfs 6,12(4)
    lfs 5,16(4)

    //
    // CALCULATIONS
    //

    fsubs 0,1,9
    fsubs 2,13,8
    fsubs 1,12,7
    fadds 11,11,6
    fadds 10,10,5
    fadds 4,4,3
    fabs 0,0
    fabs 13,2
    fabs 12,1

    //
    // COMPARES AND BRANCHES
    //

    fcmpu 7,13,10
    li 3,0
    crnot 30,29
    fcmpu 1,0,11
    mfcr 0
    rlwinm 0,0,31,1
    fcmpu 6,12,4
    crnot 26,25
    cmpwi 7,0,0
    mfcr 0
    rlwinm 0,0,27,1
    bgt- 1,.L14
    cmpwi 6,0,0
    beq- 7,.L14
    beq- 6,.L14
    li 3,1

    //
    // POP STACK AND RETURN
    //
.L14:
    addi 1,1,16
    blr
</div>]]>
    </content>
</entry>
<entry>
    <title>Performance and Good Data Design</title>
    <link rel="alternate" type="text/html" href="http://www.cellperformance.com/mike_acton/2006/04/basic_principles_of_good_data.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://www.cellperformance.com/cgi-bin/mt/mt-atom.cgi/weblog/blog_id=2/entry_id=2" title="Performance and Good Data Design" />
    <id>tag:www.cellperformance.com,2006:/mike_acton//2.2</id>
    
    <published>2006-04-03T08:47:20Z</published>
    <updated>2006-05-13T13:25:50Z</updated>
    
    <summary>What follows are some simple rules of thumb that programmers can follow to create a solid pipeline from the content creators to the screen and speakers.</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://www.cellperformance.com/mike_acton</uri>
    </author>
            <category term="Public" />
    
    <content type="html" xml:lang="en" xml:base="http://www.cellperformance.com/mike_acton/">
        <![CDATA[<div class="subtitle">Why Is Data Design Important For Games?</div>
All game technology is simply a function to manipulate data. A game can be thought of as a very complicated DSP with controllers, source art and time as inputs and an audio-visual display as output. This is not a radical or revolutionary concept. It is however, widely forgotten or ignored in console game development. Development has changed dramatically over the years and the idea behind this article is to remind game programmers that the only thing we do now of any consequence is transform data. The only thing that really makes games unique is the types and amounts of data programmers must transform in a short, fixed period of time.<br />
<br />
Data access is the biggest problem in attaining maximum performance from a game console. This is necessarily true. Modern consoles are made up of deeply pipelined systems – The CPUs and coprocessors are designed for minimum cycle throughput, caches are designed for maximum throughput on sequential access, DSPs and GPUs are designed to maximize performance at the cost of instruction and data space – Any significant change in data access patterns will stall any or all of these pipelines. Of course there are required stalls in any pipelined system, however game systems themselves provide a pipeline from the content creators to the hardware and the best of those maximize the width and speed of those pipelines by minimizing non-required stalls along the way.<br />
<br />
<div class="subtitle">Good code follows good data, not the other way around.</div>
Fundamentally game systems can not be built for performance independent of the data. Good code follows good data, not the other way around. It is a common misperception that the products of programmers are programs – code. The truth is that the only people who care about code are programmers and we do not ultimately serve ourselves – The product of programmers is a service – console game programmers provide a mechanism for the content creators to put the content on a particular piece of hardware. Both the content creators and the game players want more and more and if we do not do everything we can within the limits of time and budget simply because the code doesn’t meet our expectations of how code should be designed or because a particular piece of hardware is more difficult to work with than we had hoped, then we are doing a disservice to our ultimate customers.<br />
<br />
<div class="rule-of-thumb">
In order to effectively design and optimize a system for a console game, both the data and how it is used must be known.  This is the most obvious, most crucial and most neglected principle in software architecture.
</div>
<br />
It used to be that programmer's did everything - the design, the art, the testing, not to mention writing the code. Things have changed.  There are professional designers, level builders, graphic artists, technical artists, art directors, creative directors, and a QA staff large enough to require their own building.<br />
<br />
In today's console game development world, programmers are responsible for only one thing: the data traffic through the console. detailed knowledge of what that content actually is, when it is transferred and how it is constructed it is very difficult to shape that data for best-fit performance. <br />
<br />
What follows are some simple rules of thumb that programmers can follow to create a solid pipeline from the content creators to the screen and speakers.<br />
<br />]]>
        <![CDATA[<div class="subtitle">Work closely with the content creators</div>

Games have the unique and distinct advantage of having a finite set of data that must be managed and having the people responsible for generating that data immediately available to them. For game programmers there exists a straightforward and simple method of understanding how data is generated and gaining insight into possible patterns in that data and the construction process - Maintain a close relationship with the content creators. <br />
<br />
The lone cowboy programmer who sits in the back room making Mountain Dew pyramids is dead. Development methodologies whose main contribution is to simply group more programmers together do nothing but create pairs or groups of cowboys. Only by maintaining constant dialogue with the content creators can a programmer understand what data is critical to the vision and bring it to the screen and speakers. And only by maintaining constant dialogue with the content creators can those content creators benefit from the technical expertise of the programmer and articulate their vision while generating data more suitable to the platform.<br />
<br />
<div class="subtitle">Know the data and access patterns</div>

No matter how closely the programmer works with the content creators there are some patterns that can only be found through a more traditional data analysis. Whether ad hoc or a detailed system in its own right, there must be some method of logging raw data as it passes through different stages of the transformation pipeline and some ability to inspect it visually for patterns. It should not come as a surprise that simply viewing data content as hex dump will make patterns obvious that they would not have considered otherwise.<br />
<br />
Often generic data shaping techniques are used that improve average performance, but with even a cursory knowledge of the data multiple optimizations become obvious. Programmers must not build models based on their guesses of what the data probably looks like; rather they should design for the patterns of the content creators. The same can be said for function-level optimizations – The game programmer should not optimize a function or loop without knowing exactly what patterns of data are being transformed. <br />

<div class="subtitle">Be prepared to re-organize the data</div>

There has been much discontent with the Waterfall Process of code development through the years and quite a few alternative methods of managing the process have been espoused with varying degrees of success. For example, currently at High Moon Studios we are learning to implement the Agile SCRUM methodology. However it is obvious that in these processes a most critical factor is widely ignored – the data. The waterfall process is alive and well in data design. Most data in games, either as file formats or as runtime data passed through the system are fixed through some early design process and only relatively minor adjustments are made through the development process as failures are evident. To make matters worse, when those inevitable failures do become evident game programmers search for every conceivable alternative to changing a data format because the cost of those changes is perceived as very high. This is where the development process often goes from bad to worse. <br />
<br />
The problem is that with data just as with code, the real cost in not in the changes, but lack of ability to adapt to changing demands and to reality. It is often impossible to change the pipeline design if, on analysis, it is shown that memory (or other data) access in general is slow and spread through the entire system. Some optimizations cannot wait until the final phase of application development. Game programmers must be prepared to adapt the data as demand requires. To try to design data without information on what content creators will actually provide (in reality) is foolish at best. To build code and optimizations around that data is even worse. The key is to only store, transform or transfer that data that is immediately useful, build on that as demands become obvious while minimizing the impact on the content creators. <br />
<br />
<div class="subtitle">Everything you add must make a difference</div>

There is a truism that can oft be heard through the halls of the offices of government contractors, “[It’s] good enough for government work.” Regardless of how it was originally meant, it should serve as a warning that the quest for a perfect solution is counter-productive and not cost-effective. Game developers do not build space shuttle control systems, they build games. Games are about performance hacks – the quest to find real-time solutions to a set of normally slow, calculation and memory-intensive problems. That consoles use polygons is a performance hack; the shading and lighting models are performance hacks; a z-buffer is a performance hack; etc.. Game programmers tend to transfer, and thus transform, a lot of extra data in an attempt to create a more perfect solution. They must not forget that if their model does not make a significant difference to either the content creators or the players, it’s probably not worth doing. “Performance hacks” are not only acceptable in game development, they are the entire reason that console game programming is a unique discipline. What level of performance must be attained is entirely determined by the data and expectations of the player.<br />
<br />
<div class="subtitle">Design for the hardware</div>

A good console programmer, nomatter if that programmer is working on AI or UI, will have at least a basic understanding of the hardware the game will be running on. When a programmer develops a system without taking into account the consequences on the target platform or platforms, that programmer is doomed to a difficult and long process of optimizing the system. Every console has unique problems that demand unique solutions, different processor mix, different cache mechanisms, different DMA transfer mechanisms, different register file sizes, etc.. Of course, given the reality of budgets and time, all data transformation cannot be perfectly well suited to the hardware, but if the hardware is reasonably understood, any initial implementation will be much better suited and the system overall will realize performance benefits.<br />
<br />
<div class="subtitle">Sort by dominant type</div>

Fundamentally modern systems benefit from doing the simplest possible thing as many times as possible - reducing the amount of times the same inputs must be read, increasing the locality of reads and writes, and minimizing branches in order to maximize the potential parallel calculations. The first step in maximizing the benefits from specialized console hardware is determining the types of data changes that cause the most significant impact on the data transformation pipeline – i.e. the data that would require the most, or most significant, pipelines to flush.<br />
<br />
<div class="subtitle">Clearly distinguish RO, WO, RW data</div>

Whether or not data is read-only (RO), write-only (WO) or read-write (RW) has a dramatic effect on the most suitable organization. It should be readily apparent that read-only and write-only data are significantly easier to manage than read-write data and if that distinction is clearly made, no effort need be wasted on examining unsuitable optimizations. Additionally in systems where memory is shared across multiple processors the amount of necessary data latches can be drastically reduced if the read-write limitations are known in advance.<br />
<br />
Benefits: Easier to move to shared memory model, easier to gaurantee restricted access, easier to separate reads and writes.<br />
<br />
<div class="subtitle">Know that almost everything belongs to a set</div>

- Source data: enemy / pickup<br />
- RT data: collision vector<br />

<div class="subtitle">Take advantage of frame coherence</div>

… Another unique advantage of games is frame-to-frame coherence – it is very likely that much of the data will remain either identical or nearly so between any two frames of gameplay. Although many of these cases are readily apparent, it is good practice to evaluate the runtime data for unexpected frame coherence.<br />
<br />
<div class="subtitle">Summary</div>


]]>
    </content>
</entry>

</feed> 

