<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
   <channel>
      <title>CellPerformance</title>
      <link>http://www.cellperformance.com/articles/</link>
      <description>All things related to getting the best performance from your Cell Broadband Engine™ (CBE) processor.</description>
      <language>en</language>
      <copyright>Copyright 2007</copyright>
      <lastBuildDate>Tue, 10 Jul 2007 07:17:23 +0000</lastBuildDate>
      <generator>http://www.sixapart.com/movabletype/?v=3.2</generator>
      <docs>http://blogs.law.harvard.edu/tech/rss</docs> 

            <item>
         <title>Fast Matrix Multiplication on Cell (SMP) Systems</title>
         <description><![CDATA[<p>Daniel Hackenberg wrote to tell me about some matrix multiply code he has written for the Cell. <br /><br />
<br /><br />
From his page:<br />
<div class="quote"><br />
This site describes a fast matrix multiplication code for Cell BE processors. It has been developed as part of a seminar paper at the Center for Information Services and High Performance Computing. The program is freely available under the GNU GPL.<br />
</div><br />
<br /><br />
Go ahead and check it out: <a href="http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/forschung/architektur_und_leistungsanalyse_von_hochleistungsrechnern/cell/">Fast Matrix Multiplication on Cell (SMP) Systems [tu-desden.de]</a><br />
</p>]]></description>
         <link>http://www.cellperformance.com/articles/2007/07/fast_matrix_multiplication_on.html</link>
         <guid>http://www.cellperformance.com/articles/2007/07/fast_matrix_multiplication_on.html</guid>
         <category>CBE</category>
         <pubDate>Tue, 10 Jul 2007 07:17:23 +0000</pubDate>
      </item>
            <item>
         <title>Cleaning House</title>
         <description><![CDATA[<div class="sticky-note">
<b>UPDATE! 7 July 2007</b> The <a href="http://forum.beyond3d.com/forumdisplay.php?f=57">new CellPerformance Forums</a> are now up and running, hosted by our friends at Beyond3D. [Thanks guys!]<br />
<br />
I'll be fixing up the links and generally cleaning things up to point all article discussions over to the new forums. It might take a little time, so be patient - but the quality of their forums is great, and I know that the addition of the existing B3D community to our own will drive a lot of good discussion.<br />
<br />
Remember the main articles will continue to be posted here. Hopefully, a few more than I've had time for in recent months. <br />
<br />
Well be back up and running full-speed shortly!<br />
<br />
Mike.
</div>

<div class="sticky-note">
Hey everyone! I know our forums have been hacked. You'd think that these kids would have better things to do. You'd also think that they'd appreciate exactly the kind of info we're trying to share here. Dumb. <br />
<br />
Anyway, not worth the effort to worry about them. I'm working on a plan that will make the forums better and more useful. And hopefully, I can get a little help from some friends.<br />
<br />
Stay tuned. It's time for me to get back to this and get all of you more of the info you want!<br />
<br />
Mike.
</div>]]></description>
         <link>http://www.cellperformance.com/articles/2007/06/cleaning_house.html</link>
         <guid>http://www.cellperformance.com/articles/2007/06/cleaning_house.html</guid>
         <category>CBE</category>
         <pubDate>Fri, 29 Jun 2007 07:43:27 +0000</pubDate>
      </item>
            <item>
         <title>Handy PS3 Linux Framebuffer Utilities</title>
         <description><![CDATA[While the documentation within Sony's vsync example should be enough to get you started with writing to the framebuffer, here's a couple of handy functions to test the framebuffer settings, open the virtual terminal and get access the the frame buffer.<br />
<br />
Open the virtual terminal:<br />
<a href="http://www.cellperformance.com/public/attachments/cp_vt.h">cp_vt.h</a><br />
<a href="http://www.cellperformance.com/public/attachments/cp_vt.c">cp_vt.c</a><br />
<br />
Open the framebuffer:<br />
<a href="http://www.cellperformance.com/public/attachments/cp_fb.h">cp_fb.h</a><br />
<a href="http://www.cellperformance.com/public/attachments/cp_fb.c">cp_fb.c</a><br />
<br />
Dump framebuffer info:<br />
<a href="http://www.cellperformance.com/public/attachments/fb_info.c">fb_info.c</a><br />
<br />
<a href="http://www.cellperformance.com/articles/2007/03/handy_ps3_linux_framebuffer_ut.html#fb_info">Example output from fb_info</a><br />
<a href="http://www.cellperformance.com/articles/2007/03/handy_ps3_linux_framebuffer_ut.html#fb_use">Example of using cp_vt and cp_fb</a><br />
<br />

<div class="sticky-note">
Files should be compiled with:
<pre class="code">
ppu-gcc -std=c99 -pedantic -W -Wall -O3
</pre>
</div>
]]></description>
         <link>http://www.cellperformance.com/articles/2007/03/handy_ps3_linux_framebuffer_ut.html</link>
         <guid>http://www.cellperformance.com/articles/2007/03/handy_ps3_linux_framebuffer_ut.html</guid>
         <category>CBE</category>
         <pubDate>Sat, 31 Mar 2007 06:14:13 +0000</pubDate>
      </item>
            <item>
         <title>HowTo: Huge TLB pages on PS3 Linux</title>
         <description><![CDATA[<div class="sticky-note">
<b>Updated! (22 Mar 07) Minor edits. Added notes for YellowDog Linux. Added source code for using huge page allocation.</b> <br />
<b>Updated! (30 Mar 07) A couple minor fixes. Thanks to Guénaël Renault for pointing them out!</b><br />
<b>Updated! (15 July 07) Added notes for kernel 2.6.21</b>
</div>

<div class="sticky-note">
Guest article: Understanding the TLB and minimizing misses is a critical part of high performance Cell programming. Unfortunately some PS3 kernels do not come with huge page support enabled. Jakub Kurzak and Alfredo Buttari step through the details of recompiling the kernel for huge page support.
</div>

The availability of huge TLB pages depends on the way the linux kernel has been configured prior to compilation. The default kernel that ships with Fedora Core 5 (most likely with any other distribution that has binary kernel packages) doesn't include this option. So, in order to have huge TLB pages, it is necessary to reconfigure the kernel, recompile it, instruct the boot loader about the newly created kernel image. Finally we will also show a way to allocate the TLB pages automatically at boot time.<br />
<br />

<div class="sticky-note">
[Mike Acton] This process also works with YellowDog Linux virtually unchanged.
</div>]]></description>
         <link>http://www.cellperformance.com/articles/2007/01/howto_huge_tlb_pages_on_ps3_li.html</link>
         <guid>http://www.cellperformance.com/articles/2007/01/howto_huge_tlb_pages_on_ps3_li.html</guid>
         <category>CBE</category>
         <pubDate>Tue, 30 Jan 2007 04:26:49 +0000</pubDate>
      </item>
            <item>
         <title>Cross-compiling for PS3 Linux</title>
         <description><![CDATA[Now that the PS3 is out and multiple Linux-based distributions are available which can be installed using <a href="http://www.playstation.com/ps3-openplatform/index.html">Open Platform [playstation.com]</a> it's time to start developing on some publically available hardware!<br />
<br />
Although the PPU and SPU compilers can be installed and used on the PS3 directly, I find it much more familiar and convinient to cross-compile from my desktop and just ship the resulting executables over to the target (PS3). <br />
<br />
In this article, I will detail the basic steps I used to get started building on a host PC and running on the PS3.

<ul>
<li><a href="http://www.cellperformance.com/articles/2006/11/crosscompiling_for_ps3_linux.html#install_linux">Install Linux</a></li>
<li><a href="http://www.cellperformance.com/articles/2006/11/crosscompiling_for_ps3_linux.html#install_libspe2">Install elfspe2 and libspe2 on PS3</a></li>
<li><a href="http://www.cellperformance.com/articles/2006/11/crosscompiling_for_ps3_linux.html#install_toolchain">Install toochain on host PC</a></li>
<li><a href="http://www.cellperformance.com/articles/2006/11/crosscompiling_for_ps3_linux.html#install_libspe2_host">Install libspe2 on host PC</a></li>
<li><a href="http://www.cellperformance.com/articles/2006/11/crosscompiling_for_ps3_linux.html#build_hello_libspe2">Building Hello World (for libspe2)</a></li>
<li><a href="http://www.cellperformance.com/articles/2006/11/crosscompiling_for_ps3_linux.html#hello_source_libspe2">Hello World source (for libspe2)</a></li>
<li><a href="http://www.cellperformance.com/articles/2006/11/crosscompiling_for_ps3_linux.html#using_ibm_sdk">Using the IBM SDK</a></li>
<li><a href="http://www.cellperformance.com/articles/2006/11/crosscompiling_for_ps3_linux.html#access_ps3_over_vnc">Access the PS3 Over VNC</a></li>
<li><a href="http://www.cellperformance.com/articles/2006/11/crosscompiling_for_ps3_linux.html#upgrade_libspe">Upgrade libspe and libspe2</a></li>
</ul>]]></description>
         <link>http://www.cellperformance.com/articles/2006/11/crosscompiling_for_ps3_linux.html</link>
         <guid>http://www.cellperformance.com/articles/2006/11/crosscompiling_for_ps3_linux.html</guid>
         <category>CBE</category>
         <pubDate>Wed, 29 Nov 2006 08:13:56 +0000</pubDate>
      </item>
            <item>
         <title>Unaligned scalar load and store on the SPU</title>
         <description><![CDATA[Albert Noll, a student at UC Irvine is working on an interesting project. According to him:
<div class="quote">
"I am currently working on a java virtual machine runtime environment which
hides the heterogenity of the cell architecture. Conventional java code
code be executed and benefit from the numerous execution units the Cell
architecture offers. I am doing some java benchmarks (java grande) to test
the efficiancy of the implementation, but I still have some problems
achieving really good results."
</div>
<br />
One of the problems Albert has encountered recently is in loading and storing scalar doubles to the java stack. He recently posed a question on the <a href="http://www-128.ibm.com/developerworks/forums/dw_thread.jsp?forum=739&thread=135517&cat=46">Cell Broadband Engine Architecture forum [ibm.com]</a>:
<div class="quote">
"I have the following problem: I want to load a double value from an array (represents stack of the application)
which is of type unsigned int. The two 32-bit values, which
represent the double value have been casted to a double before, so the bits
are set according to the double representation of the value."
</div>
<br />
The solution to this problem is to remember that the SPU does not have a scalar instruction set or access  local memory in anything except 16 bytes quadwords. The ability to compile scalar code on the SPU is something of a convinience, but it doesn't come without a penalty. <br />
<br />
The first step, before considering performance,  is to properly be able to load and store the unaligned double values. <br />
<br />]]></description>
         <link>http://www.cellperformance.com/articles/2006/09/unaligned_scalar_load_and_stor_1.html</link>
         <guid>http://www.cellperformance.com/articles/2006/09/unaligned_scalar_load_and_stor_1.html</guid>
         <category>CBE</category>
         <pubDate>Fri, 15 Sep 2006 10:59:18 +0000</pubDate>
      </item>
            <item>
         <title>atan2 on SPU</title>
         <description><![CDATA[On 2006 March 03 on the IBM developerWorks <a href="http://www-128.ibm.com/developerworks/forums/dw_thread.jsp?forum=739&thread=109947&message=13795522&cat=46&q=atan2#13795522">Cell Broadband Engine Architecture forum [ibm.com]</a> an interesting question was asked:<br />
<div class="quote">
"I am trying to port an application from an older version of SDK to SDK 1.0. It uses atan2(.....) function, which is causing trouble... This code worked fine on SDK28, but now it looks like the new functions dont have this particular function defined..<br />
I did change the makefile to include $(SDKLIB)/libmath.a<br />
<br />
I searched in ./sysroot/usr/spu/include/* and src/include/spu/* but couldn't find a headerfile that has it defined.<br />
<br />
Can anyone please suggest if I should just change the code to not use that function or is there a way to invoke it still?<br />
<br />
Thanks!"
</div>
<br />
It turned out this function was not available in the SDK.<br />
<br />
The following is a branch-free implementation of atan2 vector floats for the SPU. A scalar version which simply casts to vector and back is also provided. This implementation is fairly quick-and-dirty and no particular level of accuracy is gauranteed, but it should be usable for many purposes.<br />
<br />
<a href ="http://www.cellperformance.com/articles/2006/09/atan2_on_spu.html#cp_fatan">static inline vector float cp_fatan( const vector float x );</a><br />
<a href ="http://www.cellperformance.com/articles/2006/09/atan2_on_spu.html#cp_fatan_scalar">static inline float cp_fatan_scalar( const float x );</a><br />
<a href ="http://www.cellperformance.com/articles/2006/09/atan2_on_spu.html#cp_fatan2">static inline vector float cp_fatan2( const vector float y, const vector float x );</a><br />
<a href ="http://www.cellperformance.com/articles/2006/09/atan2_on_spu.html#cp_fatan2_scalar">static inline float cp_fatan2_scalar( const float y, const float x );</a><br />

<br />
Or download the source files:<br />
<a href="http://www.cellperformance.com/public/attachments/cp_fatan-cbe-spu.h">cp_fatan-cbe-spu.h</a><br />
<a href="http://www.cellperformance.com/public/attachments/cp_fatan-cbe-spu.c">cp_fatan-cbe-spu.c</a><br />
<br />

<div class="sticky-note">
This code is C99 source. For gcc, use the following flags: <span class="monospace-strong">-std=c99 -pedantic</span>
</div>]]></description>
         <link>http://www.cellperformance.com/articles/2006/09/atan2_on_spu.html</link>
         <guid>http://www.cellperformance.com/articles/2006/09/atan2_on_spu.html</guid>
         <category>CBE</category>
         <pubDate>Tue, 12 Sep 2006 10:21:52 +0000</pubDate>
      </item>
            <item>
         <title>Branch-free implementation of half-precision (16 bit) floating point</title>
         <description><![CDATA[<div class="sticky-note">Update! (19 July 06) Added Multiply. Fixed a problem with using __builtin_clz().</div>
<div class="sticky-note">Update! (17 July 06) The code has been considerably refactored. Decided to go with single function per expression. The expressions have been reduced as a first optimization pass.</div>

<div class="subtitle">Project</div>

The goal of this project is serve as an example of developing some relatively complex operations completely without branches - a software implementation of half-precision floating point numbers (That does not use floating point hardware). This example should echo the IEEE 754 standard for floating point numbers as closely as reasonable, including support for +/- INF, QNan, SNan, and denormalized numbers. However, exceptions will not be implemented.<br />
<br />
Half-precision floats are used in cases where neither the range nor the precision of 32 bit floating point numbers are needed, but where some dynamic precision is required. Two common uses are for image transformation, where the range of each component (e.g. red, green, blue, alpha) is typically limited to or near [0.0,1.0] or vertex data (e.g. position, texture coordinates, color values, etc.).<br />
<br />
The main advantage of half-precision floats is their size. Beyond the considerable potential for memory savings, processing a large number of half-precision values is more cache-friendly than using 32 bit values.<br />
<br />
The current released version (including tests) can be downloaded here: <a href="http://www.cellperformance.com/public/attachments/half.tgz">half.tgz</a><br />
<br />
<br />
<a href="http://www.cellperformance.com/articles/2006/07/branchfree_implementation_of_h_1.html#half_to_float">half_to_float()</a> Convert Half To Float (Scalar Version)<br />
<a href="http://www.cellperformance.com/articles/2006/07/branchfree_implementation_of_h_1.html#half_from_float">half_from_float()</a> Convert Float to Half (Scalar Version)<br />
<a href="http://www.cellperformance.com/articles/2006/07/branchfree_implementation_of_h_1.html#half_add">half_add()</a> Half Add (Scalar Version)<br />
<a href="http://www.cellperformance.com/articles/2006/07/branchfree_implementation_of_h_1.html#half_sub">half_sub()</a> Half Subtract (Scalar Version)<br />
<a href="http://www.cellperformance.com/articles/2006/07/branchfree_implementation_of_h_1.html#half_mul">half_mul()</a> Half Multiply (Scalar Version)<br />
<br />
]]></description>
         <link>http://www.cellperformance.com/articles/2006/07/branchfree_implementation_of_h_1.html</link>
         <guid>http://www.cellperformance.com/articles/2006/07/branchfree_implementation_of_h_1.html</guid>
         <category>CBE</category>
         <pubDate>Mon, 17 Jul 2006 09:20:36 +0000</pubDate>
      </item>
            <item>
         <title>Better Performance Through Branch Elimination</title>
         <description><![CDATA[<div class="sticky-note">Update! (11 July 06) Major Revision. With much help from André de Leiradella, these are improved working drafts of the series on branch elimination. Now included are a more detailed background on branches and many more examples!
</div>
 
<div class="subtitle">Introduction</div>

Second only to poor data access patterns, branches can have a big negative impact in the performance of a program. Methods for reducing branch penalties, such as both dynamic and static (software-assisted) branch prediction hardware, despite their successes, are increasingly less effective as the length of the instruction pipelines increase, particularly with in-order architectures where execution must be stalled when hardware prediction fails.<br />
<br />
<div class="quote">
Branching, both conditional and unconditional, slows most implementations. Even an unconditional branch or a correctly predicted taken branch may cause a delay if the target instruction is not in the fetch buffer or the cache. It is therefore best to use branch instructions carefully and to devise algorithms that reduce branching. Many operations that normally use branches may be performed either with fewer or no branches.
</div>
<div class="quote-cite">
-- From IBM's The PowerPC Compiler Writer's Guide 3.1.5
</div>
<br />
Branches represent a significant part of both performance critical and general purpose code - as a general rule of thumb, 20% of the instructions in typical code are branches. In inner loops and other code sections which demand the highest performance may benefit from a multifold increase in performance by eliminating, or reducing, branches.<br />
<br />
This series of articles will present the types of delays that branches may cause in program execution and some programming patterns that help avoid those delays.<br />
<br />
<br />
<strong>Part 1: Introduction</strong><br />
<strong>Part 2:</strong> <a href="http://www.cellperformance.com/articles/2006/04/background_on_branching.html">Background on Branching</a><br />
<strong>Part 3:</strong> <a href="http://www.cellperformance.com/articles/2006/04/benefits_to_branch_elimination.html">Benefits to Branch Elimination</a><br />
<strong>Part 4:</strong> <a href="http://www.cellperformance.com/articles/2006/04/programming_with_branches_patt.html">Programming with Branches, Patterns and Tips</a><br />
<strong>Part 5:</strong> <a href="http://www.cellperformance.com/articles/2006/04/more_techniques_for_eliminatin_1.html">More Techniques for Eliminating Branches</a><br />
<br />
<strong>Additional Examples:</strong><br />
<br />
<a href="http://www.cellperformance.com/articles/2006/07/increment_and_decrement_wrappi.html">Increment And Decrement Wrapping Values</a><br />
Occasionally you have a set of values that you want to wrap around as
you increment and decrement them. But the straightfoward implementation
can have a big impact on processors where comparisons and branches are
expensive (e.g. PowerPC). This article presents a straightforward
branch-free implementation of these functions.<br />
<br>
<a href="http://www.cellperformance.com/articles/2006/04/choosing_to_avoid_branches_a_s.html">Choosing to Avoid Branches: A Small Altivec Example</a><br />
An example of why less instructions doesn't always equal faster code.<br />
<br />
<a href="http://www.cellperformance.com/articles/2006/06/branchfree_implementation_of_h_1.html">Branch-free implementation of half-precision (16 bit) floating point</a><br />
The goal of this project is serve as an example of developing some
relatively complex operations completely without branches - a software
implementation of half-precision floating point numbers.<br />
<br />

<div class="sticky-note">
If this article interests you, I recommend highly <a href="http://www.awprofessional.com/authors/bio.asp?a=4bdf8d5b-1419-4de9-8f87-fb9e4cd9c569">Henry S. Warren</a>'s book <a href="http://www.awprofessional.com/bookstore/product.asp?isbn=0201914654&rl=1">Hacker's Delight</a> and his associated website, <a href="http://www.hackersdelight.org/">Hacker's Delight</a>. This book is a must-have for every programmer.
</div>]]></description>
         <link>http://www.cellperformance.com/articles/2006/07/tutorial_branch_elimination_pa.html</link>
         <guid>http://www.cellperformance.com/articles/2006/07/tutorial_branch_elimination_pa.html</guid>
         <category>CBE</category>
         <pubDate>Tue, 11 Jul 2006 04:30:07 +0000</pubDate>
      </item>
            <item>
         <title>Box Overlap</title>
         <description><![CDATA[<div class="subtitle">Background</div>
Interactive 3D applications frequently need to check whether one geometric object overlaps another.  In this article, we'll look at a function to test for overlap between 3D boxes, and we'll show how to optimize this function for the CBE.<br />
]]></description>
         <link>http://www.cellperformance.com/articles/2006/06/box_overlap.html</link>
         <guid>http://www.cellperformance.com/articles/2006/06/box_overlap.html</guid>
         <category>CBE</category>
         <pubDate>Sun, 18 Jun 2006 11:21:21 +0000</pubDate>
      </item>
            <item>
         <title>A 4x4 Matrix Inverse</title>
         <description><![CDATA[<div class="sticky-note">
<b>GUEST ARTICLE!</b> Cédric Lallain is a Frenchman who has been working with me on Cell/PS3 research at Highmoon Studios in Carlsbad, CA.. I hope that this is only the first of many contributions to the community by Cédric. Welcome aboard! -- Mike.
</div>
<div class="subtitle">
Inverse matrix on PPU and on SPU using SIMD instructions.
</div>
<p>
This article will talk about how to convert some scalar code to SIMD code for the PPU and SPU using the inverse matrix as an example.
<p>
Most of the time in the video games, programmers are not doing a standard inverse matrix. 
It is too expensive. Instead, to inverse a matrix, they consider it as orthonormal and they just do a 3x3 transpose of the rotation part with a dot product for the translation. 
Sometimes the full inverse algorithm is necessary. 
<p>
The main goal is to be able to do it as fast as possible. 
This is why the code should use SIMD instructions as much as possible.
  <div class="quote">A vector is an instruction operand containing a set of data elements packed into a one-dimensional array. The elements can be fixed-point or floating-point values. Most Vector/SIMD Multimedia Extension and SPU instructions operate on vector operands. Vectors are also called
Single-Instruction, Multiple-Data (SIMD) operands, or packed operands.<br />
SIMD processing exploits data-level parallelism. Data-level parallelism means that the operations required to transform a set of vector elements can be performed on all elements of the vector at the same time. That is, a single instruction can be applied to multiple data elements in parallel.</div>
<div class="quote-cite">[Chapter 2.5.1 in the released pdf by IBM: <a href="http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/9F820A5FFA3ECE8C8725716A0062585F">Cell Broadband Engine Programming Handbook [ibm.com]</a>]. </div>

  <div class="quote">Each SPE is a 128-bit RISC processor specialized for data-rich, compute-intensive SIMD and scalar applications.</div>
  <div class="quote-cite">[Chapter 3 in the released pdf by IBM: <a href="http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/9F820A5FFA3ECE8C8725716A0062585F">Cell Broadband Engine Programming Handbook [ibm.com]</a>]. </div>
<p><br />
Also the number of branches should stay to the strict minimum. Any extra branches will slow down the final solution.
For more information about it, check the article: <a href="http://www.cellperformance.com/articles/2006/04/tutorial_branch_elimination_pa.html"
 > Better Performance Through Branch Elimination [CellPerformance.com]</a>.

The first step is to choose the most suitable algorithm in order to reach the objectives.
Different algorithms exist to inverse a matrix:

<p><b> The Gauss-Jordan elimination: </b>
The Gauss-Jordan elimination is a method to find the inverse matrix solving a system of linear equations.
 A good explanation about how this algorithm work can be found in the book <a href="http://www.library.cornell.edu/nr/cbookcpdf.html"> "Numerical Recipes in C" [library.cornell.edu] </a> chapter 2.1. <br />
For a visual demonstration using a java applet see: <a href="http://www.cse.uiuc.edu/eot/modules/linear_equations/gauss_jordan/"> Gauss-Jordan Elimination [cse.uiuc.edu]</a>.
In this algorithm, the choice of a good pivot is a critical part. 
To do it, all floating point values of a specific column need to be tested with each other, one by one. This, by definition, doesn't suit very well in SIMD code. 
<br />
Performing the algorithm, some multiplications are be done between columns 
(e.g.: to apply the pivot) and some other operations between rows
(e.g.: to apply the multiplier to the rest of the matrix). 
This requires extra code to swap rows and columns in order to use SIMD instructions.
<p>
<b> Inversion using LU decomposition: </b>
The description of the inverse calculation can be found in <a href="http://www.library.cornell.edu/nr/cbookcpdf.html"> "Numerical Recipes in C" [library.cornell.edu] </a> chapter 2.3.
<div class="quote">
In linear algebra, a block LU decomposition is a decomposition of a block matrix into a lower block triangular matrix L and an upper block triangular matrix U. This decomposition is used in numerical analysis to reduce the complexity of the block matrix formula.</div>
<div class="quote-cite">[<a href="http://en.wikipedia.org/wiki/Block_LU_decomposition">Block LU decomposition [wikipedia.org]</a>]</div>
This algorithm would probably be very useful if the size of the matrix was 8x8. 
In this case, it requires doing the calculation two floating points at a time 
where a vector type contains four.<p>

  <b> Inversion by Partitioning: </b>
To inverse a matrix A (size N) by partitioning, the matrix is partitioned into:
<pre>
       |  A0    A1  |
   A = |            | with A0 and A3 squared matrix with the respective size
       |  A2    A3  |                s0 and s3 following the rule: s0 + s3 = N
</pre>
The inverse is
<pre>
          |  B0    B1  |
   InvA = |            |
          |  B2    B3  |
</pre>
with:
<pre>
  B0 = Inv(A0 - A1 * InvA3 * A2)
  B1 = - B0 * (A1 * InvA3)
  B2 = - (InvA3 * A2) * B0
  B3 = InvA3 + B2 * (A1 * InvA3)
</pre>
More information can be found in <a href="http://www.library.cornell.edu/nr/cbookcpdf.html"> "Numerical Recipes in C" [library.cornell.edu] </a> chapter 2.7<p>
The issue related above is also present here; the idea is to work four floating points at a time and not only two. <p>

  <b> Using the inverse formula ( (1/det(M)) * Transpose(Cofactor(M))): </b>
Check the article about <a href="http://mathworld.wolfram.com/MatrixInverse.html"> Matrix Inverse [mathworld.wolfram.com]</a> for more information about this formula.
<p>
This is the algorithm which will be used to inverse the matrix. Each step presents a very good factorization ratio; it's possible to group the operations in order to replace them by SIMD instructions. <br />
The most critical part in this algorithm is the calculation of all cofactors. This part has also two great advantages for our objectives. It's 100% calculation; this allows writing code without branching. All cofactor values are computed the same way and can be computed in parallel and independently of each other. This is a perfect place to use the SIMD instructions.
<p>
This article will start with a basic implementation of the inverse formula using scalar instructions. Then this code will be modified to prepare the SIMD version. The first SIMD version will be done for the PPU. The final one will be conversion using the SPU intrinsic instruction set.
]]></description>
         <link>http://www.cellperformance.com/articles/2006/06/a_4x4_matrix_inverse_1.html</link>
         <guid>http://www.cellperformance.com/articles/2006/06/a_4x4_matrix_inverse_1.html</guid>
         <category>CBE</category>
         <pubDate>Sat, 03 Jun 2006 09:17:03 +0000</pubDate>
      </item>
            <item>
         <title>Avoiding Microcoded Instructions On The PPU</title>
         <description><![CDATA[<div class="subtitle">What are microcoded instructions?</div>

Microcode is a special instruction set that is (usually) only available to the hardware. On the PPU (PowerPC Unit), small microprograms made up of microcode are stored in ROM and executed in the place of those PowerPC instructions that were too costly to implement directly in hardware or do not fit into the pipeline design very well. The size of a microprogram is measured in microwords. <br />
<br />
The PowerPC instructions for which a microprogram is executed are often called <i>microcoded instructions</i>.<br />
<br />
Microcoded instructions may be <i>conditionally executed</i> or <i>unconditionally executed</i>. Unconditionally executed microcoded instructions <i>always</i> execute the microprogram. Conditionally executed microcoded instructions will only execute the microprogram when the values of the register operands are exceptional in some way. Microcoded instructions are a special case of normal instructions and conditionally executed microcoded instructions are a special case of those.<br />

<div class="subtitle">Why avoid microcoded instructions?</div>

<div class="quote-open">
<div class="quote">The G5 core implements several instructions in microcode. These instructions cause a pipeline bubble during decode. The most commonly used microcoded instructions are load and store multiple -- lmw and stmw. These are often generated by the compiler to save space when saving and restoring registers on the stack. You can force GCC to avoid these instructions by specifying -mnomultiple. Indexed forms and/or algebraic forms of updating load and stores are also executed as microcode. You can force GCC to avoid these instructions by specifying -mno-update.</div>
<div class="quote-cite">-- From <a href="http://developer.apple.com/technotes/tn/tn2087.html">G5 Performance Primer [apple.com]</a></div>
</div>
<br />

Like the G5, the PPU contains microcoded instructions. Microcoded instructions are implemented in order to maintain compatibility with the PowerPC standard (a processor can only be called a PowerPC processor if it adheres to <a href="http://www-128.ibm.com/developerworks/eserver/articles/archguide.html?S_TACT=105AGX16&S_CMP=DWPA">the standard [ibm.com]</a>.) When one of these instructions is decoded, the current pipeline is flushed, the microded program is then fetched from ROM and executed as a single atomic unit. The process of flushing the pipeline, fetching the microcode and executing the program takes quite a long time compared to other instructions. Additionally, because the instruction must be executed atomically in order to remain as transparent to the user as possible, any resources needed by the microcode program must be locked.<br />

<br />

<div class="quote">
;; micr insns will stall at least 7 cycles to get the first instr from ROM, micro instructions are not dual issued. 
</div>
<div class="quote-cite">
-- From <a href=http://www.cellperformance.com/public/attachments/cellpu.md>cellpu.md</a> (<a href="http://www.bsc.es/projects/deepcomputing/linuxoncell/gcctoolchain_cbe.html">CBE Toolchain 2.3 source code [bsc.es]</a>)
</div>

<div class="rule-of-thumb">
The minimum seven (7) cycle stall for microcoded instructions is derived from the fixed stages of the microcode section of the instruction pipeline. Microcoded stages are inserted after the last instruction buffer stage and before the first instruction decode stage. The actual penalty is determined by the complexity and length of the instruction. <br />
<br />
For more information on the PPU pipeline stages see: <a href="http://www.research.ibm.com/journal/rd/494/kahle.html">Introduction to the Cell multiprocessor [ibm.com]</a>
</div>

<br />

The details on which instructions are microcoded and the associated penalties are specific to each PowerPC device and are outlined in the User's Guide for the individual processor. For example, see the <a href="http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/AE818B5D1DBB02EC87256DDE00007821/$file/970FX_user_manual.v1.6.2006FEB09.pdf">IBM PowerPC 970FX RISC Microprocessor User's Guide [ibm.com]</a> paying particular attention to <i>Section 6.3.3 Instruction Decode, Cracking, and Microcode</I>.<br />

<br />

The PPU User's Guide has not been released publically. So how is a programmer to know which instructions are microcoded and how to avoid them?<br />

<br />
Read on to find out.

<div class="sticky-note">
<b>UPDATE: 11 MAY 2006</b><br />
<br />
On May 10, 2006 IBM released the <a href="http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/9F820A5FFA3ECE8C8725716A0062585F">Cell Broadband Engine Programming Handbook [ibm.com]</a>.  Section A.1.3.1 (Unconditionally Microcoded Instructions) has a detailed list of those instructions which are always microcoded, including latency information and microword count. Before this document was released there were no public documents which described in detail, the penalties for using microcoded instructions. This article has been updated to reflect those details.<br />
<br />
From the document:<br />
<br />
<div class="quote">
<b>Note:</b> A minimum of 11 cycles are required before the first instruction is received from the microcode ROM, so microcoded instructions should be avoided if possible.<br />
<br />
Most microcoded instructions are decoded into two or three simple PowerPC instructions, and they can be avoided in most cases. The microcoded instructions are typically decomposed into an integer and a load or store operation, with a dependency between them. Although most microcoded PowerPC instructions are decoded into only a few simple instructions, it is important to keep in mind that there are typically dependencies between the internal operations of the microcode, which generate stalls at the issue stage. Replacing the microcoded instructions with PowerPC instructions not only avoids stalling but also gives more latitude in scheduling instructions to avoid stalls, as well as potentially improving multithreaded performance.<br />
</div>
</div>]]></description>
         <link>http://www.cellperformance.com/articles/2006/04/avoiding_microcoded_instructio.html</link>
         <guid>http://www.cellperformance.com/articles/2006/04/avoiding_microcoded_instructio.html</guid>
         <category>CBE</category>
         <pubDate>Fri, 28 Apr 2006 04:45:14 +0000</pubDate>
      </item>
            <item>
         <title>Choosing to Avoid Branches: A Small Altivec Example</title>
         <description><![CDATA[<div class="subtitle">Balancing speed and instruction count</div>

I came across an <a href="http://www.simdtech.org/altivec/archive/msg?list_name=altivec&monthdir=200604&msg=msg00009.html">interesting bit of Altivec code [simdtech.org]</a> on the <a href="http://www.simdtech.org/altivec">Altivec mailing list [simdtech.org]</a> today and I thought it would make a good example of how speed is not simply a matter of instruction count on the PPU and SPU.<br />
<br />

The function (modified slightly for this example):
<div class="code">
<a href="http://www.cellperformance.com/articles/2006/04/vector_unsigned_short.html">vector unsigned short</a>
test( <a href="http://www.cellperformance.com/articles/2006/04/vector_unsigned_short.html">vector unsigned short</a> a, <a href="http://www.cellperformance.com/articles/2006/04/vector_unsigned_short.html">vector unsigned short</a> b ) 
{
  const <a href="http://www.cellperformance.com/articles/2006/04/vector_unsigned_short.html">vector unsigned short</a> mask = (<a href="http://www.cellperformance.com/articles/2006/04/vector_unsigned_short.html">vector unsigned short</a>)<a href="http://www.cellperformance.com/articles/2006/04/vec_cmplt.html">vec_cmplt</a>( a, b );

  if (<a href="http://www.cellperformance.com/articles/2006/04/vec_all_ge.html">vec_all_ge</a>(a, b)) 
  {
    return (a);
  }

  return (mask);
}
</div>

Ingoring the function stack push/pop, we would have a very few instructions. Something like:

<div class="code">
  <span style="color:#ff0000;">vcmpgtuh. v0,  v3, v2</span>
  mfcr      r0
  rlwinm    r0,  r0, 27, 1
  <span style="color:#ff0000;">vcmpgtuh  v0,  v3, v2</span>
  cmpwi     cr7, r0, 0
  bne+      cr7, L2
  vor       v2,  v0, v0
L2:
</div>

<a href="http://www.shellandslate.com/bencv.html">Ben Weiss [shellandslate.com]</a> wrote this snippet to demonstrate that GCC does not optimize out redundant calls when Altivec instructions are called both with and without the CR-modify bit set.  It's worth noting that this code (in the general case) will be a performance loss. Not only will there very likely be performance issues in these instructions themselves, but it will also affect the surrounding code by creating an optimization barrier.<br />
<br />
Read on to find out more about the issues and examine the alternative...]]></description>
         <link>http://www.cellperformance.com/articles/2006/04/choosing_to_avoid_branches_a_s.html</link>
         <guid>http://www.cellperformance.com/articles/2006/04/choosing_to_avoid_branches_a_s.html</guid>
         <category>CBE</category>
         <pubDate>Thu, 20 Apr 2006 07:54:25 +0000</pubDate>
      </item>
            <item>
         <title>More Techniques for Eliminating Branches</title>
         <description><![CDATA[<div class="subtitle">Better Performance Through Branch Elimination</div>

<strong>Part 1:</strong> <a href="http://www.cellperformance.com/articles/2006/07/tutorial_branch_elimination_pa.html">Introduction</a><br />
<strong>Part 2:</strong> <a href="http://www.cellperformance.com/articles/2006/04/background_on_branching.html">Background on Branching</a><br />
<strong>Part 3:</strong> <a href="http://www.cellperformance.com/articles/2006/04/benefits_to_branch_elimination.html">Benefits to Branch Elimination</a><br />
<strong>Part 4:</strong> <a href="http://www.cellperformance.com/articles/2006/04/programming_with_branches_patt.html">Programming with Branches, Patterns and Tips</a><br />
<strong>Part 5: More Techniques for Eliminating Branches</strong><br />
<br />
<strong>Additional Examples:</strong><br />
<br />
<a href="http://www.cellperformance.com/articles/2006/07/increment_and_decrement_wrappi.html">Increment And Decrement Wrapping Values</a><br />
Occasionally you have a set of values that you want to wrap around as
you increment and decrement them. But the straightfoward implementation
can have a big impact on processors where comparisons and branches are
expensive (e.g. PowerPC). This article presents a straightforward
branch-free implementation of these functions.<br />
<br>
<a href="http://www.cellperformance.com/articles/2006/04/choosing_to_avoid_branches_a_s.html">Choosing to Avoid Branches: A Small Altivec Example</a><br />
An example of why less instructions doesn't always equal faster code.<br />
<br />
<a href="http://www.cellperformance.com/articles/2006/06/branchfree_implementation_of_h_1.html">Branch-free implementation of half-precision (16 bit) floating point</a><br />
The goal of this project is serve as an example of developing some
relatively complex operations completely without branches - a software
implementation of half-precision floating point numbers.<br />
<br />
]]></description>
         <link>http://www.cellperformance.com/articles/2006/04/more_techniques_for_eliminatin_1.html</link>
         <guid>http://www.cellperformance.com/articles/2006/04/more_techniques_for_eliminatin_1.html</guid>
         <category>CBE</category>
         <pubDate>Tue, 11 Apr 2006 08:00:21 +0000</pubDate>
      </item>
            <item>
         <title>Programming with Branches, Patterns and Tips</title>
         <description><![CDATA[<div class="subtitle">Better Performance Through Branch Elimination</div>

<strong>Part 1:</strong> <a href="http://www.cellperformance.com/articles/2006/07/tutorial_branch_elimination_pa.html">Introduction</a><br />
<strong>Part 2:</strong> <a href="http://www.cellperformance.com/articles/2006/04/background_on_branching.html">Background on Branching</a><br />
<strong>Part 3:</strong> <a href="http://www.cellperformance.com/articles/2006/04/benefits_to_branch_elimination.html">Benefits to Branch Elimination</a><br />
<strong>Part 4: Programming with Branches,  Patterns and Tips</strong><br />
<strong>Part 5:</strong> <a href="http://www.cellperformance.com/articles/2006/04/more_techniques_for_eliminatin_1.html">More Techniques for Eliminating Branches</a><br />
<br />
<strong>Additional Examples:</strong><br />
<br />
<a href="http://www.cellperformance.com/articles/2006/07/increment_and_decrement_wrappi.html">Increment And Decrement Wrapping Values</a><br />
Occasionally you have a set of values that you want to wrap around as
you increment and decrement them. But the straightfoward implementation
can have a big impact on processors where comparisons and branches are
expensive (e.g. PowerPC). This article presents a straightforward
branch-free implementation of these functions.<br />
<br>
<a href="http://www.cellperformance.com/articles/2006/04/choosing_to_avoid_branches_a_s.html">Choosing to Avoid Branches: A Small Altivec Example</a><br />
An example of why less instructions doesn't always equal faster code.<br />
<br />
<a href="http://www.cellperformance.com/articles/2006/06/branchfree_implementation_of_h_1.html">Branch-free implementation of half-precision (16 bit) floating point</a><br />
The goal of this project is serve as an example of developing some
relatively complex operations completely without branches - a software
implementation of half-precision floating point numbers.<br />
<br />]]></description>
         <link>http://www.cellperformance.com/articles/2006/04/programming_with_branches_patt.html</link>
         <guid>http://www.cellperformance.com/articles/2006/04/programming_with_branches_patt.html</guid>
         <category>CBE</category>
         <pubDate>Tue, 11 Apr 2006 07:35:33 +0000</pubDate>
      </item>
      
   </channel>
</rss>
