<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>CellPerformance</title>
    <link rel="alternate" type="text/html" href="http://www.cellperformance.com/articles/" />
    <link rel="self" type="application/atom+xml" href="http://www.cellperformance.com/articles/atom.xml" />
   <id>tag:www.cellperformance.com,2007:/articles//1</id>
    <link rel="service.post" type="application/atom+xml" href="http://www.cellperformance.com/cgi-bin/mt/mt-atom.cgi/weblog/blog_id=1" title="CellPerformance" />
    <updated>2007-07-10T07:21:46Z</updated>
    <subtitle>All things related to getting the best performance from your Cell Broadband Engine™ (CBE) processor.</subtitle>
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type 3.2</generator>
 
<entry>
    <title>Fast Matrix Multiplication on Cell (SMP) Systems</title>
    <link rel="alternate" type="text/html" href="http://www.cellperformance.com/articles/2007/07/fast_matrix_multiplication_on.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://www.cellperformance.com/cgi-bin/mt/mt-atom.cgi/weblog/blog_id=1/entry_id=90" title="Fast Matrix Multiplication on Cell (SMP) Systems" />
    <id>tag:www.cellperformance.com,2007:/articles//1.90</id>
    
    <published>2007-07-10T07:17:23Z</published>
    <updated>2007-07-10T07:21:46Z</updated>
    
    <summary>Daniel Hackenberg wrote to tell me about some matrix multiply code he has written for the Cell. </summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://www.cellperformance.com/mike_acton</uri>
    </author>
            <category term="CBE" />
    
    <content type="html" xml:lang="en" xml:base="http://www.cellperformance.com/articles/">
        <![CDATA[<p>Daniel Hackenberg wrote to tell me about some matrix multiply code he has written for the Cell. <br /><br />
<br /><br />
From his page:<br />
<div class="quote"><br />
This site describes a fast matrix multiplication code for Cell BE processors. It has been developed as part of a seminar paper at the Center for Information Services and High Performance Computing. The program is freely available under the GNU GPL.<br />
</div><br />
<br /><br />
Go ahead and check it out: <a href="http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/forschung/architektur_und_leistungsanalyse_von_hochleistungsrechnern/cell/">Fast Matrix Multiplication on Cell (SMP) Systems [tu-desden.de]</a><br />
</p>]]>
        
    </content>
</entry>
<entry>
    <title>Cleaning House</title>
    <link rel="alternate" type="text/html" href="http://www.cellperformance.com/articles/2007/06/cleaning_house.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://www.cellperformance.com/cgi-bin/mt/mt-atom.cgi/weblog/blog_id=1/entry_id=89" title="Cleaning House" />
    <id>tag:www.cellperformance.com,2007:/articles//1.89</id>
    
    <published>2007-06-29T07:43:27Z</published>
    <updated>2007-07-08T03:07:59Z</updated>
    
    <summary> I&apos;m working on a plan that will make the forums better and more useful. And hopefully, I can get a little help from some friends.</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://www.cellperformance.com/mike_acton</uri>
    </author>
            <category term="CBE" />
    
    <content type="html" xml:lang="en" xml:base="http://www.cellperformance.com/articles/">
        <![CDATA[<div class="sticky-note">
<b>UPDATE! 7 July 2007</b> The <a href="http://forum.beyond3d.com/forumdisplay.php?f=57">new CellPerformance Forums</a> are now up and running, hosted by our friends at Beyond3D. [Thanks guys!]<br />
<br />
I'll be fixing up the links and generally cleaning things up to point all article discussions over to the new forums. It might take a little time, so be patient - but the quality of their forums is great, and I know that the addition of the existing B3D community to our own will drive a lot of good discussion.<br />
<br />
Remember the main articles will continue to be posted here. Hopefully, a few more than I've had time for in recent months. <br />
<br />
Well be back up and running full-speed shortly!<br />
<br />
Mike.
</div>

<div class="sticky-note">
Hey everyone! I know our forums have been hacked. You'd think that these kids would have better things to do. You'd also think that they'd appreciate exactly the kind of info we're trying to share here. Dumb. <br />
<br />
Anyway, not worth the effort to worry about them. I'm working on a plan that will make the forums better and more useful. And hopefully, I can get a little help from some friends.<br />
<br />
Stay tuned. It's time for me to get back to this and get all of you more of the info you want!<br />
<br />
Mike.
</div>]]>
        
    </content>
</entry>
<entry>
    <title>Handy PS3 Linux Framebuffer Utilities</title>
    <link rel="alternate" type="text/html" href="http://www.cellperformance.com/articles/2007/03/handy_ps3_linux_framebuffer_ut.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://www.cellperformance.com/cgi-bin/mt/mt-atom.cgi/weblog/blog_id=1/entry_id=87" title="Handy PS3 Linux Framebuffer Utilities" />
    <id>tag:www.cellperformance.com,2007:/articles//1.87</id>
    
    <published>2007-03-31T06:14:13Z</published>
    <updated>2007-03-31T07:44:51Z</updated>
    
    <summary>While the documentation within Sony&apos;s vsync example should be enough to get you started with writing to the framebuffer, here&apos;s a couple of handy functions to test the framebuffer settings, open the virtual terminal and get access the the frame buffer.</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://www.cellperformance.com/mike_acton</uri>
    </author>
            <category term="CBE" />
    
    <content type="html" xml:lang="en" xml:base="http://www.cellperformance.com/articles/">
        <![CDATA[While the documentation within Sony's vsync example should be enough to get you started with writing to the framebuffer, here's a couple of handy functions to test the framebuffer settings, open the virtual terminal and get access the the frame buffer.<br />
<br />
Open the virtual terminal:<br />
<a href="http://www.cellperformance.com/public/attachments/cp_vt.h">cp_vt.h</a><br />
<a href="http://www.cellperformance.com/public/attachments/cp_vt.c">cp_vt.c</a><br />
<br />
Open the framebuffer:<br />
<a href="http://www.cellperformance.com/public/attachments/cp_fb.h">cp_fb.h</a><br />
<a href="http://www.cellperformance.com/public/attachments/cp_fb.c">cp_fb.c</a><br />
<br />
Dump framebuffer info:<br />
<a href="http://www.cellperformance.com/public/attachments/fb_info.c">fb_info.c</a><br />
<br />
<a href="http://www.cellperformance.com/articles/2007/03/handy_ps3_linux_framebuffer_ut.html#fb_info">Example output from fb_info</a><br />
<a href="http://www.cellperformance.com/articles/2007/03/handy_ps3_linux_framebuffer_ut.html#fb_use">Example of using cp_vt and cp_fb</a><br />
<br />

<div class="sticky-note">
Files should be compiled with:
<pre class="code">
ppu-gcc -std=c99 -pedantic -W -Wall -O3
</pre>
</div>
]]>
        <![CDATA[<div id="fb_info" class="subtitle">fb_info</div>

fb_info dumps the current settings for the framebuffer setup on the PS3.<br />
<br />
For example - for 480i the output should look something like this:
<pre class="code">
FBIOGET_VBLANK:
  flags:
    FB_VBLANK_VBLANKING   : FALSE
    FB_VBLANK_HBLANKING   : FALSE
    FB_VBLANK_HAVE_VBLANK : FALSE
    FB_VBLANK_HAVE_HBLANK : FALSE
    FB_VBLANK_HAVE_COUNT  : FALSE
    FB_VBLANK_HAVE_VCOUNT : FALSE
    FB_VBLANK_HAVE_HCOUNT : FALSE
    FB_VBLANK_VSYNCING    : FALSE
    FB_VBLANK_HAVE_VSYNC  : TRUE
  count  : 0
  vcount : 1
  hcount : 0
-------------------------------------
FBIOGET_FSCREENINFO:
  id          : "PS3 FB"
  smem_start  : 0x00000000
  smem_len    : 18874368
  type        : FB_TYPE_PACKED_PIXELS (0)
  type_aux    : N/A
  visual      : FB_VISUAL_TRUECOLOR (2)
  xpanstep    : 1
  ypanstep    : 1
  ywrapstep   : 1
  line_length : 2880
  mmio_start  : 0x00000000
  mmio_len    : 0
  accel       : FB_ACCEL_NONE (0)
-------------------------------------
PS3FB_IOCTL_SCREENINFO:
    xres        : 720
    yres        : 480
    xoff        : 72
    yoff        : 48
    num_frames  : 2
-------------------------------------
</pre>

<div id="fb_use" class="subtitle">Using cp_vt and cp_fb</div>

These functions are very simple to use. The user running them should have read/write access to the framebufer (/dev/fb0) and the main console (/dev/console).

<pre class="code">
{
    cp_vt vt;
    cp_fb fb;

    cp_vt_open_graphics(&vt);
    cp_fb_open(&fb);

    uint32_t frame_ndx = 0;

    while (1)
    {
        uint32_t* const restrict frame_top = (uint32_t*)fb.draw_addr[ frame_ndx ];

        // Write pixel to the frame buffer ...
        // x and y are image position
        // rgb24 is 32bit pixel value (where top 8 bits are unused)

        frame_top[ ( y * fb.stride ) + x ] = rgb24;

        // At the vsync, the previous frame is finished sending to the CRT
        cp_fb_wait_vsync( &fb );

        // Send the frame just drawn to the CRT by the next vblank
        cp_fb_flip( &fb, frame_ndx );

        frame_ndx  = frame_ndx ^ 0x01;
    }

    cp_vt_close(&vt);
    cp_fb_close(&fb);
}
</pre>

A more complete example: <a href="http://www.cellperformance.com/public/attachments/fb_test.c">fb_test.c</a><br />
]]>
    </content>
</entry>
<entry>
    <title>HowTo: Huge TLB pages on PS3 Linux</title>
    <link rel="alternate" type="text/html" href="http://www.cellperformance.com/articles/2007/01/howto_huge_tlb_pages_on_ps3_li.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://www.cellperformance.com/cgi-bin/mt/mt-atom.cgi/weblog/blog_id=1/entry_id=86" title="HowTo: Huge TLB pages on PS3 Linux" />
    <id>tag:www.cellperformance.com,2007:/articles//1.86</id>
    
    <published>2007-01-30T04:26:49Z</published>
    <updated>2007-07-18T07:04:48Z</updated>
    
    <summary>Understanding the TLB and minimizing misses is a critical part of high performance Cell programming. Unfortunately some PS3 kernels do not come with huge page support enabled. Jakub Kurzak and Alfredo Buttari step through the details of recompiling the kernel for huge page support.</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://www.cellperformance.com/mike_acton</uri>
    </author>
            <category term="CBE" />
    
    <content type="html" xml:lang="en" xml:base="http://www.cellperformance.com/articles/">
        <![CDATA[<div class="sticky-note">
<b>Updated! (22 Mar 07) Minor edits. Added notes for YellowDog Linux. Added source code for using huge page allocation.</b> <br />
<b>Updated! (30 Mar 07) A couple minor fixes. Thanks to Guénaël Renault for pointing them out!</b><br />
<b>Updated! (15 July 07) Added notes for kernel 2.6.21</b>
</div>

<div class="sticky-note">
Guest article: Understanding the TLB and minimizing misses is a critical part of high performance Cell programming. Unfortunately some PS3 kernels do not come with huge page support enabled. Jakub Kurzak and Alfredo Buttari step through the details of recompiling the kernel for huge page support.
</div>

The availability of huge TLB pages depends on the way the linux kernel has been configured prior to compilation. The default kernel that ships with Fedora Core 5 (most likely with any other distribution that has binary kernel packages) doesn't include this option. So, in order to have huge TLB pages, it is necessary to reconfigure the kernel, recompile it, instruct the boot loader about the newly created kernel image. Finally we will also show a way to allocate the TLB pages automatically at boot time.<br />
<br />

<div class="sticky-note">
[Mike Acton] This process also works with YellowDog Linux virtually unchanged.
</div>]]>
        <![CDATA[<div class="subtitle">Rebuilding the PS3 Linux Kernel</div>

<div class="sticky-note">
[Mike Acton] For more detailed information on the Linux Kernel and the build process, see: 
<ul>
<li><a href="http://www.faqs.org/docs/Linux-HOWTO/Kernel-HOWTO.html">The Linux Kernel HOWTO [faqs.org]</a></li>
<li><a href="http://www.cellperformance.com/public/linux-20061110-docs/LinuxKernelOverview.html">PS3 Linux Kernel Overview</a></li>
<li>Also see: <a href="http://julipedia.blogspot.com/2007/03/building-updated-kernel-for-ps3.html">Building an Updated Kernel for PS3 [julipedia.blogspot.com]</a></li>
<li>Also see: <a href="http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-nfs-root-howto.txt">PS3 NFS Root File System HOWTO</a> by Geoff Levand (PS3 kernel maintainer)</li>
</ul>
</div>

<div class="sticky-note">
[Mike Acton] For more information on using huge tlb pages, especially from user space, read <a href="http://www.gelato.unsw.edu.au/lxr/source/Documentation/vm/hugetlbpage.txt?v=2.6.16;a=ppc">hugetlbpages.txt</a> which is found in the kernel source under /Documents/vm/
</div>

Here are the steps:<br />
<br />

<ol>
<li>Recompile the kernel in order to have huge TLB pages
<ol>

<li> Take the kernel source from the add-on cd (filename is linux-20061110.tar.bz2)
<div class="sticky-note">[Mike Acton] Download the <a href="http://dl.qj.net/PS3-Linux-Addon-Disc-Source-PlayStation-3/pg/12/fid/11310/catid/514">PS3 Source Add-On CD [qj.net]</a>.</div>

<div class="sticky-note">[Mike Acton] A more recent (2.6.21 as of this update) kernel and sources can be found the more recent Add-on disc package (CELL-Linux-CL_20070516-ADDON) which can be found in various Linux mirrors: <br />
<ul>
<li><a href="http://ftp.uk.linux.org/pub/linux/Sony-PS3/">http://ftp.uk.linux.org/pub/linux/Sony-PS3/</a></li>
<li><a href="http://www.kernel.org/pub/linux/kernel/people/geoff/cell/">http://www.kernel.org/pub/linux/kernel/people/geoff/cell/</a></li>
<li><a href="http://ftp.riken.go.jp/pub/Linux/kernel/people/geoff/cell/">http://ftp.riken.go.jp/pub/Linux/kernel/people/geoff/cell/</a></li>
</ul>
</div>

<li> unpack it in the /usr/src directory

</li>
<li> make a link:
<pre class="code">
	$ ln -s /usr/src/linux-20061110 /usr/src/linux
</pre>
<div class="sticky-note">
[Mike Acton] <b>For Linux 2.6.21:</b><br />
<pre class="code">
	$ ln -s /usr/src/linux-2.6.21-20070425 /usr/src/linux
</pre>
</div>
</li>


<li> prepare for kernel configuration:
<div class="sticky-note">
[Mike Acton] <b>For Linux 2.6.21:</b><br />
To build a more recent kernel you will need to install a few things first:

<ol>
<li><a href="http://www.methods.co.nz/asciidoc/index.html">AsciiDoc</a>. Download: <a href="http://www.methods.co.nz/asciidoc/asciidoc-8.2.1.tar.gz">asciidoc-8.2.1.tar.gz [methods.co.nz]</a></li>
<pre class="code">
$ cd /usr/src
$ tar xzvf asciidoc.tar.gz
$ cd asciidoc-8.2.1
$ ./install.sh
</pre>
</li>

<li><a href="http://cyberelk.net/tim/software/xmlto/">xmlto</a>. Download: <a href="http://cyberelk.net/tim/data/xmlto/stable/xmlto-0.0.18.tar.bz2">xmlto-0.0.18.tar.bz2 [cyberelk.net]</a></li>
<pre class="code">
$ cd /usr/src
$ tar xjvf xmlto-0.0.18.tar.bz2
$ cd xmlto-0.0.18
$ ./configure
$ make
$ make install 
</pre>
</li>

<li><a href="http://git.or.cz/">git</a>, a revision control system. Download: <a href="http://www.kernel.org/pub/software/scm/git/git-1.5.2.tar.gz">git 1.5.2 [kernel.org]</a>
<pre class="code">
$ cd /usr/src
$ tar xzvf git-1.5.2.tar.gz
$ cd git-1.5.2
$ make prefix=/usr all doc
$ make prefix=/usr install install-doc 
</pre>
</li>

<li><a href="http://dtc.ozlabs.org/">dtc</a> (Device Tree Compiler)  NOTE: To build the kernel, you need a version newer than the dtc-20060419.tar.gz  version available on the dtc web page.
<pre class="code">
$ cd /usr/src
$ git clone git://www.jdl.com/software/dtc.git 
$ cd dtc
$ make
$ make install
</code>
</li>
</ol>

</div>
</li>

<li><div class="sticky-note">[Mike Acton] mrproper should be done before make to clean any older build data, if you have them.</div>
<pre class="code">
$ make mrproper
</pre>
</li>

<li> copy the kernel config file that comes with the fedora installation into /usr/src/linux
<pre class="code">
$ cp /boot/config-2.6.16 /usr/src/linux/.config
</pre>

<div class="sticky-note">
[Mike Acton] On YellowDog Linux, this file is /boot/config-2.6.16-20061110.ydl.1ps3
</div>
<div class="sticky-note">[Mike Acton] <b>For Linux 2.6.21:</b><br />
The config file has been updated significantly since the original 2.6.16 release. It's much easier to start with the file included in the kernel distribution. 
<pre class="code">
$ cd /usr/src/linux
$ cp arch/powerpc/configs/ps3_defconfig .config
</pre>
</div>
</li>

<li>This next step goes through the old configuration file and prompts the user whenever 
     a new kernel option that is not present in the old kernel is encountered (none in this case
     since the old and the new kernels are exactly the same version)
<pre class="code">
$ make oldconfig
</pre>
<div class="sticky-note">[Mike Acton] <b>For Linux 2.6.21:</b> There's no need for this step if you copied the file from the kernel distribution itself.
</div>
</li>

<li> enable huge TLB pages in the kernel configuration
<pre class="code">
$ make menuconfig
</pre>
     Now go to File systems --> Pseudo filesystems and enable huge TLB pages by pressing
     the space bar on the "HugeTLB file system support" option. Now select "exit" repeatedly and
     answer "yes" when asked to save the new kernel configuration
</li>

<li> compile kernel and modules and install modules (it will take around 20 minutes):
<pre class="code">
$ make all
$ make modules_install
</pre>
</li>
</ol>

<li> install the new kernel:
<div class="sticky-note">
[Mike Acton] <b>For Linux 2.6.21:</b> Replace references to 2.6.16 with 2.6.21 in this and the following steps.
</div>
<pre class="code">
$ cp /usr/src/linux/vmlinux /boot/vmlinux-2.6.16_HTLB
</pre>
</li>
<li> create a ramdisk image for the new kernel:

<pre class="code">
$ mkinitrd /boot/initrd-2.6.16_HTLB.img 2.6.16
</pre>

<div class="sticky-note">
[Mike Acton] On Yellowdog Linux, mkinitrd lives in /sbin.
</div>

<div class="sticky-note">
[Mike Acton] <b>For Linux 2.6.21:</b><br />
<i>"When I do mkinitrd, it says: No modules available for kernel "2.6.21". What's up?</i><br />
<br />
The problem is this version of the kernel doesn't isn't installed as "2.6.21", it's installed as "2.6.21-rc7". You can discover that by looking in /lib/modules:
<pre>
$ ls /lib/modules
total 16
drwxr-xr-x 3 root root 4096 Mar 22 05:57 2.6.16
drwxr-xr-x 5 root root 4096 Jan 19 06:06 2.6.16-20061110.ydl.1ps3
drwxr-xr-x 3 root root 4096 Jul 15 08:24 2.6.20
drwxr-xr-x 3 root root 4096 Jul 17 06:22 2.6.21-rc7
</pre>
So the actual command you need to run is:
<pre class="code">
$ mkinitrd /boot/initrd-2.6.21_HTLB.img 2.6.21-rc7
</pre>
</div>

</li>
<li> tell the bootloader (kboot) where the new kernel is:
<pre class="code">
$ vim /etc/kboot.conf
</pre>
     add the following line
<pre class="code">
linux_htlb='/boot/vmlinux-2.6.16_HTLB initrd=/boot/initrd-2.6.16_HTLB.img'
<div class="sticky-note">[Mike Acton] For YellowDog Linux, use:
ydl_htlb      ='/dev/sda1:/vmlinux-2.6.16_HTLB initrd=/dev/sda1:/initrd-2.6.16_HTLB.img \
root=/dev/sda2 init=/sbin/init video=ps3fb:mode:3 rhgb'
ydl480i_htlb  ='/dev/sda1:/vmlinux-2.6.16_HTLB initrd=/dev/sda1:/initrd-2.6.16_HTLB.img \
root=/dev/sda2 init=/sbin/init video=ps3fb:mode:1 rhgb'
ydl1080i_htlb ='/dev/sda1:/vmlinux-2.6.16_HTLB initrd=/dev/sda1:/initrd-2.6.16_HTLB.img \
root=/dev/sda2 init=/sbin/init video=ps3fb:mode:4 rhgb'
ydltext_htlb  ='/dev/sda1:/vmlinux-2.6.16_HTLB initrd=/dev/sda1:/initrd-2.6.16_HTLB.img \
root=/dev/sda2 init=/sbin/init 3'</div>
</pre>
     if you want this kernel to be loaded by default then change the "default" line into
<pre class="code">
default=linux_htlb
<div class="sticky-note">[Mike Acton] For YellowDog Linux, use one of the modes above.</div>
</pre>
</li>

<li> instruct the boot process in order to allocate huge TLB pages. (Pick one of the following two options)
<ol>
<li> OPTION 1:
<pre class="code">
$ vim /etc/rc.local
</pre>
     add the following lines:
<pre class="code">
mkdir -p /huge
echo 20 > /proc/sys/vm/nr_hugepages
mount -t hugetlbfs nodev /huge
chown root:root /huge
chmod 755 /huge
</pre>
     be sure to change the "chown" line according to your system settings.
</li>


<li> OPTION 2: create a /etc/init.d/htlb script with the following content:<br />
<div class="quote"><i>All the commands added to the rc.local file in the previous step are executed at the end of the boot sequence.
This means that the huge TLB pages allocation is performed when lots of the system memory has been
already allocated by other processes. This results in the allocation of 6 or 7 pages. In order to obtain
few pages more (8 or 9) we have to move the huge TLB pages allocation earlier in the boot sequence (i.e. at
runlevel-1)</i>
</div>
<br />

<div class="sticky-note">
[Mike Acton] chkconfig required some additional settings not in the previous version of this script. Modified version is here:
<pre class="code">
	#!/bin/sh
	#
	# htlb:	Start/stop huge TLB pages allocation
	#
        # [Mike Acton] The runlevel and priority settings for chkconfig are stolen straight out of cpuspeed.
        
        # chkconfig: 12345 06 99
        # description: Start/stop huge TLB pages allocation

	. /etc/rc.d/init.d/functions

	start()
	{
	    mkdir -p /huge
	    echo 20 > /proc/sys/vm/nr_hugepages
	    mount -t hugetlbfs nodev /huge
	    chown root:root /huge
	    chmod 775 /huge
        }

	stop()
	{
	    echo 0 > /proc/sys/vm/nr_hugepages
	}
	
	case "$1" in
	  start)
		start
		;;
	  stop)
		stop
		;;
	  restart|reload)
	        stop
	        start
	        ;;
	  *)
	        echo $"Usage: $0 {start|stop|status|restart|reload}"
	        exit 1
		;;
	esac
	
	exit 0
</pre>
</div>
Make the new service executable:
<pre class="code">
$ chmod a+x /etc/init.d/htlb
</pre>
Add the service to runlevel-1:
<pre class="code">
$ /sbin/chkconfig --add htlb
</pre>
</li>
</ol>
</li>

<li> reboot. During the boot process, when presented the "kboot:" prompt you'll be able to choose your kernel using the "tab" key.
</li>

</ol>

<div class="sticky-note">
[Mike Acton] Validate that huge pages are now installed and working by:
<pre class="code">
$ cat /proc/meminfo | grep Huge
</pre>

You should see something like:

<pre class="code">
HugePages_Total:     8
HugePages_Free:      8
Hugepagesize:    16384 kB
</pre>

and...
<pre class="code">
$ cat /proc/filesystems  | grep huge
</pre>

You should see something like:

<pre class="code">
nodev   hugetlbfs
</pre>
</div>


<div class="sticky-note">
[Mike Acton] Here are some helper functions for allocating and freeing huge memory:<br />
<br />
<a href="http://www.cellperformance.com/public/attachments/cp_hugemem.c">cp_hugemem.c</a><br />
<a href="http://www.cellperformance.com/public/attachments/cp_hugemem.h">cp_hugemem.h</a><br />
<br />
They are very simple to use:
<pre class="code">
{
    // Allocate...
    const size_t  hmem_size = 128 * 1024 * 1024;
    cp_hugemem    hmem;

    int was_hugemem_allocated = cp_hugemem_alloc( &hmem, hmem_size );
    if ( !was_hugemem_allocated )
    {
        fprintf(stderr,"Error: Could not allocate hugemem\n");
        return (-1);
    }

    // Use the memory...
    char* ptr = (char*)hmem.addr;

    // Free...
    cp_hugemem_free( &hmem );
}
</pre>
</div>

<div class="subtitle">About the Authors</div>

<b><a href="http://www.cs.utk.edu/~kurzak/">Jakub Kurzak</a> AKA Koobas</b> is a researcher at the University of Tennessee, Knoxville, and a member of the Innovative Computing Lab (ICL - http://icl.cs.utk.edu/), where he mostly does things related programming multi-core processors and the Cell processor. Before that he was a student the University of Houston, where he dealt with programming distributed memory machines using message passing (MPI). Jakub's interests are in parallel programming techniques (message passing, multi-threading), parallel
number crunching algorithms, and performance optimization.<br />
<br />

<b><a href="http://www.cs.utk.edu/~buttari/">Alfredo Buttari</a></b> is a research associate at the Computer Science dept. of the University of Tennessee Knoxville. Alfredo is a member of the Innovative Computing Laboratory which deals with many aspects of High Performance Computing. His interests are in developing high performance software for Linear Algebra which is mostly achieved through parallel programming techniques of all sorts (MPI, OpenMP, threads...), including the more exotic approaches like the Cell programming model. Before to Tennesse Alfredo got a PhD and a Master degree in Computer Science from the "Tor Vergata" University of Rome (Italy).<br />


]]>
    </content>
</entry>
<entry>
    <title>Cross-compiling for PS3 Linux</title>
    <link rel="alternate" type="text/html" href="http://www.cellperformance.com/articles/2006/11/crosscompiling_for_ps3_linux.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://www.cellperformance.com/cgi-bin/mt/mt-atom.cgi/weblog/blog_id=1/entry_id=82" title="Cross-compiling for PS3 Linux" />
    <id>tag:www.cellperformance.com,2006:/articles//1.82</id>
    
    <published>2006-11-29T08:13:56Z</published>
    <updated>2006-12-22T19:39:30Z</updated>
    
    <summary>n this article, I will detail the basic steps I used to get started building on a host PC and running on the PS3.</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://www.cellperformance.com/mike_acton</uri>
    </author>
            <category term="CBE" />
    
    <content type="html" xml:lang="en" xml:base="http://www.cellperformance.com/articles/">
        <![CDATA[Now that the PS3 is out and multiple Linux-based distributions are available which can be installed using <a href="http://www.playstation.com/ps3-openplatform/index.html">Open Platform [playstation.com]</a> it's time to start developing on some publically available hardware!<br />
<br />
Although the PPU and SPU compilers can be installed and used on the PS3 directly, I find it much more familiar and convinient to cross-compile from my desktop and just ship the resulting executables over to the target (PS3). <br />
<br />
In this article, I will detail the basic steps I used to get started building on a host PC and running on the PS3.

<ul>
<li><a href="http://www.cellperformance.com/articles/2006/11/crosscompiling_for_ps3_linux.html#install_linux">Install Linux</a></li>
<li><a href="http://www.cellperformance.com/articles/2006/11/crosscompiling_for_ps3_linux.html#install_libspe2">Install elfspe2 and libspe2 on PS3</a></li>
<li><a href="http://www.cellperformance.com/articles/2006/11/crosscompiling_for_ps3_linux.html#install_toolchain">Install toochain on host PC</a></li>
<li><a href="http://www.cellperformance.com/articles/2006/11/crosscompiling_for_ps3_linux.html#install_libspe2_host">Install libspe2 on host PC</a></li>
<li><a href="http://www.cellperformance.com/articles/2006/11/crosscompiling_for_ps3_linux.html#build_hello_libspe2">Building Hello World (for libspe2)</a></li>
<li><a href="http://www.cellperformance.com/articles/2006/11/crosscompiling_for_ps3_linux.html#hello_source_libspe2">Hello World source (for libspe2)</a></li>
<li><a href="http://www.cellperformance.com/articles/2006/11/crosscompiling_for_ps3_linux.html#using_ibm_sdk">Using the IBM SDK</a></li>
<li><a href="http://www.cellperformance.com/articles/2006/11/crosscompiling_for_ps3_linux.html#access_ps3_over_vnc">Access the PS3 Over VNC</a></li>
<li><a href="http://www.cellperformance.com/articles/2006/11/crosscompiling_for_ps3_linux.html#upgrade_libspe">Upgrade libspe and libspe2</a></li>
</ul>]]>
        <![CDATA[<div class="subtitle" id="install_linux"s> Install Linux </div> 
I have sucessfully compiled and run using both <a href="http://www.terrasoftsolutions.com/products/ydl/">Yellow Dog Linux [terrasoftsolutions.com]</a> and <a href="http://fedora.redhat.com/">Fedora Core [redhat.com]</a>.<br/> <br/> This article assumes that Linux is already installed on the PS3. It's very easy to install and the process is already documented quite well.<br/> <br/> 

Carl Bender over at <a href="http://www.ps3pc.net">PS3PC.net</a> has written a very good guide on <a href="http://linuxps3.net/index.php?option=com_content&task=view&id=33&Itemid=32">Installing Fedora 5 Linux on Your PS3 [linuxps3.net]</a><br /><br />


See also: <a href="http://www.terrasoftsolutions.com/support/installation/">Installation Guide for Yellow Dog Linux [terrasoftsolutions.com]</a><br/> 
See also: <a href="http://www.cellperformance.com/public/linux-20061110-docs/HowToEnableYourDistro.html">Installation Guide for Fedora Core [cellperformance.com]</a><br/> 
See also: <a href="http://www.pslinux.org/index.php?title=Main_Page">Linux on the Playstation 3 Wiki [pslinux.org]</a><br/>
See also: <a href="http://www.daniel.jp/joomla/info/ps3/installing-gentoo-on-the-ps3.html">Installing Gentoo on the PS3 [daniel.jp]</a><br />
<br/>

<div class="sticky-note"> 
<b>
NOTE: For the sake of this article, Yellow Dog Linux 5 (32 bit version for PS3) will be assumed. A 32 bit host PowerPC Fedora Core 5 installation will also be assumed (Although 64 bit and x64 versions of the libraries are available for other types of hosts.)
</b>
</div>

<div class="sticky-note"> 
<span class="monospace-strong">cat /proc/cpuinfo</span> (For the Target PS3)<br/> 
<pre class="code">
processor : 0
cpu : Cell Broadband Engine, altivec supported
clock : 3192.000000MHz
revision : 5.1 (pvr 0070 0501)

processor : 1
cpu : Cell Broadband Engine, altivec supported
clock : 3192.000000MHz
revision : 5.1 (pvr 0070 0501)

timebase : 79800000
machine : PS3PF
</pre>
<br />

<span class="monospace-strong">cat /proc/interrupts</span> (For the Target PS3)<br/> 
<pre class="code">
 CPU0 CPU1
 10: 19437 0 PS3PF irq controller Edge ehci_hcd:usb1
 11: 20767742 0 PS3PF irq controller Edge ehci_hcd:usb2
 16: 0 0 PS3PF irq controller Edge ohci_hcd:usb3
 17: 0 0 PS3PF irq controller Edge ohci_hcd:usb4
128: 0 574866 PS3PF irq controller Edge IPI0 (call function)
129: 0 3024105 PS3PF irq controller Edge IPI1 (reschedule)
130: 0 0 PS3PF irq controller Edge IPI2 (unused)
131: 0 0 PS3PF irq controller Edge IPI3 (debugger break)
132: 555759 0 PS3PF irq controller Edge IPI0 (call function)
133: 2998857 0 PS3PF irq controller Edge IPI1 (reschedule)
134: 0 0 PS3PF irq controller Edge IPI2 (unused)
135: 0 0 PS3PF irq controller Edge IPI3 (debugger break)
136: 0 0 PS3PF irq controller Edge Virtual UART
137: 0 0 PS3PF irq controller Edge spe00.0
138: 1 0 PS3PF irq controller Edge spe00.1
139: 7 0 PS3PF irq controller Edge spe00.2
140: 0 0 PS3PF irq controller Edge spe01.0
141: 2 0 PS3PF irq controller Edge spe01.1
142: 6 0 PS3PF irq controller Edge spe01.2
143: 0 0 PS3PF irq controller Edge spe02.0
144: 2 0 PS3PF irq controller Edge spe02.1
145: 6 0 PS3PF irq controller Edge spe02.2
146: 0 0 PS3PF irq controller Edge spe03.0
147: 2 0 PS3PF irq controller Edge spe03.1
148: 13 0 PS3PF irq controller Edge spe03.2
149: 0 0 PS3PF irq controller Edge spe04.0
150: 2 0 PS3PF irq controller Edge spe04.1
151: 13 0 PS3PF irq controller Edge spe04.2
152: 0 0 PS3PF irq controller Edge spe05.0
153: 1 0 PS3PF irq controller Edge spe05.1
154: 9 0 PS3PF irq controller Edge spe05.2
155: 27210328 0 PS3PF irq controller Edge ps3fb vsync
156: 1809885 0 PS3PF irq controller Edge PS3PF stor
157: 387328 0 PS3PF irq controller Edge PS3PF stor
158: 65 0 PS3PF irq controller Edge PS3PF stor
159: 1509 0 PS3PF irq controller Edge snd_ps3pf
160: 0 78885 PS3PF irq controller Edge gbec connection
BAD: 0
</pre> 
</div> 

<div class="subtitle" id="install_libspe2"> Install elfspe2 and libspe2 on PS3 </div> 
<b>elfspe2</b> allows SPU executables to be run standalone from the commandline (aka spulets)<br/> 
<b>libspe2</b> is a PPU library for launching and communicating with SPU executables.<br/> 
<br/> 

1. Copy the following files to the PS3. These files can be found on the <a href="http://dl.qj.net/PS3-Linux-Addon-Disc-PlayStation-3/pg/12/fid/11308/catid/514">PS3 Linux Add-On Packages CD</a> in the <b>spu</b> directory.<br/> 
<ul>
<li> libspe2-2.0.0-be0644.3.20061107.1.ps3pf.ppc.rpm </li> 
<li> elfspe2-2.0.0-be0644.3.20061107.1.ps3pf.ppc.rpm </li> 
</ul> 

2. As root, <span class="monospace-strong">rpm -ivh *.rpm</span><br/> 
<br/> 

<div class="subtitle" id="install_toolchain"> Install toochain on host PC </div> 

I am using Fedora Core 5 installed on a PowerPC Mac Mini as my host machine for PS3 development. Working from a PowerPC platform is extremely convinient. However, all of the following libraries are also either available as i686 packages or can be recompiled for the i686 platform if you prefer that.<br/> 
<br/> 

<div class="sticky-note"> 
<span class="monospace-strong">cat /proc/cpuinfo</span> (For the Host PC)<br/> 
<pre class="code">
processor : 0
cpu : 7447A, altivec supported
clock : 1249.999995MHz
revision : 0.2 (pvr 8003 0102)
bogomips : 83.20
timebase : 41620997
machine : PowerMac10,1
motherboard : PowerMac10,1 MacRISC3 Power Macintosh
detected as : 287 (Mac mini)
pmac flags : 00000010
L2 cache : 512K unified
pmac-generation : NewWorld
</pre> 
</div> 

1. Copy the following files to the host PC. These files can be found at <a href="http://www.bsc.es/projects/deepcomputing/linuxoncell/">Barcelona Supercomputing Center, Linux on Cell [bsc.es]</a> under <span class="monospace-strong">Programming Models -&gt; Linux on Cell -&gt; Cell BE Components -&gt; GNU Toolchain</span>. 
<ul> 
<li> ppu-binutils-3.2-4.ppc.rpm </li> 
<li> ppu-gcc-3.2-4.ppc.rpm </li> 
<li> ppu-gcc-c++-3.2-4.ppc.rpm </li> 
<li> ppu-toolchain-3.2-4.src.rpm </li> 
<li> ppu-toolchain-debuginfo-3.2-4.ppc.rpm </li> 
<li> spu-binutils-3.2-6.ppc.rpm </li> 
<li> spu-gcc-3.2-6.ppc.rpm </li> 
<li> spu-gcc-c++-3.2-6.ppc.rpm </li> 
<li> spu-newlib-1.14.0.200610300000-1.ps3pf.ppc.rpm </li> 
<li> spu-toolchain-3.2-6.src.rpm </li> 
<li> spu-toolchain-debuginfo-3.2-6.ppc.rpm </li> 
</ul>
<br />

2. As root, <span class="monospace-strong">rpm -ivh *.rpm</span><br/> 

<div class="subtitle" id="install_libspe2_host"> Install libspe2 on host PC </div>

1. Copy the following files to the host PC. These files can be found on the <a href="http://dl.qj.net/PS3-Linux-Addon-Disc-PlayStation-3/pg/12/fid/11308/catid/514">PS3 Linux Add-On Packages CD</a> in the <b>spu</b> directory.<br/> 
<ul> 
<li> libspe2-2.0.0-be0644.3.20061107.1.ps3pf.ppc.rpm </li> 
<li> libspe2-devel-2.0.0-be0644.3.20061107.1.ps3pf.ppc.rpm </li> 
</ul> 

2. As root, <span class="monospace-strong">rpm -ivh *.rpm</span><br/> 
<br/> 
<div class="subtitle" id="build_hello_libspe2"> Building Hello World (for libspe2)</div> 

1. On the host PC, compile the example:<br/> 
<br/> 
<span class="monospace-strong">ppu-gcc -m32 ppu_hello.c -lspe2 -o ppu_hello</span><br/> 
<span class="monospace-strong">spu-gcc spu_hello.c -o spu_hello</span><br/> 
<br/> 
NOTE: If the 64 bit support headers and libraries are installed on the host the <span class="monospace-strong">-m32</span> can be omitted from the PPU compilation step.<br/> 
<br/> 

2. Copy the two executables to the PS3.<br/> 
3. To execute <span class="monospace-strong">spu_hello</span> using libspe2, just run <span class="monospace-strong">./ppu_hello</span><br/> 
4. To execute <span class="monospace-strong">spu_hello</span> using elfspe2, just run <span class="monospace-strong">./spu_hello</span> directly.<br/> 
<br/> 

<div class="subtitle" id="hello_source_libspe2"> Hello World source (for libspe2)</div> 

<a href="http://www.cellperformance.com/public/attachments/ppu_hello.c">ppu_hello.c</a> 
<pre class="code">
<span class="line-number">  0</span>#include &lt;stdlib.h&gt;
<span class="line-number">  1</span>#include &lt;libspe2.h&gt;
<span class="line-number">  2</span>
<span class="line-number">  3</span>int
<span class="line-number">  4</span>main()
<span class="line-number">  5</span>{
<span class="line-number">  6</span>  unsigned int          createflags = 0;
<span class="line-number">  7</span>  unsigned int          runflags    = 0;
<span class="line-number">  8</span>  unsigned int          entry       = SPE_DEFAULT_ENTRY;
<span class="line-number">  9</span>  void*                 argp        = NULL;
<span class="line-number"> 10</span>  void*                 envp        = NULL;
<span class="line-number"> 11</span>
<span class="line-number"> 12</span>  spe_program_handle_t* program     = spe_image_open("spu_hello");
<span class="line-number"> 13</span>  spe_context_ptr_t     spe         = spe_context_create(createflags, NULL);
<span class="line-number"> 14</span>  spe_stop_info_t       stop_info;
<span class="line-number"> 15</span>
<span class="line-number"> 16</span>  spe_program_load(spe, program);
<span class="line-number"> 17</span>  spe_context_run(spe, &amp;entry, runflags, argp, envp, &amp;stop_info);
<span class="line-number"> 18</span>  spe_image_close(program);
<span class="line-number"> 19</span>  spe_context_destroy(spe);
<span class="line-number"> 20</span>
<span class="line-number"> 21</span>  return (0);
<span class="line-number"> 22</span>}
</pre>


<a href="http://www.cellperformance.com/public/attachments/spu_hello.c">spu_hello.c</a> 
<pre class="code">
<span class="line-number"> 0</span>#include &lt;stdio.h&gt;
<span class="line-number"> 1</span> 
<span class="line-number"> 2</span>int
<span class="line-number"> 3</span>main( unsigned long spuid )
<span class="line-number"> 4</span>{
<span class="line-number"> 5</span> printf("Hello, World! (From SPU:%d)\n",spuid);
<span class="line-number"> 6</span> return (0);
<span class="line-number"> 7</span>}
</pre> 

<div class="subtitle" id="using_ibm_sdk"> Using the IBM SDK </div> 

The IBM SDK uses libspe not libspe2, so in order to build the IBM libraries and samples, libspe must be installed.<br/> 
<br/> 

<div class="sticky-note">
<b>What is the difference between libspe and libspe2? Will both continue to be used?</b><br />
<br />
libspe2 is a re-design of libspe. The folks at IBM have strongly implied that libspe is on its way out and we should expect a future revision of the SDK to be refactored for libspe2.<br />
<br />
<b>Roland (RSei)</b> gave an excellent description of reasoning behind the design of libspe2 in <a href="http://www-128.ibm.com/developerworks/forums/dw_thread.jsp?message=13896030&cat=46&thread=144504&treeDisplayType=threadmode1&forum=739#13896030">IBM's Cell Broadband Engine Architecture forum</a>:

<div class="quote">
"There have been a number of requirements and issues with libspe1 that led to the design of a new major version with a different API. I'll try to explain a few major aspects just briefly:<br /><br />
1. libspe is supposed to be the "low-level API" to use SPE resources. We think that the "SPE context" introduced in libspe2 is the better low-level construct than the "SPE thread" (as defined in libspe1), which already suggests a particular programming model and view. By using "SPE contexts", it is, e.g., possible to have other models like (synchronous) function offload to SPEs more easily without introducing the complexity and overhead of threading into an application. Another example is the possibility to exchange the code on an SPE, but leaving the data in place, which allows for easy and efficient "chaining" of processing steps und PPE control. In the thread model, this would have to rely on SPE programs using overlays. By the way, it is very easy to have the libspe1 thread model as a special case implemented on top of libspe2 and we have actually done this exercise internally.<br /><br />
2. Many people asked for a more complete "SPE thread library" (similar to what you usually have, e.g., in pthread). By removing the special concept of an "SPE thread" (in the libspe1 sense), we are actually addressing this requirement. When using libspe2, the programmer relies on the thread package of choice and just uses SPEs in these threads. All thread-specific aspects of the application are standard - so you have full functionality.<br />
3. There were many complaints about the event API in libspe1 - from usability to efficiency. We think, we found a good solution in libspe2.<br /><br />
4. We feel that the "SPE groups" in libspe1 were tieing together rather orthogonal concepts like scheduling and event handling. So we gave up this construct. You may have noticed that we introduced "SPE gang contexts" and you have probably already guessed that we are working on gang scheduling to leverage this - but "gangs" are purely a scheduling construct and do *not* replace the previous groups.<br /><br />
5. You are right that binding threads to specific, physical SPEs has been part of the libspe1 API, although it had never been implemented. There are many discussions about this feature. At this point, we don't have a conclusive answer how we want to support "affinity" of threads to physical SPE resources. We simply felt we are not ready yet to define the API and stick to it in the future."<br />
</div>

</div>

1. Copy the following files to the host PC. These files can be found at <a href="http://www.bsc.es/projects/deepcomputing/linuxoncell/">Barcelona Supercomputing Cente, Linux on Cell [bsc.es]</a> under <span class="monospace-strong">Programming Models -&gt; Linux on Cell -&gt; Cell BE Components -&gt; GNU Toolchain</span>. 
<ul> 
<li> libspe-1.1.0-1.ppc.rpm </li> 
<li> libspe-debuginfo-1.1.0-1.ppc.rpm </li> 
<li> libspe-devel-1.1.0-1.ppc.rpm </li> 
</ul> 

2. As root, <span class="monospace-strong">rpm -ivh *.rpm</span><br/> <br/> 
3. Copy the libspe libraries from the Host PC at <span class="monospace-strong">/usr/lib/libspe.so.*</span> to <span class="monospace-strong">/usr/lib/</span> on the PS3.<br />
4. Copy the following file onto the host PC. This file can be found at IBM alphaWorks' <a href="http://www.alphaworks.ibm.com/tech/cellsw/download">IBM Cell Broadband Engine Software Development Kit download page</a>. You will need to agree to the licenses in order to download the file. 
<ul> 
<li> cell-sdk-lib-samples-1.1-10.noarch.rpm </li> 
</ul> 

5. As root, <span class="monospace-strong">rpm -ivh cell-sdk-lib-samples-1.1-10.noarch.rpm</span>. The source files should now be installed in <span class="monospace-strong">/opt/IBM/cell-sdk-1.1</span>.<br/> 
6. Only minor modifications are needed to cross-compile the SDK.<br/> 
<ul> 
<li> <span class="monospace-strong">cd /opt/IBM/cell-sdk-1.1</span> </li> 
<li> Open <span class="monospace-strong">make.footer</span> </li> 
<li> Search for (starting at line 84 in my copy):<br/> 
<pre class="code">
########################################################################
# Common GNU Defines (Host, PPU32, PPU64, SPU)
########################################################################
</pre> 
</li> 
<li> Delete the following section (starting at line 91 in my copy): 
<pre class="code">
ifeq "$(HOST_PROCESSOR)" "ppc64"
 SCE_ROOT =
 SCE_SYSROOT =
 SCE_PPU_BINDIR = /usr/bin
 SCE_SPU_BINDIR = /usr/bin
 PPU_TOOL_PREFIX =
 PPU32_TOOL_PREFIX =
else
 # SCE_VERSION is defined in environment or in make.env
 SCE_ROOT = /opt/sce/$(SCE_VERSION)
 SCE_SYSROOT = $(SCE_ROOT)/ppu/sysroot
 SCE_PPU_BINDIR = $(SCE_ROOT)/ppu/bin
 SCE_SPU_BINDIR = $(SCE_ROOT)/spu/bin
 PPU_TOOL_PREFIX = $(PPU_PREFIX)
 PPU32_TOOL_PREFIX = $(PPU32_PREFIX)
endif
</pre>
</li> 
<li> Insert the following section at the same location: 
<pre class="code">
 SCE_ROOT =
 SCE_SYSROOT =
 SCE_PPU_BINDIR = /usr/bin
 SCE_SPU_BINDIR = /usr/bin</pre> 
</li> 
<li> If 64 bit support is not installed, search for (line 150 in my copy):<br/> 
<pre class="code">
#********************
# 64-bit PPU Targets
#********************
</pre> 
</li> 
<li> If 64 bit support is not installed, delete the following lines:<br/> 
<pre class="code">
PPU64_TARGETS := $(strip $(PROGRAM_ppu64) \
 $(PROGRAMS_ppu64) \
 $(LIBRARY_ppu64) \
 $(SHARED_LIBRARY_ppu64))

ifdef PPU64_TARGETS
 TARGET_PROCESSOR := ppu64
endif
</pre> 
</li> 
<li> Save the changes </li> 
</ul> 
7. If GLUT is not installed on the host PC, install it (for Fedora-based hosts) with <span class="monospace-strong">yum install freeglut-devel</span><br/> 
8. The SDK and samples should now build without errors: <span class="monospace-strong">cd src; make</span> (Although quite a few warnings will be generated - there is a bit of non-standard compliant code in the SDK which should be fixed.)<br/> 
9. Copy the following files from the host PC to the target PS3's <span class="monospace-strong">/usr/lib</span> directory. 
<ul> 
<li> /opt/IBM/cell-sdk-1.1/src/lib/matrix/ppu_shared/libmatrix.so </li> 
<li> /opt/IBM/cell-sdk-1.1/src/lib/image/ppu_shared/libimage.so </li> 
<li> /opt/IBM/cell-sdk-1.1/src/lib/vector/ppu_shared/libvector.so </li> 
<li> /opt/IBM/cell-sdk-1.1/src/lib/surface/ppu_shared/libsurface.so </li> 
<li> /opt/IBM/cell-sdk-1.1/src/lib/noise/ppu_shared/libnoise.so </li> 
<li> /opt/IBM/cell-sdk-1.1/src/lib/fft/ppu_shared/libfft.so </li> 
<li> /opt/IBM/cell-sdk-1.1/src/lib/gmath/ppu_shared/libgmath.so </li> 
<li> /opt/IBM/cell-sdk-1.1/src/lib/math/ppu_shared/libmath.so </li> 
<li> /opt/IBM/cell-sdk-1.1/src/lib/misc/ppu_shared/libmisc.so </li> 
<li> /opt/IBM/cell-sdk-1.1/src/lib/audio_resample/ppu_shared/libaudio_resample.so </li> 
</ul> 10. Now anything built with the IBM SDK should be able to run on the PS3. 

<div class="subtitle" id="access_ps3_over_vnc"> Access the PS3 Over VNC </div> 

I have two Playstation 3 units and only one HDMI input on my HD TV and that one is going to be used for game playing, not developing. So the PS3 I use for development is head-less. The vast majority of the time I can accomplish everything I need by a simple secure shell to the PS3. But occasionally I want to use the machine as though I were local, and that is what VNC is for.<br/> 
<br/> 

How to setup VNC on the PS3 (for Yellow Dog Linux):<br/> 
1. Secure shell from the host to the PS3 with X11 using: <span class="monospace-strong">ssh -X [PS3_IP_ADDRESS]</span><br/> 
2. On the PS3, launch the firewall security settings application using: <span class="monospace-strong">system-config-securitylevel</span>. At this point you will need to enter the root password for the PS3.<br/> 
3. Click on "Other ports", then "+ Add" and add port 5901 (TCP). This will allow the VNC connection through the firewall running on the PS3. Go ahead and close the application.<br/> 
4. On the PS3, run the VNC server using: <span class="monospace-strong">vncserver</span>. If this is the first time you've run the server, you will need to provide a password that will be used to access the machine.<br/> 
5. On the host PC, start the VNC client using: <span class="monospace-strong">vncviewer [PS3_IP_ADDRESS]:[DISPLAY_NUMBER]</span>. The display number was printed when the server was started. It defaults to 1 (ONE).<br/> 
6. After you enter the password, you should now see the PS3 window manager running with an open shell by default.<br/> 
7. In order to kill the VNC server use: <span class="monospace-strong">vncserver -kill :[DISPLAY_NUMBER]</span><br/> 
8. In order to use the default Yellow Dog window manager (Enlightenment), uncomment the following lines in ~/.vnc/xstartup on the PS3 and restart the server.<br/> 

<pre class="code">
unset SESSION_MANAGER
exec /etc/X11/xinit/xinitrc
</pre> 
<br/> 

The only real practical difference between using the PS3 over VNC and using it locally will be if you are writing graphics to the framebuffer. These effects will only display over the locally connected display.<br/> <br/> 

<div class="subtitle" id="upgrade_libspe"> Upgrade libspe and libspe2 </div> 

The official release of libspe and libspe2 that were available at launch have some minor issues that were patched recently. Both libraries are being actively developed and there will always be new patches available for brave developers. There is a cumulative version available through December 6.<br />
<br /> 

To build and install the latest version:<br/> <br /> 

1. Download the following files from <a href="http://ozlabs.org/pipermail/cbe-oss-dev/2006-December/000682.html"> [Cbe-oss-dev] libspe and libspe2 december release</a> to the Host PC.
<ul>
<li>libspe-1.2.0.tar.gz</li>
<li>libspe2-2.0.1.tar.gz</li>
</ul>
The files will probably need to be renamed locally after download.<br />
2. Untar the two files with:
<ul>
<li><span class="monospace-strong">tar xzvf libspe-1.2.0.tar.gz</span></li>
<li><span class="monospace-strong">tar xzvf libspe2-2.0.1.tar.gz</span></li>
</ul>
3. In the <span class="monospace-strong">libspe2-2.0.1</span> directory, open the <span class="monospace-strong">make.defines</span> file, and change the equivalent section to be:
<pre class="code">
ifeq "$(CROSS_COMPILE)" "1"
SYSROOT ?= sysroot
prefix ?= /usr
CROSS ?= ppu-
EXTRA_CFLAGS = -m32 -mabi=altivec
else
</pre>
4. Save the file, then build the patches for <span class="monospace-strong">speevent</span> using:
<pre class="code">
patch -p1 &lt; initevent.diff
patch -p1 &lt; event-public.diff
patch -p1 &lt; make_speevent_thread_safe.diff
</pre>
5. Build the library using: <span class="monospace-strong">make; make install</span><br />
6. Copy all the files (recursively) in the <span class="monospace-strong">libspe2-2.0.1/sysroot/usr/</span> directory to the <span class="monospace-strong">/usr/</span> directory on the PS3 <b>and</b> the Host PC.<br />
7. In the <span
class="monospace-strong">libspe-1.2.0</span> directory, open
the <span class="monospace-strong">Makefile</span>
file, and change the equivalent section to be:
<pre class="code">
ifeq "$(CROSS_COMPILE)" "1"
SYSROOT ?= sysroot
prefix ?= /usr
CROSS ?= ppu-
EXTRA_CFLAGS = -m32 -mabi=altivec
else
</pre>
8. Save the file, then build the library using: <span class="monospace-strong">make; make install</span><br />
9. Copy all the files (recursively) in the <span
class="monospace-strong">libspe-1.2.0/sysroot/usr/</span>
directory to the <span
class="monospace-strong">/usr/</span> directory on the
PS3 <b>and</b> the Host PC.<br />
<br />
Congratulations, <span class="monospace-strong">libspe-1.2.0</span> and <span class="monospace-strong">libspe2-2.0.1</span> are now installed on the PS3 and will be used by the any applications which are dynamically linked to either of those libraries.<br />
<br />
<div class="sticky-note">
Special thanks to <b>Dirk Herrendoerfer</b> for both making the release available and for answering my questions on the build procedures.
</div>

]]>
    </content>
</entry>
<entry>
    <title>Unaligned scalar load and store on the SPU</title>
    <link rel="alternate" type="text/html" href="http://www.cellperformance.com/articles/2006/09/unaligned_scalar_load_and_stor_1.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://www.cellperformance.com/cgi-bin/mt/mt-atom.cgi/weblog/blog_id=1/entry_id=77" title="Unaligned scalar load and store on the SPU" />
    <id>tag:www.cellperformance.com,2006:/articles//1.77</id>
    
    <published>2006-09-15T10:59:18Z</published>
    <updated>2006-09-15T11:55:59Z</updated>
    
    <summary>An example of unaligned loads and stores on the SPU. The solution to this problem is to remember that the SPU does not have a scalar instruction set or access  local memory in anything except 16 bytes quadwords.</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://www.cellperformance.com/mike_acton</uri>
    </author>
            <category term="CBE" />
    
    <content type="html" xml:lang="en" xml:base="http://www.cellperformance.com/articles/">
        <![CDATA[Albert Noll, a student at UC Irvine is working on an interesting project. According to him:
<div class="quote">
"I am currently working on a java virtual machine runtime environment which
hides the heterogenity of the cell architecture. Conventional java code
code be executed and benefit from the numerous execution units the Cell
architecture offers. I am doing some java benchmarks (java grande) to test
the efficiancy of the implementation, but I still have some problems
achieving really good results."
</div>
<br />
One of the problems Albert has encountered recently is in loading and storing scalar doubles to the java stack. He recently posed a question on the <a href="http://www-128.ibm.com/developerworks/forums/dw_thread.jsp?forum=739&thread=135517&cat=46">Cell Broadband Engine Architecture forum [ibm.com]</a>:
<div class="quote">
"I have the following problem: I want to load a double value from an array (represents stack of the application)
which is of type unsigned int. The two 32-bit values, which
represent the double value have been casted to a double before, so the bits
are set according to the double representation of the value."
</div>
<br />
The solution to this problem is to remember that the SPU does not have a scalar instruction set or access  local memory in anything except 16 bytes quadwords. The ability to compile scalar code on the SPU is something of a convinience, but it doesn't come without a penalty. <br />
<br />
The first step, before considering performance,  is to properly be able to load and store the unaligned double values. <br />
<br />]]>
        <![CDATA[Instead of:
<div class="code">
{
  // [ostack is unsigned int]
  double tmp[2];
  memcpy(tmp, ostack - 4, 16);

  double res = tmp[0] * tmp[1];
  memcpy(ostack - 4, &res, 8);
  ostack -= 2;
}
</div>

Simplify a bit:
<div class="code">
{
  const double arg0   = <a href="#unaligned_load_double">unaligned_load_double</a>( ostack-4 );
  const double arg1   = <a href="#unaligned_load_double">unaligned_load_double</a>( ostack-2 );
  const double result = arg0 * arg1;

  <a href="#unaligned_store_double">unaligned_store_double</a>( ostack-4, result );
  ostack -= 2;
}
</div>

After Albert gets his basic scalar code working, the next two steps to optimizing this access would probably be:<br />
<ul>
<li>Instead of loading and storing each individual element, keep a "cache" of a couple of quadwords in the stack that are being worked with and keep using them until the data must be flushed (store) or requires loading a new line.</li>
<li>Start moving away from the scalar implementation completely. The both the SPU compiler and the SPU itself are most effective when working with vectors.</li>
</ul>
<br />
Have a look at the unaligned load and store functions below and hopefully it will be clear why scalar values necessarily complicates things on the SPU and ultimately leads to poorer performance.<br />
<a href="#unaligned_load_double">double unaligned_load_double( const char* const arg0 );</a><br />
<a href="#unaligned_store_double">void   unaligned_store_double( char* const arg0, double arg1 );</a><br />
<br />
Or download the files directly:<br />
<a href="http://www.cellperformance.com/public/attachments/unaligned_double.c">unaligned_double.c</a><br />
<a href="http://www.cellperformance.com/public/attachments/unaligned_double.h">unaligned_double.h</a><br />
<a href="http://www.cellperformance.com/public/attachments/vsl.c">vsl.c</a><br />
<a href="http://www.cellperformance.com/public/attachments/vsl.h">vsl.h</a><br />
<br />
<div class="sticky-note">
If you have a question you think <a href="http://www.cellperformance.com/mike_acton">Mike Acton</a> or one of the other members of CellPerformance can help you with, feel free to <a href="mailto:macton@cellperformance.com">email</a> or post in our <a href="http://cellperformance.com/phpBB2/">forums</a>.
</div>
<br />
unaligned_double.h<br />
<div class="code">
<span class="line-number">  0</span>#ifndef UNALIGNED_DOUBLE_H
<span class="line-number">  1</span>#define UNALIGNED_DOUBLE_H
<span class="line-number">  2</span>
<span class="line-number">  3</span>double unaligned_load_double( const char* const arg0 );
<span class="line-number">  4</span>void   unaligned_store_double( char* const arg0, double arg1 );
<span class="line-number">  5</span>
<span class="line-number">  6</span>#endif
</div>

<br />
unaligned_double.c<br />
<div class="code">
<span class="line-number">  0</span>#include "unaligned_double.h"
<span class="line-number">  1</span>#include "vsl.h"
<span class="line-number">  2</span>#include &lt;stdint.h&gt;
<span class="line-number">  3</span>#include &lt;spu_intrinsics.h&gt;
<span class="line-number">  4</span>
<span class="line-number">  5</span>// Return the double stored at the unaligned address at (arg0)
<span id="unaligned_load_double" class="line-number">  6</span>double 
<span class="line-number">  7</span>unaligned_load_double( const char* const arg0 )
<span class="line-number">  8</span>{
<span class="line-number">  9</span>    uintptr_t  source_addr_u         = (uintptr_t)arg0;
<span class="line-number"> 10</span>    qword      source_addr           = si_from_uint( source_addr_u );
<span class="line-number"> 11</span>
<span class="line-number"> 12</span>    // Unaligned load requires two reads since the double may be
<span class="line-number"> 13</span>    // stored across two aligned quadwords. These loads will read
<span class="line-number"> 14</span>    // starting at the quadword aligned address of source_addr
<span class="line-number"> 15</span>    // i.e. will ignore the low 4 bits of source_addr
<span class="line-number"> 16</span>
<span class="line-number"> 17</span>    qword      source_lo             = si_lqd( source_addr, 0x00 );
<span class="line-number"> 18</span>    qword      source_hi             = si_lqd( source_addr, 0x10 );
<span class="line-number"> 19</span>
<span class="line-number"> 20</span>    // Get low 4 bits of source address. This is the offset from the
<span class="line-number"> 21</span>    // aligned address.
<span class="line-number"> 22</span>
<span class="line-number"> 23</span>    qword      shift_offset          = si_andi( source_addr, 0x0f );
<span class="line-number"> 24</span>
<span class="line-number"> 25</span>    // Create a shuffle pattern which selects the appropriate
<span class="line-number"> 26</span>    // bytes of the double from the two quadwords.
<span class="line-number"> 27</span>    //
<span class="line-number"> 28</span>    // The value of shift offset is stored in byte[3] of shift_offset,
<span class="line-number"> 29</span>    // 1. Generate the pattern 0x03030303_03030303_03030303_03030303
<span class="line-number"> 30</span>    // 2. Use shuffle to generate a qword with each byte filled with 
<span class="line-number"> 31</span>    //    the shift offset.
<span class="line-number"> 32</span>
<span class="line-number"> 33</span>    qword      lo_byte_pattern       = si_ilh( 0x0303 );
<span class="line-number"> 34</span>    qword      shift_offset_pattern  = si_shufb( shift_offset, shift_offset, lo_byte_pattern );
<span class="line-number"> 35</span>
<span class="line-number"> 36</span>    // Add shift_offset_pattern to a vector for shift left (which is
<span class="line-number"> 37</span>    // just a byte vector with bytes incrementing from 0-15). This will
<span class="line-number"> 38</span>    // generate a shuffle pattern which will be used store the unaligned 
<span class="line-number"> 39</span>    // double in the preferred slot of the result.
<span class="line-number"> 40</span>
<span class="line-number"> 41</span>    qword      vector_shift_left     = (qword)_vsl;
<span class="line-number"> 42</span>    qword      shift_pattern         = si_a( shift_offset_pattern, vector_shift_left );
<span class="line-number"> 43</span>
<span class="line-number"> 44</span>    // Move double into preferred slot and extract.
<span class="line-number"> 45</span>
<span class="line-number"> 46</span>    qword      result_preferred      = si_shufb( source_lo, source_hi, shift_pattern );
<span class="line-number"> 47</span>    double     result                = si_to_double( result_preferred );
<span class="line-number"> 48</span>
<span class="line-number"> 49</span>    return (result);
<span class="line-number"> 50</span>}
<span class="line-number"> 51</span>
<span class="line-number"> 52</span>// Write the double (arg1) at the unaligned address at (arg0)
<span id="unaligned_store_double" class="line-number"> 53</span>void 
<span class="line-number"> 54</span>unaligned_store_double( char* const arg0, double arg1 )
<span class="line-number"> 55</span>{
<span class="line-number"> 56</span>    uintptr_t  source_addr_u         = (uintptr_t)arg0;
<span class="line-number"> 57</span>    qword      source_addr           = si_from_uint( source_addr_u );
<span class="line-number"> 58</span>    qword      value                 = si_from_double( arg1 );
<span class="line-number"> 59</span>
<span class="line-number"> 60</span>    // Unaligned store requires two reads since the double may be
<span class="line-number"> 61</span>    // stored across two aligned quadwords. The two source lines
<span class="line-number"> 62</span>    // will be loaded, modified then stored.
<span class="line-number"> 63</span>
<span class="line-number"> 64</span>    qword      source_lo             = si_lqd( source_addr, 0x00 );
<span class="line-number"> 65</span>    qword      source_hi             = si_lqd( source_addr, 0x10 );
<span class="line-number"> 66</span>
<span class="line-number"> 67</span>    // Get low 4 bits of source address. This is the offset from the
<span class="line-number"> 68</span>    // aligned address.
<span class="line-number"> 69</span>
<span class="line-number"> 70</span>    qword      shift_offset          = si_andi( source_addr, 0x0f );
<span class="line-number"> 71</span>
<span class="line-number"> 72</span>    // Create a shuffle pattern which selects the appropriate
<span class="line-number"> 73</span>    // bytes of the double from the low quadword.
<span class="line-number"> 74</span>    //
<span class="line-number"> 75</span>    // The value of shift offset is stored in byte[3] of shift_offset,
<span class="line-number"> 76</span>    // 1. Generate the pattern 0x03030303_03030303_03030303_03030303
<span class="line-number"> 77</span>    // 2. Use shuffle to generate a qword with each byte filled with 
<span class="line-number"> 78</span>    //    the shift offset.
<span class="line-number"> 79</span>
<span class="line-number"> 80</span>    qword      lo_byte_pattern          = si_ilh( 0x0303 );
<span class="line-number"> 81</span>    qword      shift_offset_pattern_lo  = si_shufb( shift_offset, shift_offset, lo_byte_pattern );
<span class="line-number"> 82</span>
<span class="line-number"> 83</span>
<span class="line-number"> 84</span>    // Create a shuffle pattern which selects the appropriate
<span class="line-number"> 85</span>    // bytes of the double from the high quadword 
<span class="line-number"> 86</span>    // high offset = (16 bytes - low offset)
<span class="line-number"> 87</span>
<span class="line-number"> 88</span>    qword      shift_adjust_hi          = si_ilh( 0x1010 );
<span class="line-number"> 89</span>    qword      shift_offset_pattern_hi  = si_sf( shift_adjust_hi, shift_offset_pattern_lo );
<span class="line-number"> 90</span>
<span class="line-number"> 91</span>    // Subtract shift offset pattern from a vector for shift left (which is
<span class="line-number"> 92</span>    // just a byte vector with bytes incrementing from 0-15). This will
<span class="line-number"> 93</span>    // generate a shuffle pattern which will be used store the unaligned 
<span class="line-number"> 94</span>    // double at the unaliged locations used in the first source line.
<span class="line-number"> 95</span>
<span class="line-number"> 96</span>    qword      vector_shift_left     = (qword)_vsl;
<span class="line-number"> 97</span>    qword      shift_pattern_lo      = si_sf( shift_offset_pattern_lo, vector_shift_left );
<span class="line-number"> 98</span>    qword      shift_pattern_hi      = si_sf( shift_offset_pattern_hi, vector_shift_left );
<span class="line-number"> 99</span>
<span class="line-number">100</span>    // Mask the bits that will be unmodified in the source lines.
<span class="line-number">101</span>    // Any shuffle pattern outside the range [0x00,0x07] is not being used
<span class="line-number">102</span>    // by the value, so that will be kept in the source lines.
<span class="line-number">103</span>
<span class="line-number">104</span>    qword      source_bits_mask_lo   = si_clgtbi( shift_pattern_lo, 0x07 );
<span class="line-number">105</span>    qword      source_bits_mask_hi   = si_clgtbi( shift_pattern_hi, 0x07 );
<span class="line-number">106</span>    qword      value_lo              = si_shufb( value, value, shift_pattern_lo );
<span class="line-number">107</span>    qword      value_hi              = si_shufb( value, value, shift_pattern_hi );
<span class="line-number">108</span>
<span class="line-number">109</span>    // Clear space in source lines to store the value
<span class="line-number">110</span>
<span class="line-number">111</span>    qword      prepped_source_lo     = si_and( source_lo, source_bits_mask_lo );
<span class="line-number">112</span>    qword      prepped_source_hi     = si_and( source_hi, source_bits_mask_hi );
<span class="line-number">113</span>
<span class="line-number">114</span>    // Clear everything unwanted from the value
<span class="line-number">115</span>
<span class="line-number">116</span>    qword      prepped_value_lo     = si_andc( value_lo, source_bits_mask_lo );
<span class="line-number">117</span>    qword      prepped_value_hi     = si_andc( value_hi, source_bits_mask_hi );
<span class="line-number">118</span>
<span class="line-number">119</span>    // Combine the source lines and the value lines
<span class="line-number">120</span>
<span class="line-number">121</span>    qword      result_lo            = si_or( prepped_source_lo, prepped_value_lo );
<span class="line-number">122</span>    qword      result_hi            = si_or( prepped_source_hi, prepped_value_hi );
<span class="line-number">123</span>
<span class="line-number">124</span>    // Store the result
<span class="line-number">125</span>
<span class="line-number">126</span>    si_stqd( result_lo, source_addr, 0x00 );
<span class="line-number">127</span>    si_stqd( result_hi, source_addr, 0x10 );
<span class="line-number">128</span>}
</div>

<br />
vsl.h<br/>
<div class="code">
<span class="line-number">  0</span>#ifndef VSL_H
<span class="line-number">  1</span>#define VSL_H
<span class="line-number">  2</span>
<span class="line-number">  3</span>extern const vector unsigned char _vsl;
<span class="line-number">  4</span>
<span class="line-number">  5</span>#endif
</div>
<br />
vsl.c<br />
<div class="code">
<span class="line-number">  0</span>// Vector for shift left
<span class="line-number">  1</span>const vector unsigned char _vsl = { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07
<span class="line-number">  2</span>                                   ,0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f };
</div>
]]>
    </content>
</entry>
<entry>
    <title>atan2 on SPU</title>
    <link rel="alternate" type="text/html" href="http://www.cellperformance.com/articles/2006/09/atan2_on_spu.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://www.cellperformance.com/cgi-bin/mt/mt-atom.cgi/weblog/blog_id=1/entry_id=75" title="atan2 on SPU" />
    <id>tag:www.cellperformance.com,2006:/articles//1.75</id>
    
    <published>2006-09-12T10:21:52Z</published>
    <updated>2006-09-16T07:13:39Z</updated>
    
    <summary>A branch-free implementation of atan2 vector floats for the SPU.</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://www.cellperformance.com/mike_acton</uri>
    </author>
            <category term="CBE" />
    
    <content type="html" xml:lang="en" xml:base="http://www.cellperformance.com/articles/">
        <![CDATA[On 2006 March 03 on the IBM developerWorks <a href="http://www-128.ibm.com/developerworks/forums/dw_thread.jsp?forum=739&thread=109947&message=13795522&cat=46&q=atan2#13795522">Cell Broadband Engine Architecture forum [ibm.com]</a> an interesting question was asked:<br />
<div class="quote">
"I am trying to port an application from an older version of SDK to SDK 1.0. It uses atan2(.....) function, which is causing trouble... This code worked fine on SDK28, but now it looks like the new functions dont have this particular function defined..<br />
I did change the makefile to include $(SDKLIB)/libmath.a<br />
<br />
I searched in ./sysroot/usr/spu/include/* and src/include/spu/* but couldn't find a headerfile that has it defined.<br />
<br />
Can anyone please suggest if I should just change the code to not use that function or is there a way to invoke it still?<br />
<br />
Thanks!"
</div>
<br />
It turned out this function was not available in the SDK.<br />
<br />
The following is a branch-free implementation of atan2 vector floats for the SPU. A scalar version which simply casts to vector and back is also provided. This implementation is fairly quick-and-dirty and no particular level of accuracy is gauranteed, but it should be usable for many purposes.<br />
<br />
<a href ="http://www.cellperformance.com/articles/2006/09/atan2_on_spu.html#cp_fatan">static inline vector float cp_fatan( const vector float x );</a><br />
<a href ="http://www.cellperformance.com/articles/2006/09/atan2_on_spu.html#cp_fatan_scalar">static inline float cp_fatan_scalar( const float x );</a><br />
<a href ="http://www.cellperformance.com/articles/2006/09/atan2_on_spu.html#cp_fatan2">static inline vector float cp_fatan2( const vector float y, const vector float x );</a><br />
<a href ="http://www.cellperformance.com/articles/2006/09/atan2_on_spu.html#cp_fatan2_scalar">static inline float cp_fatan2_scalar( const float y, const float x );</a><br />

<br />
Or download the source files:<br />
<a href="http://www.cellperformance.com/public/attachments/cp_fatan-cbe-spu.h">cp_fatan-cbe-spu.h</a><br />
<a href="http://www.cellperformance.com/public/attachments/cp_fatan-cbe-spu.c">cp_fatan-cbe-spu.c</a><br />
<br />

<div class="sticky-note">
This code is C99 source. For gcc, use the following flags: <span class="monospace-strong">-std=c99 -pedantic</span>
</div>]]>
        <![CDATA[<div class="code">
<span class="line-number">  0</span>// ## cp_fatan-cbe-spu.h (C99)
<span class="line-number">  1</span>// ## Version 1.0
<span class="line-number">  2</span>// ##                        
<span class="line-number">  3</span>// ## Copyright (c) 2006 Mike Acton <macton@gmail.com>
<span class="line-number">  4</span>// ##                        
<span class="line-number">  5</span>// ## SIGNIFICANT REFERENCES:
<span class="line-number">  6</span>// ##                        
<span class="line-number">  7</span>// ##    [1] Cephes Math Library Release 2.8:  June, 2000
<span class="line-number">  8</span>// ##        Copyright 1984, 1995, 2000, Stephen L. Moshier
<span class="line-number">  9</span>// ##    [2] Numerical Computation Guide (PDF)
<span class="line-number"> 10</span>// ##        Copyright 2000, Sun Microsystems, Inc.
<span class="line-number"> 11</span>// ##    [3] IEEE 754 Support in C99 (PDF)
<span class="line-number"> 12</span>// ##        Copyright 2001, Jim Thomas
<span class="line-number"> 13</span>// ##    [4] Solaris 10 Reference Manual : atan2(3M)
<span class="line-number"> 14</span>// ##        Copyright 1994-2005, Sun Microsystems, Inc.
<span class="line-number"> 15</span>// ##                        
<span class="line-number"> 16</span>// ## Permission is hereby granted, free of charge, to any person obtaining
<span class="line-number"> 17</span>// ## a copy of this software and associated documentation files 
<span class="line-number"> 18</span>// ## (the "Software"), to deal in the Software without restriction, including
<span class="line-number"> 19</span>// ## without limitation the rights to use, copy, modify, merge, publish, 
<span class="line-number"> 20</span>// ## distribute, sublicense, and/or sell copies of the Software, and to permit
<span class="line-number"> 21</span>// ## persons to whom the Software is furnished to do so, subject to the 
<span class="line-number"> 22</span>// ## following conditions:
<span class="line-number"> 23</span>// ##                        
<span class="line-number"> 24</span>// ## The above copyright notice and this permission notice shall be included 
<span class="line-number"> 25</span>// ## in all copies or substantial portions of the Software.
<span class="line-number"> 26</span>// ##                        
<span class="line-number"> 27</span>// ## THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS 
<span class="line-number"> 28</span>// ## OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 
<span class="line-number"> 29</span>// ## FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 
<span class="line-number"> 30</span>// ## AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 
<span class="line-number"> 31</span>// ## LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 
<span class="line-number"> 32</span>// ## OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 
<span class="line-number"> 33</span>// ## THE SOFTWARE.
<span class="line-number"> 34</span>// ##                        
<span class="line-number"> 35</span>
<span class="line-number"> 36</span>#ifndef CP_FATAN_CBE_SPU_H
<span class="line-number"> 37</span>#define CP_FATAN_CBE_SPU_H
<span class="line-number"> 38</span>
<span class="line-number"> 39</span>#include &lt;stdint.h&gt;
<span class="line-number"> 40</span>#include &lt;spu_intrinsics.h&gt;
<span class="line-number"> 41</span>
<span class="line-number"> 42</span>// ##                        
<span class="line-number"> 43</span>// ## Global Floating-point constants (32 bit)
<span class="line-number"> 44</span>// ##                        
<span class="line-number"> 45</span>// ## Constant is loaded in each element of 32 bit floating-point vector
<span class="line-number"> 46</span>// ## from local store.
<span class="line-number"> 47</span>// ##                        
<span class="line-number"> 48</span>// ## cp_flpio4()  +PI/+4
<span class="line-number"> 49</span>// ## cp_flt3p8()  tan( +3.0 * PI / +8.0 )
<span class="line-number"> 50</span>// ## cp_flnpio2() -PI/+2
<span class="line-number"> 51</span>// ## cp_flpio2()  +PI/+2
<span class="line-number"> 52</span>// ## cp_flpt66()  +0.66
<span class="line-number"> 53</span>// ## cp_flpi()    +PI
<span class="line-number"> 54</span>// ## cp_flnpi()   -PI
<span class="line-number"> 55</span>
<span class="line-number"> 56</span>extern const vector unsigned int _cp_f_pio4;
<span class="line-number"> 57</span>extern const vector unsigned int _cp_f_t3p8;
<span class="line-number"> 58</span>extern const vector unsigned int _cp_f_npio2;
<span class="line-number"> 59</span>extern const vector unsigned int _cp_f_pio2;
<span class="line-number"> 60</span>extern const vector unsigned int _cp_f_pt66;
<span class="line-number"> 61</span>extern const vector unsigned int _cp_f_pi;
<span class="line-number"> 62</span>extern const vector unsigned int _cp_f_npi;
<span class="line-number"> 63</span>
<span class="line-number"> 64</span>static inline qword
<span class="line-number"> 65</span>cp_flpio4( void )
<span class="line-number"> 66</span>{
<span class="line-number"> 67</span>    return si_lqa( (intptr_t)&_cp_f_pio4 );
<span class="line-number"> 68</span>}
<span class="line-number"> 69</span>
<span class="line-number"> 70</span>static inline qword
<span class="line-number"> 71</span>cp_flt3p8( void )
<span class="line-number"> 72</span>{
<span class="line-number"> 73</span>    return si_lqa( (intptr_t)&_cp_f_t3p8 );
<span class="line-number"> 74</span>}
<span class="line-number"> 75</span>
<span class="line-number"> 76</span>static inline qword
<span class="line-number"> 77</span>cp_flnpio2( void )
<span class="line-number"> 78</span>{
<span class="line-number"> 79</span>    return si_lqa( (intptr_t)&_cp_f_npio2 );
<span class="line-number"> 80</span>}
<span class="line-number"> 81</span>
<span class="line-number"> 82</span>static inline qword
<span class="line-number"> 83</span>cp_flpio2( void )
<span class="line-number"> 84</span>{
<span class="line-number"> 85</span>    return si_lqa( (intptr_t)&_cp_f_pio2 );
<span class="line-number"> 86</span>}
<span class="line-number"> 87</span>
<span class="line-number"> 88</span>static inline qword
<span class="line-number"> 89</span>cp_flpt66( void )
<span class="line-number"> 90</span>{
<span class="line-number"> 91</span>    return si_lqa( (intptr_t)&_cp_f_pt66 );
<span class="line-number"> 92</span>}
<span class="line-number"> 93</span>
<span class="line-number"> 94</span>static inline qword
<span class="line-number"> 95</span>cp_flpi( void )
<span class="line-number"> 96</span>{
<span class="line-number"> 97</span>    return si_lqa( (intptr_t)&_cp_f_pi );
<span class="line-number"> 98</span>}
<span class="line-number"> 99</span>
<span class="line-number">100</span>static inline qword
<span class="line-number">101</span>cp_flnpi( void )
<span class="line-number">102</span>{
<span class="line-number">103</span>    return si_lqa( (intptr_t)&_cp_f_npi );
<span class="line-number">104</span>}
<span class="line-number">105</span>
<span class="line-number">106</span>// ##                        
<span class="line-number">107</span>// ## Load-Immediate Floating-point constants (32 bit)
<span class="line-number">108</span>// ##                        
<span class="line-number">109</span>// ## Constant is loaded in each element of 32 bit floating-point vector
<span class="line-number">110</span>// ## using immediate values. i.e. No loads
<span class="line-number">111</span>// ##                        
<span class="line-number">112</span>// ## cp_filzero()   +0.0  +0x00000000
<span class="line-number">113</span>// ## cp_filnzero()  -0.0  +0x80000000
<span class="line-number">114</span>// ## cp_filone()    +1.0  +0x3f800000
<span class="line-number">115</span>// ## cp_filtwo()    +2.0  +0x40000000
<span class="line-number">116</span>// ## cp_filinf()    +INF  +0x7f800000
<span class="line-number">117</span>// ## cp_filninf()   -INF  +0xff800000
<span class="line-number">118</span>// ## cp_filnan()     NaN  +0x7fc00000
<span class="line-number">119</span>// ##                        
<span class="line-number">120</span>
<span class="line-number">121</span>static inline qword 
<span class="line-number">122</span>cp_filzero( void )
<span class="line-number">123</span>{
<span class="line-number">124</span>    return si_ilhu( (int16_t)0x0000 );
<span class="line-number">125</span>}
<span class="line-number">126</span>
<span class="line-number">127</span>static inline qword 
<span class="line-number">128</span>cp_filnzero( void )
<span class="line-number">129</span>{
<span class="line-number">130</span>    return si_ilhu( (int16_t)0x8000 );
<span class="line-number">131</span>}
<span class="line-number">132</span>
<span class="line-number">133</span>static inline qword 
<span class="line-number">134</span>cp_filone( void )
<span class="line-number">135</span>{
<span class="line-number">136</span>    return si_ilhu( (int16_t)0x3f80 );
<span class="line-number">137</span>}
<span class="line-number">138</span>
<span class="line-number">139</span>static inline qword 
<span class="line-number">140</span>cp_filtwo( void )
<span class="line-number">141</span>{
<span class="line-number">142</span>    return si_ilhu( (int16_t)0x4000 );
<span class="line-number">143</span>}
<span class="line-number">144</span>
<span class="line-number">145</span>static inline qword 
<span class="line-number">146</span>cp_filinf( void )
<span class="line-number">147</span>{
<span class="line-number">148</span>    return si_ilhu( (int16_t)0x7f80 );
<span class="line-number">149</span>}
<span class="line-number">150</span>
<span class="line-number">151</span>static inline qword 
<span class="line-number">152</span>cp_filninf( void )
<span class="line-number">153</span>{
<span class="line-number">154</span>    return si_ilhu( (int16_t)0xff80 );
<span class="line-number">155</span>}
<span class="line-number">156</span>
<span class="line-number">157</span>static inline qword 
<span class="line-number">158</span>cp_filnan( void )
<span class="line-number">159</span>{
<span class="line-number">160</span>    return si_ilhu( (int16_t)0x7fc0 );
<span class="line-number">161</span>}
<span class="line-number">162</span>
<span class="line-number">163</span>// ##                        
<span class="line-number">164</span>// ## cp_fatan() Coefficients and other constants
<span class="line-number">165</span>// ##                        
<span class="line-number">166</span>
<span class="line-number">167</span>extern const vector unsigned int _cp_f_atan_q4;
<span class="line-number">168</span>extern const vector unsigned int _cp_f_atan_q3;
<span class="line-number">169</span>extern const vector unsigned int _cp_f_atan_q2;
<span class="line-number">170</span>extern const vector unsigned int _cp_f_atan_q1;
<span class="line-number">171</span>extern const vector unsigned int _cp_f_atan_q0;
<span class="line-number">172</span>extern const vector unsigned int _cp_f_atan_p4;
<span class="line-number">173</span>extern const vector unsigned int _cp_f_atan_p3;
<span class="line-number">174</span>extern const vector unsigned int _cp_f_atan_p2;
<span class="line-number">175</span>extern const vector unsigned int _cp_f_atan_p1;
<span class="line-number">176</span>extern const vector unsigned int _cp_f_atan_p0;
<span class="line-number">177</span>extern const vector unsigned int _cp_f_hmorebits;
<span class="line-number">178</span>extern const vector unsigned int _cp_f_morebits;
<span class="line-number">179</span>
<span class="line-number">180</span>// ## cp_fatan(x)
<span class="line-number">181</span>// ##                        
<span class="line-number">182</span>// ## 0     <= x           <= 0.66
<span class="line-number">183</span>// ## -PI/2 <= cp_fatan(x) <= +PI/2
<span class="line-number">184</span>// ##                        
<span class="line-number">185</span>// ## Each floating-point component of the result is a function of
<span class="line-number">186</span>// ## the corresponding components of x:
<span class="line-number">187</span>// ##
<span class="line-number">188</span>// ##    0.0                                             { x == 0.0
<span class="line-number">189</span>// ##                        
<span class="line-number">190</span>// ##    +PI                                             {
<span class="line-number">191</span>// ##    ---                                             { x == INF
<span class="line-number">192</span>// ##    2.0                                             {
<span class="line-number">193</span>// ##                        
<span class="line-number">194</span>// ##    -PI                                             {
<span class="line-number">195</span>// ##    ---                                             { x == -INF
<span class="line-number">196</span>// ##    2.0                                             {
<span class="line-number">197</span>// ##                        
<span class="line-number">198</span>// ##                           
<span class="line-number">199</span>// ##                   2      4      6     8            {
<span class="line-number">200</span>// ##           P  + P x  + P x  + P x + P x             {
<span class="line-number">201</span>// ##        2   0    1      2      3     4              {
<span class="line-number">202</span>// ##    x  x   ----------------------------------- + x  { otherwise
<span class="line-number">203</span>// ##                    2     4      6      8   10      {
<span class="line-number">204</span>// ##            Q  + Q x + Q x  + Q x  + Q x + x        {
<span class="line-number">205</span>// ##             0    1     2      3      4             {
<span class="line-number">206</span>
<span class="line-number">207</span>static inline qword
<span class="line-number">208</span>_cp_fatan( const qword x )
<span class="line-number">209</span>{
<span class="line-number">210</span>    // ##                        
<span class="line-number">211</span>    // ## Load constants
<span class="line-number">212</span>    // ##                        
<span class="line-number">213</span>    
<span class="line-number">214</span>    const qword f_one           = cp_filone();
<span class="line-number">215</span>    const qword f_inf           = cp_filinf();
<span class="line-number">216</span>    const qword f_ninf          = cp_filninf();
<span class="line-number">217</span>    const qword f_msb           = cp_filnzero();
<span class="line-number">218</span>    const qword f_zero          = cp_filzero();
<span class="line-number">219</span>
<span class="line-number">220</span>    const qword f_pt66          = si_lqa( (intptr_t)&_cp_f_pt66      );
<span class="line-number">221</span>    const qword f_pio2          = si_lqa( (intptr_t)&_cp_f_pio2      );
<span class="line-number">222</span>    const qword f_npio2         = si_lqa( (intptr_t)&_cp_f_npio2     );
<span class="line-number">223</span>    const qword f_pio4          = si_lqa( (intptr_t)&_cp_f_pio4      );
<span class="line-number">224</span>    const qword f_t3p8          = si_lqa( (intptr_t)&_cp_f_t3p8      );
<span class="line-number">225</span>
<span class="line-number">226</span>    const qword f_atan_p0       = si_lqa( (intptr_t)&_cp_f_atan_p0    );
<span class="line-number">227</span>    const qword f_atan_p1       = si_lqa( (intptr_t)&_cp_f_atan_p1    );
<span class="line-number">228</span>    const qword f_atan_p2       = si_lqa( (intptr_t)&_cp_f_atan_p2    );
<span class="line-number">229</span>    const qword f_atan_p3       = si_lqa( (intptr_t)&_cp_f_atan_p3    );
<span class="line-number">230</span>    const qword f_atan_p4       = si_lqa( (intptr_t)&_cp_f_atan_p4    );
<span class="line-number">231</span>    const qword f_atan_q0       = si_lqa( (intptr_t)&_cp_f_atan_q0    );
<span class="line-number">232</span>    const qword f_atan_q1       = si_lqa( (intptr_t)&_cp_f_atan_q1    );
<span class="line-number">233</span>    const qword f_atan_q2       = si_lqa( (intptr_t)&_cp_f_atan_q2    );
<span class="line-number">234</span>    const qword f_atan_q3       = si_lqa( (intptr_t)&_cp_f_atan_q3    );
<span class="line-number">235</span>    const qword f_atan_q4       = si_lqa( (intptr_t)&_cp_f_atan_q4    );
<span class="line-number">236</span>    const qword f_morebits      = si_lqa( (intptr_t)&_cp_f_morebits  );
<span class="line-number">237</span>    const qword f_hmorebits     = si_lqa( (intptr_t)&_cp_f_hmorebits );
<span class="line-number">238</span>    
<span class="line-number">239</span>    // ##                        
<span class="line-number">240</span>    // ## pos_x = -x            { x < 0
<span class="line-number">241</span>    // ##          x            { otherwise
<span class="line-number">242</span>    // ##                        
<span class="line-number">243</span>    
<span class="line-number">244</span>    const qword neg_x           = si_xor( x, f_msb );          
<span class="line-number">245</span>    const qword sign_mask       = si_fcgt( f_zero, x );
<span class="line-number">246</span>    const qword pos_x           = si_selb( x, neg_x, sign_mask );
<span class="line-number">247</span>    
<span class="line-number">248</span>    // ##                        
<span class="line-number">249</span>    // ## Range reduction
<span class="line-number">250</span>    // ##                        
<span class="line-number">251</span>    
<span class="line-number">252</span>    // ##                        
<span class="line-number">253</span>    // ## range0_mask = ( pos_x > tan( 3.0 * PI / 8.0 ) )
<span class="line-number">254</span>    // ## range1_mask = ( pos_x <= 0.66 )
<span class="line-number">255</span>    // ## range2_mask = !( range0_mask || range1_mask )
<span class="line-number">256</span>    // ##                        
<span class="line-number">257</span>    
<span class="line-number">258</span>    const qword range0_mask     = si_fcgt( pos_x, f_t3p8 );
<span class="line-number">259</span>    const qword range1_gt_mask  = si_fcgt( f_pt66, pos_x );
<span class="line-number">260</span>    const qword range1_eq_mask  = si_fceq( f_pt66, pos_x );
<span class="line-number">261</span>    const qword range1_mask     = si_or( range1_gt_mask, range1_eq_mask );
<span class="line-number">262</span>    const qword range2_mask     = si_nor( range0_mask, range1_mask );
<span class="line-number">263</span>    
<span class="line-number">264</span>    // ##                        
<span class="line-number">265</span>    // ## range0_x = -1.0 
<span class="line-number">266</span>    // ##            -----
<span class="line-number">267</span>    // ##            pos_x
<span class="line-number">268</span>    // ##                        
<span class="line-number">269</span>    // ## range0_y = PI
<span class="line-number">270</span>    // ##            ---
<span class="line-number">271</span>    // ##            2.0
<span class="line-number">272</span>    // ##                        
<span class="line-number">273</span>    
<span class="line-number">274</span>    const qword range0_x0       = si_frest( pos_x );
<span class="line-number">275</span>    const qword range0_x1       = si_fi( pos_x, range0_x0 );
<span class="line-number">276</span>    const qword range0_x2       = si_fnms( range0_x1, pos_x, f_one );
<span class="line-number">277</span>    const qword range0_x3       = si_fma( range0_x2, range0_x1, range0_x1 );
<span class="line-number">278</span>    const qword range0_x        = si_xor( range0_x3, f_msb );
<span class="line-number">279</span>    const qword range0_y        = f_pio2;
<span class="line-number">280</span>    
<span class="line-number">281</span>    // ##                        
<span class="line-number">282</span>    // ## range1_x = pos_x
<span class="line-number">283</span>    // ## range1_y = 0.0
<span class="line-number">284</span>    // ##                        
<span class="line-number">285</span>    
<span class="line-number">286</span>    const qword range1_x        = pos_x;
<span class="line-number">287</span>    const qword range1_y        = f_zero;
<span class="line-number">288</span>    
<span class="line-number">289</span>    
<span class="line-number">290</span>    // ##                        
<span class="line-number">291</span>    // ## range2_x = (pos_x-1.0)
<span class="line-number">292</span>    // ##            -----------
<span class="line-number">293</span>    // ##            (pos_x+1.0)
<span class="line-number">294</span>    // ##                        
<span class="line-number">295</span>    // ## range2_y = PI
<span class="line-number">296</span>    // ##            ---
<span class="line-number">297</span>    // ##            4.0
<span class="line-number">298</span>    // ##                        
<span class="line-number">299</span>    
<span class="line-number">300</span>    const qword range2_y        = f_pio4;
<span class="line-number">301</span>    const qword range2_x0num    = si_fs( pos_x, f_one );
<span class="line-number">302</span>    const qword range2_x0den    = si_fa( pos_x, f_one );
<span class="line-number">303</span>    const qword range2_x0       = si_frest( range2_x0den );
<span class="line-number">304</span>    const qword range2_x1       = si_fnms( range2_x0, range2_x0den, f_one );
<span class="line-number">305</span>    const qword range2_x2       = si_fma( range2_x1, range2_x0, range2_x0 );
<span class="line-number">306</span>    const qword range2_x        = si_fm( range2_x0num, range2_x2 );
<span class="line-number">307</span>    
<span class="line-number">308</span>    // ##                        
<span class="line-number">309</span>    // ## range_x  = range0_x { range0_mask
<span class="line-number">310</span>    // ##            range1_x { range1_mask
<span class="line-number">311</span>    // ##            range2_x { range2_mask
<span class="line-number">312</span>    // ##                        
<span class="line-number">313</span>    // ## range_y  = range0_y { range0_mask
<span class="line-number">314</span>    // ##            range1_y { range1_mask
<span class="line-number">315</span>    // ##            range2_y { range2_mask
<span class="line-number">316</span>    // ##                        
<span class="line-number">317</span>    
<span class="line-number">318</span>    const qword range_x0        = si_selb( range2_x, range0_x, range0_mask );
<span class="line-number">319</span>    const qword range_x         = si_selb( range_x0, range1_x, range1_mask );
<span class="line-number">320</span>    const qword range_y0        = si_selb( range2_y, range0_y, range0_mask );
<span class="line-number">321</span>    const qword range_y         = si_selb( range_y0, range1_y, range1_mask );
<span class="line-number">322</span>    
<span class="line-number">323</span>    // ##                        
<span class="line-number">324</span>    // ##                  2
<span class="line-number">325</span>    // ## xp2    =  range_x 
<span class="line-number">326</span>    // ##                             2        3     4
<span class="line-number">327</span>    // ##           P  + P xp2 + P xp2  + P xp2 + P xp2
<span class="line-number">328</span>    // ##            0    1       2        3       4
<span class="line-number">329</span>    // ## zdiv   =  ------------------------------------------
<span class="line-number">330</span>    // ##                             2        3       4     5
<span class="line-number">331</span>    // ##           Q  + Q xp2 + Q xp2  + Q xp2 + Q xp2 + xp2
<span class="line-number">332</span>    // ##            0    1       2        3       4 
<span class="line-number">333</span>    // ## 
<span class="line-number">334</span>    // ## z1     = range_x * ( xp2 * zdiv ) + range_x
<span class="line-number">335</span>    // ## 
<span class="line-number">336</span>    
<span class="line-number">337</span>    const qword xp2             = si_fm( range_x, range_x );
<span class="line-number">338</span>    const qword znum0           = f_atan_p0;
<span class="line-number">339</span>    const qword znum1           = si_fma( znum0, xp2, f_atan_p1 );
<span class="line-number">340</span>    const qword znum2           = si_fma( znum1, xp2, f_atan_p2 );
<span class="line-number">341</span>    const qword znum3           = si_fma( znum2, xp2, f_atan_p3 );
<span class="line-number">342</span>    const qword znum            = si_fma( znum3, xp2, f_atan_p4 );
<span class="line-number">343</span>    const qword zden0           = si_fa( xp2, f_atan_q0 );
<span class="line-number">344</span>    const qword zden1           = si_fma( zden0, xp2, f_atan_q1 );
<span class="line-number">345</span>    const qword zden2           = si_fma( zden1, xp2, f_atan_q2 );
<span class="line-number">346</span>    const qword zden3           = si_fma( zden2, xp2, f_atan_q3 );
<span class="line-number">347</span>    const qword zden            = si_fma( zden3, xp2, f_atan_q4 );
<span class="line-number">348</span>    const qword zden_r0         = si_frest( zden );
<span class="line-number">349</span>    const qword zden_r1         = si_fnms( zden_r0, zden, f_one );
<span class="line-number">350</span>    const qword zden_r          = si_fma( zden_r1, zden_r0, zden_r0 );
<span class="line-number">351</span>    const qword zdiv            = si_fm( znum, zden_r );
<span class="line-number">352</span>    const qword z0              = si_fm( xp2, zdiv );
<span class="line-number">353</span>    const qword z1              = si_fma( range_x, z0, range_x );
<span class="line-number">354</span>    
<span class="line-number">355</span>    // ##                        
<span class="line-number">356</span>    // ## zadd      =  z1 + 0.5 * MOREBITS { range2_mask
<span class="line-number">357</span>    // ##              z1 + MOREBITS       { range1_mask
<span class="line-number">358</span>    // ##              z1                  { otherwise
<span class="line-number">359</span>    // ##                        
<span class="line-number">360</span>    // ## yaddz     = range_y + zadd
<span class="line-number">361</span>    // ##                        
<span class="line-number">362</span>    // ## pos_yaddz = yaddz      { yaddz >= 0
<span class="line-number">363</span>    // ##             -yaddz     { yaddz <  0
<span class="line-number">364</span>    // ##                        
<span class="line-number">365</span>
<span class="line-number">366</span>    const qword zadd0           = si_selb( f_zero, f_hmorebits, range2_mask );
<span class="line-number">367</span>    const qword zadd1           = si_selb( zadd0,  f_morebits,  range1_mask );
<span class="line-number">368</span>    const qword zadd            = si_fa( z1, zadd1 );
<span class="line-number">369</span>    const qword yaddz           = si_fa( range_y, zadd );
<span class="line-number">370</span>    const qword neg_yaddz       = si_xor( yaddz, f_msb );
<span class="line-number">371</span>    const qword pos_yaddz       = si_selb( yaddz,  neg_yaddz,  sign_mask );
<span class="line-number">372</span>    
<span class="line-number">373</span>    // ##                        
<span class="line-number">374</span>    // ## result_y0 = 0.0        { x == 0.0
<span class="line-number">375</span>    // ##             pos_yaddz  { otherwise
<span class="line-number">376</span>    // ##                        
<span class="line-number">377</span>    
<span class="line-number">378</span>    const qword x_eqz_mask      = si_fceq( f_zero, x );
<span class="line-number">379</span>    const qword result_y0       = si_selb( pos_yaddz, x, x_eqz_mask );
<span class="line-number">380</span>
<span class="line-number">381</span>    // ##                        
<span class="line-number">382</span>    // ## result_y2 = +PI         {
<span class="line-number">383</span>    // ##             ---         { x == INF
<span class="line-number">384</span>    // ##             2.0         {
<span class="line-number">385</span>    // ##                        
<span class="line-number">386</span>    // ##             -PI         {
<span class="line-number">387</span>    // ##             ---         { x == -INF
<span class="line-number">388</span>    // ##             2.0         {
<span class="line-number">389</span>    // ##                        
<span class="line-number">390</span>    // ##             result_y0   { otherwise
<span class="line-number">391</span>    // ##                        
<span class="line-number">392</span>
<span class="line-number">393</span>    const qword x_eqinf_mask    = si_fceq( f_inf,  x );
<span class="line-number">394</span>    const qword x_eqninf_mask   = si_fceq( f_ninf, x );
<span class="line-number">395</span>    const qword result_y1       = si_selb( result_y0, f_pio2,  x_eqinf_mask );
<span class="line-number">396</span>    const qword result          = si_selb( result_y1, f_npio2, x_eqninf_mask );
<span class="line-number">397</span>
<span class="line-number">398</span>    return (result);
<span class="line-number">399</span>}
<span class="line-number">400</span>
<span id="cp_fatan" class="line-number">401</span>static inline vector float
<span class="line-number">402</span>cp_fatan( const vector float x )
<span class="line-number">403</span>{
<span class="line-number">404</span>    return (vector float)( _cp_fatan( (qword)x ) );
<span class="line-number">405</span>}
<span class="line-number">406</span>
<span id="cp_fatan_scalar" class="line-number">407</span>static inline float
<span class="line-number">408</span>cp_fatan_scalar( const float x )
<span class="line-number">409</span>{
<span class="line-number">410</span>    const qword vx      = si_from_float( x );
<span class="line-number">411</span>    const qword vresult = _cp_fatan( vx );
<span class="line-number">412</span>    const float result  = si_to_float( vresult );
<span class="line-number">413</span>
<span class="line-number">414</span>    return (result);
<span class="line-number">415</span>}
<span class="line-number">416</span>
<span class="line-number">417</span>// ## cp_fatan2(y,x)
<span class="line-number">418</span>// ## 
<span class="line-number">419</span>// ## -INF <= x              <= INF
<span class="line-number">420</span>// ## -INF <= y              <= INF
<span class="line-number">421</span>// ## -PI  <= cp_fatan2(y,x) <= +PI
<span class="line-number">422</span>// ##                        
<span class="line-number">423</span>// ## Each floating-point component of the result is a function of
<span class="line-number">424</span>// ## the corresponding components of y and x:
<span class="line-number">425</span>// ##                        
<span class="line-number">426</span>// ##     +PI                  { (y == +0.0) && (x < 0.0)
<span class="line-number">427</span>// ##                        
<span class="line-number">428</span>// ##     -PI                  { (y == -0.0) && (x < 0.0)
<span class="line-number">429</span>// ##     
<span class="line-number">430</span>// ##     +0.0                 { (y == +0.0) && (x > 0.0)
<span class="line-number">431</span>// ##     
<span class="line-number">432</span>// ##     -0.0                 { (y == -0.0) && (x > 0.0)
<span class="line-number">433</span>// ##      
<span class="line-number">434</span>// ##     -PI                  {
<span class="line-number">435</span>// ##     ----                 { (y < 0.0) && (x == 0.0)
<span class="line-number">436</span>// ##     +2.0                 {
<span class="line-number">437</span>// ##     
<span class="line-number">438</span>// ##     +PI                  {
<span class="line-number">439</span>// ##     ----                 { (y > 0.0) && (x == 0.0)
<span class="line-number">440</span>// ##     +2.0                 {
<span class="line-number">441</span>// ##     
<span class="line-number">442</span>// ##     NaN                  { (y == NaN) || (x == NaN) 
<span class="line-number">443</span>// ##     
<span class="line-number">444</span>// ##     +PI                  { (y == +0.0) && (x == -0.0)
<span class="line-number">445</span>// ##                        
<span class="line-number">446</span>// ##     -PI                  { (y == -0.0) && (x == -0.0)
<span class="line-number">447</span>// ##                        
<span class="line-number">448</span>// ##     +0.0                 { (y == +0.0) && (x == +0.0)
<span class="line-number">449</span>// ##                        
<span class="line-number">450</span>// ##     -0.0                 { (y == -0.0) && (x == +0.0)
<span class="line-number">451</span>// ##                        
<span class="line-number">452</span>// ##     +PI                  {
<span class="line-number">453</span>// ##     ---                  { (y == +INF) && (x == +INF)
<span class="line-number">454</span>// ##     4.0                  {
<span class="line-number">455</span>// ##                        
<span class="line-number">456</span>// ##     -PI                  {
<span class="line-number">457</span>// ##     ---                  { (y == -INF) && (x == +INF)
<span class="line-number">458</span>// ##     4.0                  {
<span class="line-number">459</span>// ##                        
<span class="line-number">460</span>// ##     +3.0 PI              {
<span class="line-number">461</span>// ##     -------              { (y == +INF) && (x == -INF)
<span class="line-number">462</span>// ##     +4.0                 {
<span class="line-number">463</span>// ##                        
<span class="line-number">464</span>// ##     -3.0 PI              {
<span class="line-number">465</span>// ##     -------              { (y == -INF) && (x == -INF)
<span class="line-number">466</span>// ##     +4.0                 {
<span class="line-number">467</span>// ##                        
<span class="line-number">468</span>// ##     +PI                  { isfinite(y) && (+y > 0) && (x == -INF)
<span class="line-number">469</span>// ##                        
<span class="line-number">470</span>// ##     -PI                  { isfinite(y) && (-y > 0) && (x == -INF)
<span class="line-number">471</span>// ##                        
<span class="line-number">472</span>// ##     +0.0                 { isfinite(y) && (+y > 0) && (x == +INF)
<span class="line-number">473</span>// ##                        
<span class="line-number">474</span>// ##     -0.0                 { isfinite(y) && (-y > 0) && (x == +INF)
<span class="line-number">475</span>// ##                        
<span class="line-number">476</span>// ##     +PI                  {
<span class="line-number">477</span>// ##     ----                 { (isfinite(x) && (y == +INF)
<span class="line-number">478</span>// ##     +2.0                 {
<span class="line-number">479</span>// ##                        
<span class="line-number">480</span>// ##     -PI                  {
<span class="line-number">481</span>// ##     ---                  { (isfinite(x) && (y == -INF)
<span class="line-number">482</span>// ##     +2.0                 {
<span class="line-number">483</span>// ##                        
<span class="line-number">484</span>// ##                   ( y )  {
<span class="line-number">485</span>// ##     +PI  + cp_atan( - )  { ( x <  0.0 ) && ( y >= 0.0 )
<span class="line-number">486</span>// ##                   ( x )  {
<span class="line-number">487</span>// ##                                     
<span class="line-number">488</span>// ##                   ( y )  {
<span class="line-number">489</span>// ##     -PI  + cp_atan( - )  { ( x <  0.0 ) && ( y < 0.0 )
<span class="line-number">490</span>// ##                   ( x )  {
<span class="line-number">491</span>// ##                                     
<span class="line-number">492</span>// ##                   ( y )  {
<span class="line-number">493</span>// ##     +0.0 + cp_atan( - )  { otherwise
<span class="line-number">494</span>// ##                   ( x )  {
<span class="line-number">495</span>// ##                                     
<span class="line-number">496</span>
<span class="line-number">497</span>qword _cp_fatan2( qword y, qword x )
<span class="line-number">498</span>{
<span class="line-number">499</span>    const qword f_one       = cp_filone();
<span class="line-number">500</span>    const qword f_zero      = cp_filzero();
<span class="line-number">501</span>    const qword f_pi        = si_lqa( (intptr_t)&_cp_f_pi  );
<span class="line-number">502</span>    const qword f_npi       = si_lqa( (intptr_t)&_cp_f_npi );
<span class="line-number">503</span>
<span class="line-number">504</span>    // ##                        
<span class="line-number">505</span>    // ## yox = y
<span class="line-number">506</span>    // ##       -
<span class="line-number">507</span>    // ##       x
<span class="line-number">508</span>    // ##                        
<span class="line-number">509</span>    // ## z   = +PI + cp_atan( yox ) { ( x <  0.0 ) && ( y >= 0.0 )
<span class="line-number">510</span>    // ##       -PI + cp_atan( yox ) { ( x <  0.0 ) && ( y <  0.0 )
<span class="line-number">511</span>    // ##       0.0 + cp_atan( yox ) { otherwise
<span class="line-number">512</span>
<span class="line-number">513</span>    const qword x_ltz_mask  = si_fcgt( f_zero, x );
<span class="line-number">514</span>    const qword y_ltz_mask  = si_fcgt( f_zero, y );
<span class="line-number">515</span>    const qword xy_ltz_mask = si_and( x_ltz_mask, y_ltz_mask );
<span class="line-number">516</span>    const qword zadd0       = si_selb( f_zero, f_pi, x_ltz_mask );
<span class="line-number">517</span>    const qword zadd        = si_selb( zadd0, f_npi, xy_ltz_mask );
<span class="line-number">518</span>    const qword x_r0        = si_frest( x );
<span class="line-number">519</span>    const qword x_r1        = si_fnms( x_r0, x, f_one );
<span class="line-number">520</span>    const qword x_r         = si_fma( x_r1, x_r0, x_r0 );
<span class="line-number">521</span>    const qword yox         = si_fm( y, x_r );
<span class="line-number">522</span>    const qword atan_yox    = _cp_fatan( yox );
<span class="line-number">523</span>    const qword result      = si_fa( zadd, atan_yox );
<span class="line-number">524</span>
<span class="line-number">525</span>    return (result);
<span class="line-number">526</span>}
<span class="line-number">527</span>
<span id="cp_fatan2" class="line-number">528</span>vector float cp_fatan2( vector float arg0 /* y */, vector float arg1 /* x */ )
<span class="line-number">529</span>{
<span class="line-number">530</span>    const qword y           = (qword)arg0;
<span class="line-number">531</span>    const qword x           = (qword)arg1;
<span class="line-number">532</span>    const qword result      = _cp_fatan2( y, x );
<span class="line-number">533</span>
<span class="line-number">534</span>    return (vector float)(result);
<span class="line-number">535</span>}
<span class="line-number">536</span>
<span id="cp_fatan2_scalar" class="line-number">537</span>float cp_fatan2_scalar( float arg0 /* y */, float arg1 /* x */ )
<span class="line-number">538</span>{
<span class="line-number">539</span>    const qword y           = si_from_float( arg0 );
<span class="line-number">540</span>    const qword x           = si_from_float( arg1 );
<span class="line-number">541</span>    const qword z           = _cp_fatan2( y, x );
<span class="line-number">542</span>    const float result      = si_to_float( z );
<span class="line-number">543</span>
<span class="line-number">544</span>    return( result );
<span class="line-number">545</span>}
<span class="line-number">546</span>
<span class="line-number">547</span>#endif /* CP_FATAN_CBE_SPU_H */
</div>
<div class="code">
<span class="line-number">  0</span>// ## cp_fatan-cbe-spu.c (C99)
<span class="line-number">  1</span>// ## Version 1.0
<span class="line-number">  2</span>// ##                        
<span class="line-number">  3</span>// ## Copyright (c) 2006 Mike Acton <macton@gmail.com>
<span class="line-number">  4</span>// ##                        
<span class="line-number">  5</span>// ## SIGNIFICANT REFERENCES:
<span class="line-number">  6</span>// ##                        
<span class="line-number">  7</span>// ##    [1] Cephes Math Library Release 2.8:  June, 2000
<span class="line-number">  8</span>// ##        Copyright 1984, 1995, 2000, Stephen L. Moshier
<span class="line-number">  9</span>// ##    [2] Numerical Computation Guide (PDF)
<span class="line-number"> 10</span>// ##        Copyright 2000, Sun Microsystems, Inc.
<span class="line-number"> 11</span>// ##    [3] IEEE 754 Support in C99 (PDF)
<span class="line-number"> 12</span>// ##        Copyright 2001, Jim Thomas
<span class="line-number"> 13</span>// ##    [4] Solaris 10 Reference Manual : atan2(3M)
<span class="line-number"> 14</span>// ##        Copyright 1994-2005, Sun Microsystems, Inc.
<span class="line-number"> 15</span>// ##                        
<span class="line-number"> 16</span>// ## Permission is hereby granted, free of charge, to any person obtaining
<span class="line-number"> 17</span>// ## a copy of this software and associated documentation files 
<span class="line-number"> 18</span>// ## (the "Software"), to deal in the Software without restriction, including
<span class="line-number"> 19</span>// ## without limitation the rights to use, copy, modify, merge, publish, 
<span class="line-number"> 20</span>// ## distribute, sublicense, and/or sell copies of the Software, and to permit
<span class="line-number"> 21</span>// ## persons to whom the Software is furnished to do so, subject to the 
<span class="line-number"> 22</span>// ## following conditions:
<span class="line-number"> 23</span>// ##                        
<span class="line-number"> 24</span>// ## The above copyright notice and this permission notice shall be included 
<span class="line-number"> 25</span>// ## in all copies or substantial portions of the Software.
<span class="line-number"> 26</span>// ##                        
<span class="line-number"> 27</span>// ## THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS 
<span class="line-number"> 28</span>// ## OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 
<span class="line-number"> 29</span>// ## FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 
<span class="line-number"> 30</span>// ## AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 
<span class="line-number"> 31</span>// ## LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 
<span class="line-number"> 32</span>// ## OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 
<span class="line-number"> 33</span>// ## THE SOFTWARE.
<span class="line-number"> 34</span>// ##                        
<span class="line-number"> 35</span>
<span class="line-number"> 36</span>// Loading these contants from (global) SPU local memory is going to be a win over building them
<span class="line-number"> 37</span>// or storing them locally near the function.
<span class="line-number"> 38</span>
<span class="line-number"> 39</span>const vector unsigned int _cp_f_pio4            = {+0x3F490FDA,+0x3F490FDA,+0x3F490FDA,+0x3F490FDA};
<span class="line-number"> 40</span>const vector unsigned int _cp_f_t3p8            = {+0x401A8279,+0x401A8279,+0x401A8279,+0x401A8279};
<span class="line-number"> 41</span>const vector unsigned int _cp_f_npio2           = {-0x4036F026,-0x4036F026,-0x4036F026,-0x4036F026};
<span class="line-number"> 42</span>const vector unsigned int _cp_f_pio2            = {+0x3FC90FDA,+0x3FC90FDA,+0x3FC90FDA,+0x3FC90FDA};
<span class="line-number"> 43</span>const vector unsigned int _cp_f_pt66            = {+0x3F28F5C2,+0x3F28F5C2,+0x3F28F5C2,+0x3F28F5C2};
<span class="line-number"> 44</span>const vector unsigned int _cp_f_pi              = {+0x40490fda,+0x40490fda,+0x40490fda,+0x40490fda};
<span class="line-number"> 45</span>const vector unsigned int _cp_f_npi             = {-0x3fb6f026,-0x3fb6f026,-0x3fb6f026,-0x3fb6f026};
<span class="line-number"> 46</span>
<span class="line-number"> 47</span>const vector unsigned int _cp_f_atan_q4         = {+0x43428CF7,+0x43428CF7,+0x43428CF7,+0x43428CF7};
<span class="line-number"> 48</span>const vector unsigned int _cp_f_atan_q3         = {+0x43F2B1F8,+0x43F2B1F8,+0x43F2B1F8,+0x43F2B1F8};
<span class="line-number"> 49</span>const vector unsigned int _cp_f_atan_q2         = {+0x43D870C6,+0x43D870C6,+0x43D870C6,+0x43D870C6};
<span class="line-number"> 50</span>const vector unsigned int _cp_f_atan_q1         = {+0x432506EA,+0x432506EA,+0x432506EA,+0x432506EA};
<span class="line-number"> 51</span>const vector unsigned int _cp_f_atan_q0         = {+0x41C6DE22,+0x41C6DE22,+0x41C6DE22,+0x41C6DE22};
<span class="line-number"> 52</span>const vector unsigned int _cp_f_atan_p4         = {-0x3D7E4CB1,-0x3D7E4CB1,-0x3D7E4CB1,-0x3D7E4CB1};
<span class="line-number"> 53</span>const vector unsigned int _cp_f_atan_p3         = {-0x3D0A3A07,-0x3D0A3A07,-0x3D0A3A07,-0x3D0A3A07};
<span class="line-number"> 54</span>const vector unsigned int _cp_f_atan_p2         = {-0x3D69FB9F,-0x3D69FB9F,-0x3D69FB9F,-0x3D69FB9F};
<span class="line-number"> 55</span>const vector unsigned int _cp_f_atan_p1         = {-0x3E7EBD5E,-0x3E7EBD5E,-0x3E7EBD5E,-0x3E7EBD5E};
<span class="line-number"> 56</span>const vector unsigned int _cp_f_atan_p0         = {-0x409FFC03,-0x409FFC03,-0x409FFC03,-0x409FFC03};
<span class="line-number"> 57</span>const vector unsigned int _cp_f_hmorebits       = {+0x240D3131,+0x240D3131,+0x240D3131,+0x240D3131};
<span class="line-number"> 58</span>const vector unsigned int _cp_f_morebits        = {+0x248D3131,+0x248D3131,+0x248D3131,+0x248D3131};
<span class="line-number"> 59</span>
</div>]]>
    </content>
</entry>
<entry>
    <title>Branch-free implementation of half-precision (16 bit) floating point</title>
    <link rel="alternate" type="text/html" href="http://www.cellperformance.com/articles/2006/07/branchfree_implementation_of_h_1.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://www.cellperformance.com/cgi-bin/mt/mt-atom.cgi/weblog/blog_id=1/entry_id=57" title="Branch-free implementation of half-precision (16 bit) floating point" />
    <id>tag:www.cellperformance.com,2006:/articles//1.57</id>
    
    <published>2006-07-17T09:20:36Z</published>
    <updated>2006-12-24T19:55:19Z</updated>
    
    <summary>The goal of this project is serve as an example of developing some relatively complex operations completely without branches - a software implementation of half-precision floating point numbers.</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://www.cellperformance.com/mike_acton</uri>
    </author>
            <category term="CBE" />
    
    <content type="html" xml:lang="en" xml:base="http://www.cellperformance.com/articles/">
        <![CDATA[<div class="sticky-note">Update! (19 July 06) Added Multiply. Fixed a problem with using __builtin_clz().</div>
<div class="sticky-note">Update! (17 July 06) The code has been considerably refactored. Decided to go with single function per expression. The expressions have been reduced as a first optimization pass.</div>

<div class="subtitle">Project</div>

The goal of this project is serve as an example of developing some relatively complex operations completely without branches - a software implementation of half-precision floating point numbers (That does not use floating point hardware). This example should echo the IEEE 754 standard for floating point numbers as closely as reasonable, including support for +/- INF, QNan, SNan, and denormalized numbers. However, exceptions will not be implemented.<br />
<br />
Half-precision floats are used in cases where neither the range nor the precision of 32 bit floating point numbers are needed, but where some dynamic precision is required. Two common uses are for image transformation, where the range of each component (e.g. red, green, blue, alpha) is typically limited to or near [0.0,1.0] or vertex data (e.g. position, texture coordinates, color values, etc.).<br />
<br />
The main advantage of half-precision floats is their size. Beyond the considerable potential for memory savings, processing a large number of half-precision values is more cache-friendly than using 32 bit values.<br />
<br />
The current released version (including tests) can be downloaded here: <a href="http://www.cellperformance.com/public/attachments/half.tgz">half.tgz</a><br />
<br />
<br />
<a href="http://www.cellperformance.com/articles/2006/07/branchfree_implementation_of_h_1.html#half_to_float">half_to_float()</a> Convert Half To Float (Scalar Version)<br />
<a href="http://www.cellperformance.com/articles/2006/07/branchfree_implementation_of_h_1.html#half_from_float">half_from_float()</a> Convert Float to Half (Scalar Version)<br />
<a href="http://www.cellperformance.com/articles/2006/07/branchfree_implementation_of_h_1.html#half_add">half_add()</a> Half Add (Scalar Version)<br />
<a href="http://www.cellperformance.com/articles/2006/07/branchfree_implementation_of_h_1.html#half_sub">half_sub()</a> Half Subtract (Scalar Version)<br />
<a href="http://www.cellperformance.com/articles/2006/07/branchfree_implementation_of_h_1.html#half_mul">half_mul()</a> Half Multiply (Scalar Version)<br />
<br />
]]>
        <![CDATA[<div class="sticky-note">
There is sometimes confusion on the best way to pluralize a half-precision floating-point number (half) in English. I asked <a href="http://pinker.wjh.harvard.edu/about/shortbio.html">Steven Pinker</a>, a renowned expert in the English language and our mis-use of it, what he thought. Here is his reply:
<div class="quote">
"Dear Mike,<br />
<br />
Well, in my line of work I should be asking what sounds right to people and
then trying to explain that as a datum, rather than telling people what is
right. But if it was me, I would probably say "halfs," not "halves," for the
same reason as the team in Toronto is called the Maple Leafs, and as
explained in the chapter "Of Mice and Men" in my book Words and Rules.
Basically, a half-precision floating point datum is not a "half" of
anything, even metaphorically speaking, so the irregular form doesn't get
inherited by the neologism. See the chapter for more details.<br />
<br />
Best,<br />
Steve Pinker"
</div> 
</div>

<div class="sticky-note">
Note that all integer operations all encapulated in macros (implemented as static inline functions). This process both helps to enforce the rule of branch-free 
coding and will make writing the SIMD version much easier.
</div>
<br />
<a href="http://www.cellperformance.com/public/attachments/half.c">half.c</a><br />
<pre class="code">
<span class="line-number">  0</span>// Branch-free implementation of half-precision (16 bit) floating point
<span class="line-number">  1</span>// Copyright 2006 Mike Acton <macton@gmail.com>
<span class="line-number">  2</span>// 
<span class="line-number">  3</span>// Permission is hereby granted, free of charge, to any person obtaining a 
<span class="line-number">  4</span>// copy of this software and associated documentation files (the "Software"),
<span class="line-number">  5</span>// to deal in the Software without restriction, including without limitation
<span class="line-number">  6</span>// the rights to use, copy, modify, merge, publish, distribute, sublicense, 
<span class="line-number">  7</span>// and/or sell copies of the Software, and to permit persons to whom the 
<span class="line-number">  8</span>// Software is furnished to do so, subject to the following conditions:
<span class="line-number">  9</span>// 
<span class="line-number"> 10</span>// The above copyright notice and this permission notice shall be included 
<span class="line-number"> 11</span>// in all copies or substantial portions of the Software.
<span class="line-number"> 12</span>// 
<span class="line-number"> 13</span>// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
<span class="line-number"> 14</span>// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 
<span class="line-number"> 15</span>// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
<span class="line-number"> 16</span>// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 
<span class="line-number"> 17</span>// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
<span class="line-number"> 18</span>// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
<span class="line-number"> 19</span>// THE SOFTWARE
<span class="line-number"> 20</span>//
<span class="line-number"> 21</span>// Half-precision floating point format
<span class="line-number"> 22</span>// ------------------------------------
<span class="line-number"> 23</span>//
<span class="line-number"> 24</span>//   | Field    | Last | First | Note
<span class="line-number"> 25</span>//   |----------|------|-------|----------
<span class="line-number"> 26</span>//   | Sign     | 15   | 15    |
<span class="line-number"> 27</span>//   | Exponent | 14   | 10    | Bias = 15
<span class="line-number"> 28</span>//   | Mantissa | 9    | 0     |
<span class="line-number"> 29</span>//
<span class="line-number"> 30</span>// Compiling
<span class="line-number"> 31</span>// ---------
<span class="line-number"> 32</span>//
<span class="line-number"> 33</span>//  Preferred compile flags for GCC: 
<span class="line-number"> 34</span>//     -O3 -fstrict-aliasing -std=c99 -pedantic -Wall -Wstrict-aliasing
<span class="line-number"> 35</span>//
<span class="line-number"> 36</span>//     This file is a C99 source file, intended to be compiled with a C99 
<span class="line-number"> 37</span>//     compliant compiler. However, for the moment it remains combatible
<span class="line-number"> 38</span>//     with C++98. Therefore if you are using a compiler that poorly implements
<span class="line-number"> 39</span>//     C standards (e.g. MSVC), it may be compiled as C++. This is not
<span class="line-number"> 40</span>//     guaranteed for future versions. 
<span class="line-number"> 41</span>//
<span class="line-number"> 42</span>
<span class="line-number"> 43</span>#include "half.h"
<span class="line-number"> 44</span>
<span class="line-number"> 45</span>// Load immediate
<span class="line-number"> 46</span>static inline uint32_t _uint32_li( uint32_t a )
<span class="line-number"> 47</span>{
<span class="line-number"> 48</span>  return (a);
<span class="line-number"> 49</span>}
<span class="line-number"> 50</span>
<span class="line-number"> 51</span>// Decrement
<span class="line-number"> 52</span>static inline uint32_t _uint32_dec( uint32_t a )
<span class="line-number"> 53</span>{
<span class="line-number"> 54</span>  return (a - 1);
<span class="line-number"> 55</span>}
<span class="line-number"> 56</span>
<span class="line-number"> 57</span>// Increment
<span class="line-number"> 58</span>static inline uint32_t _uint32_inc( uint32_t a )
<span class="line-number"> 59</span>{
<span class="line-number"> 60</span>  return (a + 1);
<span class="line-number"> 61</span>}
<span class="line-number"> 62</span>
<span class="line-number"> 63</span>// Complement
<span class="line-number"> 64</span>static inline uint32_t _uint32_not( uint32_t a )
<span class="line-number"> 65</span>{
<span class="line-number"> 66</span>  return (~a);
<span class="line-number"> 67</span>}
<span class="line-number"> 68</span>
<span class="line-number"> 69</span>// Negate
<span class="line-number"> 70</span>static inline uint32_t _uint32_neg( uint32_t a )
<span class="line-number"> 71</span>{
<span class="line-number"> 72</span>  return (-a);
<span class="line-number"> 73</span>}
<span class="line-number"> 74</span>
<span class="line-number"> 75</span>// Extend sign
<span class="line-number"> 76</span>static inline uint32_t _uint32_ext( uint32_t a )
<span class="line-number"> 77</span>{
<span class="line-number"> 78</span>  return (((int32_t)a)&gt;&gt;31);
<span class="line-number"> 79</span>}
<span class="line-number"> 80</span>
<span class="line-number"> 81</span>// And
<span class="line-number"> 82</span>static inline uint32_t _uint32_and( uint32_t a, uint32_t b )
<span class="line-number"> 83</span>{
<span class="line-number"> 84</span>  return (a &amp; b);
<span class="line-number"> 85</span>}
<span class="line-number"> 86</span>
<span class="line-number"> 87</span>// Exclusive Or
<span class="line-number"> 88</span>static inline uint32_t _uint32_xor( uint32_t a, uint32_t b )
<span class="line-number"> 89</span>{
<span class="line-number"> 90</span>  return (a ^ b);
<span class="line-number"> 91</span>}
<span class="line-number"> 92</span>
<span class="line-number"> 93</span>// And with Complement
<span class="line-number"> 94</span>static inline uint32_t _uint32_andc( uint32_t a, uint32_t b )
<span class="line-number"> 95</span>{
<span class="line-number"> 96</span>  return (a &amp; ~b);
<span class="line-number"> 97</span>}
<span class="line-number"> 98</span>
<span class="line-number"> 99</span>// Or
<span class="line-number">100</span>static inline uint32_t _uint32_or( uint32_t a, uint32_t b )
<span class="line-number">101</span>{
<span class="line-number">102</span>  return (a | b);
<span class="line-number">103</span>}
<span class="line-number">104</span>
<span class="line-number">105</span>// Shift Right Logical
<span class="line-number">106</span>static inline uint32_t _uint32_srl( uint32_t a, int sa )
<span class="line-number">107</span>{
<span class="line-number">108</span>  return (a &gt;&gt; sa);
<span class="line-number">109</span>}
<span class="line-number">110</span>
<span class="line-number">111</span>// Shift Left Logical
<span class="line-number">112</span>static inline uint32_t _uint32_sll( uint32_t a, int sa )
<span class="line-number">113</span>{
<span class="line-number">114</span>  return (a &lt;&lt; sa);
<span class="line-number">115</span>}
<span class="line-number">116</span>
<span class="line-number">117</span>// Add
<span class="line-number">118</span>static inline uint32_t _uint32_add( uint32_t a, uint32_t b )
<span class="line-number">119</span>{
<span class="line-number">120</span>  return (a + b);
<span class="line-number">121</span>}
<span class="line-number">122</span>
<span class="line-number">123</span>// Subtract
<span class="line-number">124</span>static inline uint32_t _uint32_sub( uint32_t a, uint32_t b )
<span class="line-number">125</span>{
<span class="line-number">126</span>  return (a - b);
<span class="line-number">127</span>}
<span class="line-number">128</span>
<span class="line-number">129</span>// Multiply
<span class="line-number">130</span>static inline uint32_t _uint32_mul( uint32_t a, uint32_t b )
<span class="line-number">131</span>{
<span class="line-number">132</span>  return (a * b);
<span class="line-number">133</span>}
<span class="line-number">134</span>
<span class="line-number">135</span>// Select on Sign bit
<span class="line-number">136</span>static inline uint32_t _uint32_sels( uint32_t test, uint32_t a, uint32_t b )
<span class="line-number">137</span>{
<span class="line-number">138</span>  const uint32_t mask   = _uint32_ext( test );
<span class="line-number">139</span>  const uint32_t sel_a  = _uint32_and(  a,     mask  );
<span class="line-number">140</span>  const uint32_t sel_b  = _uint32_andc( b,     mask  );
<span class="line-number">141</span>  const uint32_t result = _uint32_or(   sel_a, sel_b );
<span class="line-number">142</span>
<span class="line-number">143</span>  return (result);
<span class="line-number">144</span>}
<span class="line-number">145</span>
<span class="line-number">146</span>// Select Bits on mask
<span class="line-number">147</span>static inline uint32_t _uint32_selb( uint32_t mask, uint32_t a, uint32_t b )
<span class="line-number">148</span>{
<span class="line-number">149</span>  const uint32_t sel_a  = _uint32_and(  a,     mask  );
<span class="line-number">150</span>  const uint32_t sel_b  = _uint32_andc( b,     mask  );
<span class="line-number">151</span