Recent changes to this wiki:

Add q3a + ogt shaders news.
diff --git a/index.mdwn b/index.mdwn
index 8799d58..d16e142 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -7,6 +7,7 @@ The aim of this driver and others such as [freedreno](http://freedreno.github.co
 
 ## News
 ===
+* 2013-03-18: Q3A now runs with open source generated shaders! Read all about it [at libvs blog](http://libv.livejournal.com/24402.html)
 * 2013-02-06: Libv blogs about [Quake 3 Arena running on top of the limare prototype driver](http://libv.livejournal.com/23886.html).
 * 2013-02-02: Libv talks about [Open ARM GPU drivers](https://fosdem.org/2013/schedule/event/operating_systems_open_arm_gpu/) at FOSDEM, and Connor talks about [his compiler work in the Xorg DevRoom](https://fosdem.org/2013/schedule/event/maliisa/). Public goes wild for Q3A running on limare, and Connor fills the DevRoom to the brim!
 * 2012-12-07: After what can only be described as an eternity, the LinuxTag demo code and tons of other changes have now been pushed to gitorious. Here is [a video of our brand new spinning companion cube](http://www.youtube.com/watch?v=k16ve88d-L0) spinning away at 60Hz.

diff --git a/index.mdwn b/index.mdwn
index 89e6fde..8799d58 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -65,6 +65,7 @@ Lima Documents
 * [[Fragment+Shader+Backend]]
 * [[Render State]]
 * [[Texel Formats]]
+* [[Compiling Q3A Shaders]]
 
 ## Contribute
 ===

diff --git a/Compiling_Q3A_Shaders.mdwn b/Compiling_Q3A_Shaders.mdwn
new file mode 100644
index 0000000..422a599
--- /dev/null
+++ b/Compiling_Q3A_Shaders.mdwn
@@ -0,0 +1,21 @@
+This page explains how to use open-gpu-tools to generate the required shaders for the limare port of Quake 3 Arena to run without using the binary compiler. The shaders have been hand-converted from ESSL (the input of the binary compiler) to a custom assembly/IR, and so some playing around/learning/reading the source is necessary in order to understand how the shaders work.
+
+## Setting up open-gpu-tools
+
+Clone [my open-gpu-tools tree](https://gitorious.org/~cwabbott/open-gpu-tools/cwabbotts-open-gpu-tools), and switch to the ir branch. Compile libcommon.so by cd'ing to the common directory and running make. Same thing with ir_tools and assemble.
+
+## Fragment shaders
+
+The fragment shaders are written in assembly, meaning that you have to use the use the assemble tool to generate a working MBS file. To assemble a shader ~/my_shader.in into an mbs file ~/my_shader.mbs, from the assemble directory do:
+
+    ./assemble -a lima_pp -s verbose -t fragment -o ~/my_shader.mbs ~/my_shader.in
+
+## Vertex shaders
+
+The vertex shaders are compiled from gp_ir, meaning you need to use the ir_tools to compile it to MBS. To parse an input shader ~/my_shader.in into a binary gp_ir file ~/my_shader.ir, from the ir_tool directory do:
+
+    ./ir_parse -i lima_gp_ir -o ~/my_shader.ir ~/my_shader.in
+
+And to compile that to MBS, do:
+
+    ./ir_lower -i lima_gp_ir -a lima_gp -f mbs -o ~/my_shader.mbs ~/my_shader.ir

Move into setting up X
diff --git a/OdroidSetup.mdwn b/OdroidSetup.mdwn
index e29580e..4c8b0c3 100644
--- a/OdroidSetup.mdwn
+++ b/OdroidSetup.mdwn
@@ -193,5 +193,42 @@ locale-gen en_US.UTF-8
 </pre>
 or by whichever locale is listed as LANG when running locale.
 
+# Setting up X
+
+Create /etc/X11/xorg.conf with the following content:
+<pre>
+Section "Device"
+        identifier "FBDEV"
+        Driver "fbdev"
+        Option "fbdev" "/dev/fb6"
+EndSection
+
+Section "Screen"
+        identifier "Default Screen"
+        Device "FBDEV"
+        DefaultDepth 16
+EndSection
+</pre>
+
+You can now start the display manager:
+<pre>
+lightdm&
+</pre>
+
+I haven't yet figured out how the strange exynos fb drivers can be coaxed into doing 24 bit colour.
+
 # Mali binaries
 
+Install es2gears and es2_info through:
+
+<pre>
+apt-get install mesa-utils-extra
+</pre>
+
+This will drag in the full mesa, which includes an openGLESv2 lib, which we really do not need.
+
+<pre>
+mv /usr/lib/arm-linux-gnueabihf/mesa-egl /usr/lib/arm-linux-gnueabihf/.mesa-egl
+</pre>
+
+Nasty, but works.

Add random fluff for getting ALIP running.
diff --git a/OdroidSetup.mdwn b/OdroidSetup.mdwn
index 8fb3b3f..e29580e 100644
--- a/OdroidSetup.mdwn
+++ b/OdroidSetup.mdwn
@@ -138,4 +138,60 @@ Then run the following to create boot.scr, the file that u-boot looks for:
 mkimage -A arm -O linux -T script -C none -a 0 -e 0 -n "BOOT Script for ODROID-X2" -d boot.txt boot.scr
 </pre>
 
+# First boot setup
+
+I experienced some resolver issues, as apparently the dhcpd nameserver info was not passed on properly (networkmanager?) So i added the following to /etc/resolv.conf to manually override things
+
+<pre>
+nameserver 192.168.x.x
+</pre>
+
+I then went on to install the most important package for any network connected device:
+<pre>
+apt-get update
+apt-get install openssh-server
+</pre>
+
+I could then ssh into the device and start changing some things.
+<pre>
+sudo -s
+passwd
+</pre>
+
+Now you can just ssh in as root.
+
+<pre>
+echo "odroid" > /etc/hostname
+</pre>
+
+Log out and in again to see this take effect.
+
+Then drop the linaro user and add your own. Make sure it is added to the video group.
+<pre>
+userdel -r linaro
+adduser user
+adduser user video
+</pre>
+
+You will see loads of locale issues when running any apt things:
+<pre>
+perl: warning: Setting locale failed.
+perl: warning: Please check that your locale settings:
+	LANGUAGE = (unset),
+	LC_ALL = (unset),
+	LANG = "en_US.UTF-8"
+    are supported and installed on your system.
+perl: warning: Falling back to the standard locale ("C").
+locale: Cannot set LC_CTYPE to default locale: No such file or directory
+locale: Cannot set LC_MESSAGES to default locale: No such file or directory
+locale: Cannot set LC_ALL to default locale: No such file or directory
+</pre>
+
+You can fix these by running:
+<pre>
+locale-gen en_US.UTF-8
+</pre>
+or by whichever locale is listed as LANG when running locale.
+
 # Mali binaries
+

Add boot.scr creation information, and make the boot partition vfat.
diff --git a/OdroidSetup.mdwn b/OdroidSetup.mdwn
index 475aa35..8fb3b3f 100644
--- a/OdroidSetup.mdwn
+++ b/OdroidSetup.mdwn
@@ -29,17 +29,18 @@ I/O size (minimum/optimal): 512 bytes / 512 bytes
 Disk identifier: 0x834d1732
 
         Device Boot      Start         End      Blocks   Id  System
-/dev/mmcblk0p1            3072      527359      262144   83  Linux
+/dev/mmcblk0p1            3072      527359      262144    b  W95 FAT32
 /dev/mmcblk0p2          527360    31116287    15294464   83  Linux
 </pre>
 
-The important thing to note is that the first partition should start at 3072, as the space underneath is used by the u-boot and trustedzone binaries. It also might pay to provide a separate boot partition, with kernel images and u-boot script files. Apart from that, you are free to partition as you like, as long as you update u-boot script accordingly.
+The important thing to note is that the first partition should start at 3072, as the space underneath is used by the u-boot and trustedzone binaries, and it should be a FAT based boot partition. Apart from that, you are free to partition as you like, as long as you update u-boot script accordingly.
 
 Note that this for a 16GB card, actual offsets and sizes might look different for you. In this setup, 256MB was reserved for the boot partition, and the remainder was given for one big root filesystem.
 
 Now format all partitions:
 <pre>
-mkfs.ext3 /dev/mmcblkX
+mkfs.vfat /dev/mmcblkXp1
+mkfs.ext3 /dev/mmcblkXp2
 </pre>
 
 # U-boot setup Pt.1
@@ -125,4 +126,16 @@ cp arch/arm/boot/zImage PATH_TO_BOOTFS
 </pre>
 # U-boot setup Pt.2
 
+Now we create a file called boot.txt in our boot partition, and it should contain the following:
+<pre>
+setenv bootargs 'root=/dev/mmcblk0p2 rw rootwait console=tty0 console=ttySAC1,115200n8 mem=2047M'
+ext2load mmc 0:1 0x40008000 zImage
+bootm 0x40008000
+</pre>
+
+Then run the following to create boot.scr, the file that u-boot looks for:
+<pre>
+mkimage -A arm -O linux -T script -C none -a 0 -e 0 -n "BOOT Script for ODROID-X2" -d boot.txt boot.scr
+</pre>
+
 # Mali binaries

Add line for installing zImage.
diff --git a/OdroidSetup.mdwn b/OdroidSetup.mdwn
index c3f370b..475aa35 100644
--- a/OdroidSetup.mdwn
+++ b/OdroidSetup.mdwn
@@ -119,6 +119,10 @@ Once that's done, run:
 make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- INSTALL_MOD_PATH=/PATH_TO_ROOTFS/ modules_install
 </pre>
 
+You can now copy the kernel image to the boot partition:
+<pre>
+cp arch/arm/boot/zImage PATH_TO_BOOTFS
+</pre>
 # U-boot setup Pt.2
 
 # Mali binaries

Add module installation.
diff --git a/OdroidSetup.mdwn b/OdroidSetup.mdwn
index 41b5f15..c3f370b 100644
--- a/OdroidSetup.mdwn
+++ b/OdroidSetup.mdwn
@@ -114,6 +114,11 @@ make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- -j5 zImage modules
 
 Now go and make some tea :)
 
+Once that's done, run:
+<pre>
+make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- INSTALL_MOD_PATH=/PATH_TO_ROOTFS/ modules_install
+</pre>
+
 # U-boot setup Pt.2
 
 # Mali binaries

Add first part of kernel build info.
diff --git a/OdroidSetup.mdwn b/OdroidSetup.mdwn
index a272e2c..41b5f15 100644
--- a/OdroidSetup.mdwn
+++ b/OdroidSetup.mdwn
@@ -73,6 +73,47 @@ Pick a root filesystem laid out for arm hardfloat. My current preference is a Li
 
 # Kernel build
 
+First, you need a clone of the odroid kernel.
+
+You can either clone an existing kernel tree, and then fetch the odroid one on top:
+
+<pre>
+git clone /home/user/kernel/linux-2.6/ kernel
+cd kernel/
+git remote rm origin
+git remote add origin https://github.com/hardkernel/linux.git
+git fetch
+git checkout odroid-3.0.y
+</pre>
+
+Or you can just make a quick copy of the top level tree, without downloading a full (and huge) git repository.
+
+<pre>
+git clone --depth 1 https://github.com/hardkernel/linux.git -b odroid-3.0.y kernel
+</pre>
+
+Make sure that your cross toolchain is in your path.
+
+You can now select one of many odroid machine targets, although i personally find "ubuntu" very shortsighted:
+
+<pre>
+ls arch/arm/configs/odroid*ubuntu*
+</pre>
+
+Here, we pick the odroid-x2 with mali enabled:
+
+<pre>
+make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- odroidx2_ubuntu_mali_defconfig 
+</pre>
+
+After this we can build our kernel:
+
+<pre>
+make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- -j5 zImage modules
+</pre>
+
+Now go and make some tea :)
+
 # U-boot setup Pt.2
 
 # Mali binaries

Add rootfs description.
diff --git a/OdroidSetup.mdwn b/OdroidSetup.mdwn
index a61288f..a272e2c 100644
--- a/OdroidSetup.mdwn
+++ b/OdroidSetup.mdwn
@@ -1,5 +1,7 @@
 This document gathers all the necessary info to set up an SD-Card with a bootable gnu/linux, with your own kernel and with mali binaries installed.
 
+Order your odroid with the uart module. As with any ARM device, serial is indispensable for debugging any boot failures. Make sure that you are using a recent enough driver for cp210x, this module only became useful after linux kernel 3.2, but the fixes to this specific module can easily be backported. Ask libv on irc for more info if you need this.
+
 # SD-Card
 
 First off, clear out the first bits of the SD-Card for sanity's sake:
@@ -35,10 +37,42 @@ The important thing to note is that the first partition should start at 3072, as
 
 Note that this for a 16GB card, actual offsets and sizes might look different for you. In this setup, 256MB was reserved for the boot partition, and the remainder was given for one big root filesystem.
 
-# U-boot setup
+Now format all partitions:
+<pre>
+mkfs.ext3 /dev/mmcblkX
+</pre>
+
+# U-boot setup Pt.1
+
+Samsung does currently not provide sources with its build of u-boot, so both Samsung and Hardkernel are violating the GPL.
+
+You can download a tarball with all the u-boot binaries from [here](http://www.mdrjr.net/odroid/mirror/BSPs/Alpha4/unpacked/boot.tar.gz)
+
+Untar this:
+
+<pre>
+tar -zxvf boot.tar.gz
+</pre>
+
+Then make the script in there executable:
+<pre>
+chmod +x sd_fusing.sh
+</pre>
+
+And now make this script install all the blobs to your SD-Card:
+
+<pre>
+./sd_fusing.sh /dev/mmcblkX
+</pre>
+
+After that, your SD-Card should be bootable, if you have a uart, you should be able to see U-boot attempting to load already.
+
+# Root filesystem
+
+Pick a root filesystem laid out for arm hardfloat. My current preference is a Linaro ALIP style image, with lightdm and xfce. It can be downloaded [here](https://snapshots.linaro.org/quantal/images/alip). Once you have downloaded it, you can simply untar it in the root partition of your sd-card. After untarring, you need to move everything in the binary directory one level up. This is there to protect people from overwriting their main filesystem. Do not forget to remove SHA256SUMS :)
 
 # Kernel build
 
-# Root fs
+# U-boot setup Pt.2
 
 # Mali binaries

Fill out SD-Card section.
diff --git a/OdroidSetup.mdwn b/OdroidSetup.mdwn
index 33d63c2..a61288f 100644
--- a/OdroidSetup.mdwn
+++ b/OdroidSetup.mdwn
@@ -2,8 +2,43 @@ This document gathers all the necessary info to set up an SD-Card with a bootabl
 
 # SD-Card
 
+First off, clear out the first bits of the SD-Card for sanity's sake:
+
+<code>
+dd if=/dev/zero of=/dev/mmcblkX bs=1M count=5
+</code>
+
+Then set up some partitions on the SD-Card:
+
+<code>
+fdisk /dev/mmcblkX
+</code>
+
+And work it until it looks somewhat like this:
+
+<pre>
+Command (m for help): p
+
+Disk /dev/mmcblk0: 15.9 GB, 15931539456 bytes
+4 heads, 16 sectors/track, 486192 cylinders, total 31116288 sectors
+Units = sectors of 1 * 512 = 512 bytes
+Sector size (logical/physical): 512 bytes / 512 bytes
+I/O size (minimum/optimal): 512 bytes / 512 bytes
+Disk identifier: 0x834d1732
+
+        Device Boot      Start         End      Blocks   Id  System
+/dev/mmcblk0p1            3072      527359      262144   83  Linux
+/dev/mmcblk0p2          527360    31116287    15294464   83  Linux
+</pre>
+
+The important thing to note is that the first partition should start at 3072, as the space underneath is used by the u-boot and trustedzone binaries. It also might pay to provide a separate boot partition, with kernel images and u-boot script files. Apart from that, you are free to partition as you like, as long as you update u-boot script accordingly.
+
+Note that this for a 16GB card, actual offsets and sizes might look different for you. In this setup, 256MB was reserved for the boot partition, and the remainder was given for one big root filesystem.
+
 # U-boot setup
 
 # Kernel build
 
+# Root fs
+
 # Mali binaries

Initial structure.
diff --git a/OdroidSetup.mdwn b/OdroidSetup.mdwn
new file mode 100644
index 0000000..33d63c2
--- /dev/null
+++ b/OdroidSetup.mdwn
@@ -0,0 +1,9 @@
+This document gathers all the necessary info to set up an SD-Card with a bootable gnu/linux, with your own kernel and with mali binaries installed.
+
+# SD-Card
+
+# U-boot setup
+
+# Kernel build
+
+# Mali binaries

Fill out ODROID section. Links to hardkernel are deliberately not put in place, hardkernel does not wish to support the lima driver project.
diff --git a/Devices.mdwn b/Devices.mdwn
index c941665..2b612d9 100644
--- a/Devices.mdwn
+++ b/Devices.mdwn
@@ -35,13 +35,19 @@ According to the [spec sheet](http://www.pointofview-online.com/showroom.php?sho
 
 # Exynos 4 (**GPL VIOLATOR**)
 
+These SoCs are the best performing Mali-400 devices out there. They are proper speed-daemons. The exynos 42xx series has a dual A9, whereas the exynos 44xx series has a quad A9. All come with a Mali-400MP4.
+
 All exynos 4 devices come with binary only u-boot. This means that Samsung, and its device makers, are violating the GPL.
 
 ## Origen Board (**GPL VIOLATOR**)
 
 ## ODROIDs (**GPL VIOLATOR**)
 
+The Odroids are small developer boards with many possible connections. The Odroid-x2/u2 is hyperfast, as it can clock the 4 A9s to 2GHz, and the Mali-400MP4 can clock up to 640MHz. This makes for a nice high-end benchmarker, and a good comparison for the comparatively meek A10.
+
+Hardkernel tries to portray itself as open source friendly, but they have a lot to learn still. They are providing some sort of crazy android and ubuntu pre-made SD-card images, and even hand out Mali binaries for ubuntu. Hardkernel knows the pain of getting the Mali binaries built and integrated, yet they are not interested in cooperating with our project. They officially claim to have "community based" support, and we all know what that means. Since these devices are developer boards, they have a much longer life span than your average mobile phone or tablet. In the mid to long term, hardkernel, well, hardkernels customers, will end up depending on the support of the lima driver project.
 
+Here is [[some_information|OdroidSetup]] on how to set up your own SD card with a custom built kernel and with mali binaries (which we need for reverse engineering).
 
 ## Samsung Galaxy S II (**GPL VIOLATOR**)
 

Expand allwinner section and mark samsung as a gpl violator.
diff --git a/Devices.mdwn b/Devices.mdwn
index da1b79a..c941665 100644
--- a/Devices.mdwn
+++ b/Devices.mdwn
@@ -4,11 +4,15 @@ Be careful where you buy, most cheap shops will not ship from your country but w
 
 # AllWinner A10
 
-The allwinner A10 and A13 SoCs are currently the easiest and best supported targets for developing an open source driver for the ARM Mali. There is a very active open source community, called [linux-sunxi](http://linux-sunxi.org), to support these SoCs, and device support is growing rapidly.
+The allwinner A10 and A13 SoCs are currently the easiest and best supported targets for developing an open source driver for the ARM Mali.
 
-## Cubieboard (Open Source Hardware!)
+These devices are a Cortex A8 capable of clocking little over 1GHz, comes with lots of expansion possibilities, even SATA. It features a Mali-400MP1, so it is not a stellar performer, but it more than makes up for that in availability and price, and openness. Allwinner itself is not directly supporting open source software, and would be a GPL violator in itself. But luckily, their lack of control on their device makers made the necessary code fall out through the cracks, and they are the most compliant of any chinese SoC maker today.
 
-The [Cubieboard](http://cubieboard.org) comes with 512 or 1024 MB of DDR3 RAM, 4 GB of NAND flash storage, a microSD card slot, Fast Ethernet, USB host ports, a SATA port, HDMI output and can be had for as low as 49 USD. As of December 2012, it is currently only available for pre-order.
+There is a very active open source community, called [linux-sunxi](http://linux-sunxi.org), to support these SoCs, and device support is growing rapidly. Check out [the main linux-sunxi page](http://linux-sunxi.org/Main_Page) to find out about the supported, and sometimes even fully open source, hardware available.
+
+## Cubieboard (**Open Source Hardware!**)
+
+The [Cubieboard](http://cubieboard.org) comes with 512 or 1024 MB of DDR3 RAM, 4 GB of NAND flash storage, a microSD card slot, Fast Ethernet, USB host ports, a SATA port, HDMI output and can be had for as low as 49 USD.
 
 ## Gooseberry
 
@@ -18,8 +22,6 @@ The [Gooseberry](http://gooseberry.atspace.co.uk/) board is actually a tablet bo
 
 The [Hackberry](https://www.miniand.com/products/Hackberry%20A10%20Developer%20Board) development board comes with 1 GB of DDR3 RAM, 4 GB of NAND flash storage, a full-size SDHC card slot, Fast Ethernet, USB host ports, built-in 802.11n Wi-Fi, HDMI output and can be had for 65 USD.
 
-## Mele A1000
-
 # AMLogic 8726-M (Mali 400)
 
 ## Zenithink ZT-280 (**GPL VIOLATOR**)
@@ -31,15 +33,19 @@ The ZT-280 range includes the C71, a 7" tablet with a capacitive display. Can be
 
 According to the [spec sheet](http://www.pointofview-online.com/showroom.php?shop_mode=product_detail&product_id=308) provided by its manufacturer/reseller, the ProTab 2XXL features a Mali-400 GPU. This tablet features a 10" capacitive touch-screen, and is very competetively priced - it retails for [about EUR 170](http://geizhals.eu/713232). Point of View publishes "Firmware Updates" in its somewhat chaotic [download area](http://downloads.pointofview-online.com/Drivers/), but there's no source code in sight anywhere.
 
-# Exynos 4
+# Exynos 4 (**GPL VIOLATOR**)
+
+All exynos 4 devices come with binary only u-boot. This means that Samsung, and its device makers, are violating the GPL.
+
+## Origen Board (**GPL VIOLATOR**)
+
+## ODROIDs (**GPL VIOLATOR**)
 
-## Origen Board
 
-## ODROID
 
-## Samsung Galaxy S II
+## Samsung Galaxy S II (**GPL VIOLATOR**)
 
-## Samsung Galaxy S III
+## Samsung Galaxy S III (**GPL VIOLATOR**)
 
 # Exynos 5
 

Add link to linux-sunxi, and list allwinner first. It is our prime target today.
diff --git a/Devices.mdwn b/Devices.mdwn
index 5f4e858..da1b79a 100644
--- a/Devices.mdwn
+++ b/Devices.mdwn
@@ -2,20 +2,11 @@ This page lists some of the available devices with a Mali GPU, together with som
 
 Be careful where you buy, most cheap shops will not ship from your country but will ship from China. This means that you might end up paying customs, and end up wasting some time at the customs office.
 
-# AMLogic 8726-M (Mali 400)
-
-## Zenithink ZT-280 (**GPL VIOLATOR**)
-
-The ZT-280 range includes the C71, a 7" tablet with a capacitive display. Can be had for under EUR 100 these days, but add customs and postage to that.
-
-
-## Point of View ProTab 2XXL (**GPL VIOLATOR**)
-
-According to the [spec sheet](http://www.pointofview-online.com/showroom.php?shop_mode=product_detail&product_id=308) provided by its manufacturer/reseller, the ProTab 2XXL features a Mali-400 GPU. This tablet features a 10" capacitive touch-screen, and is very competetively priced - it retails for [about EUR 170](http://geizhals.eu/713232). Point of View publishes "Firmware Updates" in its somewhat chaotic [download area](http://downloads.pointofview-online.com/Drivers/), but there's no source code in sight anywhere.
-
 # AllWinner A10
 
-## Cubieboard
+The allwinner A10 and A13 SoCs are currently the easiest and best supported targets for developing an open source driver for the ARM Mali. There is a very active open source community, called [linux-sunxi](http://linux-sunxi.org), to support these SoCs, and device support is growing rapidly.
+
+## Cubieboard (Open Source Hardware!)
 
 The [Cubieboard](http://cubieboard.org) comes with 512 or 1024 MB of DDR3 RAM, 4 GB of NAND flash storage, a microSD card slot, Fast Ethernet, USB host ports, a SATA port, HDMI output and can be had for as low as 49 USD. As of December 2012, it is currently only available for pre-order.
 
@@ -29,6 +20,17 @@ The [Hackberry](https://www.miniand.com/products/Hackberry%20A10%20Developer%20B
 
 ## Mele A1000
 
+# AMLogic 8726-M (Mali 400)
+
+## Zenithink ZT-280 (**GPL VIOLATOR**)
+
+The ZT-280 range includes the C71, a 7" tablet with a capacitive display. Can be had for under EUR 100 these days, but add customs and postage to that.
+
+
+## Point of View ProTab 2XXL (**GPL VIOLATOR**)
+
+According to the [spec sheet](http://www.pointofview-online.com/showroom.php?shop_mode=product_detail&product_id=308) provided by its manufacturer/reseller, the ProTab 2XXL features a Mali-400 GPU. This tablet features a 10" capacitive touch-screen, and is very competetively priced - it retails for [about EUR 170](http://geizhals.eu/713232). Point of View publishes "Firmware Updates" in its somewhat chaotic [download area](http://downloads.pointofview-online.com/Drivers/), but there's no source code in sight anywhere.
+
 # Exynos 4
 
 ## Origen Board

Carlos is not an anonymous user, but just a useless user.
This reverts commit 71a2ca94db7a71c268455a744d2708bb090ed774
diff --git a/index.mdwn b/index.mdwn
index 150cc4f..89e6fde 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -3,7 +3,7 @@
 
 Lima is an open source graphics driver which supports Mali-200 and Mali-400 GPUs. 
 
-The aim of this driver and others such as [freedreno](http://freedreno.github.com) is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary xcvdrivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. 
+The aim of this driver and others such as [freedreno](http://freedreno.github.com) is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary drivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. 
 
 ## News
 ===

Carlos is not an anonymous user, but just a useless user.
This reverts commit da881376fedc3e8ed8dcc4377b62f6ae656a643e
diff --git a/index.mdwn b/index.mdwn
index 32fedd9..150cc4f 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -3,7 +3,7 @@
 
 Lima is an open source graphics driver which supports Mali-200 and Mali-400 GPUs. 
 
-The aim of this driver and others such as [freedreno](http://freedreno.github.com) is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary drivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. 
+The aim of this driver and others such as [freedreno](http://freedreno.github.com) is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary xcvdrivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. 
 
 ## News
 ===
@@ -82,7 +82,3 @@ Please subscribe to our [mailinglist](http://vlists.pepperfish.net/cgi-bin/mailm
 
 ===
 
-PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
-
-GREETINGS,
-ANONYMOUS USER

Carlos is not an anonymous user, but just a useless user.
This reverts commit f80dab29f9403a8e1c6da8028d61db2c71a938b8
diff --git a/index.mdwn b/index.mdwn
index 246f68c..32fedd9 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -5,27 +5,11 @@ Lima is an open source graphics driver which supports Mali-200 and Mali-400 GPUs
 
 The aim of this driver and others such as [freedreno](http://freedreno.github.com) is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary drivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. 
 
-===
-
-PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
-
-GREETINGS,
-ANONYMOUS USER
-
-===
 ## News
 ===
 * 2013-02-06: Libv blogs about [Quake 3 Arena running on top of the limare prototype driver](http://libv.livejournal.com/23886.html).
 * 2013-02-02: Libv talks about [Open ARM GPU drivers](https://fosdem.org/2013/schedule/event/operating_systems_open_arm_gpu/) at FOSDEM, and Connor talks about [his compiler work in the Xorg DevRoom](https://fosdem.org/2013/schedule/event/maliisa/). Public goes wild for Q3A running on limare, and Connor fills the DevRoom to the brim!
 * 2012-12-07: After what can only be described as an eternity, the LinuxTag demo code and tons of other changes have now been pushed to gitorious. Here is [a video of our brand new spinning companion cube](http://www.youtube.com/watch?v=k16ve88d-L0) spinning away at 60Hz.
-===
-
-PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
-
-GREETINGS,
-ANONYMOUS USER
-
-===
 * 2012-05-27: Linuxtag talk slides and a separate demo of limare was posted on [phoronix](http://www.phoronix.com/scan.php?page=news_item&px=MTEwODA).
 * 2012-05-26: Lima talk at [Linuxtag Berlin](http://www.linuxtag.org/2012/de/program/program/vortragsdetails.html?no_cache=1&talkid=481): Textured, lighted portal cube, spins away correctly [(full video)](http://blip.tv/opensuse/linuxtag2012-lima-liberating-arm-s-mali-gpu-6166702)!
 * 2012-04-14: Rob Clark announces the [freedreno project](http://bloggingthemonkey.blogspot.co.uk/2012/04/fighting-back-against-binary-blobs.html) inspired by the Lima approach
@@ -37,14 +21,6 @@ ANONYMOUS USER
 * 2012-02-03: First public renders of [smoothed triangle, smoothed strip, smoothed fan, flat quad, triangle quad, smoothed lighted rotated cube](http://limadriver.org/content)
 * 2012-01-24: A new name has been chosen for the project: remali now becomes Lima! We now have a gitorious project, there is the #lima channel on freenode. A mailing list will be created soon.
 * 2012-01-23: [Codethink](http://www.codethink.co.uk/) puts out a [press release](http://www.prweb.com/releases/2012/1/prweb9130318.htm) for the business world. This is definitely not vaporware!
-===
-
-PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
-
-GREETINGS,
-ANONYMOUS USER
-
-===
 * 2012-01-21: Talk appears on [the FOSDEM schedule.](http://fosdem.org/2012/schedule/event/mali "Liberating ARM's Mali GPU")[The cat is out of the bag!](http://twitter.com/#!/codethink/status/160803588929626112) Story published by [phoronix](http://www.phoronix.com/vr.php?view=16971), hits [slashdot](http://linux.slashdot.org/story/12/01/21/0935248/coming-soon-an-open-source-reverse-engineered-mali-gpu-driver), [golem](http://www.golem.de/1201/89274.html), [pro-linux](http://www.pro-linux.de/news/1/17948/freier-treiber-fuer-mali-grafikprozessoren-angekuendigt.html) and [tweakers](http://tweakers.net/nieuws/79485/opensourcedriver-voor-arms-mali-gpu-in-ontwikkeling.html).
 
 ## Software
@@ -58,14 +34,7 @@ Documentation for the shader compiler, and the initial investigation of the inst
 ===
 
 ### [Mali-400](Hardware#Mali-400):
-===
-
-PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
-
-GREETINGS,
-ANONYMOUS USER
 
-===
 * [AMLogic 8726-M](Hardware#AMLogic+8726-M) (Zenithink C71)
 * [Allwinner A10](Hardware#Allwinner+A10) (Mele A1000, MK802)
 * [ST-Ericsson Novathor](Hardware#ST-Ericsson+Novathor)
@@ -77,14 +46,7 @@ ANONYMOUS USER
 
 ## Documentation
 ===
-===
-
-PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
-
-GREETINGS,
-ANONYMOUS USER
 
-===
 The documentation is currently kept in the wiki, pages of interest are:
 
 Original (Falanx) datasheets:
@@ -106,14 +68,6 @@ Lima Documents
 
 ## Contribute
 ===
-===
-
-PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
-
-GREETINGS,
-ANONYMOUS USER
-
-===
 
 The Lima driver currently only has some preliminary and highly experimental support. This experimental phase is necessary to gain a full and complete understanding of how the Mali GPUs work. Once more is known, an actual graphics driver (most likely based off of Mesa/Gallium) can be written. There is a lot of interesting work that still needs to be done!
 
@@ -132,5 +86,3 @@ PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
 
 GREETINGS,
 ANONYMOUS USER
-
-===

Carlos is not an anonymous user, but just a useless user.
This reverts commit a23a8009ccc54368db33c8c1b96a68c4de3e4d9e
diff --git a/index.mdwn b/index.mdwn
index 9a6cd4a..246f68c 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -1,11 +1,3 @@
-===
-
-PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
-
-GREETINGS,
-ANONYMOUS USER
-
-===
 # **Lima**: An open source graphics driver for ARM Mali GPUs
 ===
 

Carlos is not an anonymous user, but just a useless user.
This reverts commit ae53600143425a7d6bf7abebb26c9e6dc16797ae
diff --git a/index.mdwn b/index.mdwn
index 3573463..9a6cd4a 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -23,6 +23,8 @@ ANONYMOUS USER
 ===
 ## News
 ===
+* 2013-02-06: Libv blogs about [Quake 3 Arena running on top of the limare prototype driver](http://libv.livejournal.com/23886.html).
+* 2013-02-02: Libv talks about [Open ARM GPU drivers](https://fosdem.org/2013/schedule/event/operating_systems_open_arm_gpu/) at FOSDEM, and Connor talks about [his compiler work in the Xorg DevRoom](https://fosdem.org/2013/schedule/event/maliisa/). Public goes wild for Q3A running on limare, and Connor fills the DevRoom to the brim!
 * 2012-12-07: After what can only be described as an eternity, the LinuxTag demo code and tons of other changes have now been pushed to gitorious. Here is [a video of our brand new spinning companion cube](http://www.youtube.com/watch?v=k16ve88d-L0) spinning away at 60Hz.
 ===
 

This reverts commit 8927230e9df7a8c63c997baa3bf707108e600842
diff --git a/index.mdwn b/index.mdwn
index 9a6cd4a..3573463 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -23,8 +23,6 @@ ANONYMOUS USER
 ===
 ## News
 ===
-* 2013-02-06: Libv blogs about [Quake 3 Arena running on top of the limare prototype driver](http://libv.livejournal.com/23886.html).
-* 2013-02-02: Libv talks about [Open ARM GPU drivers](https://fosdem.org/2013/schedule/event/operating_systems_open_arm_gpu/) at FOSDEM, and Connor talks about [his compiler work in the Xorg DevRoom](https://fosdem.org/2013/schedule/event/maliisa/). Public goes wild for Q3A running on limare, and Connor fills the DevRoom to the brim!
 * 2012-12-07: After what can only be described as an eternity, the LinuxTag demo code and tons of other changes have now been pushed to gitorious. Here is [a video of our brand new spinning companion cube](http://www.youtube.com/watch?v=k16ve88d-L0) spinning away at 60Hz.
 ===
 

diff --git a/index.mdwn b/index.mdwn
index 246f68c..9a6cd4a 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -1,3 +1,11 @@
+===
+
+PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
+
+GREETINGS,
+ANONYMOUS USER
+
+===
 # **Lima**: An open source graphics driver for ARM Mali GPUs
 ===
 

diff --git a/index.mdwn b/index.mdwn
index 32fedd9..246f68c 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -5,11 +5,27 @@ Lima is an open source graphics driver which supports Mali-200 and Mali-400 GPUs
 
 The aim of this driver and others such as [freedreno](http://freedreno.github.com) is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary drivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. 
 
+===
+
+PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
+
+GREETINGS,
+ANONYMOUS USER
+
+===
 ## News
 ===
 * 2013-02-06: Libv blogs about [Quake 3 Arena running on top of the limare prototype driver](http://libv.livejournal.com/23886.html).
 * 2013-02-02: Libv talks about [Open ARM GPU drivers](https://fosdem.org/2013/schedule/event/operating_systems_open_arm_gpu/) at FOSDEM, and Connor talks about [his compiler work in the Xorg DevRoom](https://fosdem.org/2013/schedule/event/maliisa/). Public goes wild for Q3A running on limare, and Connor fills the DevRoom to the brim!
 * 2012-12-07: After what can only be described as an eternity, the LinuxTag demo code and tons of other changes have now been pushed to gitorious. Here is [a video of our brand new spinning companion cube](http://www.youtube.com/watch?v=k16ve88d-L0) spinning away at 60Hz.
+===
+
+PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
+
+GREETINGS,
+ANONYMOUS USER
+
+===
 * 2012-05-27: Linuxtag talk slides and a separate demo of limare was posted on [phoronix](http://www.phoronix.com/scan.php?page=news_item&px=MTEwODA).
 * 2012-05-26: Lima talk at [Linuxtag Berlin](http://www.linuxtag.org/2012/de/program/program/vortragsdetails.html?no_cache=1&talkid=481): Textured, lighted portal cube, spins away correctly [(full video)](http://blip.tv/opensuse/linuxtag2012-lima-liberating-arm-s-mali-gpu-6166702)!
 * 2012-04-14: Rob Clark announces the [freedreno project](http://bloggingthemonkey.blogspot.co.uk/2012/04/fighting-back-against-binary-blobs.html) inspired by the Lima approach
@@ -21,6 +37,14 @@ The aim of this driver and others such as [freedreno](http://freedreno.github.co
 * 2012-02-03: First public renders of [smoothed triangle, smoothed strip, smoothed fan, flat quad, triangle quad, smoothed lighted rotated cube](http://limadriver.org/content)
 * 2012-01-24: A new name has been chosen for the project: remali now becomes Lima! We now have a gitorious project, there is the #lima channel on freenode. A mailing list will be created soon.
 * 2012-01-23: [Codethink](http://www.codethink.co.uk/) puts out a [press release](http://www.prweb.com/releases/2012/1/prweb9130318.htm) for the business world. This is definitely not vaporware!
+===
+
+PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
+
+GREETINGS,
+ANONYMOUS USER
+
+===
 * 2012-01-21: Talk appears on [the FOSDEM schedule.](http://fosdem.org/2012/schedule/event/mali "Liberating ARM's Mali GPU")[The cat is out of the bag!](http://twitter.com/#!/codethink/status/160803588929626112) Story published by [phoronix](http://www.phoronix.com/vr.php?view=16971), hits [slashdot](http://linux.slashdot.org/story/12/01/21/0935248/coming-soon-an-open-source-reverse-engineered-mali-gpu-driver), [golem](http://www.golem.de/1201/89274.html), [pro-linux](http://www.pro-linux.de/news/1/17948/freier-treiber-fuer-mali-grafikprozessoren-angekuendigt.html) and [tweakers](http://tweakers.net/nieuws/79485/opensourcedriver-voor-arms-mali-gpu-in-ontwikkeling.html).
 
 ## Software
@@ -34,7 +58,14 @@ Documentation for the shader compiler, and the initial investigation of the inst
 ===
 
 ### [Mali-400](Hardware#Mali-400):
+===
+
+PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
+
+GREETINGS,
+ANONYMOUS USER
 
+===
 * [AMLogic 8726-M](Hardware#AMLogic+8726-M) (Zenithink C71)
 * [Allwinner A10](Hardware#Allwinner+A10) (Mele A1000, MK802)
 * [ST-Ericsson Novathor](Hardware#ST-Ericsson+Novathor)
@@ -46,7 +77,14 @@ Documentation for the shader compiler, and the initial investigation of the inst
 
 ## Documentation
 ===
+===
+
+PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
+
+GREETINGS,
+ANONYMOUS USER
 
+===
 The documentation is currently kept in the wiki, pages of interest are:
 
 Original (Falanx) datasheets:
@@ -68,6 +106,14 @@ Lima Documents
 
 ## Contribute
 ===
+===
+
+PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
+
+GREETINGS,
+ANONYMOUS USER
+
+===
 
 The Lima driver currently only has some preliminary and highly experimental support. This experimental phase is necessary to gain a full and complete understanding of how the Mali GPUs work. Once more is known, an actual graphics driver (most likely based off of Mesa/Gallium) can be written. There is a lot of interesting work that still needs to be done!
 
@@ -86,3 +132,5 @@ PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
 
 GREETINGS,
 ANONYMOUS USER
+
+===

diff --git a/index.mdwn b/index.mdwn
index 150cc4f..32fedd9 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -3,7 +3,7 @@
 
 Lima is an open source graphics driver which supports Mali-200 and Mali-400 GPUs. 
 
-The aim of this driver and others such as [freedreno](http://freedreno.github.com) is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary xcvdrivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. 
+The aim of this driver and others such as [freedreno](http://freedreno.github.com) is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary drivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. 
 
 ## News
 ===
@@ -82,3 +82,7 @@ Please subscribe to our [mailinglist](http://vlists.pepperfish.net/cgi-bin/mailm
 
 ===
 
+PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
+
+GREETINGS,
+ANONYMOUS USER

diff --git a/index.mdwn b/index.mdwn
index 89e6fde..150cc4f 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -3,7 +3,7 @@
 
 Lima is an open source graphics driver which supports Mali-200 and Mali-400 GPUs. 
 
-The aim of this driver and others such as [freedreno](http://freedreno.github.com) is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary drivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. 
+The aim of this driver and others such as [freedreno](http://freedreno.github.com) is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary xcvdrivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. 
 
 ## News
 ===

Add FOSDEM news.
diff --git a/index.mdwn b/index.mdwn
index 51d2f5b..89e6fde 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -7,6 +7,8 @@ The aim of this driver and others such as [freedreno](http://freedreno.github.co
 
 ## News
 ===
+* 2013-02-06: Libv blogs about [Quake 3 Arena running on top of the limare prototype driver](http://libv.livejournal.com/23886.html).
+* 2013-02-02: Libv talks about [Open ARM GPU drivers](https://fosdem.org/2013/schedule/event/operating_systems_open_arm_gpu/) at FOSDEM, and Connor talks about [his compiler work in the Xorg DevRoom](https://fosdem.org/2013/schedule/event/maliisa/). Public goes wild for Q3A running on limare, and Connor fills the DevRoom to the brim!
 * 2012-12-07: After what can only be described as an eternity, the LinuxTag demo code and tons of other changes have now been pushed to gitorious. Here is [a video of our brand new spinning companion cube](http://www.youtube.com/watch?v=k16ve88d-L0) spinning away at 60Hz.
 * 2012-05-27: Linuxtag talk slides and a separate demo of limare was posted on [phoronix](http://www.phoronix.com/scan.php?page=news_item&px=MTEwODA).
 * 2012-05-26: Lima talk at [Linuxtag Berlin](http://www.linuxtag.org/2012/de/program/program/vortragsdetails.html?no_cache=1&talkid=481): Textured, lighted portal cube, spins away correctly [(full video)](http://blip.tv/opensuse/linuxtag2012-lima-liberating-arm-s-mali-gpu-6166702)!

add a section on latencies (possibly incomplete?)
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index b1dbb60..2f01bb2 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -434,7 +434,7 @@ Unlike a normal CPU, there are no explicit output registers for the ALU's, nor a
 
 # Temporaries
 
-Akin to the fragment shader, there are also temporaries, which unlike registers can be indexed using a base register/input. They share the same namespace and method of loading as uniforms. Storing temporaries uses the same fields as storing a register/varying does, except that the "temporary store flag" is enabled, the unknown field is changed, and the complex ALU is used to select the store address instead of the "varying/register store 0" and "varying/register store 1" fields. Also, read-after-write has a latency of 4 cycles (i.e. a temporary cannot be read until 4 instructions after it is written).
+Akin to the fragment shader, there are also temporaries, which unlike registers can be indexed using a base register/input. They share the same namespace and method of loading as uniforms. Storing temporaries uses the same fields as storing a register/varying does, except that the "temporary store flag" is enabled, the unknown field is changed, and the complex ALU is used to select the store address instead of the "varying/register store 0" and "varying/register store 1" fields, which are set to 0.
 
 ## Output Transformation
 
@@ -488,6 +488,10 @@ These are the known inputs:
     28-31: Register 0 Output [-1, last instruction] (Register/Attribute)
 Note: If attribute_load_en is disabled then the attribute slot can be used to load registers too.
 
+## Latencies
+
+Temporaries have a latency of 4 instructions, i.e. writes take 4 cycles to appear. Registers have a similar latency of 3 instructions. Writes to address registers 1-3 have a latency of 4 instructions. Writes to address register 0 (temporary store) have no latency though, so it can be set in the same instruction as the temporary store itself. The complex1 operation has a latency of 2 cycles.
+
 Instruction format:
 
     0-4:   Multiply 0 Input A

remove the codethink logo
diff --git a/index.mdwn b/index.mdwn
index fc6f7c6..51d2f5b 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -80,5 +80,3 @@ Please subscribe to our [mailinglist](http://vlists.pepperfish.net/cgi-bin/mailm
 
 ===
 
-<p class="alignright">The Lima driver is sponsored by <a href="http://www.codethink.co.uk/2012/01/23/open-source-graphics-drivers/"><img border="0" src="/codethink.png" alt="Codethink" />
-</a></p>

diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index ad7e95b..b1dbb60 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -471,12 +471,13 @@ These are the known inputs:
 
     0-3:   Register 0 Output [0, current] (Register/Attribute)
     4-7:   Register 1 Output [0, current] (Register)
-    8-11:  Unknown (Never seen)
+    8:     Unused, same as 21? (seen in m200_hw_workarounds.c nop shader)
+    9-11:  Unknown
     12-15: Load Result [0, current] (Uniform/Temporary)
     16,17: Accumulator 0,1 Output [-1, last instruction]
     18,19: Multiplier 0,1 Output [-1, last instruction]
     20:    Passthrough Output [-1, last instruction]
-    21:    Unused
+    21:    Unused/nop (i.e. this ALU is not used during this instruction)
     22:    Complex Output [-1, last instruction]
     22:    Identity/Passthrough (0 for add, 1 for multiply)
              Accumulator 0,1 Input 1: add(a, -ident) means pass(a)

Add glAlphaFunc reference value.
diff --git a/Render_State.mdwn b/Render_State.mdwn
index ff8992e..3595aa5 100644
--- a/Render_State.mdwn
+++ b/Render_State.mdwn
@@ -47,6 +47,7 @@ The Mali render state is a record of 16 32-bit words (64 bytes). It consists of
 
     0x1C [7] stencil test
       00000000 00000000 11111111 11111111 GL_STENCIL_TEST (either all bits are set or not)
+      00000000 11111111 00000000 00000000 glAlphaFunc reference value: 0.5 = 0x80, 1.0 = 0xFF.
 
     0x20 [8] multisample
       00000000 00000000 00000000 00000111 always set? could be another CompareFunc

diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 0f19801..ad7e95b 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -113,7 +113,7 @@ There also exists various "pipeline registers" (four of them listed above) which
         It seems that varyings (floats) can be loaded in aligned groups of 1, 2, or 4.
         This specifies how many to load at once. Note that the alignment affects the addressing;
         for example, loading from an index of x at an alignment of 4 is equivalent to loading from 2*x
-        at an alignment of 2.
+         and 2*x+1 at an alignment of 2.
         00 - no alignment (load 1 float)
         01 - alignment by 2 (load 2 floats)
         11 - alignment by 4 (load 4 floats)

diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index dbe8fd2..0f19801 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -434,7 +434,7 @@ Unlike a normal CPU, there are no explicit output registers for the ALU's, nor a
 
 # Temporaries
 
-Akin to the fragment shader, there are also temporaries, which unlike registers can be indexed using a base register/input. They share the same namespace and method of loading as uniforms. Storing temporaries uses the same fields as storing a register/varying does, except that the "temporary store flag" is enabled, the unknown field is changed, and the complex ALU is used to select the store address instead of the "varying/register store 0" and "varying/register store 1" fields. Also, it seems they have a latency of 4 cycles (i.e. a temporary cannot be read until 4 instructions after it is written).
+Akin to the fragment shader, there are also temporaries, which unlike registers can be indexed using a base register/input. They share the same namespace and method of loading as uniforms. Storing temporaries uses the same fields as storing a register/varying does, except that the "temporary store flag" is enabled, the unknown field is changed, and the complex ALU is used to select the store address instead of the "varying/register store 0" and "varying/register store 1" fields. Also, read-after-write has a latency of 4 cycles (i.e. a temporary cannot be read until 4 instructions after it is written).
 
 ## Output Transformation
 
@@ -560,6 +560,7 @@ Instruction format:
         0 - multiply (out = a * b)
         1 - complex 1 (inverse, inverse sqrt, etc.)
             takes all four inputs as arguments
+            This instruction has a latency of 2 cycles.
         3 - complex 2 (inverse, inverse sqrt, etc.)
             takes first two inputs as arguments,
             the other two are normal (multiply)

whoops
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index ace9bf9..dbe8fd2 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -434,7 +434,7 @@ Unlike a normal CPU, there are no explicit output registers for the ALU's, nor a
 
 # Temporaries
 
-Akin to the fragment shader, there are also temporaries, which unlike registers can be indexed using a base register/input. They share the same namespace and method of loading as uniforms. Storing temporaries uses the same fields as storing a register/varying does, except that the "temporary store flag" is enabled, the unknown field is changed, and the complex ALU is used to select the store address instead of the "varying/register store 0" and "varying/register store 1" fields. Also, it seems they have a latency of 6 cycles (i.e. a temporary cannot be read until 6 instructions after it is written).
+Akin to the fragment shader, there are also temporaries, which unlike registers can be indexed using a base register/input. They share the same namespace and method of loading as uniforms. Storing temporaries uses the same fields as storing a register/varying does, except that the "temporary store flag" is enabled, the unknown field is changed, and the complex ALU is used to select the store address instead of the "varying/register store 0" and "varying/register store 1" fields. Also, it seems they have a latency of 4 cycles (i.e. a temporary cannot be read until 4 instructions after it is written).
 
 ## Output Transformation
 

mali gp temporary stuff
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index c7ae062..ace9bf9 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -434,7 +434,7 @@ Unlike a normal CPU, there are no explicit output registers for the ALU's, nor a
 
 # Temporaries
 
-Akin to the fragment shader, there are also temporaries, which unlike registers can be indexed using a base register/input. They share the same namespace and method of loading as uniforms. Storing temporaries uses the same fields as storing a register/varying does, except that the "temporary store flag" is enabled, the unknown field is changed, and the complex ALU is used to select the store address instead of the "varying/register store 0" and "varying/register store 1" fields.
+Akin to the fragment shader, there are also temporaries, which unlike registers can be indexed using a base register/input. They share the same namespace and method of loading as uniforms. Storing temporaries uses the same fields as storing a register/varying does, except that the "temporary store flag" is enabled, the unknown field is changed, and the complex ALU is used to select the store address instead of the "varying/register store 0" and "varying/register store 1" fields. Also, it seems they have a latency of 6 cycles (i.e. a temporary cannot be read until 6 instructions after it is written).
 
 ## Output Transformation
 
@@ -547,6 +547,7 @@ Instruction format:
         4 - inverse sqrt (Partial)
         5 - reciprocal (Partial)
         9 - passthrough
+        10 - Set Address Register 0 & Address Register 1 from result of passthrough unit
         12 - Set Address Register 0 (Temporary Store address)
         13 - Set Address Register 1
         14 - Set Address Register 2

diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index acc5951..c7ae062 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -471,7 +471,7 @@ These are the known inputs:
 
     0-3:   Register 0 Output [0, current] (Register/Attribute)
     4-7:   Register 1 Output [0, current] (Register)
-    9-11:  Unknown (Never seen)
+    8-11:  Unknown (Never seen)
     12-15: Load Result [0, current] (Uniform/Temporary)
     16,17: Accumulator 0,1 Output [-1, last instruction]
     18,19: Multiplier 0,1 Output [-1, last instruction]

diff --git a/index.mdwn b/index.mdwn
index c34b4b3..fc6f7c6 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -36,7 +36,7 @@ Documentation for the shader compiler, and the initial investigation of the inst
 * [AMLogic 8726-M](Hardware#AMLogic+8726-M) (Zenithink C71)
 * [Allwinner A10](Hardware#Allwinner+A10) (Mele A1000, MK802)
 * [ST-Ericsson Novathor](Hardware#ST-Ericsson+Novathor)
-* [Samsung Exynos](Hardware#Samsung+Exynos) (Galaxy S2/S3/Tab/Note)
+* [Samsung Exynos](Hardware#Samsung+Exynos) (Galaxy S2/S3/Tab/Note, Samsung Chromebook)
 
 ### [Mali-200](Hardware#Mali-200):
 

Add new odepush
diff --git a/index.mdwn b/index.mdwn
index b6e2c65..c34b4b3 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -7,6 +7,7 @@ The aim of this driver and others such as [freedreno](http://freedreno.github.co
 
 ## News
 ===
+* 2012-12-07: After what can only be described as an eternity, the LinuxTag demo code and tons of other changes have now been pushed to gitorious. Here is [a video of our brand new spinning companion cube](http://www.youtube.com/watch?v=k16ve88d-L0) spinning away at 60Hz.
 * 2012-05-27: Linuxtag talk slides and a separate demo of limare was posted on [phoronix](http://www.phoronix.com/scan.php?page=news_item&px=MTEwODA).
 * 2012-05-26: Lima talk at [Linuxtag Berlin](http://www.linuxtag.org/2012/de/program/program/vortragsdetails.html?no_cache=1&talkid=481): Textured, lighted portal cube, spins away correctly [(full video)](http://blip.tv/opensuse/linuxtag2012-lima-liberating-arm-s-mali-gpu-6166702)!
 * 2012-04-14: Rob Clark announces the [freedreno project](http://bloggingthemonkey.blogspot.co.uk/2012/04/fighting-back-against-binary-blobs.html) inspired by the Lima approach

added Mele A1000
diff --git a/Devices.mdwn b/Devices.mdwn
index 966b6c1..5f4e858 100644
--- a/Devices.mdwn
+++ b/Devices.mdwn
@@ -27,6 +27,8 @@ The [Gooseberry](http://gooseberry.atspace.co.uk/) board is actually a tablet bo
 
 The [Hackberry](https://www.miniand.com/products/Hackberry%20A10%20Developer%20Board) development board comes with 1 GB of DDR3 RAM, 4 GB of NAND flash storage, a full-size SDHC card slot, Fast Ethernet, USB host ports, built-in 802.11n Wi-Fi, HDMI output and can be had for 65 USD.
 
+## Mele A1000
+
 # Exynos 4
 
 ## Origen Board

added some Exynos 4 and 5 devices
diff --git a/Devices.mdwn b/Devices.mdwn
index 7fe9c86..966b6c1 100644
--- a/Devices.mdwn
+++ b/Devices.mdwn
@@ -26,3 +26,25 @@ The [Gooseberry](http://gooseberry.atspace.co.uk/) board is actually a tablet bo
 ## Hackberry
 
 The [Hackberry](https://www.miniand.com/products/Hackberry%20A10%20Developer%20Board) development board comes with 1 GB of DDR3 RAM, 4 GB of NAND flash storage, a full-size SDHC card slot, Fast Ethernet, USB host ports, built-in 802.11n Wi-Fi, HDMI output and can be had for 65 USD.
+
+# Exynos 4
+
+## Origen Board
+
+## ODROID
+
+## Samsung Galaxy S II
+
+## Samsung Galaxy S III
+
+# Exynos 5
+
+This SoC incorporates the Mali-T604 GPU along with 2 Cortex-A15 cores.
+
+## Arndale Board
+
+## Samsung Chromebook XE303C12
+
+This is, as of December 2012, the only ARM-based Chromebook. It costs 249 USD.
+
+## Google Nexus 10

added some AllWinner A10 boards
diff --git a/Devices.mdwn b/Devices.mdwn
index b0de43f..7fe9c86 100644
--- a/Devices.mdwn
+++ b/Devices.mdwn
@@ -1,16 +1,28 @@
-This page lists some of the available devices with a mali GPU, together with some useful info about them. The GPL VIOLATOR status for most of the devices is pretty much a given at this point, so let's just mark devices as such unless proven otherwise.
+This page lists some of the available devices with a Mali GPU, together with some useful info about them. The GPL VIOLATOR status for most of the devices is pretty much a given at this point, so let's just mark devices as such unless proven otherwise.
 
 Be careful where you buy, most cheap shops will not ship from your country but will ship from China. This means that you might end up paying customs, and end up wasting some time at the customs office.
 
 # AMLogic 8726-M (Mali 400)
 
 ## Zenithink ZT-280 (**GPL VIOLATOR**)
-===
 
 The ZT-280 range includes the C71, a 7" tablet with a capacitive display. Can be had for under EUR 100 these days, but add customs and postage to that.
 
 
 ## Point of View ProTab 2XXL (**GPL VIOLATOR**)
-===
 
 According to the [spec sheet](http://www.pointofview-online.com/showroom.php?shop_mode=product_detail&product_id=308) provided by its manufacturer/reseller, the ProTab 2XXL features a Mali-400 GPU. This tablet features a 10" capacitive touch-screen, and is very competetively priced - it retails for [about EUR 170](http://geizhals.eu/713232). Point of View publishes "Firmware Updates" in its somewhat chaotic [download area](http://downloads.pointofview-online.com/Drivers/), but there's no source code in sight anywhere.
+
+# AllWinner A10
+
+## Cubieboard
+
+The [Cubieboard](http://cubieboard.org) comes with 512 or 1024 MB of DDR3 RAM, 4 GB of NAND flash storage, a microSD card slot, Fast Ethernet, USB host ports, a SATA port, HDMI output and can be had for as low as 49 USD. As of December 2012, it is currently only available for pre-order.
+
+## Gooseberry
+
+The [Gooseberry](http://gooseberry.atspace.co.uk/) board is actually a tablet board. It comes with 4 GB of on-board storage, 802.11n Wi-Fi, HDMI, and a microSD card slot. Android 4.0 "Ice Cream Sandwich" is officially supported.
+
+## Hackberry
+
+The [Hackberry](https://www.miniand.com/products/Hackberry%20A10%20Developer%20Board) development board comes with 1 GB of DDR3 RAM, 4 GB of NAND flash storage, a full-size SDHC card slot, Fast Ethernet, USB host ports, built-in 802.11n Wi-Fi, HDMI output and can be had for 65 USD.

diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 8de419e..acc5951 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -541,8 +541,6 @@ Instruction format:
         7 - max/logical or (a || b)
         note: abs(a) is implemented as max(a, -a)
     86-89: Complex OpCode
-        For complex functions (rcp, sqrt, etc.), the inputs to the multiply ALU0 and
-        the input to the complex ALU are the same value.
         0 - unused
         2 - exp2 (Partial)
         3 - log2 (Partial)

diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index a28496d..8de419e 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -430,7 +430,7 @@ For more information on the disassembly/decompiling tools see [[Vertex+Disassemb
 
 The vertex shader has a scalar VLIW architecture. Each instruction has a field for 2 addition ALU's, 2 multiplication ALU's, a complex ALU, a passthrough ALU, an attribute load unit, a register load unit, a uniform/temporary load unit, and a varying/register/temporary store unit. Instructions are fixed-length - each instruction consists of 4 words. Constants are implemented internally by uniforms.
 
-Unlike a normal CPU, there are no explicit output registers for the ALU's, nor are there any explicit input registers. Instead, the input field(s) for each ALU can directly reference the ALU results from previous instructions (see below). However, there are 16 registers (maybe less?) that can be used when two instructions are too far apart for one to reference the result of the other, or for special cases such as loops. Only one (4-component) register may be loaded & stored per instruction, and storing registers and temporaries shares some of the same fields as storing varyings.
+Unlike a normal CPU, there are no explicit output registers for the ALU's, nor are there any explicit input registers. Instead, the input field(s) for each ALU can directly reference the ALU results from previous instructions (see below). However, there are 16 registers that can be used when two instructions cannot be scheduled so that one references the result of the other (either directly, or through one or more passthroughs), or for special cases such as loops. Only one (4-component) register may be loaded & stored per instruction, and storing registers and temporaries shares some of the same fields as storing varyings.
 
 # Temporaries
 

Added some new Exynos options.
diff --git a/Hardware.mdwn b/Hardware.mdwn
index e240b13..c6c6cff 100644
--- a/Hardware.mdwn
+++ b/Hardware.mdwn
@@ -35,7 +35,7 @@ From a driver point of view, very few infrastructural changes are needed for sup
 
 ## Mali-T604/T658
 
-These unified shader designs were announced by ARM but are not currently shipping. Once this hardware is available to lima developers, support for it can be evaluated.
+The T604 was first released in November 2012 as part of the Exynos 5250 chipset by Samsung, integrated in the Google Nexus 10 tablet and Samsung Chromebook. This and the as of yet unreleased T658 are of a unified shader design. Once this hardware is available to lima developers, support for it can be evaluated.
 
 # SoCs #
 ===
@@ -63,7 +63,9 @@ There is a pre-built image of Linaro Android with the Lima(re) demo included. Th
 
 ## Samsung Exynos
 
-The [Samsung Exynos](http://en.wikipedia.org/wiki/Exynos) 42xx is a range of ARM Cortex A9 devices clocked between 1.2 and 1.8GHz. They are the only devices currently carrying a Mali-400MP4. The Exynos of course stars in the top selling, high end Samsung android based smartphones and tablets. The best sold phone of 2011, the Samsung Galaxy S II, comes with an Exynos. A [Single Board Computer with a 4210, called origen,](http://www.origenboard.org/) is available with android and ubuntu support.
+The [Samsung Exynos](http://en.wikipedia.org/wiki/Exynos) 42xx is a range of ARM Cortex A9 devices clocked between 1.2 and 1.8GHz. They are the only devices currently carrying a Mali-400MP4. The Exynos of course stars in the top selling, high end Samsung android based smartphones and tablets. The best sold phone of 2011, the Samsung Galaxy S II, comes with an Exynos. A [Single Board Computer with a 4210, called origen,](http://www.origenboard.org/) is available with android and ubuntu support. Another option is the [Cotton Candy](http://www.cstick.com) by FXI Tech, a USB/HDMI thumb computer, which in its initial revision has the 4210. 
+
+Exynos 5250 (also seen as Exynos 5) is a dual-core ARM Cortex A15 device clocked at 1.7GHz with a Mali T604. The first releases were part of the first Google Nexus 10, and the first ARM-based Chromebook, both by Samsung.
 
 ## Telechips 8902
 

lower bits of frag shader address
diff --git a/Render_State.mdwn b/Render_State.mdwn
index 61d37d5..ff8992e 100644
--- a/Render_State.mdwn
+++ b/Render_State.mdwn
@@ -59,7 +59,10 @@ The Mali render state is a record of 16 32-bit words (64 bytes). It consists of
       00000000 00000000 11110000 00000111 (default in GLES2) 
       00000000 00000000 11111000 00000111 (default in lima)
 
-    0x24 [9] shader address (16-aligned)
+    0x24 [9] shader address
+
+      11111111 11111111 11111111 11100000 Fragment shader address
+      00000000 00000000 00000000 00011111 Size of first instruction
 
     0x28 [10] varying types
 

and/or/xor are for scalar multiply ALU too
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index de1e2df..a28496d 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -207,6 +207,9 @@ There also exists various "pipeline registers" (four of them listed above) which
     o - opcode:
         00xxx - arg0 * arg1 * 2^x where x is in two's complement format
         01000 - not(arg0)
+        01001 - and(arg0, arg1)
+        01010 - or(arg0, arg1)
+        01011 - xor(arg0, arg1)
         01100 - notEqual(arg0, arg1)
         01101 - lessThan(arg0, arg1)
         01110 - lessThanEqual(arg0, arg1)

add link to list of texel formats
diff --git a/index.mdwn b/index.mdwn
index 0dd4519..b6e2c65 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -61,6 +61,7 @@ Lima Documents
 * [[MBS+File+Format]]
 * [[Fragment+Shader+Backend]]
 * [[Render State]]
+* [[Texel Formats]]
 
 ## Contribute
 ===

Big list of texel formats
diff --git a/Texel_Formats.mdwn b/Texel_Formats.mdwn
new file mode 100644
index 0000000..dfbf9f2
--- /dev/null
+++ b/Texel_Formats.mdwn
@@ -0,0 +1,55 @@
+Like many GPUs, the Mali GPU supports a wide range of texel formats:
+
+                 alpha   flags    components
+     id bpp byt  ia  ha  rb  ro   r   g   b   a   d   s   l   i   note
+
+     00   1   1                                           1    
+     01   1   1   +   +                       1                
+     02   1   1                                               1
+     03   2   1       +       +               1           1    
+     04   4   1                                           4    
+     05   4   1   +   +                       4                
+     06   4   1                                               4
+     07   4   1       +   +   +   1   1   1   1                
+     08   8   1       +       +               4           4    
+     09   8   1                                           8    
+     0A   8   1   +   +                       8                
+     0B   8   1                                               8
+     0C   8   1           +       3   3   2                    
+     0D   8   1       +   +   +   2   2   2   2                
+     0E  16   2           +       5   6   5                    
+     0F  16   2       +   +   +   5   5   5   1                
+     10  16   2       +   +   +   4   4   4   4                
+     11  16   1       +       +               8           8    
+     12  16   2                                          16    
+     13  16   2   +   +                      16                
+     14  16   2                                              16
+     15 N/A   1                   8   8   8                    
+     16  32   1       +   +   +   8   8   8   8                
+     17  32   1           +       8   8   8                    
+     18  32   4       +   +   +  10  10  10   2                
+     19  32   4           +      11  11  10                    
+     1A  32   4           +      10  12  10                    
+     1B  32   2       +       +                  16      16    
+     1C  64   2       +   +   +  16  16  16  16                
+     1D   4   1                                                  Paletted?
+     1E   8   1                                                  Paletted?
+     20   4   1                                                  ETC1_RGB8 (Ericcon Texture Compression)
+     22  16   2                                          16      Float
+     23  16   2   +   +                      16                  Float
+     24  16   2                                              16  Float
+     25  32   2       +       +              16          16      Float
+     26  64   2       +   +   +  16  16  16  16                  Float
+     2C  32   4                                  24   8          Depth/stencil
+     2D  64   4                                                
+     2E  48   2           +      16  16  16                    
+     2F  48   2           +      16  16  16                      Float
+     32  32   4                                                  ?
+     3F   0   0                                                  INVALID
+
+    bpp: bits per pixel
+    byt: bytes per copy element
+    ia: is alpha
+    ha: has alpha
+    rb/ro: ?
+    r/g/b/a/d/s/l/i: red/green/blue/alpha/depth/stencil/luminance/intensity

some new bits
diff --git a/Render_State.mdwn b/Render_State.mdwn
index 959771b..61d37d5 100644
--- a/Render_State.mdwn
+++ b/Render_State.mdwn
@@ -22,6 +22,8 @@ The Mali render state is a record of 16 32-bit words (64 bytes). It consists of
     0x0C [3] depth test
       00000000 00000000 00000000 00000001 GL_DEPTH_TEST
       00000000 00000000 00000000 00001110 depthFunc (CompareFunc)
+      00000000 11111111 00000000 00000000 polygonOffset factor
+      11111111 00000000 00000000 00000000 polygonOffset units
 
     0x10 [4] depth range
       11111111 11111111 00000000 00000000 max(nearVal, farVal)
@@ -47,27 +49,35 @@ The Mali render state is a record of 16 32-bit words (64 bytes). It consists of
       00000000 00000000 11111111 11111111 GL_STENCIL_TEST (either all bits are set or not)
 
     0x20 [8] multisample
+      00000000 00000000 00000000 00000111 always set? could be another CompareFunc
+      00000000 00000000 00000000 01101000 (0x00006800 "4x MSAA" in lima)
       00000000 00000000 00000000 10000000 GL_SAMPLE_ALPHA_TO_COVERAGE
+      00000000 00000000 00000001 00000000 GL_SAMPLE_ALPHA_TO_ONE
       00000000 00000000 11110000 00000000 sampleCoverage (SampleCoverage)
+      00000000 11000000 00000000 00000000 vertex selector? (00 GL_POINTS 01 GL_LINE* 10 GL_TRIANGLE*)
 
       00000000 00000000 11110000 00000111 (default in GLES2) 
       00000000 00000000 11111000 00000111 (default in lima)
-      00000000 00000000 00000000 01101000 (0x00006800 "4x MSAA" in lima)
 
-    0x24 [9] shader address
+    0x24 [9] shader address (16-aligned)
 
     0x28 [10] varying types
 
-    0x2C [11] uniforms address
+    0x2C [11] uniforms address (16-aligned)
 
-    0x30 [12] textures address
+    0x30 [12] textures address (16-aligned)
 
     0x34 [13] ?
+      00000000 00000000 00000001 00000000 ? usually 1
+      00000000 00000000 00000010 00000000 Enable early Z
+      00000000 00000000 00010000 00000000 Enable pixel kill
 
-    0x38 [14] dither (and maybe more)
+    0x38 [14] dither etc
+      00000000 00000000 00010000 00000000 glFrontFace (0=GL_CCW, 1=GL_CW)
       00000000 00000000 00100000 00000000 GL_DITHER
+      00000000 00000001 00000000 00000000 set if(uniform_size) in Lima
 
-    0x3C [15] varyings address
+    0x3C [15] varyings address (16-aligned)
 
 ## Bitfields
 

clarify first sentence a bit
diff --git a/Render_State.mdwn b/Render_State.mdwn
index 891a734..959771b 100644
--- a/Render_State.mdwn
+++ b/Render_State.mdwn
@@ -1,6 +1,6 @@
 # Render state
 
-The Mali render state is a record of 16 32-bit words (64 bytes). It consists of mainly rasterizer state. When queuing the draw command it is passed `LIMA_PLBU_CMD_RSW_VERTEX_ARRAY` (see `vs_commands_draw_add` in the Lima source).
+The Mali render state is a record of 16 32-bit words (64 bytes). It consists of mainly rasterizer state. When queuing a draw command an address of such a structure is passed with `LIMA_PLBU_CMD_RSW_VERTEX_ARRAY` (see `vs_commands_draw_add` in the Lima source).
 
     0x00 [0] blend color
       00000000 00000000 00000000 11111111 blendColor blue component

add link to new Render State page
diff --git a/index.mdwn b/index.mdwn
index c07c28b..0dd4519 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -60,6 +60,7 @@ Lima Documents
 * [[Mali_Offline_Shader_Compiler]]
 * [[MBS+File+Format]]
 * [[Fragment+Shader+Backend]]
+* [[Render State]]
 
 ## Contribute
 ===

Add with my findings about the mali render state word
diff --git a/Render_State.mdwn b/Render_State.mdwn
new file mode 100644
index 0000000..891a734
--- /dev/null
+++ b/Render_State.mdwn
@@ -0,0 +1,132 @@
+# Render state
+
+The Mali render state is a record of 16 32-bit words (64 bytes). It consists of mainly rasterizer state. When queuing the draw command it is passed `LIMA_PLBU_CMD_RSW_VERTEX_ARRAY` (see `vs_commands_draw_add` in the Lima source).
+
+    0x00 [0] blend color
+      00000000 00000000 00000000 11111111 blendColor blue component
+      00000000 11111111 00000000 00000000 blendColor green component  
+
+    0x04 [1] blend color
+      00000000 00000000 00000000 11111111 blendColor red component
+      00000000 11111111 00000000 00000000 blendColor alpha component
+
+    0x08 [2] alpha blend
+      00000000 00000000 00000000 00000111 modeRGB (BlendEquation)
+      00000000 00000000 00000000 00111000 modeAlpha (BlendEquation)
+      00000000 00000000 00000111 11000000 srcRGB (ColorBlendFunc)
+      00000000 00000000 11111000 00000000 dstRGB (ColorBlendFunc)
+      00000000 00001111 00000000 00000000 srcAlpha (AlphaBlendFunc)
+      00000000 11110000 00000000 00000000 dstAlpha (AlphaBlendFunc)
+      ???????? 00000000 00000000 00000000 always 11111100? (TODO: check whether this is GLES1 glAlphaFunc)
+
+    0x0C [3] depth test
+      00000000 00000000 00000000 00000001 GL_DEPTH_TEST
+      00000000 00000000 00000000 00001110 depthFunc (CompareFunc)
+
+    0x10 [4] depth range
+      11111111 11111111 00000000 00000000 max(nearVal, farVal)
+      00000000 00000000 11111111 11111111 min(nearVal, farVal)
+
+    0x14 [5] stencil GL_FRONT
+      00000000 00000000 00000000 00000111 func (CompareFunc)
+      00000000 00000000 00000000 00111000 sfail (StencilOp)
+      00000000 00000000 00000001 11000000 dpfail (StencilOp)
+      00000000 00000000 00001110 00000000 dppass (StencilOp)
+      00000000 11111111 00000000 00000000 ref
+      11111111 00000000 00000000 00000000 mask
+
+    0x18 [6] stencil GL_BACK
+      00000000 00000000 00000000 00000111 func (CompareFunc)
+      00000000 00000000 00000000 00111000 sfail (StencilOp)
+      00000000 00000000 00000001 11000000 dpfail (StencilOp)
+      00000000 00000000 00001110 00000000 dppass (StencilOp)
+      00000000 11111111 00000000 00000000 ref
+      11111111 00000000 00000000 00000000 mask
+
+    0x1C [7] stencil test
+      00000000 00000000 11111111 11111111 GL_STENCIL_TEST (either all bits are set or not)
+
+    0x20 [8] multisample
+      00000000 00000000 00000000 10000000 GL_SAMPLE_ALPHA_TO_COVERAGE
+      00000000 00000000 11110000 00000000 sampleCoverage (SampleCoverage)
+
+      00000000 00000000 11110000 00000111 (default in GLES2) 
+      00000000 00000000 11111000 00000111 (default in lima)
+      00000000 00000000 00000000 01101000 (0x00006800 "4x MSAA" in lima)
+
+    0x24 [9] shader address
+
+    0x28 [10] varying types
+
+    0x2C [11] uniforms address
+
+    0x30 [12] textures address
+
+    0x34 [13] ?
+
+    0x38 [14] dither (and maybe more)
+      00000000 00000000 00100000 00000000 GL_DITHER
+
+    0x3C [15] varyings address
+
+## Bitfields
+
+    CompareFunc:
+        000 GL_NEVER
+        001 GL_LESS
+        010 GL_EQUAL
+        011 GL_LEQUAL
+        100 GL_GREATER
+        101 GL_NOTEQUAL
+        110 GL_GEQUAL
+        111 GL_ALWAYS
+
+    StencilOp:
+        000 GL_KEEP
+        001 GL_REPLACE
+        010 GL_ZERO
+        011 GL_INVERT
+        100 GL_INCR_WRAP
+        101 GL_DECR_WRAP
+        110 GL_INCR
+        111 GL_DECR
+
+    BlendEquation:
+        000 GL_FUNC_SUBTRACT
+        001 GL_FUNC_REVERSE_SUBTRACT
+        010 GL_FUNC_ADD
+        100 GL_MIN_EXT
+        101 GL_MAX_EXT
+
+    ColorBlendFunc:
+        00000 GL_SRC_COLOR
+        00001 GL_DST_COLOR
+        00010 GL_CONSTANT_COLOR
+        00011 GL_ZERO
+        00111 GL_SRC_ALPHA_SATURATE
+        01000 GL_ONE_MINUS_SRC_COLOR
+        01001 GL_ONE_MINUS_DST_COLOR
+        01010 GL_ONE_MINUS_CONSTANT_COLOR
+        01011 GL_ONE
+        10000 GL_SRC_ALPHA
+        10001 GL_DST_ALPHA
+        11000 GL_ONE_MINUS_SRC_ALPHA
+        11001 GL_ONE_MINUS_DST_ALPHA
+        10010 GL_CONSTANT_ALPHA
+        11010 GL_ONE_MINUS_CONSTANT_ALPHA
+
+    AlphaBlendFunc is the same as ColorBlendFunc, except that the upper bit is missing.
+      This can be the case because the upper bit determines _ALPHA or _COLOR, and for the the alpha factor
+      these are equivalent.
+
+    SampleCoverage:
+        0000 value=0.00 inverted=FALSE
+        0001 value=0.25 inverted=FALSE
+        0011 value=0.50 inverted=FALSE
+        0111 value=0.75 inverted=FALSE
+        1111 value=1.0  inverted=FALSE
+        1111 value=0.00 inverted=TRUE
+        1110 value=0.25 inverted=TRUE
+        1100 value=0.50 inverted=TRUE
+        1000 value=0.75 inverted=TRUE
+        0000 value=1.00 inverted=TRUE

fix multiplier comparison opcode
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index aa697cf..de1e2df 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -193,7 +193,7 @@ There also exists various "pipeline registers" (four of them listed above) which
         01010 - or(arg0, arg1)
         01011 - xor(arg0, arg1)
         01100 - notEqual(arg0, arg1)
-        01101 - greaterThan(arg0, arg1)
+        01101 - lessThan(arg0, arg1)
         01110 - lessThanEqual(arg0, arg1)
         01111 - equal(arg0, arg1)
         10000 - min(arg0, arg1)
@@ -208,7 +208,7 @@ There also exists various "pipeline registers" (four of them listed above) which
         00xxx - arg0 * arg1 * 2^x where x is in two's complement format
         01000 - not(arg0)
         01100 - notEqual(arg0, arg1)
-        01101 - greaterThan(arg0, arg1)
+        01101 - lessThan(arg0, arg1)
         01110 - lessThanEqual(arg0, arg1)
         10001 - max(arg0, arg1)
         10000 - min(arg0, arg1)

SATT is also a possibility for table
diff --git a/MBS+File+Format.mdwn b/MBS+File+Format.mdwn
index 7fa28ee..e54a9a2 100644
--- a/MBS+File+Format.mdwn
+++ b/MBS+File+Format.mdwn
@@ -59,7 +59,7 @@
     }
 
     table {
-    	chunk    header; // ="SUNI"/"SVAR"
+    	chunk    header; // ="SUNI"/"SVAR"/"SATT"
     	uint32_t count;
     	symbol   symbols[count];
     }

add implementation of asin and acos
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 6e61dda..aa697cf 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -339,6 +339,10 @@ There also exists various "pipeline registers" (four of them listed above) which
     $temp.x *= $temp.y;
     result = atan_pt2 $temp;
 
+    asin and acos are implemented using atan2, as follows:
+    asin(x) = atan2(x, sqrt(1 - x^2))
+    acos(x) = atan2(sqrt(1 - x^2), x)
+
     atan_pt1:
 
     dd ddmm mmaa aaaa AAbb bbbb BBoo oo01

fill in more of the vertex shader format (mainly from mbs_dump)
diff --git a/MBS+File+Format.mdwn b/MBS+File+Format.mdwn
index 38ed1b5..7fa28ee 100644
--- a/MBS+File+Format.mdwn
+++ b/MBS+File+Format.mdwn
@@ -34,7 +34,7 @@
     }
 
     symbol {
-    	chunk    header; // ="VUNI"/"VVAR"
+    	chunk    header; // ="VUNI"/"VVAR"/"VATT"
     	string   symbol;
     	uint8_t  unknown_0; // =0x00
         // type: 
@@ -71,6 +71,9 @@
 
     frag {
     	chunk    header;  // ="CFRA"
+        // version (seems _mali_core_type from mali_ioctl.h)
+        //   0x05 MALI_200
+        //   0x07 MALI_400_PP
     	uint32_t version; // =5
     	frag_sta sta;
     	frag_dis dis;
@@ -80,8 +83,28 @@
     	dbin     code;
     }
 
+    vert_fins {
+        chunk header; // ="FINS"
+        uint32_t unknown_0;
+        uint32_t instructions;
+        uint32_t attrib_prefetch;
+    }
+
+    vertex {
+        chunk header; // ="CVER"
+        // version (seems _mali_core_type from mali_ioctl.h)
+        //   0x02 MALI_GP2
+        //   0x06 MALI_400_GP
+        uint32_t version;
+        vert_fins fins;
+        table uniforms; // ="SUNI"
+        table attributes; // ="SATT"
+        table variants; // ="SVAR"
+        dbin code;
+    }
 
     file {
     	chunk header; // ="MBS1"
     	frag  fragment;
+        vert  vertex;
     }

fill in FSTA, FDIS, FBUU
diff --git a/MBS+File+Format.mdwn b/MBS+File+Format.mdwn
index 03bcf8c..38ed1b5 100644
--- a/MBS+File+Format.mdwn
+++ b/MBS+File+Format.mdwn
@@ -12,19 +12,25 @@
 
     frag_sta {
     	chunk    header;    // ="FSTA"
-    	uint32_t unknown_0; // =1
-    	uint32_t unknown_1; // =1
+    	uint32_t stacksize; // fragment stack size
+    	uint32_t stackofs;  // starting offset
     }
 
     frag_dis {
     	chunk    header;    // ="FDIS"
-    	uint32_t unknown_0; // =0
+    	uint32_t discard;   // 1 if shader has discard instruction
     }
 
     frag_buu {
     	chunk    header;    // ="FBUU"
-    	uint32_t unknown_0; // =256
-    	uint32_t unknown_1; // =0
+        uint8_t reads_color; // gl_FBColor
+        uint8_t writes_color; // gl_FragColor
+        uint8_t reads_depth; // gl_FBDepth
+        uint8_t writes_depth; // ? gl_FragDepth (not supported in GLES2)
+        uint8_t reads_stencil; // gl_FBStencil
+        uint8_t writes_stencil; // ? gl_FragStencil (not supported in GLES2)
+        uint8_t unknown_0;
+        uint8_t unknown_1;
     }
 
     symbol {

describe types: add struct, samplerExternalOES
diff --git a/MBS+File+Format.mdwn b/MBS+File+Format.mdwn
index 56d3e0e..03bcf8c 100644
--- a/MBS+File+Format.mdwn
+++ b/MBS+File+Format.mdwn
@@ -31,7 +31,16 @@
     	chunk    header; // ="VUNI"/"VVAR"
     	string   symbol;
     	uint8_t  unknown_0; // =0x00
-    	uint8_t  type;      // =0x00
+        // type: 
+        //   0x01 float
+        //   0x02 int
+        //   0x03 bool 
+        //   0x04 matrix
+        //   0x05 sampler2D
+        //   0x06 samplerCube
+        //   0x08 struct
+        //   0x09 samplerExternalOES 
+    	uint8_t  type;      
     	uint16_t component_count;
     	uint16_t component_size;
     	uint16_t entry_count;
@@ -40,7 +49,7 @@
     	uint8_t  precision;
     	uint32_t invariant; // 1 if "invariant" keyword specified, otherwise 0
     	uint16_t offset;
-    	uint16_t index; // Usually -1 (0xFFFF)
+    	uint16_t index; // Usually -1 (0xFFFF) otherwise index of parent struct
     }
 
     table {

chunk vuni/vvar: invariant
diff --git a/MBS+File+Format.mdwn b/MBS+File+Format.mdwn
index 3127f20..56d3e0e 100644
--- a/MBS+File+Format.mdwn
+++ b/MBS+File+Format.mdwn
@@ -30,15 +30,15 @@
     symbol {
     	chunk    header; // ="VUNI"/"VVAR"
     	string   symbol;
-    	uint8_t  type;      // =0x00
     	uint8_t  unknown_0; // =0x00
+    	uint8_t  type;      // =0x00
     	uint16_t component_count;
     	uint16_t component_size;
     	uint16_t entry_count;
     	uint16_t src_stride;
     	uint8_t  dst_stride;
     	uint8_t  precision;
-    	uint32_t unknown_1; // =0x00000000
+    	uint32_t invariant; // 1 if "invariant" keyword specified, otherwise 0
     	uint16_t offset;
     	uint16_t index; // Usually -1 (0xFFFF)
     }

add logical and/or/xor
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 2c06e6a..6e61dda 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -189,6 +189,9 @@ There also exists various "pipeline registers" (four of them listed above) which
     Opcode:
         00xxx - arg0 * arg1 * 2^x, where x is in two's-complement format
         01000 - not(arg0)
+        01001 - and(arg0, arg1)
+        01010 - or(arg0, arg1)
+        01011 - xor(arg0, arg1)
         01100 - notEqual(arg0, arg1)
         01101 - greaterThan(arg0, arg1)
         01110 - lessThanEqual(arg0, arg1)

note additional multiply for atan2
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 348bed8..2c06e6a 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -331,6 +331,11 @@ There also exists various "pipeline registers" (four of them listed above) which
     atan_pt1 takes the (scalar) input and produces a 3-component vector.
     atan_pt2 takes the vector and produces the final output.
 
+    Unlike atan_pt1, you need to do an additional multiply between atan2_pt1 and atan_pt2:
+    $temp.xyz = atan2_pt1 y, x;
+    $temp.x *= $temp.y;
+    result = atan_pt2 $temp;
+
     atan_pt1:
 
     dd ddmm mmaa aaaa AAbb bbbb BBoo oo01

note special varying source values for inputting into textureCube()
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index e5994dd..348bed8 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -127,8 +127,10 @@ There also exists various "pipeline registers" (four of them listed above) which
         However, I haven't been able to test this theory because I haven't gotten the compiler
         to produce a value for O other than 11. 
     s - source:
-        00pp - varying
+        00pp - normal varying
         01pp - register (see second instruction format)
+        1000 - varying, input to textureCube()
+        1001 - register, input to textureCube()
         1011 - gl_FragCoord
         1100 - gl_PointCoord
         1101 - gl_FrontFacing

diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index b639a37..e5994dd 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -381,7 +381,7 @@ There also exists various "pipeline registers" (four of them listed above) which
     control[16], Branch/Discard
 
     Branch:
-    0 0011  tttt tttt tttt tttt tttt tttt ttt0 0000 0000 0000 0000 0000 0ccc aaaa aabb bbbb 0000
+    0 0011 tttt tttt tttt tttt tttt tttt ttt0 0000 0000 0000 0000 0000 0ccc aaaa aabb bbbb 0000
 
     Discard:
     0 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0111 1111 0000 0000 0000 0011

add discard
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index c33dc9c..b639a37 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -32,7 +32,7 @@ Apparently, most GPU's processes fragments in groups of 2x2; I suspect ours does
     13:     {31, 1} Scalar Addition ALU
     14:     {30, 1} Vec4-Scalar Multiply/Transcendental Scalar ALU
     15:     {41, 1} Temporary Write/Framebuffer Read
-    16:     {73, 2} Branch
+    16:     {73, 2} Branch/Discard
     17:     {64, 2} Vec4 Constant Fetch 0
     18:     {64, 2} Vec4 Constant Fetch 1
     19..24: {     } Scheduling
@@ -378,10 +378,14 @@ There also exists various "pipeline registers" (four of them listed above) which
         Note: since gl_FBDepth is a float, and the alignment is set to 1,
         this instr will always set the x component of the specified destination register.
 
-    control[16], branch
+    control[16], Branch/Discard
 
+    Branch:
     0 0011  tttt tttt tttt tttt tttt tttt ttt0 0000 0000 0000 0000 0000 0ccc aaaa aabb bbbb 0000
 
+    Discard:
+    0 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0111 1111 0000 0000 0000 0011
+
     c - condition:
         bit 0 - jump if a > b
         bit 1 - jump if a = b

diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 995a488..c33dc9c 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -101,7 +101,6 @@ There also exists various "pipeline registers" (four of them listed above) which
 
     control[7], Varying Fetch
     
-    0xmdii3C60
     00 mmmm dddd iiii iiOO 00oo oo00 0aa0 ssss
     Or, for loading from a register (used for loading texture coordinates from a register):
     00 mmmm dddd SSSS SSSS Anrr rr00 0000 01pp

Galaxy Note (GT-N7000) also has Mali-400
diff --git a/index.mdwn b/index.mdwn
index 8fe3c3a..c07c28b 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -35,7 +35,7 @@ Documentation for the shader compiler, and the initial investigation of the inst
 * [AMLogic 8726-M](Hardware#AMLogic+8726-M) (Zenithink C71)
 * [Allwinner A10](Hardware#Allwinner+A10) (Mele A1000, MK802)
 * [ST-Ericsson Novathor](Hardware#ST-Ericsson+Novathor)
-* [Samsung Exynos](Hardware#Samsung+Exynos) (Galaxy S2/S3/Tab)
+* [Samsung Exynos](Hardware#Samsung+Exynos) (Galaxy S2/S3/Tab/Note)
 
 ### [Mali-200](Hardware#Mali-200):
 

diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 85a6ff4..995a488 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -155,13 +155,6 @@ There also exists various "pipeline registers" (four of them listed above) which
 
     The coordinates for the texture fetch are always the output of the varying load.
 
-    The actual sampler index (i.e. which sampler unit to use) is passed in by the driver
-    as a uniform vec2, as indicated in the symbol table, and read by the sampler unit before
-    actually performing the texture sample. I suspect the use of 2 indices may have to do with
-    the "virtualized textures" feature (see datasheet on front page), but the driver doesn't
-    seem to implement this. Note that the index is aligned, just like for varying vec2 loads,
-    so, for example, an index of 1 tells the processor to load uniform[0].zw.
-
     s - sampler index (offset into uniform table)
     o - sampler index register offset enable
     c - sampler index offset register

diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index cdbaf01..85a6ff4 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -247,7 +247,7 @@ There also exists various "pipeline registers" (four of them listed above) which
         01110 - min(arg0, arg1)
         10000 - sum3 - dest.xyzw = sum of first 3 components of arg1
         10001 - sum4 - dest.xyzw = sum of all components of arg1
-            Note: the output is broadcast to all channels - 
+            Note: for sum3 and sum4, the output is broadcast to all channels - 
             you can use the write mask to select which component to write to
         10100 - dFdx(arg0, arg1)
         10101 - dFdy(arg0, arg1)
@@ -383,8 +383,8 @@ There also exists various "pipeline registers" (four of them listed above) which
     s - source
         11 - gl_FBColor
         10 - gl_FBDepth
-        Note: since gl_FBDepth is a float, this instr will always set the x component
-        of the specified destination register.
+        Note: since gl_FBDepth is a float, and the alignment is set to 1,
+        this instr will always set the x component of the specified destination register.
 
     control[16], branch
 

update sum3 and sum4
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index c17bd74..cdbaf01 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -245,7 +245,10 @@ There also exists various "pipeline registers" (four of them listed above) which
         01010 - lessThanEqual(arg0, arg1)
         01111 - max(arg0, arg1)
         01110 - min(arg0, arg1)
-        10001 - dest.w = sum of all components of arg1
+        10000 - sum3 - dest.xyzw = sum of first 3 components of arg1
+        10001 - sum4 - dest.xyzw = sum of all components of arg1
+            Note: the output is broadcast to all channels - 
+            you can use the write mask to select which component to write to
         10100 - dFdx(arg0, arg1)
         10101 - dFdy(arg0, arg1)
             Note: dFdx(x) is actually implemented as dFdx(-x, x) (same for dFdy)

diff --git a/index.mdwn b/index.mdwn
index 64e9df3..8fe3c3a 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -33,7 +33,7 @@ Documentation for the shader compiler, and the initial investigation of the inst
 ### [Mali-400](Hardware#Mali-400):
 
 * [AMLogic 8726-M](Hardware#AMLogic+8726-M) (Zenithink C71)
-* [Allwinner A10](Hardware#Allwinner+A10)
+* [Allwinner A10](Hardware#Allwinner+A10) (Mele A1000, MK802)
 * [ST-Ericsson Novathor](Hardware#ST-Ericsson+Novathor)
 * [Samsung Exynos](Hardware#Samsung+Exynos) (Galaxy S2/S3/Tab)
 

describe the texture sampler unit more thoroughly
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 3824551..c17bd74 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -151,11 +151,18 @@ There also exists various "pipeline registers" (four of them listed above) which
 
     control[8], Texture fetch
 
-    00111001000000000001ssssssssssssottttt00000b000000ccccccrrrrrr
+    00 1110 0100 0000 0000 01ss ssss ssss ssot tttt 0000 0b00 0000 cccc ccrr rrrr
 
     The coordinates for the texture fetch are always the output of the varying load.
 
-    s - sampler index
+    The actual sampler index (i.e. which sampler unit to use) is passed in by the driver
+    as a uniform vec2, as indicated in the symbol table, and read by the sampler unit before
+    actually performing the texture sample. I suspect the use of 2 indices may have to do with
+    the "virtualized textures" feature (see datasheet on front page), but the driver doesn't
+    seem to implement this. Note that the index is aligned, just like for varying vec2 loads,
+    so, for example, an index of 1 tells the processor to load uniform[0].zw.
+
+    s - sampler index (offset into uniform table)
     o - sampler index register offset enable
     c - sampler index offset register
     t - sampler type
@@ -223,8 +230,7 @@ There also exists various "pipeline registers" (four of them listed above) which
         11 - round to integer
 
     control[12], Vec4 Addition ALU
-    
-    0xoaaA?b??
+
     iooo ooMM mmmm dddd CCaa aaaa aaAA AADD bbbb bbbb BBBB
     
     i - whether to get Argument 1 from the multiplication ALU (below)

added framebuffer read stuff
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 2f8afb3..3824551 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -31,7 +31,7 @@ Apparently, most GPU's processes fragments in groups of 2x2; I suspect ours does
     12:     {44, 1} Vec4 Addition ALU
     13:     {31, 1} Scalar Addition ALU
     14:     {30, 1} Vec4-Scalar Multiply/Transcendental Scalar ALU
-    15:     {41, 1} Temporary Write
+    15:     {41, 1} Temporary Write/Framebuffer Read
     16:     {73, 2} Branch
     17:     {64, 2} Vec4 Constant Fetch 0
     18:     {64, 2} Vec4 Constant Fetch 1
@@ -348,7 +348,9 @@ There also exists various "pipeline registers" (four of them listed above) which
     a - source (vector)
     A - swizzle descriptor
 
-    control[15], Temporary Write
+    control[15], Temporary Write/Framebuffer Read
+
+    Temporary Write:
 
     i iiii iiii iiii iiio rrrr rr00 0000 a0ss ssss 00dd
 
@@ -364,6 +366,17 @@ There also exists various "pipeline registers" (four of them listed above) which
     o - register offset enable
     r - offset register
 
+    Framebuffer Read:
+
+    0 0000 0000 0000 0000 0000 0000 0000 10dd dd00 11ss
+
+    d - destination register
+    s - source
+        11 - gl_FBColor
+        10 - gl_FBDepth
+        Note: since gl_FBDepth is a float, this instr will always set the x component
+        of the specified destination register.
+
     control[16], branch
 
     0 0011  tttt tttt tttt tttt tttt tttt ttt0 0000 0000 0000 0000 0000 0ccc aaaa aabb bbbb 0000

add link to freedreno's swanky new site
diff --git a/index.mdwn b/index.mdwn
index 5544f70..64e9df3 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -3,7 +3,7 @@
 
 Lima is an open source graphics driver which supports Mali-200 and Mali-400 GPUs. 
 
-The aim of this driver is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary drivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. Lima is going to solve this for you, but some time is needed still to get there.
+The aim of this driver and others such as [freedreno](http://freedreno.github.com) is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary drivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. 
 
 ## News
 ===

turns out there are 128 stages
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 3f00398..2f8afb3 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -2,7 +2,7 @@
 
 ## Fragment Shader Architecture
 
-The architecture consists of a large pipeline, consisting of a number of vector and scalar units which can be enabled and disabled by the control word. In particular, there are two vector ALU's, one of which can do addition, another which can do multiplication.  There are also two similar scalar ALU's, and the "combiner" capable of executing scalar-vector multiplies as well as various complex/transcdental functions (sqrt, rcp, sin, cos, exp2, etc.). Furthermore, there are varying, uniform/temporary, and texture load units, a temporary store unit, and a branching unit for implementing jumps. Each unit can affect/produce results which are used by all later units, as if all 6 registers are passed between each unit in the pipeline (see "Lima Fragment Pipeline" below). Furthermore, to reduce register pressure, there are a number of "pipeline registers". A pipeline register is a direct connection between two units in the pipeline, in addition to the normal registers which are passed between every unit. For more details on registers (including pipeline registers), see the "Registers" section below. To overcome the pipeline stall issues inherent in such a long pipeline (~256 stages), the architecture is likely barrelled and interleaves execution of a large number (~256) of fragments at once, and scheduling is done by the machine in order to minimize stalls.
+The architecture consists of a large pipeline, consisting of a number of vector and scalar units which can be enabled and disabled by the control word. In particular, there are two vector ALU's, one of which can do addition, another which can do multiplication.  There are also two similar scalar ALU's, and the "combiner" capable of executing scalar-vector multiplies as well as various complex/transcdental functions (sqrt, rcp, sin, cos, exp2, etc.). Furthermore, there are varying, uniform/temporary, and texture load units, a temporary store unit, and a branching unit for implementing jumps. Each unit can affect/produce results which are used by all later units, as if all 6 registers are passed between each unit in the pipeline (see "Lima Fragment Pipeline" below). Furthermore, to reduce register pressure, there are a number of "pipeline registers". A pipeline register is a direct connection between two units in the pipeline, in addition to the normal registers which are passed between every unit. For more details on registers (including pipeline registers), see the "Registers" section below. To overcome the pipeline stall issues inherent in such a long pipeline (128 stages for Mali-200, see [this page](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka12787.html)), the architecture is likely barrelled and interleaves execution of a large number of fragments at once, and scheduling is done by the machine in order to minimize stalls.
 The instruction stream is compressed down from a maximum of 18-words per instruction dependant on what units are in use.
 The remaining bits give each unit individual instructions and constants.
 

Updated and cleaned up vertex terminology, now clearer and the same as the latest source.
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 833ebd6..3f00398 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -434,115 +434,116 @@ it would seem that pass.op5 performs the opposite of pass.op4.
 
 These are the known inputs:
 
-    0-3: current instruction, attribute load result
-    4-7: current instruction, register load result
-    12-15: current instruction, uniform load result 
-    16, 17: last instruction, acc ALU0/1 results
-    18, 19: last instruction, mul ALU0/1 results
-    20: last instruction, passthrough unit (bits 111-115)
-    21: unused
-    22: identity/passthrough (0 for add, 1 for multiply)
-        For addition some_reg + -r22 means to passthrough some_reg.
-        For multiplication some_reg * r22 also means to passthrough some_reg.
-        This register also means last instruction, complex ALU result
-        when it is in the "input 0" field (instead of input 1)
-        as well as when used in the complex/passthrough ALU's.
-    23: two instructions ago, passthrough unit
-    24, 25: two instructions ago, acc ALU0/1 results
-    26, 27: two instructions ago, mul ALU0/1 results
-    28-31: last instruction, attribute load result
+    0-3:   Register 0 Output [0, current] (Register/Attribute)
+    4-7:   Register 1 Output [0, current] (Register)
+    9-11:  Unknown (Never seen)
+    12-15: Load Result [0, current] (Uniform/Temporary)
+    16,17: Accumulator 0,1 Output [-1, last instruction]
+    18,19: Multiplier 0,1 Output [-1, last instruction]
+    20:    Passthrough Output [-1, last instruction]
+    21:    Unused
+    22:    Complex Output [-1, last instruction]
+    22:    Identity/Passthrough (0 for add, 1 for multiply)
+             Accumulator 0,1 Input 1: add(a, -ident) means pass(a)
+             Multiplier  0,1 Input 1: mul(a,  ident) means pass(a)
+    23:    Passthrough Output [-2, two instructions ago]
+    24,25: Accumulator 0,1 Output [-2, two instructions ago]
+    26,27: Multiplier 0,1 Output [-2, two instructions ago]
+    28-31: Register 0 Output [-1, last instruction] (Register/Attribute)
 Note: If attribute_load_en is disabled then the attribute slot can be used to load registers too.
 
 Instruction format:
 
-    0-4:   multiply ALU0 input 0
-    5-9:   multiply ALU0 input 1
-    10-14: multiply ALU1 input 0
-    15-19: multiply ALU1 input 1
-    20:    multiply ALU0 negate
-    21:    multiply ALU1 negate
-    22-26: add ALU0 input 0
-    27-31: add ALU0 input 1
-    32-36: add ALU1 input 0
-    37-41: add ALU1 input 1
-    42:    add ALU0 input 0 negate
-    43:    add ALU0 input 1 negate
-    44:    add ALU1 input 0 negate
-    45:    add ALU1 input 1 negate
-    46-54: uniform/temporary/global (max 304) load
-    55-57: uniform offset register select
-        1 - temporary/uniform load offset 0
-        2 - temporary/uniform load offset 1
-        3 - temporary/uniform load offset 2
-        7 - no offset
-    58-61: attribute/register load
-    62:    attribute load enable (load attribute in attribute slot)
-    63-66: register load
-    67: temporary store 0 enable
-    68: temporary store 1 enable
-    69: branch
-    70: branch target low (< 0x100)
-    71-73: varying/register/temporary store input 0
-    74-76: varying/register/temporary store input 1
-    77-79: varying/register/temporary store input 2
-    80-82: varying/register/temporary store input 3
-        0 - add ALU0 output
-        1 - add ALU1 output
-        2 - mul ALU0 output
-        3 - mul ALU1 output
-        4 - passthrough ALU output
-        6 -  complex ALU output
-        7 - no input (do not store)
-    83-85: add ALU0/1 opcode
+    0-4:   Multiply 0 Input A
+    5-9:   Multiply 0 input B
+    10-14: Multiply 1 Input A (Wide-Operation Input C)
+    15-19: Multiply 1 Input B (Wide-Operation Input D)
+    20:    Multiply 0 Output Negate
+    21:    Multiply 1 Output Negate
+    22-26: Accumulator 0 Input A
+    27-31: Accumulator 0 Input B
+    32-36: Accumulator 1 Input A
+    37-41: Accumulator 1 Input B
+    42:    Accumulator 0 Input A Negate
+    43:    Accumulator 0 Input B Negate
+    44:    Accumulator 1 Input A Negate
+    45:    Accumulator 1 Input B Negate
+    46-54: Load Address (Uniform/Temporary)
+    55-57: Load Offset (Uniform/Temporary)
+        0   - Address Register 0? (Never seen)
+        1   - Address Register 1
+        2   - Address Register 2
+        3   - Address Register 3
+        4-6 - Unknown (Never seen)
+        7   - Unused (No offset)
+    58-61: Register 0 Address (Register/Attribute)
+    62:    Register 0 Attribute (Load attribute in Register 0 unit)
+    63-66: Register 1 Address
+    67: Store 0 Temporary (Store Temporary in Store 0)
+    68: Store 1 Temporary (Store Temporary in Store 1)
+    69: Branch
+    70: Branch Target Low (< 0x100)
+    71-73: Store 0 Input X (Register/Varying/Temporary)
+    74-76: Store 0 Input Y (Register/Varying/Temporary)
+    77-79: Store 1 Input Z (Register/Varying/Temporary)
+    80-82: Store 1 Input W (Register/Varying/Temporary)
+        0 - Accumulator 0 Output
+        1 - Accumulator 1 Output
+        2 - Multiplier 0 Output
+        3 - Multiplier 1 Output
+        4 - Passthrough Output
+        5 - Unknown
+        6 -  Complex Output
+        7 - Unused (Don't store)
+    83-85: Accumulator (0 & 1) opcode
         0 - add
         1 - floor
         2 - sign
-        4 - src0 >= src1 / step(src1, src0)
-        5 - src0 < src1
-        6 - min/and
-        7 - max/or
+        3 - unknown
+        4 - greater-equal/step (a >= b)
+        5 - less-than (src0 < src1)
+        6 - min/logical and (a && b)
+        7 - max/logical or (a || b)
         note: abs(a) is implemented as max(a, -a)
-    86-89: complex ALU opcode
+    86-89: Complex OpCode
         For complex functions (rcp, sqrt, etc.), the inputs to the multiply ALU0 and
         the input to the complex ALU are the same value.
         0 - unused
-        2 - exp2
-        3 - log2
-        4 - inverse sqrt
-        5 - inverse
+        2 - exp2 (Partial)
+        3 - log2 (Partial)
+        4 - inverse sqrt (Partial)
+        5 - reciprocal (Partial)
         9 - passthrough
-        12 - temporary store address
-        13 - temporary/uniform load offset 0 set
-        14 - temporary/uniform load offset 1 set
-        15 - temporary/uniform load offset 2 set
-    90-93: varying/register store 0
-    94:    varying/register/temporary store 0 destination
-        0 - temporary/register
-        1 - varying
-    95-98: varying/register store 1
-    99:    varying/register/temporary store 1 destination
-        0 - temporary/register
-        1 - varying
-    100-102: multiply ALU opcode
-        0 - multiply
+        12 - Set Address Register 0 (Temporary Store address)
+        13 - Set Address Register 1
+        14 - Set Address Register 2
+        15 - Set Address Register 3
+    90-93: Store 0 Address (Varying/Register/Temporary)
+    94:    Store 0 Varying (Store Varying in Store 0)
+    95-98: Store 1 Address (Varying/Register/Temporary)
+    99:    Store 1 Varying (Store Varying in Store 1)
+    100-102: Multiply (0 & 1) OpCode
+        0 - multiply (out = a * b)
         1 - complex 1 (inverse, inverse sqrt, etc.)
             takes all four inputs as arguments
         3 - complex 2 (inverse, inverse sqrt, etc.)
             takes first two inputs as arguments,
             the other two are normal (multiply)
-        4 - mul0_src1 ? mul1_src0 : mul0_src0 (note: mul1_src1 = 21 because it is unused)
-    103-105: passthrough opcode
-        2 - pass
-        6 - clamp(input, uniform.x, uniform.y)
-    106-110: complex ALU input
-    111-115: passthrough input
-    116-119: unknown
-        0 - normal
-        12 - temporary write
-        13 - branch
-    120-127: branch target (absolute, 0 is 1st instruction of program)
+        4 - select (out = (b ? a : c), wide operation)
+        5-7: unknown
+    103-105: Passthrough OpCode
+        2 - passthrough (out = in)
+        6 - clamp (out = max(min(in, uniform.x), uniform.y))
+        0-1,3-5,7: unknown
+    106-110: Complex Input

(Diff truncated)
Added a really pretty compiler diagram.
diff --git a/Fragment+Shader+Backend.mdwn b/Fragment+Shader+Backend.mdwn
index a327b67..140f130 100644
--- a/Fragment+Shader+Backend.mdwn
+++ b/Fragment+Shader+Backend.mdwn
@@ -13,3 +13,7 @@ Allocation for Irregular Architectures](http://user.it.uu.se/~svenolof/wpo/Alloc
 It seems to me that the main problem isn't scheduling (mostly an issue of finding the right heuristics and doing the actual grunt work to see if you can add an instruction to a packet) or register allocation (thinking of using the above-linked algorithm), but how the two should interact. Mainly, the issue has to do with how to deal with register coalescing and spills. Due to the architecture's pipelined nature and abundance of pipeline registers, scheduling has to be able to change the semantics of the program. Furthermore, scheduling would be constrained in how it could pipeline together operations (replacing normal registers with pipeline registers) if register allocation were to be performed first, because it would be harder to determine if it's legal to replace a normal register with a pipeline register. On the other side, scheduling will have the tendency to "hide" certain reads and writes, either because a register was replaced with a pipeline register, or because the instruction writes to a register that isn't the overall destination for an instruction packet (for example, a varying load unit when an ALU is also being used). Certainly, the register allocator will want to take advantage of that reduction in register pressure. Therefore, it seems that the best option is to have an instruction scheduling pass before register allocation.
 
 The difficulty with that, though, is that modern graph-coloring register allocators expect to be able to change the program semantics, even in the middle of allocation. Iterated register coalescing, for example, interleaves register coalescing/copy folding passes into the process of reducing the interference graph. However, doing so changes the way that instructions can be ordered and gives new opportunities to the scheduler, and therefore can change the interference graph. Again, adding spill code (temporary reads and writes) can once again change the structure of the program and therefore the interference graph. This breaks the guarantee implicit in both stages that the interference graph won't be changed
+
+# Pretty Compiler Picture
+
+[<img src="http://img545.imageshack.us/img545/3179/compilerb.png">](http://img545.imageshack.us/img545/3179/compilerb.png)

Added attribute register loading.
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index a4c77c0..833ebd6 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -451,6 +451,7 @@ These are the known inputs:
     24, 25: two instructions ago, acc ALU0/1 results
     26, 27: two instructions ago, mul ALU0/1 results
     28-31: last instruction, attribute load result
+Note: If attribute_load_en is disabled then the attribute slot can be used to load registers too.
 
 Instruction format:
 
@@ -474,8 +475,8 @@ Instruction format:
         2 - temporary/uniform load offset 1
         3 - temporary/uniform load offset 2
         7 - no offset
-    58-61: attribute load
-    62:    attribute load enable
+    58-61: attribute/register load
+    62:    attribute load enable (load attribute in attribute slot)
     63-66: register load
     67: temporary store 0 enable
     68: temporary store 1 enable

Updated vertex pipeline diagram (minor fix).
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 03cd653..a4c77c0 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -544,4 +544,4 @@ Instruction format:
 
 
 ## Lima Vertex Pipeline
-[<img src="http://img191.imageshack.us/img191/6044/limavertexpipeline.png">](http://img191.imageshack.us/img191/6044/limavertexpipeline.png)
+[<img src="http://img441.imageshack.us/img441/7590/limavertexpipelinen.png">](http://img441.imageshack.us/img441/7590/limavertexpipelinen.png)

Updated vertex pipeline diagram (minor fix).
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index cd8ac10..03cd653 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -544,4 +544,4 @@ Instruction format:
 
 
 ## Lima Vertex Pipeline
-[<img src="http://img98.imageshack.us/img98/6044/limavertexpipeline.png">](http://img98.imageshack.us/img98/6044/limavertexpipeline.png)
+[<img src="http://img191.imageshack.us/img191/6044/limavertexpipeline.png">](http://img191.imageshack.us/img191/6044/limavertexpipeline.png)

Updated vertex pipeline diagram.
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 88fe378..cd8ac10 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -544,4 +544,4 @@ Instruction format:
 
 
 ## Lima Vertex Pipeline
-[<img src="http://img440.imageshack.us/img440/6044/limavertexpipeline.png">](http://img440.imageshack.us/img440/6044/limavertexpipeline.png)
+[<img src="http://img98.imageshack.us/img98/6044/limavertexpipeline.png">](http://img98.imageshack.us/img98/6044/limavertexpipeline.png)

Added 3rd load address register.
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 6018d57..88fe378 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -472,6 +472,7 @@ Instruction format:
     55-57: uniform offset register select
         1 - temporary/uniform load offset 0
         2 - temporary/uniform load offset 1
+        3 - temporary/uniform load offset 2
         7 - no offset
     58-61: attribute load
     62:    attribute load enable
@@ -512,6 +513,7 @@ Instruction format:
         12 - temporary store address
         13 - temporary/uniform load offset 0 set
         14 - temporary/uniform load offset 1 set
+        15 - temporary/uniform load offset 2 set
     90-93: varying/register store 0
     94:    varying/register/temporary store 0 destination
         0 - temporary/register

Updated vertex pipeline diagram (minor fix).
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index ca10b66..6018d57 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -542,4 +542,4 @@ Instruction format:
 
 
 ## Lima Vertex Pipeline
-[<img src="http://img546.imageshack.us/img546/6044/limavertexpipeline.png">](http://img546.imageshack.us/img546/6044/limavertexpipeline.png)
+[<img src="http://img440.imageshack.us/img440/6044/limavertexpipeline.png">](http://img440.imageshack.us/img440/6044/limavertexpipeline.png)

Updated vertex pipeline diagram.
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 84f482c..ca10b66 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -542,4 +542,4 @@ Instruction format:
 
 
 ## Lima Vertex Pipeline
-[<img src="http://img13.imageshack.us/img13/6044/limavertexpipeline.png">](http://img13.imageshack.us/img13/6044/limavertexpipeline.png)
+[<img src="http://img546.imageshack.us/img546/6044/limavertexpipeline.png">](http://img546.imageshack.us/img546/6044/limavertexpipeline.png)

Added multiple load offset registers.
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 096fab9..84f482c 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -470,7 +470,8 @@ Instruction format:
     45:    add ALU1 input 1 negate
     46-54: uniform/temporary/global (max 304) load
     55-57: uniform offset register select
-        1 - temporary/uniform load offset
+        1 - temporary/uniform load offset 0
+        2 - temporary/uniform load offset 1
         7 - no offset
     58-61: attribute load
     62:    attribute load enable
@@ -509,7 +510,8 @@ Instruction format:
         5 - inverse
         9 - passthrough
         12 - temporary store address
-        13 - temporary/uniform load offset
+        13 - temporary/uniform load offset 0 set
+        14 - temporary/uniform load offset 1 set
     90-93: varying/register store 0
     94:    varying/register/temporary store 0 destination
         0 - temporary/register

Changed vertex control opcodes to flags, updated vertex pipeline diagram.
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 9f4706b..096fab9 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -475,12 +475,10 @@ Instruction format:
     58-61: attribute load
     62:    attribute load enable
     63-66: register load
-    67-70: control opcode
-        0 - nop
-        1 - temporary store
-        2 - ??? (something to do with temporaries...)
-        4 - branch to branch target + 256
-        12 - branch
+    67: temporary store 0 enable
+    68: temporary store 1 enable
+    69: branch
+    70: branch target low (< 0x100)
     71-73: varying/register/temporary store input 0
     74-76: varying/register/temporary store input 1
     77-79: varying/register/temporary store input 2
@@ -542,4 +540,4 @@ Instruction format:
 
 
 ## Lima Vertex Pipeline
-[<img src="http://img72.imageshack.us/img72/6044/limavertexpipeline.png">](http://img72.imageshack.us/img72/6044/limavertexpipeline.png)
+[<img src="http://img13.imageshack.us/img13/6044/limavertexpipeline.png">](http://img13.imageshack.us/img13/6044/limavertexpipeline.png)

add/modify control opcode for vertex shaders
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 8499f2e..9f4706b 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -475,10 +475,12 @@ Instruction format:
     58-61: attribute load
     62:    attribute load enable
     63-66: register load
-    67:    temporary store enable
-    68-70: control opcode
+    67-70: control opcode
         0 - nop
-        6 - branch
+        1 - temporary store
+        2 - ??? (something to do with temporaries...)
+        4 - branch to branch target + 256
+        12 - branch
     71-73: varying/register/temporary store input 0
     74-76: varying/register/temporary store input 1
     77-79: varying/register/temporary store input 2

Updated vertex pipeline diagram.
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 91ba0c6..8499f2e 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -540,4 +540,4 @@ Instruction format:
 
 
 ## Lima Vertex Pipeline
-[<img src="http://img441.imageshack.us/img441/6044/limavertexpipeline.png">](http://img441.imageshack.us/img441/6044/limavertexpipeline.png)
+[<img src="http://img72.imageshack.us/img72/6044/limavertexpipeline.png">](http://img72.imageshack.us/img72/6044/limavertexpipeline.png)

Added vertex pipeline diagram.
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 3401d0f..91ba0c6 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -536,3 +536,8 @@ Instruction format:
         12 - temporary write
         13 - branch
     120-127: branch target (absolute, 0 is 1st instruction of program)
+
+
+
+## Lima Vertex Pipeline
+[<img src="http://img441.imageshack.us/img441/6044/limavertexpipeline.png">](http://img441.imageshack.us/img441/6044/limavertexpipeline.png)

fixed texture fetch coordinate load, again
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 10bbb4e..3401d0f 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -108,8 +108,7 @@ There also exists various "pipeline registers" (four of them listed above) which
     
     m - Mask, (0001 = float, 0011 = vec2, 0111 = vec3, 1111 = vec4)
     d - Destination Register
-        Mali200: Writing to register 15 here loads coordinates for the texture sampler.
-        Mali400: The input to the sampler is always the output of this unit.
+        Note: writing to register 15 discards the output (used for loading texture coordinates)
     i - Varying Index
     a - alignment
         It seems that varyings (floats) can be loaded in aligned groups of 1, 2, or 4.
@@ -154,6 +153,8 @@ There also exists various "pipeline registers" (four of them listed above) which
 
     00111001000000000001ssssssssssssottttt00000b000000ccccccrrrrrr
 
+    The coordinates for the texture fetch are always the output of the varying load.
+
     s - sampler index
     o - sampler index register offset enable
     c - sampler index offset register

Mali200/400 sampler co-ordinate difference documented.
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 86978b5..10bbb4e 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -108,7 +108,8 @@ There also exists various "pipeline registers" (four of them listed above) which
     
     m - Mask, (0001 = float, 0011 = vec2, 0111 = vec3, 1111 = vec4)
     d - Destination Register
-        Note: writing to register 15 here loads coordinates for the texture sampler.
+        Mali200: Writing to register 15 here loads coordinates for the texture sampler.
+        Mali400: The input to the sampler is always the output of this unit.
     i - Varying Index
     a - alignment
         It seems that varyings (floats) can be loaded in aligned groups of 1, 2, or 4.

remove bogus restriction from fragment shader intro, added section on pipeline registers
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 406b7ed..86978b5 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -2,7 +2,7 @@
 
 ## Fragment Shader Architecture
 
-The architecture consists of a large pipeline, consisting of a number of vector and scalar units which can be enabled and disabled by the control word. Usually, each unit can affect/produce results which are used by all later units (see "Lima Fragment Pipeline" below). In particular, there are two vector ALU's, one of which can do addition, another which can do multiplication. The result of the multiplication unit can be used as the input of the addition unit, in order to implement Fused Multiply-Add and other combinations. There are also two similar scalar ALU's, and the "combiner" capable of executing scalar-vector multiplies as well as various complex/transcdental functions (sqrt, rcp, sin, cos, exp2, etc.). Furthermore, there are varying, uniform/temporary, and texture load units, a temporary store unit, and a branching unit for implementing jumps. Note that, as shown by the pipeline diagram, there is only one write port, and therefore only one (vector) register can be written to per instruction. The register written for the entire instruction is chosen by the machine as the register written by the last enabled unit. Although earlier units can write to different registers, the effects of those writes will be ignored beyond the current instruction. To overcome the pipeline stall issues inherent in such a long pipeline, the architecture is likely barrelled and interleaves execution of a large number of fragments at once, and scheduling is done by the machine in order to minimize stalls.
+The architecture consists of a large pipeline, consisting of a number of vector and scalar units which can be enabled and disabled by the control word. In particular, there are two vector ALU's, one of which can do addition, another which can do multiplication.  There are also two similar scalar ALU's, and the "combiner" capable of executing scalar-vector multiplies as well as various complex/transcdental functions (sqrt, rcp, sin, cos, exp2, etc.). Furthermore, there are varying, uniform/temporary, and texture load units, a temporary store unit, and a branching unit for implementing jumps. Each unit can affect/produce results which are used by all later units, as if all 6 registers are passed between each unit in the pipeline (see "Lima Fragment Pipeline" below). Furthermore, to reduce register pressure, there are a number of "pipeline registers". A pipeline register is a direct connection between two units in the pipeline, in addition to the normal registers which are passed between every unit. For more details on registers (including pipeline registers), see the "Registers" section below. To overcome the pipeline stall issues inherent in such a long pipeline (~256 stages), the architecture is likely barrelled and interleaves execution of a large number (~256) of fragments at once, and scheduling is done by the machine in order to minimize stalls.
 The instruction stream is compressed down from a maximum of 18-words per instruction dependant on what units are in use.
 The remaining bits give each unit individual instructions and constants.
 

diff --git a/index.mdwn b/index.mdwn
index 151079d..5544f70 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -46,6 +46,13 @@ Documentation for the shader compiler, and the initial investigation of the inst
 
 The documentation is currently kept in the wiki, pages of interest are:
 
+Original (Falanx) datasheets:
+
+* [Mali200 Product](http://web.archive.org/web/20060515063019/http://www.falanx.no/download/Mali200_Product.pdf)
+* [Mali Geometry Product Spec](http://web.archive.org/web/20060515063211/http://www.falanx.no/download/Mali%20Geometry%20Product%20Spec%20USL.pdf)
+
+Lima Documents
+
 * [[Lima+Assembler]]
 * [[Lima+ISA]]
 * [[Fragment+Assembly+Syntax]]

Added info on dumping malisc symbols
diff --git a/Mali_Offline_Shader_Compiler.mdwn b/Mali_Offline_Shader_Compiler.mdwn
index f9b759c..350f472 100644
--- a/Mali_Offline_Shader_Compiler.mdwn
+++ b/Mali_Offline_Shader_Compiler.mdwn
@@ -14,3 +14,11 @@ Full documentation can be found at [[MBS+File+Format]].
 
 There's a tool in our git tree called mbs_dump which will dump out an MBS file in a readable form, it also takes various options for how to disassemble/decompile the fragment/vertex code.
 For more info on this tool and how to use it read [[Lima+Assembler]].
+
+#Reverse Engineering
+
+It turns out that the Mali developers were kind enough to leave all the debug symbols in the final binary, and their underlying code is clean enough that it's possible to see a lot of what's going on just via the function/symbol names.
+
+To view the symbols do the following:
+
+    objdump -t `which malisc`

add gl_PointSize
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 0b2ae31..406b7ed 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -403,9 +403,15 @@ Akin to the fragment shader, there are also temporaries, which unlike registers
 
 gl_Position is implemented internally as a varying; it seems that it is hard-coded to varying 0. The compiler implements some transforms internally to convert the value calculated for gl_Position in the shader to the actual value sent to the hardware. In particular (in pseudocode):
 
+    uniform vec4 gl_mali_ViewportTransform[2];
     gl_position_actual.w = clamp(1.0 / gl_Position.w, -1e10, 1e10);
     gl_position_actual.xyz  = gl_Position.xyz * gl_position_actual.w * gl_mali_ViewportTransform[0].xyz + gl_mali_ViewportTransform[1].xyz;
 
+gl_PointSize is also implemented internally as a varying. However, its position doesn't appear to be fixed. There are also some transforms involved:
+
+    uniform vec4 gl_mali_PointSizeParameters;
+    gl_PointSize_actual = clamp(gl_PointSize, gl_mali_PointSizeParameters.x, gl_mali_PointSizeParameters.y) * gl_mali_PointSizeParameters.z;
+
 ## Complex functions
 
 Complex functions are implemented using multiply ALU opcodes 1 and 3, as well as various complex ALU opcodes. The computation looks like this:

Uploaded malisc symbol table.
diff --git a/malisc+symbols.c b/malisc+symbols.c
new file mode 100644
index 0000000..abc076a
--- /dev/null
+++ b/malisc+symbols.c
@@ -0,0 +1,1091 @@
+
+/bin/malisc:     file format elf32-i386
+
+SYMBOL TABLE:
+08048134 l    d  .interp	00000000              .interp
+08048148 l    d  .note.ABI-tag	00000000              .note.ABI-tag
+08048168 l    d  .note.gnu.build-id	00000000              .note.gnu.build-id
+0804818c l    d  .hash	00000000              .hash
+080482d4 l    d  .gnu.hash	00000000              .gnu.hash
+08048300 l    d  .dynsym	00000000              .dynsym
+080485b0 l    d  .dynstr	00000000              .dynstr
+08048732 l    d  .gnu.version	00000000              .gnu.version
+08048788 l    d  .gnu.version_r	00000000              .gnu.version_r
+080487f8 l    d  .rel.dyn	00000000              .rel.dyn
+08048810 l    d  .rel.plt	00000000              .rel.plt
+08048940 l    d  .init	00000000              .init
+08048970 l    d  .plt	00000000              .plt
+08048be0 l    d  .text	00000000              .text
+0808fd7c l    d  .fini	00000000              .fini
+0808fd98 l    d  .rodata	00000000              .rodata
+08096c18 l    d  .eh_frame	00000000              .eh_frame
+08097efc l    d  .ctors	00000000              .ctors
+08097f04 l    d  .dtors	00000000              .dtors
+08097f0c l    d  .jcr	00000000              .jcr
+08097f10 l    d  .dynamic	00000000              .dynamic
+08097ff0 l    d  .got	00000000              .got
+08097ff4 l    d  .got.plt	00000000              .got.plt
+08098098 l    d  .data	00000000              .data
+080980a0 l    d  .bss	00000000              .bss
+00000000 l    d  .comment	00000000              .comment
+00000000 l    df *ABS*	00000000              crtstuff.c
+08097efc l     O .ctors	00000000              __CTOR_LIST__
+08097f04 l     O .dtors	00000000              __DTOR_LIST__
+08097f0c l     O .jcr	00000000              __JCR_LIST__
+08048c10 l     F .text	00000000              __do_global_dtors_aux
+080980c4 l     O .bss	00000001              completed.7021
+080980c8 l     O .bss	00000004              dtor_idx.7023
+08048c70 l     F .text	00000000              frame_dummy
+00000000 l    df *ABS*	00000000              crtstuff.c
+08097f00 l     O .ctors	00000000              __CTOR_END__
+08096c18 l     O .eh_frame	00000000              __FRAME_END__
+08097f0c l     O .jcr	00000000              __JCR_END__
+0808fd50 l     F .text	00000000              __do_global_ctors_aux
+00000000 l    df *ABS*	00000000              driver.c
+00000000 l    df *ABS*	00000000              commandline.c
+08049064 l     F .text	0000022e              parse_hardware_revision
+08090720 l     O .rodata	0000000c              CSWTCH.41
+00000000 l    df *ABS*	00000000              essl_test_system.c
+00000000 l    df *ABS*	00000000              compiler.c
+08049d78 l     F .text	00000059              examine_error
+08049edf l     F .text	00000088              allocate_compiler_context
+08090748 l     O .rodata	00000008              CSWTCH.7
+00000000 l    df *ABS*	00000000              error_reporting.c
+08090bc0 l     O .rodata	00000020              CSWTCH.76
+0804a234 l     F .text	0000003d              increase_buf
+0804a271 l     F .text	0000005d              write_internal_compiler_error
+08090a58 l     O .rodata	00000168              CSWTCH.73
+00000000 l    df *ABS*	00000000              essl_list.c
+0804aa4b l     F .text	000000b6              split_and_merge
+00000000 l    df *ABS*	00000000              essl_mem.c
+0804abc7 l     F .text	0000004f              allocate_block
+00000000 l    df *ABS*	00000000              compiler_options.c
+00000000 l    df *ABS*	00000000              essl_stringbuffer.c
+0804af7b l     F .text	0000008a              _essl_string_buffer_reserve
+00000000 l    df *ABS*	00000000              essl_target.c
+00000000 l    df *ABS*	00000000              output_buffer.c
+00000000 l    df *ABS*	00000000              frontend.c
+0804b85e l     F .text	000000a9              function_partial_sort
+00000000 l    df *ABS*	00000000              typecheck.c
+0804bf00 l     F .text	00000047              type_is_or_has_sampler
+0804bf47 l     F .text	0000003c              type_is_or_has_array
+0804c012 l     F .text	0000015e              check_lvalue
+0804c170 l     F .text	00000102              typecheck_array_size
+0804e5f7 l     F .text	0000009a              typecheck
+00000000 l    df *ABS*	00000000              preprocessor.c
+0804e7a1 l     F .text	000000db              read_scanner_token
+0804e87c l     F .text	00000045              push_if_stack_entry
+0804e8c1 l     F .text	0000004a              encounter_command
+08092818 l     O .rodata	0000009c              command_strings
+0804e90b l     F .text	0000007d              add_predefined_macro
+0804ea5a l     F .text	0000011c              unary
+0804efec l     F .text	0000004f              logical_inclusive_or
+0804eb76 l     F .text	0000010e              multiplicative
+0804ec84 l     F .text	00000072              additive
+0804ecf6 l     F .text	0000009c              bitwise_shift
+0804ed92 l     F .text	000000b2              relational
+0804ee44 l     F .text	00000081              equality
+0804eec5 l     F .text	00000046              bitwise_and
+0804ef0b l     F .text	00000046              bitwise_exclusive_or
+0804ef51 l     F .text	00000046              bitwise_inclusive_or
+0804ef97 l     F .text	00000055              logical_and
+0804f03b l     F .text	0000006a              get_pp_token
+0804f0a5 l     F .text	000000aa              peek_pp_token
+0804f14f l     F .text	00000137              defined_operator
+0804f286 l     F .text	00000065              generate_integer_token
+0804f2eb l     F .text	00000cda              replace_macro
+0804ffc5 l     F .text	0000037e              directive_constant_expression
+08050343 l     F .text	000003da              skip_tokens
+00000000 l    df *ABS*	00000000              lang.c
+08092ac0 l     O .rodata	00000024              extension_names
+08092ae4 l     O .rodata	0000000c              CSWTCH.12
+00000000 l    df *ABS*	00000000              callgraph.c
+08052434 l     F .text	00000061              record_func
+08052495 l     F .text	00000167              note_calls
+00000000 l    df *ABS*	00000000              precision.c
+08052740 l     F .text	00000022              type_has_precision_qualification
+08092b38 l     O .rodata	00000024              CSWTCH.24
+08052762 l     F .text	00000053              get_default_precision_for_type
+080527b5 l     F .text	0000004e              new_type_conversion
+08052803 l     F .text	000000a3              insert_bitwise_casts_for_children_with_specific_type
+080528a6 l     F .text	0000005c              insert_bitwise_casts_for_children
+08052902 l     F .text	0000033b              insert_bitwise_casts
+08052c3d l     F .text	0000014f              get_type_with_set_precision
+08052d8c l     F .text	0000006e              set_precision_qualifier_for_node
+08052dfa l     F .text	000000b4              propagate_precision_upward
+08052eae l     F .text	000000f7              propagate_default_precision_upward
+080534ae l     F .text	000000e8              calculate_precision
+00000000 l    df *ABS*	00000000              global_variable_inlining.c
+0805367c l     F .text	000000d5              find_and_rewrite_nodes
+08053751 l     F .text	0000032d              visit_function
+08092bf0 l     O .rodata	00000008              CSWTCH.8
+00000000 l    df *ABS*	00000000              middle.c
+00000000 l    df *ABS*	00000000              control_deps.c
+08053ea4 l     F .text	00000065              symbol_for_node
+08053f09 l     F .text	0000003c              add_dependency
+080543a4 l     F .text	000000df              addresses_identical
+00000000 l    df *ABS*	00000000              optimise_loop_entry.c
+080548e0 l     F .text	00000071              clone_exp
+08054951 l     F .text	00000192              optimise_loop_entry_stmt
+00000000 l    df *ABS*	00000000              optimise_inline_functions.c
+08054b2f l     F .text	00000268              clone_node
+08054d97 l     F .text	000001fe              clone_basic_block
+08054f95 l     F .text	00000045              remove_control_dependent_op_node
+00000000 l    df *ABS*	00000000              optimise_basic_blocks.c
+00000000 l    df *ABS*	00000000              optimise_constant_fold.c
+08055a20 l     F .text	00000141              constant_fold
+00000000 l    df *ABS*	00000000              eliminate_complex_ops.c
+08055f98 l     F .text	0000005c              is_expensive_matrix_result
+08055ff4 l     F .text	000000f6              replace_returns
+080560ea l     F .text	00000066              create_index_int_constant
+08056150 l     F .text	00000d4c              process_single_node
+08092c50 l     O .rodata	00000078              CSWTCH.131
+080573af l     F .text	000001a2              explode_struct_comparison
+08057551 l     F .text	0000011f              store_reload_variable
+08057670 l     F .text	000001e0              rewrite_component_wise_matrix_op
+08056e9c l     F .text	0000009c              process_node
+00000000 l    df *ABS*	00000000              ssa.c
+08057850 l     F .text	0000009d              var_hash_fun
+080578ed l     F .text	0000003c              node_stack_push
+08057929 l     F .text	000000e0              insert_phi_node
+08057a09 l     F .text	0000005e              clone_address
+08057a67 l     F .text	00000064              create_dummy_symbol
+08057acb l     F .text	0000005b              node_stack_get_or_create
+08057b26 l     F .text	00000059              node_stack_node_get_or_create
+08057b7f l     F .text	00000050              node_stack_get_or_create_top
+08057bcf l     F .text	00000334              ssa_rename
+08058265 l     F .text	00000129              var_equal_fun
+00000000 l    df *ABS*	00000000              conditional_select.c
+00000000 l    df *ABS*	00000000              static_cycle_count.c
+00000000 l    df *ABS*	00000000              mali200_target.c
+08058b0c l     F .text	00000007              cycles_for_jump
+08058b13 l     F .text	0000002b              cycles_for_block
+08058b3e l     F .text	0000000a              is_variable_in_indexable_memory
+00000000 l    df *ABS*	00000000              mali200_type.c
+08058ca4 l     F .text	00000078              internal_type_alignment
+00000000 l    df *ABS*	00000000              mali200_driver.c
+00000000 l    df *ABS*	00000000              mali200_instruction.c
+08092e78 l     O .rodata	00000064              CSWTCH.113
+08092edc l     O .rodata	00000040              CSWTCH.116
+08059443 l     F .text	000004fa              handle_input
+00000000 l    df *ABS*	00000000              mali200_slot.c
+0805a810 l     F .text	00000148              can_be_replaced_by
+08092f1c l     O .rodata	00000010              CSWTCH.12
+00000000 l    df *ABS*	00000000              mali200_regalloc.c
+0805af54 l     F .text	0000008f              init_regalloc_context
+0805afe3 l     F .text	00000086              reset_allocations
+0805b069 l     F .text	0000006b              prepare_ranges_for_coloring
+0805b155 l     F .text	00000110              allocate_all_ranges
+00000000 l    df *ABS*	00000000              mali200_register_integration.c
+0805b518 l     F .text	00000390              integrate_instruction
+08092f60 l     O .rodata	0000015c              CSWTCH.9
+00000000 l    df *ABS*	00000000              mali200_spilling.c
+0805b95c l     F .text	00000143              put_load
+0805ba9f l     F .text	000000ab              put_store
+0805be5f l     F .text	0000012e              complete_spill_range
+080930bc l     O .rodata	00000006              spillname
+080930c4 l     O .rodata	00000040              mask_n_comps
+00000000 l    df *ABS*	00000000              mali200_word_insertion.c
+0805c37c l     F .text	0000007a              insert_cycle_into_instructions
+00000000 l    df *ABS*	00000000              mali200_emit.c
+0805c7a4 l     F .text	00000065              in_sub_reg
+0805c809 l     F .text	000000c0              opcode_of_mult
+080933a8 l     O .rodata	00000014              CSWTCH.93
+0805c8c9 l     F .text	00000154              opcode_of_add

(Diff truncated)
index: add link to fragment shader backend doc
diff --git a/index.mdwn b/index.mdwn
index 2561862..151079d 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -52,6 +52,7 @@ The documentation is currently kept in the wiki, pages of interest are:
 * [[Vertex+Disassembly]]
 * [[Mali_Offline_Shader_Compiler]]
 * [[MBS+File+Format]]
+* [[Fragment+Shader+Backend]]
 
 ## Contribute
 ===

add page on fragment shader backend
diff --git a/Fragment+Shader+Backend.mdwn b/Fragment+Shader+Backend.mdwn
new file mode 100644
index 0000000..a327b67
--- /dev/null
+++ b/Fragment+Shader+Backend.mdwn
@@ -0,0 +1,15 @@
+## Notes on fragment shader backend
+
+I'm just using this page for now to collect my thoughts on how to write the backend for the fragment processor, and to collect links that may be useful for future reference. I'm focusing on the backend, since it's the most difficult part of the compiler due to the fragment shader's novel architecture.
+
+# Links
+
+* [Retargetable Graph-Coloring Register
+Allocation for Irregular Architectures](http://user.it.uu.se/~svenolof/wpo/AllocSCOPES2003.20030626b.pdf) - used by various mesa backends, should work well for us
+* [Iterated Register Coalescing](http://www.cs.cmu.edu/afs/cs/academic/class/15745-s07/www/papers/george.pdf) - standard technique for register coalescing
+
+# Thoughts
+
+It seems to me that the main problem isn't scheduling (mostly an issue of finding the right heuristics and doing the actual grunt work to see if you can add an instruction to a packet) or register allocation (thinking of using the above-linked algorithm), but how the two should interact. Mainly, the issue has to do with how to deal with register coalescing and spills. Due to the architecture's pipelined nature and abundance of pipeline registers, scheduling has to be able to change the semantics of the program. Furthermore, scheduling would be constrained in how it could pipeline together operations (replacing normal registers with pipeline registers) if register allocation were to be performed first, because it would be harder to determine if it's legal to replace a normal register with a pipeline register. On the other side, scheduling will have the tendency to "hide" certain reads and writes, either because a register was replaced with a pipeline register, or because the instruction writes to a register that isn't the overall destination for an instruction packet (for example, a varying load unit when an ALU is also being used). Certainly, the register allocator will want to take advantage of that reduction in register pressure. Therefore, it seems that the best option is to have an instruction scheduling pass before register allocation.
+
+The difficulty with that, though, is that modern graph-coloring register allocators expect to be able to change the program semantics, even in the middle of allocation. Iterated register coalescing, for example, interleaves register coalescing/copy folding passes into the process of reducing the interference graph. However, doing so changes the way that instructions can be ordered and gives new opportunities to the scheduler, and therefore can change the interference graph. Again, adding spill code (temporary reads and writes) can once again change the structure of the program and therefore the interference graph. This breaks the guarantee implicit in both stages that the interference graph won't be changed

added note about restriction on register writes
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 0cf5cbf..0b2ae31 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -2,7 +2,7 @@
 
 ## Fragment Shader Architecture
 
-The architecture consists of a large pipeline, consisting of a number of vector and scalar units which can be enabled and disabled by the control word. Usually, each unit can affect/produce results which are used by all later units (see "Lima Fragment Pipeline" below). In particular, there are two vector ALU's, one of which can do addition, another which can do multiplication. The result of the multiplication unit can be used as the input of the addition unit, in order to implement Fused Multiply-Add and other combinations. There are also two similar scalar ALU's, and the "combiner" capable of executing scalar-vector multiplies as well as various complex/transcdental functions (sqrt, rcp, sin, cos, exp2, etc.). Furthermore, there are varying, uniform/temporary, and texture load units, a temporary store unit, and a branching unit for implementing jumps. To overcome the pipeline stall issues inherent in such a long pipeline, the architecture is likely barrelled and interleaves execution of a large number of fragments at once, and scheduling is done by the machine in order to minimize stalls.
+The architecture consists of a large pipeline, consisting of a number of vector and scalar units which can be enabled and disabled by the control word. Usually, each unit can affect/produce results which are used by all later units (see "Lima Fragment Pipeline" below). In particular, there are two vector ALU's, one of which can do addition, another which can do multiplication. The result of the multiplication unit can be used as the input of the addition unit, in order to implement Fused Multiply-Add and other combinations. There are also two similar scalar ALU's, and the "combiner" capable of executing scalar-vector multiplies as well as various complex/transcdental functions (sqrt, rcp, sin, cos, exp2, etc.). Furthermore, there are varying, uniform/temporary, and texture load units, a temporary store unit, and a branching unit for implementing jumps. Note that, as shown by the pipeline diagram, there is only one write port, and therefore only one (vector) register can be written to per instruction. The register written for the entire instruction is chosen by the machine as the register written by the last enabled unit. Although earlier units can write to different registers, the effects of those writes will be ignored beyond the current instruction. To overcome the pipeline stall issues inherent in such a long pipeline, the architecture is likely barrelled and interleaves execution of a large number of fragments at once, and scheduling is done by the machine in order to minimize stalls.
 The instruction stream is compressed down from a maximum of 18-words per instruction dependant on what units are in use.
 The remaining bits give each unit individual instructions and constants.
 

diff --git a/Hardware.mdwn b/Hardware.mdwn
index 7e9fb86..e240b13 100644
--- a/Hardware.mdwn
+++ b/Hardware.mdwn
@@ -58,6 +58,9 @@ A Dual Core ARM Cortex A9 running at 1GHz, which includes a Mali-400 MP1. Most e
 
 Plenty of information, which might be very snowball specific, can be found on [the igloo community website](http://igloocommunity.org/).
 
+There is a pre-built image of Linaro Android with the Lima(re) demo included. There are three files, [system.tar.bz2](http://snapshots.linaro.org/android/~joe-burmeister/test-lima-snowball/27/target/product/snowball/system.tar.bz2), [boot.tar.bz2](http://snapshots.linaro.org/android/~joe-burmeister/test-lima-snowball/27/target/product/snowball/boot.tar.bz2) and [userdata.tar.bz2](http://snapshots.linaro.org/android/~joe-burmeister/test-lima-snowball/27/target/product/snowball/userdata.tar.bz2) that you put onto a SD card following [Linaro's image installation instructions](https://wiki.linaro.org/Platform/Android/ImageInstallation).
+
+
 ## Samsung Exynos
 
 The [Samsung Exynos](http://en.wikipedia.org/wiki/Exynos) 42xx is a range of ARM Cortex A9 devices clocked between 1.2 and 1.8GHz. They are the only devices currently carrying a Mali-400MP4. The Exynos of course stars in the top selling, high end Samsung android based smartphones and tablets. The best sold phone of 2011, the Samsung Galaxy S II, comes with an Exynos. A [Single Board Computer with a 4210, called origen,](http://www.origenboard.org/) is available with android and ubuntu support.

diff --git a/Mali_Offline_Shader_Compiler.mdwn b/Mali_Offline_Shader_Compiler.mdwn
index af2a33f..f9b759c 100644
--- a/Mali_Offline_Shader_Compiler.mdwn
+++ b/Mali_Offline_Shader_Compiler.mdwn
@@ -12,4 +12,5 @@ Full documentation can be found at [[MBS+File+Format]].
 
 #Extracting the program binary
 
-So far, I've written a very simple C program which extracts the program itself from the malisc output - it just gets the data from the DBIN tag. I've uploaded it [here](http://pastebin.com/aF9c1GKG) for now.
+There's a tool in our git tree called mbs_dump which will dump out an MBS file in a readable form, it also takes various options for how to disassemble/decompile the fragment/vertex code.
+For more info on this tool and how to use it read [[Lima+Assembler]].

diff --git a/index.mdwn b/index.mdwn
index 9923b7e..2561862 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -32,14 +32,14 @@ Documentation for the shader compiler, and the initial investigation of the inst
 
 ### [Mali-400](Hardware#Mali-400):
 
-* [AMLogic 8726-M](Hardware#AMLogic+8726-M)
+* [AMLogic 8726-M](Hardware#AMLogic+8726-M) (Zenithink C71)
 * [Allwinner A10](Hardware#Allwinner+A10)
 * [ST-Ericsson Novathor](Hardware#ST-Ericsson+Novathor)
-* [Samsung Exynos](Hardware#Samsung+Exynos)
+* [Samsung Exynos](Hardware#Samsung+Exynos) (Galaxy S2/S3/Tab)
 
 ### [Mali-200](Hardware#Mali-200):
 
-* [Telechips 8902](Hardware#Telechips+8902), [8803](Hardware#Telechips+8803)
+* [Telechips 8902](Hardware#Telechips+8902), [8803](Hardware#Telechips+8803) (Haipad MID701)
 
 ## Documentation
 ===