Recent changes to this wiki:

looks like gl_mali_FragCoordScale disappeared in the latest compiler
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 0afdbc1..225a37c 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -141,8 +141,7 @@ There also exists various "pipeline registers" (four of them listed above) which
 
     Note: for gl_FragCoord and gl_PointCoord, the shader has to apply some transforms to get the right value.
     For gl_FragCoord, it looks like this in pseudocode:
-    uniform vec3 gl_mali_FragCoordScale; //created by the compiler
-    gl_FragCoord.xyz = gl_FragCoord_orig.xyz * gl_mali_FragCoordScale.xyz;
+    gl_FragCoord.xyz = gl_FragCoord_orig.xyz;
     gl_FragCoord.w = 1.0 / gl_FragCoord_orig.w;
 
     And for gl_PointCoord:

diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index c686f53..0afdbc1 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -373,13 +373,13 @@ There also exists various "pipeline registers" (four of them listed above) which
 
     Temporary Write:
 
-    i iiii iiii iiii iiio rrrr rr00 0000 a0ss ssss 00dd
+    i iiii iiii iiii iiio rrrr rr00 0000 aass ssss 00dd
 
     i - destination index
-    a - alignment (similar to varying load, but only 1 or 4-aligned)
-        If set to 1, store four floats at once (vec4)
-        If set to 0, only store one float (scalar)
-        Note: index 1 when a is 1 = indices 4, 5, 6, and 7 when a is 0
+    a - alignment
+        0 - float
+        1 - vec2
+        2 - vec4
     s - source register
         If the alignment is set to vec4, then only the upper 4 bits are used.
     d - destination

diff --git a/MBS+File+Format.mdwn b/MBS+File+Format.mdwn
index 5e3638f..71b4ef7 100644
--- a/MBS+File+Format.mdwn
+++ b/MBS+File+Format.mdwn
@@ -117,7 +117,7 @@ Each entry in the symbol tables represents an array of a single base type (vec4,
 Each storage type (uniform, varying, attribute) has a different alignment and stride (which are usually the same) for each type, due to restrictions on how the storage type in question can be indirectly addressed:
 
 * For varyings, float has an alignment and stride of 1, vec2 has an alignment of 2, and vec3 and vec4 have an alignment of 4 (mat2 has the same alignment as vec2, mat3 the same as vec3, etc.).
-* For uniforms in the PP, the alignment for float 1 is one, and the alignment for everything else is 4, except for mat2 where the alignment is 2 (but the stride is 4...) and vec2's inside of structures where the alignment is 2 as well (WTF?!).
+* Uniforms in the PP have the same alignment restrictions as varyings. Note that ints and bools have the exact same behavior as floats, and a sampler takes up 1 float of space (WTF?).
 * For uniforms in the GP, the stride for everything is 4, but floats, vec2's, and vec3's are fine as long as they fit within a vec4... basically it's the original ESSL packing rules.
 * For attributes, there is no packing so everything occupies a separate vec4 variable.
 * An array has the same alignment as its base type, and a stride equal to the stride of the base type times the number of elements.

diff --git a/MBS+File+Format.mdwn b/MBS+File+Format.mdwn
index 5bca87d..5e3638f 100644
--- a/MBS+File+Format.mdwn
+++ b/MBS+File+Format.mdwn
@@ -125,7 +125,7 @@ Each storage type (uniform, varying, attribute) has a different alignment and st
 
 Here is a more detailed description of each field:
 
-* component_count: for base types, the number of components (for example, vec3 has a three components). For structures, the number of elements in the structure. For samplers, it is the number of dimensions of the sampler (why is this needed??).
+* component_count: for base types, the number of components (for example, vec3 has a three components). For structures, the number of elements in the structure. For samplers, it is the number of dimensions of the sampler, sampler2D is 2 and samplerCube is 3 (and presumably sampler3D is 3) (why is this needed??).
 * component_size: The total size of each element in the array, or the total size of the symbol if not in an array. For structures, this includes any padding at the end. Normally, this is the same as the stride, except for matrices where component_size is the same as component_count.
 * entry_count: The number of elements in the array. 0 indicates that this symbol is not an array.
 * src_stride: The stride needed when accessing elements of the array indirectly. When getting the offset of an element in an array, the formula is offset + index * src_stride, where 0 <= index < entry_count. Note that src_stride must be a multiple of the alignment of the symbol type.

diff --git a/MBS+File+Format.mdwn b/MBS+File+Format.mdwn
index c72d2eb..5bca87d 100644
--- a/MBS+File+Format.mdwn
+++ b/MBS+File+Format.mdwn
@@ -125,8 +125,8 @@ Each storage type (uniform, varying, attribute) has a different alignment and st
 
 Here is a more detailed description of each field:
 
-* component_count: for base types, the number of components (for example, vec3 has a three components). For structures, the number of elements in the structure.
-* component_size: The total size of each element in the array, or the total size of the symbol if not in an array. For structures, this includes any padding at the end.
+* component_count: for base types, the number of components (for example, vec3 has a three components). For structures, the number of elements in the structure. For samplers, it is the number of dimensions of the sampler (why is this needed??).
+* component_size: The total size of each element in the array, or the total size of the symbol if not in an array. For structures, this includes any padding at the end. Normally, this is the same as the stride, except for matrices where component_size is the same as component_count.
 * entry_count: The number of elements in the array. 0 indicates that this symbol is not an array.
 * src_stride: The stride needed when accessing elements of the array indirectly. When getting the offset of an element in an array, the formula is offset + index * src_stride, where 0 <= index < entry_count. Note that src_stride must be a multiple of the alignment of the symbol type.
 * dst_stride: always 16??? except for varyings fed directly into texture fetches, where it's 24

diff --git a/MBS+File+Format.mdwn b/MBS+File+Format.mdwn
index b26aff6..c72d2eb 100644
--- a/MBS+File+Format.mdwn
+++ b/MBS+File+Format.mdwn
@@ -129,6 +129,6 @@ Here is a more detailed description of each field:
 * component_size: The total size of each element in the array, or the total size of the symbol if not in an array. For structures, this includes any padding at the end.
 * entry_count: The number of elements in the array. 0 indicates that this symbol is not an array.
 * src_stride: The stride needed when accessing elements of the array indirectly. When getting the offset of an element in an array, the formula is offset + index * src_stride, where 0 <= index < entry_count. Note that src_stride must be a multiple of the alignment of the symbol type.
-* dst_stride: always 16???
+* dst_stride: always 16??? except for varyings fed directly into texture fetches, where it's 24
 * offset: the base offset of the symbol, relative to the start of the structure or to the start of the array if this symbol is not part of a structure. Note that this must be a multiple of the alignment of this symbol.
 * index: -1 if not part of a structure, otherwise the index of the parent structure in the symbol table.

diff --git a/MBS+File+Format.mdwn b/MBS+File+Format.mdwn
index 8ba6841..b26aff6 100644
--- a/MBS+File+Format.mdwn
+++ b/MBS+File+Format.mdwn
@@ -112,12 +112,12 @@
 
 ## Symbols & Symbol layout
 
-Each entry in the symbol tables represents an array of a single base type (vec4, struct, etc.) or just a single base type. Note that, conceptually, variables are laid out on one big array; unlike the ESSL varying and uniform packing algorithm, there is no notion of columns and row.
+Each entry in the symbol tables represents an array of a single base type (vec4, struct, etc.) or just a single base type. Note that, conceptually, variables are laid out on one big array; unlike the ESSL varying and uniform packing algorithm, there is (mostly) no notion of columns and row.
 
 Each storage type (uniform, varying, attribute) has a different alignment and stride (which are usually the same) for each type, due to restrictions on how the storage type in question can be indirectly addressed:
 
 * For varyings, float has an alignment and stride of 1, vec2 has an alignment of 2, and vec3 and vec4 have an alignment of 4 (mat2 has the same alignment as vec2, mat3 the same as vec3, etc.).
-* For uniforms in the PP, the alignment for float 1 is one, and the alignment for everything else is 4, except for mat2 where the alignment is 2 (but the stride is 4...).
+* For uniforms in the PP, the alignment for float 1 is one, and the alignment for everything else is 4, except for mat2 where the alignment is 2 (but the stride is 4...) and vec2's inside of structures where the alignment is 2 as well (WTF?!).
 * For uniforms in the GP, the stride for everything is 4, but floats, vec2's, and vec3's are fine as long as they fit within a vec4... basically it's the original ESSL packing rules.
 * For attributes, there is no packing so everything occupies a separate vec4 variable.
 * An array has the same alignment as its base type, and a stride equal to the stride of the base type times the number of elements.

diff --git a/MBS+File+Format.mdwn b/MBS+File+Format.mdwn
index 056d398..8ba6841 100644
--- a/MBS+File+Format.mdwn
+++ b/MBS+File+Format.mdwn
@@ -112,13 +112,13 @@
 
 ## Symbols & Symbol layout
 
-Each entry in the symbol tables represents an array of a single base type (vec4, struct, etc.) or just a single base type. Note that, conceptually, variables are laid out on one big array; unlike the ESSL varying-packing algorithm, there is no notion of columns and row.
+Each entry in the symbol tables represents an array of a single base type (vec4, struct, etc.) or just a single base type. Note that, conceptually, variables are laid out on one big array; unlike the ESSL varying and uniform packing algorithm, there is no notion of columns and row.
 
 Each storage type (uniform, varying, attribute) has a different alignment and stride (which are usually the same) for each type, due to restrictions on how the storage type in question can be indirectly addressed:
 
 * For varyings, float has an alignment and stride of 1, vec2 has an alignment of 2, and vec3 and vec4 have an alignment of 4 (mat2 has the same alignment as vec2, mat3 the same as vec3, etc.).
 * For uniforms in the PP, the alignment for float 1 is one, and the alignment for everything else is 4, except for mat2 where the alignment is 2 (but the stride is 4...).
-* For uniforms in the GP, the stride for everything is 4, but floats, vec2's, and vec3's are fine as long as they fit within a vec4... basically it's the origina, ESSL packing rules.
+* For uniforms in the GP, the stride for everything is 4, but floats, vec2's, and vec3's are fine as long as they fit within a vec4... basically it's the original ESSL packing rules.
 * For attributes, there is no packing so everything occupies a separate vec4 variable.
 * An array has the same alignment as its base type, and a stride equal to the stride of the base type times the number of elements.
 * A structure's alignment is determined by the largest alignment of all its children, and its stride is the smallest multiple of its alignment greater than or equal to its size.

diff --git a/MBS+File+Format.mdwn b/MBS+File+Format.mdwn
index 93f21af..056d398 100644
--- a/MBS+File+Format.mdwn
+++ b/MBS+File+Format.mdwn
@@ -118,8 +118,10 @@ Each storage type (uniform, varying, attribute) has a different alignment and st
 
 * For varyings, float has an alignment and stride of 1, vec2 has an alignment of 2, and vec3 and vec4 have an alignment of 4 (mat2 has the same alignment as vec2, mat3 the same as vec3, etc.).
 * For uniforms in the PP, the alignment for float 1 is one, and the alignment for everything else is 4, except for mat2 where the alignment is 2 (but the stride is 4...).
-* For uniforms in the GP, the stride for everything is 4, but floats, vec2's, and vec3's are fine as long as they fit within a vec4... basically it's the original ESSL packing rules.
+* For uniforms in the GP, the stride for everything is 4, but floats, vec2's, and vec3's are fine as long as they fit within a vec4... basically it's the origina, ESSL packing rules.
 * For attributes, there is no packing so everything occupies a separate vec4 variable.
+* An array has the same alignment as its base type, and a stride equal to the stride of the base type times the number of elements.
+* A structure's alignment is determined by the largest alignment of all its children, and its stride is the smallest multiple of its alignment greater than or equal to its size.
 
 Here is a more detailed description of each field:
 

diff --git a/MBS+File+Format.mdwn b/MBS+File+Format.mdwn
index 5395331..93f21af 100644
--- a/MBS+File+Format.mdwn
+++ b/MBS+File+Format.mdwn
@@ -114,28 +114,19 @@
 
 Each entry in the symbol tables represents an array of a single base type (vec4, struct, etc.) or just a single base type. Note that, conceptually, variables are laid out on one big array; unlike the ESSL varying-packing algorithm, there is no notion of columns and row.
 
-Each storage type (uniform, varying, attribute) has a different alignment and stride (which are usually the same) for each type, due to restrictions on how the storage type in question can be indirectly addressed.
+Each storage type (uniform, varying, attribute) has a different alignment and stride (which are usually the same) for each type, due to restrictions on how the storage type in question can be indirectly addressed:
 
-For varyings, float has an alignment and stride of 1, vec2 has an alignment of 2, and vec3 and vec4 have an alignment of 4 (mat2 has the same alignment as vec2, mat3 the same as vec3, etc.).
-
-For uniforms in the PP, the alignment for float 1 is one, and the alignment for everything else is 4, except for mat2 where the alignment is 2 (but the stride is 4...).
-
-For uniforms in the GP, the stride for everything is 4, but floats, vec2's, and vec3's are fine as long as they fit within a vec4... basically it's the original ESSL packing rules.
-
-For attributes, there is no packing so everything occupies a separate vec4 variable.
+* For varyings, float has an alignment and stride of 1, vec2 has an alignment of 2, and vec3 and vec4 have an alignment of 4 (mat2 has the same alignment as vec2, mat3 the same as vec3, etc.).
+* For uniforms in the PP, the alignment for float 1 is one, and the alignment for everything else is 4, except for mat2 where the alignment is 2 (but the stride is 4...).
+* For uniforms in the GP, the stride for everything is 4, but floats, vec2's, and vec3's are fine as long as they fit within a vec4... basically it's the original ESSL packing rules.
+* For attributes, there is no packing so everything occupies a separate vec4 variable.
 
 Here is a more detailed description of each field:
 
-component_count: for base types, the number of components (for example, vec3 has a three components). For structures, the number of elements in the structure.
-
-component_size: The total size of each element in the array, or the total size of the symbol if not in an array. For structures, this includes any padding at the end.
-
-entry_count: The number of elements in the array. 0 indicates that this symbol is not an array.
-
-src_stride: The stride needed when accessing elements of the array indirectly. When getting the offset of an element in an array, the formula is offset + index * src_stride, where 0 <= index < entry_count. Note that src_stride must be a multiple of the alignment of the symbol type.
-
-dst_stride: always 16???
-
-offset: the base offset of the symbol, relative to the start of the structure or to the start of the array if this symbol is not part of a structure. Note that this must be a multiple of the alignment of this symbol.
-
-index: -1 if not part of a structure, otherwise the index of the parent structure in the symbol table.
+* component_count: for base types, the number of components (for example, vec3 has a three components). For structures, the number of elements in the structure.
+* component_size: The total size of each element in the array, or the total size of the symbol if not in an array. For structures, this includes any padding at the end.
+* entry_count: The number of elements in the array. 0 indicates that this symbol is not an array.
+* src_stride: The stride needed when accessing elements of the array indirectly. When getting the offset of an element in an array, the formula is offset + index * src_stride, where 0 <= index < entry_count. Note that src_stride must be a multiple of the alignment of the symbol type.
+* dst_stride: always 16???
+* offset: the base offset of the symbol, relative to the start of the structure or to the start of the array if this symbol is not part of a structure. Note that this must be a multiple of the alignment of this symbol.
+* index: -1 if not part of a structure, otherwise the index of the parent structure in the symbol table.

diff --git a/MBS+File+Format.mdwn b/MBS+File+Format.mdwn
index e54a9a2..5395331 100644
--- a/MBS+File+Format.mdwn
+++ b/MBS+File+Format.mdwn
@@ -108,3 +108,34 @@
     	frag  fragment;
         vert  vertex;
     }
+
+
+## Symbols & Symbol layout
+
+Each entry in the symbol tables represents an array of a single base type (vec4, struct, etc.) or just a single base type. Note that, conceptually, variables are laid out on one big array; unlike the ESSL varying-packing algorithm, there is no notion of columns and row.
+
+Each storage type (uniform, varying, attribute) has a different alignment and stride (which are usually the same) for each type, due to restrictions on how the storage type in question can be indirectly addressed.
+
+For varyings, float has an alignment and stride of 1, vec2 has an alignment of 2, and vec3 and vec4 have an alignment of 4 (mat2 has the same alignment as vec2, mat3 the same as vec3, etc.).
+
+For uniforms in the PP, the alignment for float 1 is one, and the alignment for everything else is 4, except for mat2 where the alignment is 2 (but the stride is 4...).
+
+For uniforms in the GP, the stride for everything is 4, but floats, vec2's, and vec3's are fine as long as they fit within a vec4... basically it's the original ESSL packing rules.
+
+For attributes, there is no packing so everything occupies a separate vec4 variable.
+
+Here is a more detailed description of each field:
+
+component_count: for base types, the number of components (for example, vec3 has a three components). For structures, the number of elements in the structure.
+
+component_size: The total size of each element in the array, or the total size of the symbol if not in an array. For structures, this includes any padding at the end.
+
+entry_count: The number of elements in the array. 0 indicates that this symbol is not an array.
+
+src_stride: The stride needed when accessing elements of the array indirectly. When getting the offset of an element in an array, the formula is offset + index * src_stride, where 0 <= index < entry_count. Note that src_stride must be a multiple of the alignment of the symbol type.
+
+dst_stride: always 16???
+
+offset: the base offset of the symbol, relative to the start of the structure or to the start of the array if this symbol is not part of a structure. Note that this must be a multiple of the alignment of this symbol.
+
+index: -1 if not part of a structure, otherwise the index of the parent structure in the symbol table.

diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 9a4e30a..c686f53 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -169,13 +169,13 @@ There also exists various "pipeline registers" (four of them listed above) which
 
     control[9], Uniform/Temporary Fetch
 
-    i iiii iiii iiii iiio rrrr rr00 0000 a000 0000 00ss
+    i iiii iiii iiii iiio rrrr rr00 0000 aa00 0000 00ss
 
     i - source index
-    a - alignment (similar to varying load, but only 1 or 4-aligned)
-        If set to 1, load four floats at once
-        If set to 0, only load one float
-        Note: index 1 when a is 1 = indices 4, 5, 6, and 7 when a is 0
+    a - alignment
+        00 - 1-aligned
+        01 - 2-aligned
+        10 - 4-aligned
     s - source
         00 - uniform
         11 - temporary

add integer move instruction
diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 48625e8..383bdb0 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -174,7 +174,7 @@ The scalar multiply and add ALU's have the same format as well.
     14 - fmul
     28 - fmin
     2C - fmax
-    30 - mov
+    30 - fmov
     36 - ffloor
     37 - fceil
     3C - fdot3
@@ -184,6 +184,7 @@ The scalar multiply and add ALU's have the same format as well.
     40 - iadd
     46 - isub
     58 - imul
+    7B - imov
     80 - feq
     81 - fne
     82 - flt (less than)
@@ -237,4 +238,5 @@ The load/store word consists of the standard 8-bit tag, followed by two 60-bit i
 The mask and swizzle acts like a move instruction. For example, a load with a mask of xzw and a swizzle of xywz means "take the x, w, and z components of the input and move them into the x, z, and w components of the register respectively."
 
 TODO: indirect access
+
 TODO: uniform buffers

add initial load/store word description
diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 13f34a2..48625e8 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -213,3 +213,28 @@ Pseudocode for how atan/atan2 is implemented:
     float result = fatan_pt2(temp1.x, temp1.z * temp1.w);
 
 To do atan instead of atan2, replace y with 1.0. asin and acos are implemented just like in the Mali 200 PP.
+
+## Load/store words
+
+The load/store word consists of the standard 8-bit tag, followed by two 60-bit instructions whose format is described below. Each instruction can load or store up to 128 bits at once.
+
+    0-7: opcode
+        03 - noop (no load/store)
+        94 - load attribute (32-bit)
+        95 - load attribute (16-bit)
+        98 - load varying (32-bit)
+        99 - load varying (16-bit)
+        AC - load uniform (16-bit)
+        B0 - load uniform (32-bit)
+        D4 - store varying (32-bit)
+        D5 - store varying (16-bit)
+    8-12: source/destination register
+    13-16: mask
+    17-24: swizzle
+    25-50: unknown
+    51-59: load/store address
+
+The mask and swizzle acts like a move instruction. For example, a load with a mask of xzw and a swizzle of xywz means "take the x, w, and z components of the input and move them into the x, z, and w components of the register respectively."
+
+TODO: indirect access
+TODO: uniform buffers

diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index c30c109..13f34a2 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -43,7 +43,7 @@ It seems like the architecture for the T6xx pipeline is based off of the Mali-20
 
 ## Registers
 
-The ALU pipeline can read/write to 32 128-bit "work registers," which can be divided into 4 32-bit (highp in GLSL) components (one vec4) or 8 16-bit (mediump) components (two vec4's). Some of the registers, however, are dedicated to special purposes (see below) and are read-only or write-only.
+The ALU pipeline can read/write to 32 128-bit registers, which can be divided into 4 32-bit (highp in GLSL) components (one vec4) or 8 16-bit (mediump) components (two vec4's). Some of the registers, however, are dedicated to special purposes (see below) and are read-only or write-only.
 
 ## Special Registers
 

diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index d65725d..c30c109 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -127,7 +127,7 @@ The vector multiply, add, and LUT ALU's share the same instruction format.
     40-47: write mask
         2 bits for each output when 32-bit, 1 bit when 16-bit
 
-When the output size is set to half, the operation is performed on the high and low half-registers at the the same time. The low 4 bits of the write mask control what components of the low half-register are written, and the high 4 bits control the high half-register. Normally, the operation is performed on the input 1 low register and input 2 low register to produce the output low register, and on the input 1 high register and input 2 high register to produce the high register. This can be overwritten, however, by the "input 1/2 replicate lower/upper half-register" bits which cause the given half-register to be used as an input to both operations at once.
+When the register mode is set to half, the operation is performed on the high and low half-registers at the the same time. The low 4 bits of the write mask control what components of the low half-register are written, and the high 4 bits control the high half-register. Normally, the operation is performed on the input 1 low register and input 2 low register to produce the output low register, and on the input 1 high register and input 2 high register to produce the high register. This can be overwritten, however, by the "input 1/2 replicate lower/upper half-register" bits which cause the given half-register to be used as an input to both operations at once.
 
 ### Scalar ALU word format
 

diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 75a6083..d65725d 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -92,7 +92,7 @@ The vector multiply, add, and LUT ALU's share the same instruction format.
         2 - full (32-bit)
     10: input 1 abs
     11: input 1 neg
-    if output size is half:
+    if input/output mode is half:
         12: input 1 replicate lower half-register
         13: input 1 replicate upper half-register
     otherwise:
@@ -107,7 +107,7 @@ The vector multiply, add, and LUT ALU's share the same instruction format.
         25-27: inline const 8-10
         28-35: inline const 0-7
     otherwise:
-        if output size is half:
+        if input/output mode is half:
             25: input 2 replicate lower half-register
             26: input 2 replicate upper half-register
         otherwise:

add output size override
diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 4123825..75a6083 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -87,7 +87,7 @@ The register 2 inline constant is a way to store a 16-bit float directly in the
 The vector multiply, add, and LUT ALU's share the same instruction format.
 
     0-7: opcode
-    8-9: output size
+    8-9: input/output mode
         1 - half (16-bit)
         2 - full (32-bit)
     10: input 1 abs
@@ -114,7 +114,11 @@ The vector multiply, add, and LUT ALU's share the same instruction format.
             25: input 2 half-register selection (high or low)
             26: unused
         28-35: input 2 swizzle
-    36-37: unknown
+    36-37: output size override
+        0 - half, write to lower half
+        1 - half, write to upper half
+        2 - normal
+        Note: I've only seen this for comparison instructions that compare two full floats or ints and need to return a half float
     38-39: output modifier
         0 - none
         1 - clamp positive

diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 030f4e0..4123825 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -37,9 +37,13 @@ It appears that the shader binaries are the same between the T600 and T650 targe
 
 The next 4 bits (bits 4-7) store the type of the next instruction (presumably for prefetch purposes), except if either the instruction is the last instruction or it's the second-to-last and the last instruction is an ALU instruction, in which case the value of 1 is used.
 
+## Comparison to Mali-200 PP
+
+It seems like the architecture for the T6xx pipeline is based off of the Mali-200 PP pipeline. Both are barreled with a relatively deep pipeline that can execute a number of threads, possibly with different shaders/uniforms/other state at the same time. The main difference is that the single, large pipeline is broken down into 3 smaller pipelines. This simplifies the logic; the ALU pipelines don't need to know how to access memory, the Load/Store and texture pipelines don't need to access work registers or uniform registers, and only the texture pipeline needs to have logic for synchronizing threads in a group and exchanging values for computing derivatives. There are more work registers, and now there's a uniform register file, in addition to normal uniform buffers accessed through the load/store pipeline, and perhaps a Radeon-like register sharing mechanism (the compiler now reports the number of work registers and uniform registers used)
+
 ## Registers
 
-The processor has up to 24 128-bit "work registers," which can be divided into 4 32-bit (highp in GLSL) components (one vec4) or 8 16-bit (mediump) components (two vec4's).
+The ALU pipeline can read/write to 32 128-bit "work registers," which can be divided into 4 32-bit (highp in GLSL) components (one vec4) or 8 16-bit (mediump) components (two vec4's). Some of the registers, however, are dedicated to special purposes (see below) and are read-only or write-only.
 
 ## Special Registers
 

diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 421af6c..030f4e0 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -26,7 +26,7 @@ ARM has been a lot more open this time with the architecture behind the T6xx. Fo
 
 ## Instruction format
 
-It appears that the shader binaries are the same between the T600 and T650 targets; the only difference is in how many cycles it takes to execute the ALU instructions (T650 takes half as many due to having twice as many ALU's). The shader consists of a stream of instruction words. There are 3 types of instruction words, corresponding to the three pipelines; instruction words can appear in any order. Instruction words are always a multiple of 4 words (128 bits). They can be parsed by reading the type from lowest 4 bits of the first word:
+It appears that the shader binaries are the same between the T600 and T650 targets; the only difference is in how many cycles it takes to execute the ALU instructions (T650 takes half as many due to having twice as many ALU's). The shader consists of a stream of instructions. There are 3 types of instruction words, corresponding to the three pipelines; instruction words can appear in any order. Each instruction is always started only after the previous instruction has fully completed, and like on the Mali 200 PP the pipeline is barreled so a number of threads, potentially with different shaders, are running at once (see next-instruction-type patent). Instruction words are always a multiple of 4 words (128 bits). They can be parsed by reading the type from lowest 4 bits of the first word:
 
     3 - Texture (4 words)
     5 - Load/Store (4 words)

diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 3e0f9c6..421af6c 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -47,6 +47,7 @@ The processor has up to 24 128-bit "work registers," which can be divided into 4
     r26 - inline constant
     r27 - load/store offset when used as output register
     r28-r29 - texture pipeline results
+    r31.w - conditional select input when written to in scalar add ALU
 
 r0 - r23 is divided into two spaces: work registers and uniform registers. A configurable number of registers can be devoted to each; if there are N uniform registers, then r0 - r(23-N) are work registers and r(24-N)-r23 are uniform registers.
 
@@ -166,6 +167,7 @@ The scalar multiply and add ALU's have the same format as well.
     28 - fmin
     2C - fmax
     30 - mov
+    36 - ffloor
     37 - fceil
     3C - fdot3
     3D - fdot3r
@@ -183,6 +185,7 @@ The scalar multiply and add ALU's have the same format as well.
     A1 - ine
     A4 - ilt
     A5 - ile
+    C5 - csel (conditional select)
     B8 - i2f
     E8 - fatan_pt2
     F0 - frcp (reciprocal)

diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 3aa0fc0..3e0f9c6 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -186,6 +186,7 @@ The scalar multiply and add ALU's have the same format as well.
     B8 - i2f
     E8 - fatan_pt2
     F0 - frcp (reciprocal)
+    F2 - frsqrt (inverse square root, 1/sqrt(x))
     F3 - fsqrt (square root)
     F4 - fexp2 (2^x)
     F5 - flog2

diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index d3b77e5..3aa0fc0 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -43,10 +43,10 @@ The processor has up to 24 128-bit "work registers," which can be divided into 4
 
 ## Special Registers
 
-r24 - can mean "unused" for 1-src instructions, or a pipeline register
-r26 - inline constant
-r27 - load/store offset when used as output register
-r28-r29 - texture pipeline results
+    r24 - can mean "unused" for 1-src instructions, or a pipeline register
+    r26 - inline constant
+    r27 - load/store offset when used as output register
+    r28-r29 - texture pipeline results
 
 r0 - r23 is divided into two spaces: work registers and uniform registers. A configurable number of registers can be devoted to each; if there are N uniform registers, then r0 - r(23-N) are work registers and r(24-N)-r23 are uniform registers.
 

diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 60ae2e4..d3b77e5 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -46,6 +46,7 @@ The processor has up to 24 128-bit "work registers," which can be divided into 4
 r24 - can mean "unused" for 1-src instructions, or a pipeline register
 r26 - inline constant
 r27 - load/store offset when used as output register
+r28-r29 - texture pipeline results
 
 r0 - r23 is divided into two spaces: work registers and uniform registers. A configurable number of registers can be devoted to each; if there are N uniform registers, then r0 - r(23-N) are work registers and r(24-N)-r23 are uniform registers.
 

diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 818ae80..60ae2e4 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -39,7 +39,15 @@ The next 4 bits (bits 4-7) store the type of the next instruction (presumably fo
 
 ## Registers
 
-The processor has XXX (todo) 128-bit "work registers," which can be divided into 4 32-bit (highp in GLSL) components (one vec4) or 8 16-bit (mediump) components (two vec4's).
+The processor has up to 24 128-bit "work registers," which can be divided into 4 32-bit (highp in GLSL) components (one vec4) or 8 16-bit (mediump) components (two vec4's).
+
+## Special Registers
+
+r24 - can mean "unused" for 1-src instructions, or a pipeline register
+r26 - inline constant
+r27 - load/store offset when used as output register
+
+r0 - r23 is divided into two spaces: work registers and uniform registers. A configurable number of registers can be devoted to each; if there are N uniform registers, then r0 - r(23-N) are work registers and r(24-N)-r23 are uniform registers.
 
 ## ALU Words
 

add integer comparisons
diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 6c50b25..818ae80 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -170,6 +170,10 @@ The scalar multiply and add ALU's have the same format as well.
     82 - flt (less than)
     83 - fle (less than or equal)
     99 - f2i
+    A0 - ieq
+    A1 - ine
+    A4 - ilt
+    A5 - ile
     B8 - i2f
     E8 - fatan_pt2
     F0 - frcp (reciprocal)

add atan
diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 19adcda..6c50b25 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -171,11 +171,20 @@ The scalar multiply and add ALU's have the same format as well.
     83 - fle (less than or equal)
     99 - f2i
     B8 - i2f
+    E8 - fatan_pt2
     F0 - frcp (reciprocal)
     F3 - fsqrt (square root)
     F4 - fexp2 (2^x)
     F5 - flog2
     F6 - fsin
     F7 - fcos
+    F9 - fatan_pt1
         Note: for sin and cos, the input needs to be divided by pi
 
+
+Pseudocode for how atan/atan2 is implemented:
+
+    vec4 temp1.xzw = fatan_pt1(x, y); //Note: a vec4 temporary is required, although the write mask is xzw so the y component isn't affected
+    float result = fatan_pt2(temp1.x, temp1.z * temp1.w);
+
+To do atan instead of atan2, replace y with 1.0. asin and acos are implemented just like in the Mali 200 PP.

Adjust Samsung gpl violation status
diff --git a/Devices.mdwn b/Devices.mdwn
index 2b612d9..26b20ee 100644
--- a/Devices.mdwn
+++ b/Devices.mdwn
@@ -33,15 +33,15 @@ The ZT-280 range includes the C71, a 7" tablet with a capacitive display. Can be
 
 According to the [spec sheet](http://www.pointofview-online.com/showroom.php?shop_mode=product_detail&product_id=308) provided by its manufacturer/reseller, the ProTab 2XXL features a Mali-400 GPU. This tablet features a 10" capacitive touch-screen, and is very competetively priced - it retails for [about EUR 170](http://geizhals.eu/713232). Point of View publishes "Firmware Updates" in its somewhat chaotic [download area](http://downloads.pointofview-online.com/Drivers/), but there's no source code in sight anywhere.
 
-# Exynos 4 (**GPL VIOLATOR**)
+# Exynos 4
 
 These SoCs are the best performing Mali-400 devices out there. They are proper speed-daemons. The exynos 42xx series has a dual A9, whereas the exynos 44xx series has a quad A9. All come with a Mali-400MP4.
 
-All exynos 4 devices come with binary only u-boot. This means that Samsung, and its device makers, are violating the GPL.
+Some exynos 4 devices might still come with binary only u-boot. This means that samsung or some device makers, might still be violating the GPL.
 
-## Origen Board (**GPL VIOLATOR**)
+## Origen Board (**GPL VIOLATOR ??? **)
 
-## ODROIDs (**GPL VIOLATOR**)
+## ODROIDs
 
 The Odroids are small developer boards with many possible connections. The Odroid-x2/u2 is hyperfast, as it can clock the 4 A9s to 2GHz, and the Mali-400MP4 can clock up to 640MHz. This makes for a nice high-end benchmarker, and a good comparison for the comparatively meek A10.
 
@@ -49,9 +49,9 @@ Hardkernel tries to portray itself as open source friendly, but they have a lot
 
 Here is [[some_information|OdroidSetup]] on how to set up your own SD card with a custom built kernel and with mali binaries (which we need for reverse engineering).
 
-## Samsung Galaxy S II (**GPL VIOLATOR**)
+## Samsung Galaxy S II (**GPL VIOLATOR ??? **)
 
-## Samsung Galaxy S III (**GPL VIOLATOR**)
+## Samsung Galaxy S III (**GPL VIOLATOR ??? **)
 
 # Exynos 5
 

moar opcodes
diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 40fc68b..19adcda 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -158,10 +158,19 @@ The scalar multiply and add ALU's have the same format as well.
     2C - fmax
     30 - mov
     37 - fceil
+    3C - fdot3
+    3D - fdot3r
+    3E - fdot4
+    3F - freduce
+    40 - iadd
+    46 - isub
+    58 - imul
     80 - feq
     81 - fne
     82 - flt (less than)
     83 - fle (less than or equal)
+    99 - f2i
+    B8 - i2f
     F0 - frcp (reciprocal)
     F3 - fsqrt (square root)
     F4 - fexp2 (2^x)

opcodes
diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 0cf6fe6..40fc68b 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -104,6 +104,7 @@ The vector multiply, add, and LUT ALU's share the same instruction format.
     38-39: output modifier
         0 - none
         1 - clamp positive
+        2 - output integer
         3 - saturate
     40-47: write mask
         2 bits for each output when 32-bit, 1 bit when 16-bit
@@ -140,6 +141,7 @@ The scalar multiply and add ALU's have the same format as well.
     26-27: output modifier
         0 - none
         1 - clamp positive
+        2 - output integer
         3 - saturate
     if output size = full
         29: unused
@@ -147,3 +149,24 @@ The scalar multiply and add ALU's have the same format as well.
     otherwise:
         29-30: output component
         31: output half-register selection (high or low)
+
+### Opcodes
+
+    10 - fadd
+    14 - fmul
+    28 - fmin
+    2C - fmax
+    30 - mov
+    37 - fceil
+    80 - feq
+    81 - fne
+    82 - flt (less than)
+    83 - fle (less than or equal)
+    F0 - frcp (reciprocal)
+    F3 - fsqrt (square root)
+    F4 - fexp2 (2^x)
+    F5 - flog2
+    F6 - fsin
+    F7 - fcos
+        Note: for sin and cos, the input needs to be divided by pi
+

half-register stuff
diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 9b5cad2..0cf6fe6 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -78,7 +78,13 @@ The vector multiply, add, and LUT ALU's share the same instruction format.
         2 - full (32-bit)
     10: input 1 abs
     11: input 1 neg
-    12-14: unknown
+    if output size is half:
+        12: input 1 replicate lower half-register
+        13: input 1 replicate upper half-register
+    otherwise:
+        12: input 1 half-register selection (high or low)
+        13: unused
+    14: input 1 half-register (when output is a full register)
     15-22: input 1 swizzle
     23: input 2 abs
     24: input 2 neg
@@ -87,7 +93,12 @@ The vector multiply, add, and LUT ALU's share the same instruction format.
         25-27: inline const 8-10
         28-35: inline const 0-7
     otherwise:
-        25-27: unknown
+        if output size is half:
+            25: input 2 replicate lower half-register
+            26: input 2 replicate upper half-register
+        otherwise:
+            25: input 2 half-register selection (high or low)
+            26: unused
         28-35: input 2 swizzle
     36-37: unknown
     38-39: output modifier
@@ -97,6 +108,8 @@ The vector multiply, add, and LUT ALU's share the same instruction format.
     40-47: write mask
         2 bits for each output when 32-bit, 1 bit when 16-bit
 
+When the output size is set to half, the operation is performed on the high and low half-registers at the the same time. The low 4 bits of the write mask control what components of the low half-register are written, and the high 4 bits control the high half-register. Normally, the operation is performed on the input 1 low register and input 2 low register to produce the output low register, and on the input 1 high register and input 2 high register to produce the high register. This can be overwritten, however, by the "input 1/2 replicate lower/upper half-register" bits which cause the given half-register to be used as an input to both operations at once.
+
 ### Scalar ALU word format
 
 The scalar multiply and add ALU's have the same format as well.
@@ -104,9 +117,13 @@ The scalar multiply and add ALU's have the same format as well.
     0-7: opcode
     8: input 1 abs
     9: input 1 negate
-    10: unknown
-    11-12: input 1 component
-    13: unknown
+    10: input 1 size (0 = half, 1 = full)
+    if input 1 size = full
+        11: unused
+        12-13: input 1 component
+    otherwise:
+        11-12: input 1 component
+        13: input 1 half-register selection (high or low)
     if "input 2 inline constant" set:
         14-24: input 2 inline constant low 11 bits
         14-15: inline const 9-10
@@ -124,5 +141,9 @@ The scalar multiply and add ALU's have the same format as well.
         0 - none
         1 - clamp positive
         3 - saturate
-    30-31: output component
-
+    if output size = full
+        29: unused
+        30-31: output component
+    otherwise:
+        29-30: output component
+        31: output half-register selection (high or low)

inline constants
diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index ea965af..9b5cad2 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -62,10 +62,12 @@ After the control word comes a series of 16-bit words, one for each enabled ALU
 ### Register word format
 
     0-4: input 1 register
-    5-9: input 2 register
+    5-9: input 2 register / inline constant
     10-14: output register
     15: input 2 inline constant
 
+The register 2 inline constant is a way to store a 16-bit float directly in the instruction. The upper 5 bits (15-11) are stored where the input 2 register would normally go, and the lower 11 bits (0-10) are stored in the ALU field as defined below. The constant is splattered across all 4 components of the input. This is much more compact than the normal embedded constants, but much more limited as well.
+
 ### Vector ALU word format
 
 The vector multiply, add, and LUT ALU's share the same instruction format.

diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index f725153..ea965af 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -68,7 +68,7 @@ After the control word comes a series of 16-bit words, one for each enabled ALU
 
 ### Vector ALU word format
 
-The vector multiply and add ALU's share the same instruction format.
+The vector multiply, add, and LUT ALU's share the same instruction format.
 
     0-7: opcode
     8-9: output size
@@ -97,6 +97,8 @@ The vector multiply and add ALU's share the same instruction format.
 
 ### Scalar ALU word format
 
+The scalar multiply and add ALU's have the same format as well.
+
     0-7: opcode
     8: input 1 abs
     9: input 1 negate

Initial ALU field descriptions
diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index ae7e7fd..f725153 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -37,6 +37,10 @@ It appears that the shader binaries are the same between the T600 and T650 targe
 
 The next 4 bits (bits 4-7) store the type of the next instruction (presumably for prefetch purposes), except if either the instruction is the last instruction or it's the second-to-last and the last instruction is an ALU instruction, in which case the value of 1 is used.
 
+## Registers
+
+The processor has XXX (todo) 128-bit "work registers," which can be divided into 4 32-bit (highp in GLSL) components (one vec4) or 8 16-bit (mediump) components (two vec4's).
+
 ## ALU Words
 
 The first (32-bit) word is a control word which, in addition to the usual 8-bit tag in bits 0-7, contains a bitfield describing which ALU's in the pipeline are in use. There are 5 ALU's, in addition to a framebuffer write (?) and branch unit.
@@ -53,4 +57,68 @@ The first (32-bit) word is a control word which, in addition to the usual 8-bit
 
 It's not clear why only every other bit is used for the ALU's (fp64?).
 
-After the control word comes a series of 16-bit words, one for each enabled ALU (up to 5). After that come the actual fields for each ALU/unit, whose sizes are noted in the table above. The instruction word is then padded with 0's to make sure it is a multiple of 4 words. Finally, embedded constants may be inserted, which consist of 4 32-bit numbers, interpreted as 4 IEEE 32-bit floats if the input is a float.
+After the control word comes a series of 16-bit words, one for each enabled ALU (up to 5) which control the input and output registers for each ALU. After that come the actual fields for each ALU/unit, whose sizes are noted in the table above. The instruction word is then padded with 0's to make sure it is a multiple of 4 words. Finally, embedded constants may be inserted, which consist of 4 32-bit numbers, interpreted as 4 IEEE 32-bit floats if the input is a float.
+
+### Register word format
+
+    0-4: input 1 register
+    5-9: input 2 register
+    10-14: output register
+    15: input 2 inline constant
+
+### Vector ALU word format
+
+The vector multiply and add ALU's share the same instruction format.
+
+    0-7: opcode
+    8-9: output size
+        1 - half (16-bit)
+        2 - full (32-bit)
+    10: input 1 abs
+    11: input 1 neg
+    12-14: unknown
+    15-22: input 1 swizzle
+    23: input 2 abs
+    24: input 2 neg
+    if "input 2 inline constant" set:
+        25-35: input 2 inline constant low 11 bits
+        25-27: inline const 8-10
+        28-35: inline const 0-7
+    otherwise:
+        25-27: unknown
+        28-35: input 2 swizzle
+    36-37: unknown
+    38-39: output modifier
+        0 - none
+        1 - clamp positive
+        3 - saturate
+    40-47: write mask
+        2 bits for each output when 32-bit, 1 bit when 16-bit
+
+### Scalar ALU word format
+
+    0-7: opcode
+    8: input 1 abs
+    9: input 1 negate
+    10: unknown
+    11-12: input 1 component
+    13: unknown
+    if "input 2 inline constant" set:
+        14-24: input 2 inline constant low 11 bits
+        14-15: inline const 9-10
+        16: inline const 8
+        17-19: inline const 5-7
+        20-24: inline const 0-4
+    otherwise:
+        14: input 2 abs
+        15: input 2 negate
+        16: unknown
+        17-18: input 2 component
+        19-24: unknown
+    25: unknown
+    26-27: output modifier
+        0 - none
+        1 - clamp positive
+        3 - saturate
+    30-31: output component
+

diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index afd7ca3..ae7e7fd 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -22,6 +22,8 @@ ARM has been a lot more open this time with the architecture behind the T6xx. Fo
 
 [Graphics processing](https://www.google.com/patents/US20120223946?hl=en&sa=X&ei=p5r0UbSJHtXh4AOqo4HYCw&ved=0CEgQ6AEwAzgK)
 
+[Embedded opcode within an intermediate value passed between instructions](https://www.google.com/patents/US20120204006?hl=en&sa=X&ei=p5r0UbSJHtXh4AOqo4HYCw&ved=0CHIQ6AEwCTgK)
+
 ## Instruction format
 
 It appears that the shader binaries are the same between the T600 and T650 targets; the only difference is in how many cycles it takes to execute the ALU instructions (T650 takes half as many due to having twice as many ALU's). The shader consists of a stream of instruction words. There are 3 types of instruction words, corresponding to the three pipelines; instruction words can appear in any order. Instruction words are always a multiple of 4 words (128 bits). They can be parsed by reading the type from lowest 4 bits of the first word:

diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 65bf9e3..afd7ca3 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -22,8 +22,6 @@ ARM has been a lot more open this time with the architecture behind the T6xx. Fo
 
 [Graphics processing](https://www.google.com/patents/US20120223946?hl=en&sa=X&ei=p5r0UbSJHtXh4AOqo4HYCw&ved=0CEgQ6AEwAzgK)
 
-[Generating and resolving pixel values within a graphics processing pipeline](https://www.google.com/patents/US8059144?hl=en&sa=X&ei=p5r0UbSJHtXh4AOqo4HYCw&ved=0CFYQ6AEwBTgK)
-
 ## Instruction format
 
 It appears that the shader binaries are the same between the T600 and T650 targets; the only difference is in how many cycles it takes to execute the ALU instructions (T650 takes half as many due to having twice as many ALU's). The shader consists of a stream of instruction words. There are 3 types of instruction words, corresponding to the three pipelines; instruction words can appear in any order. Instruction words are always a multiple of 4 words (128 bits). They can be parsed by reading the type from lowest 4 bits of the first word:

diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 464a7b7..65bf9e3 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -24,8 +24,6 @@ ARM has been a lot more open this time with the architecture behind the T6xx. Fo
 
 [Generating and resolving pixel values within a graphics processing pipeline](https://www.google.com/patents/US8059144?hl=en&sa=X&ei=p5r0UbSJHtXh4AOqo4HYCw&ved=0CFYQ6AEwBTgK)
 
-[Generating and resolving pixel values within a graphics processing pipeline](https://www.google.com/patents/US8059144?hl=en&sa=X&ei=p5r0UbSJHtXh4AOqo4HYCw&ved=0CFYQ6AEwBTgK)
-
 ## Instruction format
 
 It appears that the shader binaries are the same between the T600 and T650 targets; the only difference is in how many cycles it takes to execute the ALU instructions (T650 takes half as many due to having twice as many ALU's). The shader consists of a stream of instruction words. There are 3 types of instruction words, corresponding to the three pipelines; instruction words can appear in any order. Instruction words are always a multiple of 4 words (128 bits). They can be parsed by reading the type from lowest 4 bits of the first word:

diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 75adcb0..464a7b7 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -22,11 +22,10 @@ ARM has been a lot more open this time with the architecture behind the T6xx. Fo
 
 [Graphics processing](https://www.google.com/patents/US20120223946?hl=en&sa=X&ei=p5r0UbSJHtXh4AOqo4HYCw&ved=0CEgQ6AEwAzgK)
 
-<<<<<<< HEAD
-=======
 [Generating and resolving pixel values within a graphics processing pipeline](https://www.google.com/patents/US8059144?hl=en&sa=X&ei=p5r0UbSJHtXh4AOqo4HYCw&ved=0CFYQ6AEwBTgK)
 
->>>>>>> 84f748d804d1a65da85ccaf28470d9604619fe12
+[Generating and resolving pixel values within a graphics processing pipeline](https://www.google.com/patents/US8059144?hl=en&sa=X&ei=p5r0UbSJHtXh4AOqo4HYCw&ved=0CFYQ6AEwBTgK)
+
 ## Instruction format
 
 It appears that the shader binaries are the same between the T600 and T650 targets; the only difference is in how many cycles it takes to execute the ALU instructions (T650 takes half as many due to having twice as many ALU's). The shader consists of a stream of instruction words. There are 3 types of instruction words, corresponding to the three pipelines; instruction words can appear in any order. Instruction words are always a multiple of 4 words (128 bits). They can be parsed by reading the type from lowest 4 bits of the first word:

diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index afd7ca3..75adcb0 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -22,6 +22,11 @@ ARM has been a lot more open this time with the architecture behind the T6xx. Fo
 
 [Graphics processing](https://www.google.com/patents/US20120223946?hl=en&sa=X&ei=p5r0UbSJHtXh4AOqo4HYCw&ved=0CEgQ6AEwAzgK)
 
+<<<<<<< HEAD
+=======
+[Generating and resolving pixel values within a graphics processing pipeline](https://www.google.com/patents/US8059144?hl=en&sa=X&ei=p5r0UbSJHtXh4AOqo4HYCw&ved=0CFYQ6AEwBTgK)
+
+>>>>>>> 84f748d804d1a65da85ccaf28470d9604619fe12
 ## Instruction format
 
 It appears that the shader binaries are the same between the T600 and T650 targets; the only difference is in how many cycles it takes to execute the ALU instructions (T650 takes half as many due to having twice as many ALU's). The shader consists of a stream of instruction words. There are 3 types of instruction words, corresponding to the three pipelines; instruction words can appear in any order. Instruction words are always a multiple of 4 words (128 bits). They can be parsed by reading the type from lowest 4 bits of the first word:

diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index d4cf48e..afd7ca3 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -12,6 +12,8 @@ ARM has been a lot more open this time with the architecture behind the T6xx. Fo
 
 [Floating-point vector normalisation](https://www.google.com/patents/WO2012038708A1?cl=en&hl=en&sa=X&ei=lpT0UZP-MPjk4AOZu4HwDg&ved=0CEIQ6AEwAg)
 
+[Vector floating point argument reduction](https://www.google.com/patents/US20120078987?hl=en&sa=X&ei=p5r0UbSJHtXh4AOqo4HYCw&ved=0CF0Q6AEwBjgK)
+
 [Next-instruction-type field](https://www.google.com/patents/EP2585906A1?cl=en&hl=en&sa=X&ei=SZn0UdHEOrj54AOOsYHIBg&ved=0CDsQ6AEwAQ)
 
 [Generating and resolving pixel values within a graphics processing pipeline](https://www.google.com/patents/US8059144?hl=en&sa=X&ei=lpP0UcbvDvOt4AP_7IDgAg&ved=0CD0Q6AEwAQ)

diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 040ecc7..d4cf48e 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -16,6 +16,10 @@ ARM has been a lot more open this time with the architecture behind the T6xx. Fo
 
 [Generating and resolving pixel values within a graphics processing pipeline](https://www.google.com/patents/US8059144?hl=en&sa=X&ei=lpP0UcbvDvOt4AP_7IDgAg&ved=0CD0Q6AEwAQ)
 
+[Number format pre-conversion instructions](https://www.google.com/patents/US20120215822?hl=en&sa=X&ei=1pn0UYb8Orix4APMnICQBg&ved=0CFIQ6AEwBA)
+
+[Graphics processing](https://www.google.com/patents/US20120223946?hl=en&sa=X&ei=p5r0UbSJHtXh4AOqo4HYCw&ved=0CEgQ6AEwAzgK)
+
 ## Instruction format
 
 It appears that the shader binaries are the same between the T600 and T650 targets; the only difference is in how many cycles it takes to execute the ALU instructions (T650 takes half as many due to having twice as many ALU's). The shader consists of a stream of instruction words. There are 3 types of instruction words, corresponding to the three pipelines; instruction words can appear in any order. Instruction words are always a multiple of 4 words (128 bits). They can be parsed by reading the type from lowest 4 bits of the first word:

diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index ee3cdaa..040ecc7 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -4,6 +4,18 @@
 
 ARM has been a lot more open this time with the architecture behind the T6xx. For a good overview with some slides from ARM, see [this Anandtech article](http://www.anandtech.com/show/6136/arm-announces-8core-2nd-gen-malit600-gpus). T6xx is the first Mali unified architecture; unlike the Mali 200/400, the vertex and fragment shaders use the same pipelines. There are 3 separate pipelines: ALU, Load/Store, and Texture lookup (A, L, and T in the verbose output of the compiler). The Mali-600 target for the compiler (T604, T622, T624, T628) has 2 ALU's and so can excecute 2 ALU ops per cycle, and the T650 target (T658, T678) has 4 ALU's.
 
+## Patents
+
+[Data processing apparatus and method for processing a received workload in order to generate result data](https://www.google.com/patents/US20120304194?hl=en&sa=X&ei=lpT0UZP-MPjk4AOZu4HwDg&ved=0CDsQ6AEwAQ)
+
+[Processing order with integer inputs and floating point inputs](https://www.google.com/patents/US20120299935?hl=en&sa=X&ei=lpT0UZP-MPjk4AOZu4HwDg&ved=0CDQQ6AEwAA)
+
+[Floating-point vector normalisation](https://www.google.com/patents/WO2012038708A1?cl=en&hl=en&sa=X&ei=lpT0UZP-MPjk4AOZu4HwDg&ved=0CEIQ6AEwAg)
+
+[Next-instruction-type field](https://www.google.com/patents/EP2585906A1?cl=en&hl=en&sa=X&ei=SZn0UdHEOrj54AOOsYHIBg&ved=0CDsQ6AEwAQ)
+
+[Generating and resolving pixel values within a graphics processing pipeline](https://www.google.com/patents/US8059144?hl=en&sa=X&ei=lpP0UcbvDvOt4AP_7IDgAg&ved=0CD0Q6AEwAQ)
+
 ## Instruction format
 
 It appears that the shader binaries are the same between the T600 and T650 targets; the only difference is in how many cycles it takes to execute the ALU instructions (T650 takes half as many due to having twice as many ALU's). The shader consists of a stream of instruction words. There are 3 types of instruction words, corresponding to the three pipelines; instruction words can appear in any order. Instruction words are always a multiple of 4 words (128 bits). They can be parsed by reading the type from lowest 4 bits of the first word:

diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 1228b7b..ee3cdaa 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -2,7 +2,7 @@
 
 ## Publicly available information
 
-ARM has been a lot more open this time with the architecture behind the T6xx. For a good overview with some slides from ARM, see [this Anandtech article](http://www.anandtech.com/show/6136/arm-announces-8core-2nd-gen-malit600-gpus). T6xx is the first Mali unified architecture; unlike the Mali 200/400, the vertex and fragment shaders use the same pipelines. There are 3 separate pipelines with different bitstreams: ALU, Load/Store, and Texture lookup (A, L, and T in the verbose output of the compiler). The Mali-600 target for the compiler (T604, T622, T624, T628) has 2 ALU's and so can excecute 2 ALU ops per cycle, and the T650 target (T658, T678) has 4 ALU's.
+ARM has been a lot more open this time with the architecture behind the T6xx. For a good overview with some slides from ARM, see [this Anandtech article](http://www.anandtech.com/show/6136/arm-announces-8core-2nd-gen-malit600-gpus). T6xx is the first Mali unified architecture; unlike the Mali 200/400, the vertex and fragment shaders use the same pipelines. There are 3 separate pipelines: ALU, Load/Store, and Texture lookup (A, L, and T in the verbose output of the compiler). The Mali-600 target for the compiler (T604, T622, T624, T628) has 2 ALU's and so can excecute 2 ALU ops per cycle, and the T650 target (T658, T678) has 4 ALU's.
 
 ## Instruction format
 

add some ALU description
diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index ed3be54..1228b7b 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -16,3 +16,21 @@ It appears that the shader binaries are the same between the T600 and T650 targe
     B - ALU (16 words)
 
 The next 4 bits (bits 4-7) store the type of the next instruction (presumably for prefetch purposes), except if either the instruction is the last instruction or it's the second-to-last and the last instruction is an ALU instruction, in which case the value of 1 is used.
+
+## ALU Words
+
+The first (32-bit) word is a control word which, in addition to the usual 8-bit tag in bits 0-7, contains a bitfield describing which ALU's in the pipeline are in use. There are 5 ALU's, in addition to a framebuffer write (?) and branch unit.
+
+    0-3: instruction word type (0x8-0xB)
+    4-7: next instruction word type
+    17: vector multiply ALU (48 bits)
+    19: scalar add ALU (32 bits)
+    21: vector add ALU (48 bits)
+    23: scalar multiply ALU (32 bits)
+    25: LUT / multiply ALU 2 (48 bits)
+    26: output write/discard? (16 bits)
+    27: branch (48 bits)
+
+It's not clear why only every other bit is used for the ALU's (fp64?).
+
+After the control word comes a series of 16-bit words, one for each enabled ALU (up to 5). After that come the actual fields for each ALU/unit, whose sizes are noted in the table above. The instruction word is then padded with 0's to make sure it is a multiple of 4 words. Finally, embedded constants may be inserted, which consist of 4 32-bit numbers, interpreted as 4 IEEE 32-bit floats if the input is a float.

Confirmed type 0xB
diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index bd0d49d..ed3be54 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -13,6 +13,6 @@ It appears that the shader binaries are the same between the T600 and T650 targe
     8 - ALU (4 words)
     9 - ALU (8 words)
     A - ALU (12 words)
-    B - ALU (16 words)? (Never seen)
+    B - ALU (16 words)
 
 The next 4 bits (bits 4-7) store the type of the next instruction (presumably for prefetch purposes), except if either the instruction is the last instruction or it's the second-to-last and the last instruction is an ALU instruction, in which case the value of 1 is used.

add explicit LOD
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 7d492d1..9a4e30a 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -151,7 +151,7 @@ There also exists various "pipeline registers" (four of them listed above) which
 
     control[8], Texture fetch
 
-    00 1110 0100 0000 0000 01ss ssss ssss ssot tttt 0000 0b00 0000 cccc ccrr rrrr
+    00 1110 0100 0000 0000 01ss ssss ssss ssot tttt 0000 0lb0 0000 cccc ccrr rrrr
 
     The coordinates for the texture fetch are always the output of the varying load.
 
@@ -161,8 +161,11 @@ There also exists various "pipeline registers" (four of them listed above) which
     t - sampler type
         00000 - sampler2D
         11111 - samplerCube
-    b - lod bias enable
-    r - lod bias register
+    l - lod register enable
+    b - explicit lod
+        If true, the LOD register specifies the actual LOD.
+        If false, the LOD register specifies an offset applied to the normally-calculated LOD.
+    r - lod register
 
     control[9], Uniform/Temporary Fetch
 

add t6xx prefetch field
diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 8d95737..bd0d49d 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -6,7 +6,7 @@ ARM has been a lot more open this time with the architecture behind the T6xx. Fo
 
 ## Instruction format
 
-It appears that the shader binaries are the same between the T600 and T650 targets; the only difference is in how many cycles it takes to execute the ALU instructions (T650 takes half as many due to having twice as many ALU's). The shader consists of a stream of instruction words. There are 3 types of instruction words, corresponding to the three pipelines; instruction words can appear in any order. Instruction words are always a multiple of 4 words (128 bits). They can be parsed by reading the lowest 4 bits of the first word:
+It appears that the shader binaries are the same between the T600 and T650 targets; the only difference is in how many cycles it takes to execute the ALU instructions (T650 takes half as many due to having twice as many ALU's). The shader consists of a stream of instruction words. There are 3 types of instruction words, corresponding to the three pipelines; instruction words can appear in any order. Instruction words are always a multiple of 4 words (128 bits). They can be parsed by reading the type from lowest 4 bits of the first word:
 
     3 - Texture (4 words)
     5 - Load/Store (4 words)
@@ -14,3 +14,5 @@ It appears that the shader binaries are the same between the T600 and T650 targe
     9 - ALU (8 words)
     A - ALU (12 words)
     B - ALU (16 words)? (Never seen)
+
+The next 4 bits (bits 4-7) store the type of the next instruction (presumably for prefetch purposes), except if either the instruction is the last instruction or it's the second-to-last and the last instruction is an ALU instruction, in which case the value of 1 is used.

Add info on instruction words
diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 1f112fc..8d95737 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -3,3 +3,14 @@
 ## Publicly available information
 
 ARM has been a lot more open this time with the architecture behind the T6xx. For a good overview with some slides from ARM, see [this Anandtech article](http://www.anandtech.com/show/6136/arm-announces-8core-2nd-gen-malit600-gpus). T6xx is the first Mali unified architecture; unlike the Mali 200/400, the vertex and fragment shaders use the same pipelines. There are 3 separate pipelines with different bitstreams: ALU, Load/Store, and Texture lookup (A, L, and T in the verbose output of the compiler). The Mali-600 target for the compiler (T604, T622, T624, T628) has 2 ALU's and so can excecute 2 ALU ops per cycle, and the T650 target (T658, T678) has 4 ALU's.
+
+## Instruction format
+
+It appears that the shader binaries are the same between the T600 and T650 targets; the only difference is in how many cycles it takes to execute the ALU instructions (T650 takes half as many due to having twice as many ALU's). The shader consists of a stream of instruction words. There are 3 types of instruction words, corresponding to the three pipelines; instruction words can appear in any order. Instruction words are always a multiple of 4 words (128 bits). They can be parsed by reading the lowest 4 bits of the first word:
+
+    3 - Texture (4 words)
+    5 - Load/Store (4 words)
+    8 - ALU (4 words)
+    9 - ALU (8 words)
+    A - ALU (12 words)
+    B - ALU (16 words)? (Never seen)

Add T6xx isa page to the main page
diff --git a/index.mdwn b/index.mdwn
index d16e142..568b7d0 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -59,6 +59,7 @@ Lima Documents
 
 * [[Lima+Assembler]]
 * [[Lima+ISA]]
+* [[T6xx+ISA]]
 * [[Fragment+Assembly+Syntax]]
 * [[Vertex+Disassembly]]
 * [[Mali_Offline_Shader_Compiler]]

mention T6xx is a unified architecture
diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
index 1c6ddfb..1f112fc 100644
--- a/T6xx+ISA.mdwn
+++ b/T6xx+ISA.mdwn
@@ -2,4 +2,4 @@
 
 ## Publicly available information
 
-ARM has been a lot more open this time with the architecture behind the T6xx. For a good overview with some slides from ARM, see [this Anandtech article](http://www.anandtech.com/show/6136/arm-announces-8core-2nd-gen-malit600-gpus). There are 3 separate pipelines: ALU, Load/Store, and Texture lookup (A, L, and T in the verbose output of the compiler). The Mali-600 target for the compiler (T604, T622, T624, T628) has 2 ALU's and so can excecute 2 ALU ops per cycle, and the T650 target (T658, T678) has 4 ALU's.
+ARM has been a lot more open this time with the architecture behind the T6xx. For a good overview with some slides from ARM, see [this Anandtech article](http://www.anandtech.com/show/6136/arm-announces-8core-2nd-gen-malit600-gpus). T6xx is the first Mali unified architecture; unlike the Mali 200/400, the vertex and fragment shaders use the same pipelines. There are 3 separate pipelines with different bitstreams: ALU, Load/Store, and Texture lookup (A, L, and T in the verbose output of the compiler). The Mali-600 target for the compiler (T604, T622, T624, T628) has 2 ALU's and so can excecute 2 ALU ops per cycle, and the T650 target (T658, T678) has 4 ALU's.

Adding initial T6xx page
diff --git a/T6xx+ISA.mdwn b/T6xx+ISA.mdwn
new file mode 100644
index 0000000..1c6ddfb
--- /dev/null
+++ b/T6xx+ISA.mdwn
@@ -0,0 +1,5 @@
+# T6xx Architecture
+
+## Publicly available information
+
+ARM has been a lot more open this time with the architecture behind the T6xx. For a good overview with some slides from ARM, see [this Anandtech article](http://www.anandtech.com/show/6136/arm-announces-8core-2nd-gen-malit600-gpus). There are 3 separate pipelines: ALU, Load/Store, and Texture lookup (A, L, and T in the verbose output of the compiler). The Mali-600 target for the compiler (T604, T622, T624, T628) has 2 ALU's and so can excecute 2 ALU ops per cycle, and the T650 target (T658, T678) has 4 ALU's.

d'oh... this makes a *lot* more sense now
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index bc6f48d..7d492d1 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -20,33 +20,29 @@ Apparently, most GPU's processes fragments in groups of 2x2; I suspect ours does
 
 ### Bitfield Description
 
-     0...4: {     } Instruction Length
-     5:     { 0, 0} Output to gl_FragColor and end program.
-     6:     { 0, 0} Inter-thread synchronization?
-     7:     {34, 1} Varying Fetch
-     8:     {62, 2} Texture Sampler
-     9:     {41, 1} Uniform/Temporary Fetch
-    10:     {43 ,1} Vec4 Multiply ALU
-    11:     {30, 1} Scalar Multiply ALU
-    12:     {44, 1} Vec4 Addition ALU
-    13:     {31, 1} Scalar Addition ALU
-    14:     {30, 1} Vec4-Scalar Multiply/Transcendental Scalar ALU
-    15:     {41, 1} Temporary Write/Framebuffer Read
-    16:     {73, 2} Branch/Discard
-    17:     {64, 2} Vec4 Constant Fetch 0
-    18:     {64, 2} Vec4 Constant Fetch 1
-    19..24: {     } Scheduling
-    25:     {     } Writeback enable
-    28..31  {     } Unknown (=0)
+     0...4: {  } Instruction Length
+     5:     {0 } Output to gl_FragColor and end program.
+     6:     {0 } Inter-thread synchronization?
+     7:     {34} Varying Fetch
+     8:     {62} Texture Sampler
+     9:     {41} Uniform/Temporary Fetch
+    10:     {43} Vec4 Multiply ALU
+    11:     {30} Scalar Multiply ALU
+    12:     {44} Vec4 Addition ALU
+    13:     {31} Scalar Addition ALU
+    14:     {30} Vec4-Scalar Multiply/Transcendental Scalar ALU
+    15:     {41} Temporary Write/Framebuffer Read
+    16:     {73} Branch/Discard
+    17:     {64} Vec4 Constant Fetch 0
+    18:     {64} Vec4 Constant Fetch 1
+    19..24: {  } Next Instruction Length (prefetch)
+    25:     {  } Prefetch enable
+    28..31  {  } Unknown (=0)
     
     
-    {n, m} means this bit set adds n bits to the instruction 
-    and m to the scheduling field of the previous instruction.
-    Except for the last instruction, the scheduling field starts at 2 before taking into account the
-    next instruction.
-    Writeback enable is enabled for all instructions except for the last instruction and discard instructions.
-    If you add the n's together for each enabled unit, and then round upward to the nearest word,
-    (+1 word for the control word) they should equal the bottom 5 bits.
+    {n} means this bit set adds n bits to the instruction 
+
+Prefetch enable is set for all instructions which can possibly reach the next instruction, i.e. all instructions except for the last instruction and discard instructions. For all instructions, the next instruction length is set to the instruction length of the next instruction (duh) except for the last instruction where it's set to 0; presumably this is for prefetching, since it will work if you set a too high value for this field but it will crash if you set a value too low. If you add the n's together for each enabled unit, and then round upward to the nearest 32-bit word, (+1 word for the control word) they should equal the bottom 5 bits.
 
 
 ### Vector Swizzling

diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 3da0db1..bc6f48d 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -2,7 +2,7 @@
 
 ## Fragment Shader Architecture
 
-The architecture consists of a large pipeline, consisting of a number of vector and scalar units which can be enabled and disabled by the control word. In particular, there are two vector ALU's, one of which can do addition, another which can do multiplication.  There are also two similar scalar ALU's, and the "combiner" capable of executing scalar-vector multiplies as well as various complex/transcdental functions (sqrt, rcp, sin, cos, exp2, etc.). Furthermore, there are varying, uniform/temporary, and texture load units, a temporary store unit, and a branching unit for implementing jumps. Each unit can affect/produce results which are used by all later units, as if all 6 registers are passed between each unit in the pipeline (see "Lima Fragment Pipeline" below). Furthermore, to reduce register pressure, there are a number of "pipeline registers". A pipeline register is a direct connection between two units in the pipeline, in addition to the normal registers which are passed between every unit. For more details on registers (including pipeline registers), see the "Registers" section below. To overcome the pipeline stall issues inherent in such a long pipeline (128 stages for Mali-200, see [this page](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka12787.html)), the architecture is likely barrelled and interleaves execution of a large number of fragments at once, and scheduling is done by the machine in order to minimize stalls.
+The architecture consists of a large pipeline, consisting of a number of vector and scalar units which can be enabled and disabled by the control word. In particular, there are two vector ALU's, one of which can do addition, another which can do multiplication.  There are also two similar scalar ALU's, and the "combiner" capable of executing scalar-vector multiplies as well as various complex/transcdental functions (sqrt, rcp, sin, cos, exp2, etc.). Furthermore, there are varying, uniform/temporary, and texture load units, a temporary store unit, and a branching unit for implementing jumps. Each unit can affect/produce results which are used by all later units, as if all 6 registers are passed between each unit in the pipeline (see "Lima Fragment Pipeline" below). The only exception is that each vector ALU unit is executed in parallel with its scalar counterpart - so the vector and scalar multiply ALU's run in parallel, and so do the vector and scalar addition ALU's. Furthermore, to reduce register pressure, there are a number of "pipeline registers". A pipeline register is a direct connection between two units in the pipeline, in addition to the normal registers which are passed between every unit. For more details on registers (including pipeline registers), see the "Registers" section below. To overcome the pipeline stall issues inherent in such a long pipeline (128 stages for Mali-200, see [this page](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka12787.html)), the architecture is likely barrelled and interleaves execution of a large number of fragments at once, and scheduling is done by the machine in order to minimize stalls.
 The instruction stream is compressed down from a maximum of 18-words per instruction dependant on what units are in use.
 The remaining bits give each unit individual instructions and constants.
 

diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 4d564bc..3da0db1 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -236,7 +236,7 @@ There also exists various "pipeline registers" (four of them listed above) which
 
     iooo ooMM mmmm dddd CCaa aaaa aaAA AADD bbbb bbbb BBBB
     
-    i - whether to get Argument 1 from the multiplication ALU (below)
+    i - whether to get Argument 1 from the vector multiplication ALU (above)
     o - opcode:
         00000 - arg0 + arg1
         00100 - fract(arg1)
@@ -279,7 +279,7 @@ There also exists various "pipeline registers" (four of them listed above) which
 
     ioo ooor r1dd dddd AAaa aaaa BBbb bbbb
 
-    i - whether to get Argument 1 from Scalar Multiply (below)
+    i - whether to get Argument 1 from Scalar Multiply (above)
     o - opcode:
         00000 - arg0 + arg1
         00100 - fract(arg1)

diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 8aac380..4d564bc 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -159,7 +159,7 @@ There also exists various "pipeline registers" (four of them listed above) which
 
     The coordinates for the texture fetch are always the output of the varying load.
 
-    s - sampler index (offset into uniform table)
+    s - sampler index
     o - sampler index register offset enable
     c - sampler index offset register
     t - sampler type

diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 9750895..8aac380 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -133,6 +133,7 @@ There also exists various "pipeline registers" (four of them listed above) which
         01pp - register (see second instruction format)
         1000 - varying, input to textureCube()
         1001 - register, input to textureCube()
+        1010 - vec3 normalize (see third instruction format)
         1011 - gl_FragCoord
         1100 - gl_PointCoord
         1101 - gl_FrontFacing

diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 2f0fa0d..9750895 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -103,7 +103,9 @@ There also exists various "pipeline registers" (four of them listed above) which
     
     00 mmmm dddd iiii iiOO 00oo oo00 0aa0 ssss
     Or, for loading from a register (used for loading texture coordinates from a register):
-    00 mmmm dddd SSSS SSSS ANrr rr00 n000 01pp
+    00 mmmm dddd SSSS SSSS ANrr rr00 0000 01pp
+    Or, for normalizing a vec3 input register (uses most of the same fields as loading from a register):
+    00 mmmm dddd SSSS SSSS ANrr rr00 1000 1010
     
     m - Mask, (0001 = float, 0011 = vec2, 0111 = vec3, 1111 = vec4)
     d - Destination Register
@@ -139,8 +141,6 @@ There also exists various "pipeline registers" (four of them listed above) which
         00 - normal
         10 - divide by z
         11 - divide by w
-    n - normalize input
-        Set this to normalize a vec3 inside the varying load unit. Must set p = 10 (divide by z) as well. Seems to only work with a register as a source (as opposed to a varying).
 
     Note: for gl_FragCoord and gl_PointCoord, the shader has to apply some transforms to get the right value.
     For gl_FragCoord, it looks like this in pseudocode:

more varying load weirdness
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 89748cc..2f0fa0d 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -103,7 +103,7 @@ There also exists various "pipeline registers" (four of them listed above) which
     
     00 mmmm dddd iiii iiOO 00oo oo00 0aa0 ssss
     Or, for loading from a register (used for loading texture coordinates from a register):
-    00 mmmm dddd SSSS SSSS Anrr rr00 0000 01pp
+    00 mmmm dddd SSSS SSSS ANrr rr00 n000 01pp
     
     m - Mask, (0001 = float, 0011 = vec2, 0111 = vec3, 1111 = vec4)
     d - Destination Register
@@ -118,7 +118,7 @@ There also exists various "pipeline registers" (four of them listed above) which
         01 - alignment by 2 (load 2 floats)
         11 - alignment by 4 (load 4 floats)
     A - absolute value input modifier
-    n - negate input modifier
+    N - negate input modifier
     r - input register
     o - Offset register - vector part
         1111: no offset (0)
@@ -139,6 +139,8 @@ There also exists various "pipeline registers" (four of them listed above) which
         00 - normal
         10 - divide by z
         11 - divide by w
+    n - normalize input
+        Set this to normalize a vec3 inside the varying load unit. Must set p = 10 (divide by z) as well. Seems to only work with a register as a source (as opposed to a varying).
 
     Note: for gl_FragCoord and gl_PointCoord, the shader has to apply some transforms to get the right value.
     For gl_FragCoord, it looks like this in pseudocode:

add missing output modifier for LUT ops
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 2f01bb2..89748cc 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -302,9 +302,10 @@ There also exists various "pipeline registers" (four of them listed above) which
     
     control[14], Vec4-Scalar Multiply/Transcendental Scalar ALU
 
-    dd dddd 00ss ssss na00 0000 00oo oo00
+    dd dddd MMss ssss na00 0000 00oo oo00
 
     d - destination
+    M - output modifier
     s - source
     n - negate source
     a - take absolute value of source

Add q3a + ogt shaders news.
diff --git a/index.mdwn b/index.mdwn
index 8799d58..d16e142 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -7,6 +7,7 @@ The aim of this driver and others such as [freedreno](http://freedreno.github.co
 
 ## News
 ===
+* 2013-03-18: Q3A now runs with open source generated shaders! Read all about it [at libvs blog](http://libv.livejournal.com/24402.html)
 * 2013-02-06: Libv blogs about [Quake 3 Arena running on top of the limare prototype driver](http://libv.livejournal.com/23886.html).
 * 2013-02-02: Libv talks about [Open ARM GPU drivers](https://fosdem.org/2013/schedule/event/operating_systems_open_arm_gpu/) at FOSDEM, and Connor talks about [his compiler work in the Xorg DevRoom](https://fosdem.org/2013/schedule/event/maliisa/). Public goes wild for Q3A running on limare, and Connor fills the DevRoom to the brim!
 * 2012-12-07: After what can only be described as an eternity, the LinuxTag demo code and tons of other changes have now been pushed to gitorious. Here is [a video of our brand new spinning companion cube](http://www.youtube.com/watch?v=k16ve88d-L0) spinning away at 60Hz.

diff --git a/index.mdwn b/index.mdwn
index 89e6fde..8799d58 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -65,6 +65,7 @@ Lima Documents
 * [[Fragment+Shader+Backend]]
 * [[Render State]]
 * [[Texel Formats]]
+* [[Compiling Q3A Shaders]]
 
 ## Contribute
 ===

diff --git a/Compiling_Q3A_Shaders.mdwn b/Compiling_Q3A_Shaders.mdwn
new file mode 100644
index 0000000..422a599
--- /dev/null
+++ b/Compiling_Q3A_Shaders.mdwn
@@ -0,0 +1,21 @@
+This page explains how to use open-gpu-tools to generate the required shaders for the limare port of Quake 3 Arena to run without using the binary compiler. The shaders have been hand-converted from ESSL (the input of the binary compiler) to a custom assembly/IR, and so some playing around/learning/reading the source is necessary in order to understand how the shaders work.
+
+## Setting up open-gpu-tools
+
+Clone [my open-gpu-tools tree](https://gitorious.org/~cwabbott/open-gpu-tools/cwabbotts-open-gpu-tools), and switch to the ir branch. Compile libcommon.so by cd'ing to the common directory and running make. Same thing with ir_tools and assemble.
+
+## Fragment shaders
+
+The fragment shaders are written in assembly, meaning that you have to use the use the assemble tool to generate a working MBS file. To assemble a shader ~/my_shader.in into an mbs file ~/my_shader.mbs, from the assemble directory do:
+
+    ./assemble -a lima_pp -s verbose -t fragment -o ~/my_shader.mbs ~/my_shader.in
+
+## Vertex shaders
+
+The vertex shaders are compiled from gp_ir, meaning you need to use the ir_tools to compile it to MBS. To parse an input shader ~/my_shader.in into a binary gp_ir file ~/my_shader.ir, from the ir_tool directory do:
+
+    ./ir_parse -i lima_gp_ir -o ~/my_shader.ir ~/my_shader.in
+
+And to compile that to MBS, do:
+
+    ./ir_lower -i lima_gp_ir -a lima_gp -f mbs -o ~/my_shader.mbs ~/my_shader.ir

Move into setting up X
diff --git a/OdroidSetup.mdwn b/OdroidSetup.mdwn
index e29580e..4c8b0c3 100644
--- a/OdroidSetup.mdwn
+++ b/OdroidSetup.mdwn
@@ -193,5 +193,42 @@ locale-gen en_US.UTF-8
 </pre>
 or by whichever locale is listed as LANG when running locale.
 
+# Setting up X
+
+Create /etc/X11/xorg.conf with the following content:
+<pre>
+Section "Device"
+        identifier "FBDEV"
+        Driver "fbdev"
+        Option "fbdev" "/dev/fb6"
+EndSection
+
+Section "Screen"
+        identifier "Default Screen"
+        Device "FBDEV"
+        DefaultDepth 16
+EndSection
+</pre>
+
+You can now start the display manager:
+<pre>
+lightdm&
+</pre>
+
+I haven't yet figured out how the strange exynos fb drivers can be coaxed into doing 24 bit colour.
+
 # Mali binaries
 
+Install es2gears and es2_info through:
+
+<pre>
+apt-get install mesa-utils-extra
+</pre>
+
+This will drag in the full mesa, which includes an openGLESv2 lib, which we really do not need.
+
+<pre>
+mv /usr/lib/arm-linux-gnueabihf/mesa-egl /usr/lib/arm-linux-gnueabihf/.mesa-egl
+</pre>
+
+Nasty, but works.

Add random fluff for getting ALIP running.
diff --git a/OdroidSetup.mdwn b/OdroidSetup.mdwn
index 8fb3b3f..e29580e 100644
--- a/OdroidSetup.mdwn
+++ b/OdroidSetup.mdwn
@@ -138,4 +138,60 @@ Then run the following to create boot.scr, the file that u-boot looks for:
 mkimage -A arm -O linux -T script -C none -a 0 -e 0 -n "BOOT Script for ODROID-X2" -d boot.txt boot.scr
 </pre>
 
+# First boot setup
+
+I experienced some resolver issues, as apparently the dhcpd nameserver info was not passed on properly (networkmanager?) So i added the following to /etc/resolv.conf to manually override things
+
+<pre>
+nameserver 192.168.x.x
+</pre>
+
+I then went on to install the most important package for any network connected device:
+<pre>
+apt-get update
+apt-get install openssh-server
+</pre>
+
+I could then ssh into the device and start changing some things.
+<pre>
+sudo -s
+passwd
+</pre>
+
+Now you can just ssh in as root.
+
+<pre>
+echo "odroid" > /etc/hostname
+</pre>
+
+Log out and in again to see this take effect.
+
+Then drop the linaro user and add your own. Make sure it is added to the video group.
+<pre>
+userdel -r linaro
+adduser user
+adduser user video
+</pre>
+
+You will see loads of locale issues when running any apt things:
+<pre>
+perl: warning: Setting locale failed.
+perl: warning: Please check that your locale settings:
+	LANGUAGE = (unset),
+	LC_ALL = (unset),
+	LANG = "en_US.UTF-8"
+    are supported and installed on your system.
+perl: warning: Falling back to the standard locale ("C").
+locale: Cannot set LC_CTYPE to default locale: No such file or directory
+locale: Cannot set LC_MESSAGES to default locale: No such file or directory
+locale: Cannot set LC_ALL to default locale: No such file or directory
+</pre>
+
+You can fix these by running:
+<pre>
+locale-gen en_US.UTF-8
+</pre>
+or by whichever locale is listed as LANG when running locale.
+
 # Mali binaries
+

Add boot.scr creation information, and make the boot partition vfat.
diff --git a/OdroidSetup.mdwn b/OdroidSetup.mdwn
index 475aa35..8fb3b3f 100644
--- a/OdroidSetup.mdwn
+++ b/OdroidSetup.mdwn
@@ -29,17 +29,18 @@ I/O size (minimum/optimal): 512 bytes / 512 bytes
 Disk identifier: 0x834d1732
 
         Device Boot      Start         End      Blocks   Id  System
-/dev/mmcblk0p1            3072      527359      262144   83  Linux
+/dev/mmcblk0p1            3072      527359      262144    b  W95 FAT32
 /dev/mmcblk0p2          527360    31116287    15294464   83  Linux
 </pre>
 
-The important thing to note is that the first partition should start at 3072, as the space underneath is used by the u-boot and trustedzone binaries. It also might pay to provide a separate boot partition, with kernel images and u-boot script files. Apart from that, you are free to partition as you like, as long as you update u-boot script accordingly.
+The important thing to note is that the first partition should start at 3072, as the space underneath is used by the u-boot and trustedzone binaries, and it should be a FAT based boot partition. Apart from that, you are free to partition as you like, as long as you update u-boot script accordingly.
 
 Note that this for a 16GB card, actual offsets and sizes might look different for you. In this setup, 256MB was reserved for the boot partition, and the remainder was given for one big root filesystem.
 
 Now format all partitions:
 <pre>
-mkfs.ext3 /dev/mmcblkX
+mkfs.vfat /dev/mmcblkXp1
+mkfs.ext3 /dev/mmcblkXp2
 </pre>
 
 # U-boot setup Pt.1
@@ -125,4 +126,16 @@ cp arch/arm/boot/zImage PATH_TO_BOOTFS
 </pre>
 # U-boot setup Pt.2
 
+Now we create a file called boot.txt in our boot partition, and it should contain the following:
+<pre>
+setenv bootargs 'root=/dev/mmcblk0p2 rw rootwait console=tty0 console=ttySAC1,115200n8 mem=2047M'
+ext2load mmc 0:1 0x40008000 zImage
+bootm 0x40008000
+</pre>
+
+Then run the following to create boot.scr, the file that u-boot looks for:
+<pre>
+mkimage -A arm -O linux -T script -C none -a 0 -e 0 -n "BOOT Script for ODROID-X2" -d boot.txt boot.scr
+</pre>
+
 # Mali binaries

Add line for installing zImage.
diff --git a/OdroidSetup.mdwn b/OdroidSetup.mdwn
index c3f370b..475aa35 100644
--- a/OdroidSetup.mdwn
+++ b/OdroidSetup.mdwn
@@ -119,6 +119,10 @@ Once that's done, run:
 make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- INSTALL_MOD_PATH=/PATH_TO_ROOTFS/ modules_install
 </pre>
 
+You can now copy the kernel image to the boot partition:
+<pre>
+cp arch/arm/boot/zImage PATH_TO_BOOTFS
+</pre>
 # U-boot setup Pt.2
 
 # Mali binaries

Add module installation.
diff --git a/OdroidSetup.mdwn b/OdroidSetup.mdwn
index 41b5f15..c3f370b 100644
--- a/OdroidSetup.mdwn
+++ b/OdroidSetup.mdwn
@@ -114,6 +114,11 @@ make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- -j5 zImage modules
 
 Now go and make some tea :)
 
+Once that's done, run:
+<pre>
+make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- INSTALL_MOD_PATH=/PATH_TO_ROOTFS/ modules_install
+</pre>
+
 # U-boot setup Pt.2
 
 # Mali binaries

Add first part of kernel build info.
diff --git a/OdroidSetup.mdwn b/OdroidSetup.mdwn
index a272e2c..41b5f15 100644
--- a/OdroidSetup.mdwn
+++ b/OdroidSetup.mdwn
@@ -73,6 +73,47 @@ Pick a root filesystem laid out for arm hardfloat. My current preference is a Li
 
 # Kernel build
 
+First, you need a clone of the odroid kernel.
+
+You can either clone an existing kernel tree, and then fetch the odroid one on top:
+
+<pre>
+git clone /home/user/kernel/linux-2.6/ kernel
+cd kernel/
+git remote rm origin
+git remote add origin https://github.com/hardkernel/linux.git
+git fetch
+git checkout odroid-3.0.y
+</pre>
+
+Or you can just make a quick copy of the top level tree, without downloading a full (and huge) git repository.
+
+<pre>
+git clone --depth 1 https://github.com/hardkernel/linux.git -b odroid-3.0.y kernel
+</pre>
+
+Make sure that your cross toolchain is in your path.
+
+You can now select one of many odroid machine targets, although i personally find "ubuntu" very shortsighted:
+
+<pre>
+ls arch/arm/configs/odroid*ubuntu*
+</pre>
+
+Here, we pick the odroid-x2 with mali enabled:
+
+<pre>
+make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- odroidx2_ubuntu_mali_defconfig 
+</pre>
+
+After this we can build our kernel:
+
+<pre>
+make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- -j5 zImage modules
+</pre>
+
+Now go and make some tea :)
+
 # U-boot setup Pt.2
 
 # Mali binaries

Add rootfs description.
diff --git a/OdroidSetup.mdwn b/OdroidSetup.mdwn
index a61288f..a272e2c 100644
--- a/OdroidSetup.mdwn
+++ b/OdroidSetup.mdwn
@@ -1,5 +1,7 @@
 This document gathers all the necessary info to set up an SD-Card with a bootable gnu/linux, with your own kernel and with mali binaries installed.
 
+Order your odroid with the uart module. As with any ARM device, serial is indispensable for debugging any boot failures. Make sure that you are using a recent enough driver for cp210x, this module only became useful after linux kernel 3.2, but the fixes to this specific module can easily be backported. Ask libv on irc for more info if you need this.
+
 # SD-Card
 
 First off, clear out the first bits of the SD-Card for sanity's sake:
@@ -35,10 +37,42 @@ The important thing to note is that the first partition should start at 3072, as
 
 Note that this for a 16GB card, actual offsets and sizes might look different for you. In this setup, 256MB was reserved for the boot partition, and the remainder was given for one big root filesystem.
 
-# U-boot setup
+Now format all partitions:
+<pre>
+mkfs.ext3 /dev/mmcblkX
+</pre>
+
+# U-boot setup Pt.1
+
+Samsung does currently not provide sources with its build of u-boot, so both Samsung and Hardkernel are violating the GPL.
+
+You can download a tarball with all the u-boot binaries from [here](http://www.mdrjr.net/odroid/mirror/BSPs/Alpha4/unpacked/boot.tar.gz)
+
+Untar this:
+
+<pre>
+tar -zxvf boot.tar.gz
+</pre>
+
+Then make the script in there executable:
+<pre>
+chmod +x sd_fusing.sh
+</pre>
+
+And now make this script install all the blobs to your SD-Card:
+
+<pre>
+./sd_fusing.sh /dev/mmcblkX
+</pre>
+
+After that, your SD-Card should be bootable, if you have a uart, you should be able to see U-boot attempting to load already.
+
+# Root filesystem
+
+Pick a root filesystem laid out for arm hardfloat. My current preference is a Linaro ALIP style image, with lightdm and xfce. It can be downloaded [here](https://snapshots.linaro.org/quantal/images/alip). Once you have downloaded it, you can simply untar it in the root partition of your sd-card. After untarring, you need to move everything in the binary directory one level up. This is there to protect people from overwriting their main filesystem. Do not forget to remove SHA256SUMS :)
 
 # Kernel build
 
-# Root fs
+# U-boot setup Pt.2
 
 # Mali binaries

Fill out SD-Card section.
diff --git a/OdroidSetup.mdwn b/OdroidSetup.mdwn
index 33d63c2..a61288f 100644
--- a/OdroidSetup.mdwn
+++ b/OdroidSetup.mdwn
@@ -2,8 +2,43 @@ This document gathers all the necessary info to set up an SD-Card with a bootabl
 
 # SD-Card
 
+First off, clear out the first bits of the SD-Card for sanity's sake:
+
+<code>
+dd if=/dev/zero of=/dev/mmcblkX bs=1M count=5
+</code>
+
+Then set up some partitions on the SD-Card:
+
+<code>
+fdisk /dev/mmcblkX
+</code>
+
+And work it until it looks somewhat like this:
+
+<pre>
+Command (m for help): p
+
+Disk /dev/mmcblk0: 15.9 GB, 15931539456 bytes
+4 heads, 16 sectors/track, 486192 cylinders, total 31116288 sectors
+Units = sectors of 1 * 512 = 512 bytes
+Sector size (logical/physical): 512 bytes / 512 bytes
+I/O size (minimum/optimal): 512 bytes / 512 bytes
+Disk identifier: 0x834d1732
+
+        Device Boot      Start         End      Blocks   Id  System
+/dev/mmcblk0p1            3072      527359      262144   83  Linux
+/dev/mmcblk0p2          527360    31116287    15294464   83  Linux
+</pre>
+
+The important thing to note is that the first partition should start at 3072, as the space underneath is used by the u-boot and trustedzone binaries. It also might pay to provide a separate boot partition, with kernel images and u-boot script files. Apart from that, you are free to partition as you like, as long as you update u-boot script accordingly.
+
+Note that this for a 16GB card, actual offsets and sizes might look different for you. In this setup, 256MB was reserved for the boot partition, and the remainder was given for one big root filesystem.
+
 # U-boot setup
 
 # Kernel build
 
+# Root fs
+
 # Mali binaries

Initial structure.
diff --git a/OdroidSetup.mdwn b/OdroidSetup.mdwn
new file mode 100644
index 0000000..33d63c2
--- /dev/null
+++ b/OdroidSetup.mdwn
@@ -0,0 +1,9 @@
+This document gathers all the necessary info to set up an SD-Card with a bootable gnu/linux, with your own kernel and with mali binaries installed.
+
+# SD-Card
+
+# U-boot setup
+
+# Kernel build
+
+# Mali binaries

Fill out ODROID section. Links to hardkernel are deliberately not put in place, hardkernel does not wish to support the lima driver project.
diff --git a/Devices.mdwn b/Devices.mdwn
index c941665..2b612d9 100644
--- a/Devices.mdwn
+++ b/Devices.mdwn
@@ -35,13 +35,19 @@ According to the [spec sheet](http://www.pointofview-online.com/showroom.php?sho
 
 # Exynos 4 (**GPL VIOLATOR**)
 
+These SoCs are the best performing Mali-400 devices out there. They are proper speed-daemons. The exynos 42xx series has a dual A9, whereas the exynos 44xx series has a quad A9. All come with a Mali-400MP4.
+
 All exynos 4 devices come with binary only u-boot. This means that Samsung, and its device makers, are violating the GPL.
 
 ## Origen Board (**GPL VIOLATOR**)
 
 ## ODROIDs (**GPL VIOLATOR**)
 
+The Odroids are small developer boards with many possible connections. The Odroid-x2/u2 is hyperfast, as it can clock the 4 A9s to 2GHz, and the Mali-400MP4 can clock up to 640MHz. This makes for a nice high-end benchmarker, and a good comparison for the comparatively meek A10.
+
+Hardkernel tries to portray itself as open source friendly, but they have a lot to learn still. They are providing some sort of crazy android and ubuntu pre-made SD-card images, and even hand out Mali binaries for ubuntu. Hardkernel knows the pain of getting the Mali binaries built and integrated, yet they are not interested in cooperating with our project. They officially claim to have "community based" support, and we all know what that means. Since these devices are developer boards, they have a much longer life span than your average mobile phone or tablet. In the mid to long term, hardkernel, well, hardkernels customers, will end up depending on the support of the lima driver project.
 
+Here is [[some_information|OdroidSetup]] on how to set up your own SD card with a custom built kernel and with mali binaries (which we need for reverse engineering).
 
 ## Samsung Galaxy S II (**GPL VIOLATOR**)
 

Expand allwinner section and mark samsung as a gpl violator.
diff --git a/Devices.mdwn b/Devices.mdwn
index da1b79a..c941665 100644
--- a/Devices.mdwn
+++ b/Devices.mdwn
@@ -4,11 +4,15 @@ Be careful where you buy, most cheap shops will not ship from your country but w
 
 # AllWinner A10
 
-The allwinner A10 and A13 SoCs are currently the easiest and best supported targets for developing an open source driver for the ARM Mali. There is a very active open source community, called [linux-sunxi](http://linux-sunxi.org), to support these SoCs, and device support is growing rapidly.
+The allwinner A10 and A13 SoCs are currently the easiest and best supported targets for developing an open source driver for the ARM Mali.
 
-## Cubieboard (Open Source Hardware!)
+These devices are a Cortex A8 capable of clocking little over 1GHz, comes with lots of expansion possibilities, even SATA. It features a Mali-400MP1, so it is not a stellar performer, but it more than makes up for that in availability and price, and openness. Allwinner itself is not directly supporting open source software, and would be a GPL violator in itself. But luckily, their lack of control on their device makers made the necessary code fall out through the cracks, and they are the most compliant of any chinese SoC maker today.
 
-The [Cubieboard](http://cubieboard.org) comes with 512 or 1024 MB of DDR3 RAM, 4 GB of NAND flash storage, a microSD card slot, Fast Ethernet, USB host ports, a SATA port, HDMI output and can be had for as low as 49 USD. As of December 2012, it is currently only available for pre-order.
+There is a very active open source community, called [linux-sunxi](http://linux-sunxi.org), to support these SoCs, and device support is growing rapidly. Check out [the main linux-sunxi page](http://linux-sunxi.org/Main_Page) to find out about the supported, and sometimes even fully open source, hardware available.
+
+## Cubieboard (**Open Source Hardware!**)
+
+The [Cubieboard](http://cubieboard.org) comes with 512 or 1024 MB of DDR3 RAM, 4 GB of NAND flash storage, a microSD card slot, Fast Ethernet, USB host ports, a SATA port, HDMI output and can be had for as low as 49 USD.
 
 ## Gooseberry
 
@@ -18,8 +22,6 @@ The [Gooseberry](http://gooseberry.atspace.co.uk/) board is actually a tablet bo
 
 The [Hackberry](https://www.miniand.com/products/Hackberry%20A10%20Developer%20Board) development board comes with 1 GB of DDR3 RAM, 4 GB of NAND flash storage, a full-size SDHC card slot, Fast Ethernet, USB host ports, built-in 802.11n Wi-Fi, HDMI output and can be had for 65 USD.
 
-## Mele A1000
-
 # AMLogic 8726-M (Mali 400)
 
 ## Zenithink ZT-280 (**GPL VIOLATOR**)
@@ -31,15 +33,19 @@ The ZT-280 range includes the C71, a 7" tablet with a capacitive display. Can be
 
 According to the [spec sheet](http://www.pointofview-online.com/showroom.php?shop_mode=product_detail&product_id=308) provided by its manufacturer/reseller, the ProTab 2XXL features a Mali-400 GPU. This tablet features a 10" capacitive touch-screen, and is very competetively priced - it retails for [about EUR 170](http://geizhals.eu/713232). Point of View publishes "Firmware Updates" in its somewhat chaotic [download area](http://downloads.pointofview-online.com/Drivers/), but there's no source code in sight anywhere.
 
-# Exynos 4
+# Exynos 4 (**GPL VIOLATOR**)
+
+All exynos 4 devices come with binary only u-boot. This means that Samsung, and its device makers, are violating the GPL.
+
+## Origen Board (**GPL VIOLATOR**)
+
+## ODROIDs (**GPL VIOLATOR**)
 
-## Origen Board
 
-## ODROID
 
-## Samsung Galaxy S II
+## Samsung Galaxy S II (**GPL VIOLATOR**)
 
-## Samsung Galaxy S III
+## Samsung Galaxy S III (**GPL VIOLATOR**)
 
 # Exynos 5
 

Add link to linux-sunxi, and list allwinner first. It is our prime target today.
diff --git a/Devices.mdwn b/Devices.mdwn
index 5f4e858..da1b79a 100644
--- a/Devices.mdwn
+++ b/Devices.mdwn
@@ -2,20 +2,11 @@ This page lists some of the available devices with a Mali GPU, together with som
 
 Be careful where you buy, most cheap shops will not ship from your country but will ship from China. This means that you might end up paying customs, and end up wasting some time at the customs office.
 
-# AMLogic 8726-M (Mali 400)
-
-## Zenithink ZT-280 (**GPL VIOLATOR**)
-
-The ZT-280 range includes the C71, a 7" tablet with a capacitive display. Can be had for under EUR 100 these days, but add customs and postage to that.
-
-
-## Point of View ProTab 2XXL (**GPL VIOLATOR**)
-
-According to the [spec sheet](http://www.pointofview-online.com/showroom.php?shop_mode=product_detail&product_id=308) provided by its manufacturer/reseller, the ProTab 2XXL features a Mali-400 GPU. This tablet features a 10" capacitive touch-screen, and is very competetively priced - it retails for [about EUR 170](http://geizhals.eu/713232). Point of View publishes "Firmware Updates" in its somewhat chaotic [download area](http://downloads.pointofview-online.com/Drivers/), but there's no source code in sight anywhere.
-
 # AllWinner A10
 
-## Cubieboard
+The allwinner A10 and A13 SoCs are currently the easiest and best supported targets for developing an open source driver for the ARM Mali. There is a very active open source community, called [linux-sunxi](http://linux-sunxi.org), to support these SoCs, and device support is growing rapidly.
+
+## Cubieboard (Open Source Hardware!)
 
 The [Cubieboard](http://cubieboard.org) comes with 512 or 1024 MB of DDR3 RAM, 4 GB of NAND flash storage, a microSD card slot, Fast Ethernet, USB host ports, a SATA port, HDMI output and can be had for as low as 49 USD. As of December 2012, it is currently only available for pre-order.
 
@@ -29,6 +20,17 @@ The [Hackberry](https://www.miniand.com/products/Hackberry%20A10%20Developer%20B
 
 ## Mele A1000
 
+# AMLogic 8726-M (Mali 400)
+
+## Zenithink ZT-280 (**GPL VIOLATOR**)
+
+The ZT-280 range includes the C71, a 7" tablet with a capacitive display. Can be had for under EUR 100 these days, but add customs and postage to that.
+
+
+## Point of View ProTab 2XXL (**GPL VIOLATOR**)
+
+According to the [spec sheet](http://www.pointofview-online.com/showroom.php?shop_mode=product_detail&product_id=308) provided by its manufacturer/reseller, the ProTab 2XXL features a Mali-400 GPU. This tablet features a 10" capacitive touch-screen, and is very competetively priced - it retails for [about EUR 170](http://geizhals.eu/713232). Point of View publishes "Firmware Updates" in its somewhat chaotic [download area](http://downloads.pointofview-online.com/Drivers/), but there's no source code in sight anywhere.
+
 # Exynos 4
 
 ## Origen Board

Carlos is not an anonymous user, but just a useless user.
This reverts commit 71a2ca94db7a71c268455a744d2708bb090ed774
diff --git a/index.mdwn b/index.mdwn
index 150cc4f..89e6fde 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -3,7 +3,7 @@
 
 Lima is an open source graphics driver which supports Mali-200 and Mali-400 GPUs. 
 
-The aim of this driver and others such as [freedreno](http://freedreno.github.com) is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary xcvdrivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. 
+The aim of this driver and others such as [freedreno](http://freedreno.github.com) is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary drivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. 
 
 ## News
 ===

Carlos is not an anonymous user, but just a useless user.
This reverts commit da881376fedc3e8ed8dcc4377b62f6ae656a643e
diff --git a/index.mdwn b/index.mdwn
index 32fedd9..150cc4f 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -3,7 +3,7 @@
 
 Lima is an open source graphics driver which supports Mali-200 and Mali-400 GPUs. 
 
-The aim of this driver and others such as [freedreno](http://freedreno.github.com) is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary drivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. 
+The aim of this driver and others such as [freedreno](http://freedreno.github.com) is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary xcvdrivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. 
 
 ## News
 ===
@@ -82,7 +82,3 @@ Please subscribe to our [mailinglist](http://vlists.pepperfish.net/cgi-bin/mailm
 
 ===
 
-PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
-
-GREETINGS,
-ANONYMOUS USER

Carlos is not an anonymous user, but just a useless user.
This reverts commit f80dab29f9403a8e1c6da8028d61db2c71a938b8
diff --git a/index.mdwn b/index.mdwn
index 246f68c..32fedd9 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -5,27 +5,11 @@ Lima is an open source graphics driver which supports Mali-200 and Mali-400 GPUs
 
 The aim of this driver and others such as [freedreno](http://freedreno.github.com) is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary drivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. 
 
-===
-
-PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
-
-GREETINGS,
-ANONYMOUS USER
-
-===
 ## News
 ===
 * 2013-02-06: Libv blogs about [Quake 3 Arena running on top of the limare prototype driver](http://libv.livejournal.com/23886.html).
 * 2013-02-02: Libv talks about [Open ARM GPU drivers](https://fosdem.org/2013/schedule/event/operating_systems_open_arm_gpu/) at FOSDEM, and Connor talks about [his compiler work in the Xorg DevRoom](https://fosdem.org/2013/schedule/event/maliisa/). Public goes wild for Q3A running on limare, and Connor fills the DevRoom to the brim!
 * 2012-12-07: After what can only be described as an eternity, the LinuxTag demo code and tons of other changes have now been pushed to gitorious. Here is [a video of our brand new spinning companion cube](http://www.youtube.com/watch?v=k16ve88d-L0) spinning away at 60Hz.
-===
-
-PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
-
-GREETINGS,
-ANONYMOUS USER
-
-===
 * 2012-05-27: Linuxtag talk slides and a separate demo of limare was posted on [phoronix](http://www.phoronix.com/scan.php?page=news_item&px=MTEwODA).
 * 2012-05-26: Lima talk at [Linuxtag Berlin](http://www.linuxtag.org/2012/de/program/program/vortragsdetails.html?no_cache=1&talkid=481): Textured, lighted portal cube, spins away correctly [(full video)](http://blip.tv/opensuse/linuxtag2012-lima-liberating-arm-s-mali-gpu-6166702)!
 * 2012-04-14: Rob Clark announces the [freedreno project](http://bloggingthemonkey.blogspot.co.uk/2012/04/fighting-back-against-binary-blobs.html) inspired by the Lima approach
@@ -37,14 +21,6 @@ ANONYMOUS USER
 * 2012-02-03: First public renders of [smoothed triangle, smoothed strip, smoothed fan, flat quad, triangle quad, smoothed lighted rotated cube](http://limadriver.org/content)
 * 2012-01-24: A new name has been chosen for the project: remali now becomes Lima! We now have a gitorious project, there is the #lima channel on freenode. A mailing list will be created soon.
 * 2012-01-23: [Codethink](http://www.codethink.co.uk/) puts out a [press release](http://www.prweb.com/releases/2012/1/prweb9130318.htm) for the business world. This is definitely not vaporware!
-===
-
-PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
-
-GREETINGS,
-ANONYMOUS USER
-
-===
 * 2012-01-21: Talk appears on [the FOSDEM schedule.](http://fosdem.org/2012/schedule/event/mali "Liberating ARM's Mali GPU")[The cat is out of the bag!](http://twitter.com/#!/codethink/status/160803588929626112) Story published by [phoronix](http://www.phoronix.com/vr.php?view=16971), hits [slashdot](http://linux.slashdot.org/story/12/01/21/0935248/coming-soon-an-open-source-reverse-engineered-mali-gpu-driver), [golem](http://www.golem.de/1201/89274.html), [pro-linux](http://www.pro-linux.de/news/1/17948/freier-treiber-fuer-mali-grafikprozessoren-angekuendigt.html) and [tweakers](http://tweakers.net/nieuws/79485/opensourcedriver-voor-arms-mali-gpu-in-ontwikkeling.html).
 
 ## Software
@@ -58,14 +34,7 @@ Documentation for the shader compiler, and the initial investigation of the inst
 ===
 
 ### [Mali-400](Hardware#Mali-400):
-===
-
-PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
-
-GREETINGS,
-ANONYMOUS USER
 
-===
 * [AMLogic 8726-M](Hardware#AMLogic+8726-M) (Zenithink C71)
 * [Allwinner A10](Hardware#Allwinner+A10) (Mele A1000, MK802)
 * [ST-Ericsson Novathor](Hardware#ST-Ericsson+Novathor)
@@ -77,14 +46,7 @@ ANONYMOUS USER
 
 ## Documentation
 ===
-===
-
-PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
-
-GREETINGS,
-ANONYMOUS USER
 
-===
 The documentation is currently kept in the wiki, pages of interest are:
 
 Original (Falanx) datasheets:
@@ -106,14 +68,6 @@ Lima Documents
 
 ## Contribute
 ===
-===
-
-PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
-
-GREETINGS,
-ANONYMOUS USER
-
-===
 
 The Lima driver currently only has some preliminary and highly experimental support. This experimental phase is necessary to gain a full and complete understanding of how the Mali GPUs work. Once more is known, an actual graphics driver (most likely based off of Mesa/Gallium) can be written. There is a lot of interesting work that still needs to be done!
 
@@ -132,5 +86,3 @@ PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
 
 GREETINGS,
 ANONYMOUS USER
-
-===

Carlos is not an anonymous user, but just a useless user.
This reverts commit a23a8009ccc54368db33c8c1b96a68c4de3e4d9e
diff --git a/index.mdwn b/index.mdwn
index 9a6cd4a..246f68c 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -1,11 +1,3 @@
-===
-
-PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
-
-GREETINGS,
-ANONYMOUS USER
-
-===
 # **Lima**: An open source graphics driver for ARM Mali GPUs
 ===
 

Carlos is not an anonymous user, but just a useless user.
This reverts commit ae53600143425a7d6bf7abebb26c9e6dc16797ae
diff --git a/index.mdwn b/index.mdwn
index 3573463..9a6cd4a 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -23,6 +23,8 @@ ANONYMOUS USER
 ===
 ## News
 ===
+* 2013-02-06: Libv blogs about [Quake 3 Arena running on top of the limare prototype driver](http://libv.livejournal.com/23886.html).
+* 2013-02-02: Libv talks about [Open ARM GPU drivers](https://fosdem.org/2013/schedule/event/operating_systems_open_arm_gpu/) at FOSDEM, and Connor talks about [his compiler work in the Xorg DevRoom](https://fosdem.org/2013/schedule/event/maliisa/). Public goes wild for Q3A running on limare, and Connor fills the DevRoom to the brim!
 * 2012-12-07: After what can only be described as an eternity, the LinuxTag demo code and tons of other changes have now been pushed to gitorious. Here is [a video of our brand new spinning companion cube](http://www.youtube.com/watch?v=k16ve88d-L0) spinning away at 60Hz.
 ===
 

This reverts commit 8927230e9df7a8c63c997baa3bf707108e600842
diff --git a/index.mdwn b/index.mdwn
index 9a6cd4a..3573463 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -23,8 +23,6 @@ ANONYMOUS USER
 ===
 ## News
 ===
-* 2013-02-06: Libv blogs about [Quake 3 Arena running on top of the limare prototype driver](http://libv.livejournal.com/23886.html).
-* 2013-02-02: Libv talks about [Open ARM GPU drivers](https://fosdem.org/2013/schedule/event/operating_systems_open_arm_gpu/) at FOSDEM, and Connor talks about [his compiler work in the Xorg DevRoom](https://fosdem.org/2013/schedule/event/maliisa/). Public goes wild for Q3A running on limare, and Connor fills the DevRoom to the brim!
 * 2012-12-07: After what can only be described as an eternity, the LinuxTag demo code and tons of other changes have now been pushed to gitorious. Here is [a video of our brand new spinning companion cube](http://www.youtube.com/watch?v=k16ve88d-L0) spinning away at 60Hz.
 ===
 

diff --git a/index.mdwn b/index.mdwn
index 246f68c..9a6cd4a 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -1,3 +1,11 @@
+===
+
+PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
+
+GREETINGS,
+ANONYMOUS USER
+
+===
 # **Lima**: An open source graphics driver for ARM Mali GPUs
 ===
 

diff --git a/index.mdwn b/index.mdwn
index 32fedd9..246f68c 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -5,11 +5,27 @@ Lima is an open source graphics driver which supports Mali-200 and Mali-400 GPUs
 
 The aim of this driver and others such as [freedreno](http://freedreno.github.com) is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary drivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. 
 
+===
+
+PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
+
+GREETINGS,
+ANONYMOUS USER
+
+===
 ## News
 ===
 * 2013-02-06: Libv blogs about [Quake 3 Arena running on top of the limare prototype driver](http://libv.livejournal.com/23886.html).
 * 2013-02-02: Libv talks about [Open ARM GPU drivers](https://fosdem.org/2013/schedule/event/operating_systems_open_arm_gpu/) at FOSDEM, and Connor talks about [his compiler work in the Xorg DevRoom](https://fosdem.org/2013/schedule/event/maliisa/). Public goes wild for Q3A running on limare, and Connor fills the DevRoom to the brim!
 * 2012-12-07: After what can only be described as an eternity, the LinuxTag demo code and tons of other changes have now been pushed to gitorious. Here is [a video of our brand new spinning companion cube](http://www.youtube.com/watch?v=k16ve88d-L0) spinning away at 60Hz.
+===
+
+PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
+
+GREETINGS,
+ANONYMOUS USER
+
+===
 * 2012-05-27: Linuxtag talk slides and a separate demo of limare was posted on [phoronix](http://www.phoronix.com/scan.php?page=news_item&px=MTEwODA).
 * 2012-05-26: Lima talk at [Linuxtag Berlin](http://www.linuxtag.org/2012/de/program/program/vortragsdetails.html?no_cache=1&talkid=481): Textured, lighted portal cube, spins away correctly [(full video)](http://blip.tv/opensuse/linuxtag2012-lima-liberating-arm-s-mali-gpu-6166702)!
 * 2012-04-14: Rob Clark announces the [freedreno project](http://bloggingthemonkey.blogspot.co.uk/2012/04/fighting-back-against-binary-blobs.html) inspired by the Lima approach
@@ -21,6 +37,14 @@ The aim of this driver and others such as [freedreno](http://freedreno.github.co
 * 2012-02-03: First public renders of [smoothed triangle, smoothed strip, smoothed fan, flat quad, triangle quad, smoothed lighted rotated cube](http://limadriver.org/content)
 * 2012-01-24: A new name has been chosen for the project: remali now becomes Lima! We now have a gitorious project, there is the #lima channel on freenode. A mailing list will be created soon.
 * 2012-01-23: [Codethink](http://www.codethink.co.uk/) puts out a [press release](http://www.prweb.com/releases/2012/1/prweb9130318.htm) for the business world. This is definitely not vaporware!
+===
+
+PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
+
+GREETINGS,
+ANONYMOUS USER
+
+===
 * 2012-01-21: Talk appears on [the FOSDEM schedule.](http://fosdem.org/2012/schedule/event/mali "Liberating ARM's Mali GPU")[The cat is out of the bag!](http://twitter.com/#!/codethink/status/160803588929626112) Story published by [phoronix](http://www.phoronix.com/vr.php?view=16971), hits [slashdot](http://linux.slashdot.org/story/12/01/21/0935248/coming-soon-an-open-source-reverse-engineered-mali-gpu-driver), [golem](http://www.golem.de/1201/89274.html), [pro-linux](http://www.pro-linux.de/news/1/17948/freier-treiber-fuer-mali-grafikprozessoren-angekuendigt.html) and [tweakers](http://tweakers.net/nieuws/79485/opensourcedriver-voor-arms-mali-gpu-in-ontwikkeling.html).
 
 ## Software
@@ -34,7 +58,14 @@ Documentation for the shader compiler, and the initial investigation of the inst
 ===
 
 ### [Mali-400](Hardware#Mali-400):
+===
+
+PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
+
+GREETINGS,
+ANONYMOUS USER
 
+===
 * [AMLogic 8726-M](Hardware#AMLogic+8726-M) (Zenithink C71)
 * [Allwinner A10](Hardware#Allwinner+A10) (Mele A1000, MK802)
 * [ST-Ericsson Novathor](Hardware#ST-Ericsson+Novathor)
@@ -46,7 +77,14 @@ Documentation for the shader compiler, and the initial investigation of the inst
 
 ## Documentation
 ===
+===
+
+PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
+
+GREETINGS,
+ANONYMOUS USER
 
+===
 The documentation is currently kept in the wiki, pages of interest are:
 
 Original (Falanx) datasheets:
@@ -68,6 +106,14 @@ Lima Documents
 
 ## Contribute
 ===
+===
+
+PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
+
+GREETINGS,
+ANONYMOUS USER
+
+===
 
 The Lima driver currently only has some preliminary and highly experimental support. This experimental phase is necessary to gain a full and complete understanding of how the Mali GPUs work. Once more is known, an actual graphics driver (most likely based off of Mesa/Gallium) can be written. There is a lot of interesting work that still needs to be done!
 
@@ -86,3 +132,5 @@ PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
 
 GREETINGS,
 ANONYMOUS USER
+
+===

diff --git a/index.mdwn b/index.mdwn
index 150cc4f..32fedd9 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -3,7 +3,7 @@
 
 Lima is an open source graphics driver which supports Mali-200 and Mali-400 GPUs. 
 
-The aim of this driver and others such as [freedreno](http://freedreno.github.com) is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary xcvdrivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. 
+The aim of this driver and others such as [freedreno](http://freedreno.github.com) is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary drivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. 
 
 ## News
 ===
@@ -82,3 +82,7 @@ Please subscribe to our [mailinglist](http://vlists.pepperfish.net/cgi-bin/mailm
 
 ===
 
+PLEASE PROTECT THIS CONTENT FROM ANONYMOUS USERS
+
+GREETINGS,
+ANONYMOUS USER

diff --git a/index.mdwn b/index.mdwn
index 89e6fde..150cc4f 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -3,7 +3,7 @@
 
 Lima is an open source graphics driver which supports Mali-200 and Mali-400 GPUs. 
 
-The aim of this driver and others such as [freedreno](http://freedreno.github.com) is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary drivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. 
+The aim of this driver and others such as [freedreno](http://freedreno.github.com) is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary xcvdrivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. 
 
 ## News
 ===

Add FOSDEM news.
diff --git a/index.mdwn b/index.mdwn
index 51d2f5b..89e6fde 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -7,6 +7,8 @@ The aim of this driver and others such as [freedreno](http://freedreno.github.co
 
 ## News
 ===
+* 2013-02-06: Libv blogs about [Quake 3 Arena running on top of the limare prototype driver](http://libv.livejournal.com/23886.html).
+* 2013-02-02: Libv talks about [Open ARM GPU drivers](https://fosdem.org/2013/schedule/event/operating_systems_open_arm_gpu/) at FOSDEM, and Connor talks about [his compiler work in the Xorg DevRoom](https://fosdem.org/2013/schedule/event/maliisa/). Public goes wild for Q3A running on limare, and Connor fills the DevRoom to the brim!
 * 2012-12-07: After what can only be described as an eternity, the LinuxTag demo code and tons of other changes have now been pushed to gitorious. Here is [a video of our brand new spinning companion cube](http://www.youtube.com/watch?v=k16ve88d-L0) spinning away at 60Hz.
 * 2012-05-27: Linuxtag talk slides and a separate demo of limare was posted on [phoronix](http://www.phoronix.com/scan.php?page=news_item&px=MTEwODA).
 * 2012-05-26: Lima talk at [Linuxtag Berlin](http://www.linuxtag.org/2012/de/program/program/vortragsdetails.html?no_cache=1&talkid=481): Textured, lighted portal cube, spins away correctly [(full video)](http://blip.tv/opensuse/linuxtag2012-lima-liberating-arm-s-mali-gpu-6166702)!

add a section on latencies (possibly incomplete?)
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index b1dbb60..2f01bb2 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -434,7 +434,7 @@ Unlike a normal CPU, there are no explicit output registers for the ALU's, nor a
 
 # Temporaries
 
-Akin to the fragment shader, there are also temporaries, which unlike registers can be indexed using a base register/input. They share the same namespace and method of loading as uniforms. Storing temporaries uses the same fields as storing a register/varying does, except that the "temporary store flag" is enabled, the unknown field is changed, and the complex ALU is used to select the store address instead of the "varying/register store 0" and "varying/register store 1" fields. Also, read-after-write has a latency of 4 cycles (i.e. a temporary cannot be read until 4 instructions after it is written).
+Akin to the fragment shader, there are also temporaries, which unlike registers can be indexed using a base register/input. They share the same namespace and method of loading as uniforms. Storing temporaries uses the same fields as storing a register/varying does, except that the "temporary store flag" is enabled, the unknown field is changed, and the complex ALU is used to select the store address instead of the "varying/register store 0" and "varying/register store 1" fields, which are set to 0.
 
 ## Output Transformation
 
@@ -488,6 +488,10 @@ These are the known inputs:
     28-31: Register 0 Output [-1, last instruction] (Register/Attribute)
 Note: If attribute_load_en is disabled then the attribute slot can be used to load registers too.
 
+## Latencies
+
+Temporaries have a latency of 4 instructions, i.e. writes take 4 cycles to appear. Registers have a similar latency of 3 instructions. Writes to address registers 1-3 have a latency of 4 instructions. Writes to address register 0 (temporary store) have no latency though, so it can be set in the same instruction as the temporary store itself. The complex1 operation has a latency of 2 cycles.
+
 Instruction format:
 
     0-4:   Multiply 0 Input A

remove the codethink logo
diff --git a/index.mdwn b/index.mdwn
index fc6f7c6..51d2f5b 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -80,5 +80,3 @@ Please subscribe to our [mailinglist](http://vlists.pepperfish.net/cgi-bin/mailm
 
 ===
 
-<p class="alignright">The Lima driver is sponsored by <a href="http://www.codethink.co.uk/2012/01/23/open-source-graphics-drivers/"><img border="0" src="/codethink.png" alt="Codethink" />
-</a></p>

diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index ad7e95b..b1dbb60 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -471,12 +471,13 @@ These are the known inputs:
 
     0-3:   Register 0 Output [0, current] (Register/Attribute)
     4-7:   Register 1 Output [0, current] (Register)
-    8-11:  Unknown (Never seen)
+    8:     Unused, same as 21? (seen in m200_hw_workarounds.c nop shader)
+    9-11:  Unknown
     12-15: Load Result [0, current] (Uniform/Temporary)
     16,17: Accumulator 0,1 Output [-1, last instruction]
     18,19: Multiplier 0,1 Output [-1, last instruction]
     20:    Passthrough Output [-1, last instruction]
-    21:    Unused
+    21:    Unused/nop (i.e. this ALU is not used during this instruction)
     22:    Complex Output [-1, last instruction]
     22:    Identity/Passthrough (0 for add, 1 for multiply)
              Accumulator 0,1 Input 1: add(a, -ident) means pass(a)

Add glAlphaFunc reference value.
diff --git a/Render_State.mdwn b/Render_State.mdwn
index ff8992e..3595aa5 100644
--- a/Render_State.mdwn
+++ b/Render_State.mdwn
@@ -47,6 +47,7 @@ The Mali render state is a record of 16 32-bit words (64 bytes). It consists of
 
     0x1C [7] stencil test
       00000000 00000000 11111111 11111111 GL_STENCIL_TEST (either all bits are set or not)
+      00000000 11111111 00000000 00000000 glAlphaFunc reference value: 0.5 = 0x80, 1.0 = 0xFF.
 
     0x20 [8] multisample
       00000000 00000000 00000000 00000111 always set? could be another CompareFunc

diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 0f19801..ad7e95b 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -113,7 +113,7 @@ There also exists various "pipeline registers" (four of them listed above) which
         It seems that varyings (floats) can be loaded in aligned groups of 1, 2, or 4.
         This specifies how many to load at once. Note that the alignment affects the addressing;
         for example, loading from an index of x at an alignment of 4 is equivalent to loading from 2*x
-        at an alignment of 2.
+         and 2*x+1 at an alignment of 2.
         00 - no alignment (load 1 float)
         01 - alignment by 2 (load 2 floats)
         11 - alignment by 4 (load 4 floats)

diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index dbe8fd2..0f19801 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -434,7 +434,7 @@ Unlike a normal CPU, there are no explicit output registers for the ALU's, nor a
 
 # Temporaries
 
-Akin to the fragment shader, there are also temporaries, which unlike registers can be indexed using a base register/input. They share the same namespace and method of loading as uniforms. Storing temporaries uses the same fields as storing a register/varying does, except that the "temporary store flag" is enabled, the unknown field is changed, and the complex ALU is used to select the store address instead of the "varying/register store 0" and "varying/register store 1" fields. Also, it seems they have a latency of 4 cycles (i.e. a temporary cannot be read until 4 instructions after it is written).
+Akin to the fragment shader, there are also temporaries, which unlike registers can be indexed using a base register/input. They share the same namespace and method of loading as uniforms. Storing temporaries uses the same fields as storing a register/varying does, except that the "temporary store flag" is enabled, the unknown field is changed, and the complex ALU is used to select the store address instead of the "varying/register store 0" and "varying/register store 1" fields. Also, read-after-write has a latency of 4 cycles (i.e. a temporary cannot be read until 4 instructions after it is written).
 
 ## Output Transformation
 
@@ -560,6 +560,7 @@ Instruction format:
         0 - multiply (out = a * b)
         1 - complex 1 (inverse, inverse sqrt, etc.)
             takes all four inputs as arguments
+            This instruction has a latency of 2 cycles.
         3 - complex 2 (inverse, inverse sqrt, etc.)
             takes first two inputs as arguments,
             the other two are normal (multiply)

whoops
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index ace9bf9..dbe8fd2 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -434,7 +434,7 @@ Unlike a normal CPU, there are no explicit output registers for the ALU's, nor a
 
 # Temporaries
 
-Akin to the fragment shader, there are also temporaries, which unlike registers can be indexed using a base register/input. They share the same namespace and method of loading as uniforms. Storing temporaries uses the same fields as storing a register/varying does, except that the "temporary store flag" is enabled, the unknown field is changed, and the complex ALU is used to select the store address instead of the "varying/register store 0" and "varying/register store 1" fields. Also, it seems they have a latency of 6 cycles (i.e. a temporary cannot be read until 6 instructions after it is written).
+Akin to the fragment shader, there are also temporaries, which unlike registers can be indexed using a base register/input. They share the same namespace and method of loading as uniforms. Storing temporaries uses the same fields as storing a register/varying does, except that the "temporary store flag" is enabled, the unknown field is changed, and the complex ALU is used to select the store address instead of the "varying/register store 0" and "varying/register store 1" fields. Also, it seems they have a latency of 4 cycles (i.e. a temporary cannot be read until 4 instructions after it is written).
 
 ## Output Transformation
 

mali gp temporary stuff
diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index c7ae062..ace9bf9 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -434,7 +434,7 @@ Unlike a normal CPU, there are no explicit output registers for the ALU's, nor a
 
 # Temporaries
 
-Akin to the fragment shader, there are also temporaries, which unlike registers can be indexed using a base register/input. They share the same namespace and method of loading as uniforms. Storing temporaries uses the same fields as storing a register/varying does, except that the "temporary store flag" is enabled, the unknown field is changed, and the complex ALU is used to select the store address instead of the "varying/register store 0" and "varying/register store 1" fields.
+Akin to the fragment shader, there are also temporaries, which unlike registers can be indexed using a base register/input. They share the same namespace and method of loading as uniforms. Storing temporaries uses the same fields as storing a register/varying does, except that the "temporary store flag" is enabled, the unknown field is changed, and the complex ALU is used to select the store address instead of the "varying/register store 0" and "varying/register store 1" fields. Also, it seems they have a latency of 6 cycles (i.e. a temporary cannot be read until 6 instructions after it is written).
 
 ## Output Transformation
 
@@ -547,6 +547,7 @@ Instruction format:
         4 - inverse sqrt (Partial)
         5 - reciprocal (Partial)
         9 - passthrough
+        10 - Set Address Register 0 & Address Register 1 from result of passthrough unit
         12 - Set Address Register 0 (Temporary Store address)
         13 - Set Address Register 1
         14 - Set Address Register 2

diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index acc5951..c7ae062 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -471,7 +471,7 @@ These are the known inputs:
 
     0-3:   Register 0 Output [0, current] (Register/Attribute)
     4-7:   Register 1 Output [0, current] (Register)
-    9-11:  Unknown (Never seen)
+    8-11:  Unknown (Never seen)
     12-15: Load Result [0, current] (Uniform/Temporary)
     16,17: Accumulator 0,1 Output [-1, last instruction]
     18,19: Multiplier 0,1 Output [-1, last instruction]

diff --git a/index.mdwn b/index.mdwn
index c34b4b3..fc6f7c6 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -36,7 +36,7 @@ Documentation for the shader compiler, and the initial investigation of the inst
 * [AMLogic 8726-M](Hardware#AMLogic+8726-M) (Zenithink C71)
 * [Allwinner A10](Hardware#Allwinner+A10) (Mele A1000, MK802)
 * [ST-Ericsson Novathor](Hardware#ST-Ericsson+Novathor)
-* [Samsung Exynos](Hardware#Samsung+Exynos) (Galaxy S2/S3/Tab/Note)
+* [Samsung Exynos](Hardware#Samsung+Exynos) (Galaxy S2/S3/Tab/Note, Samsung Chromebook)
 
 ### [Mali-200](Hardware#Mali-200):
 

Add new odepush
diff --git a/index.mdwn b/index.mdwn
index b6e2c65..c34b4b3 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -7,6 +7,7 @@ The aim of this driver and others such as [freedreno](http://freedreno.github.co
 
 ## News
 ===
+* 2012-12-07: After what can only be described as an eternity, the LinuxTag demo code and tons of other changes have now been pushed to gitorious. Here is [a video of our brand new spinning companion cube](http://www.youtube.com/watch?v=k16ve88d-L0) spinning away at 60Hz.
 * 2012-05-27: Linuxtag talk slides and a separate demo of limare was posted on [phoronix](http://www.phoronix.com/scan.php?page=news_item&px=MTEwODA).
 * 2012-05-26: Lima talk at [Linuxtag Berlin](http://www.linuxtag.org/2012/de/program/program/vortragsdetails.html?no_cache=1&talkid=481): Textured, lighted portal cube, spins away correctly [(full video)](http://blip.tv/opensuse/linuxtag2012-lima-liberating-arm-s-mali-gpu-6166702)!
 * 2012-04-14: Rob Clark announces the [freedreno project](http://bloggingthemonkey.blogspot.co.uk/2012/04/fighting-back-against-binary-blobs.html) inspired by the Lima approach

added Mele A1000
diff --git a/Devices.mdwn b/Devices.mdwn
index 966b6c1..5f4e858 100644
--- a/Devices.mdwn
+++ b/Devices.mdwn
@@ -27,6 +27,8 @@ The [Gooseberry](http://gooseberry.atspace.co.uk/) board is actually a tablet bo
 
 The [Hackberry](https://www.miniand.com/products/Hackberry%20A10%20Developer%20Board) development board comes with 1 GB of DDR3 RAM, 4 GB of NAND flash storage, a full-size SDHC card slot, Fast Ethernet, USB host ports, built-in 802.11n Wi-Fi, HDMI output and can be had for 65 USD.
 
+## Mele A1000
+
 # Exynos 4
 
 ## Origen Board

added some Exynos 4 and 5 devices
diff --git a/Devices.mdwn b/Devices.mdwn
index 7fe9c86..966b6c1 100644
--- a/Devices.mdwn
+++ b/Devices.mdwn
@@ -26,3 +26,25 @@ The [Gooseberry](http://gooseberry.atspace.co.uk/) board is actually a tablet bo
 ## Hackberry
 
 The [Hackberry](https://www.miniand.com/products/Hackberry%20A10%20Developer%20Board) development board comes with 1 GB of DDR3 RAM, 4 GB of NAND flash storage, a full-size SDHC card slot, Fast Ethernet, USB host ports, built-in 802.11n Wi-Fi, HDMI output and can be had for 65 USD.
+
+# Exynos 4
+
+## Origen Board
+
+## ODROID
+
+## Samsung Galaxy S II
+
+## Samsung Galaxy S III
+
+# Exynos 5
+
+This SoC incorporates the Mali-T604 GPU along with 2 Cortex-A15 cores.
+
+## Arndale Board
+
+## Samsung Chromebook XE303C12
+
+This is, as of December 2012, the only ARM-based Chromebook. It costs 249 USD.
+
+## Google Nexus 10

added some AllWinner A10 boards
diff --git a/Devices.mdwn b/Devices.mdwn
index b0de43f..7fe9c86 100644
--- a/Devices.mdwn
+++ b/Devices.mdwn
@@ -1,16 +1,28 @@
-This page lists some of the available devices with a mali GPU, together with some useful info about them. The GPL VIOLATOR status for most of the devices is pretty much a given at this point, so let's just mark devices as such unless proven otherwise.
+This page lists some of the available devices with a Mali GPU, together with some useful info about them. The GPL VIOLATOR status for most of the devices is pretty much a given at this point, so let's just mark devices as such unless proven otherwise.
 
 Be careful where you buy, most cheap shops will not ship from your country but will ship from China. This means that you might end up paying customs, and end up wasting some time at the customs office.
 
 # AMLogic 8726-M (Mali 400)
 
 ## Zenithink ZT-280 (**GPL VIOLATOR**)
-===
 
 The ZT-280 range includes the C71, a 7" tablet with a capacitive display. Can be had for under EUR 100 these days, but add customs and postage to that.
 
 
 ## Point of View ProTab 2XXL (**GPL VIOLATOR**)
-===
 
 According to the [spec sheet](http://www.pointofview-online.com/showroom.php?shop_mode=product_detail&product_id=308) provided by its manufacturer/reseller, the ProTab 2XXL features a Mali-400 GPU. This tablet features a 10" capacitive touch-screen, and is very competetively priced - it retails for [about EUR 170](http://geizhals.eu/713232). Point of View publishes "Firmware Updates" in its somewhat chaotic [download area](http://downloads.pointofview-online.com/Drivers/), but there's no source code in sight anywhere.
+
+# AllWinner A10
+
+## Cubieboard
+
+The [Cubieboard](http://cubieboard.org) comes with 512 or 1024 MB of DDR3 RAM, 4 GB of NAND flash storage, a microSD card slot, Fast Ethernet, USB host ports, a SATA port, HDMI output and can be had for as low as 49 USD. As of December 2012, it is currently only available for pre-order.
+
+## Gooseberry
+
+The [Gooseberry](http://gooseberry.atspace.co.uk/) board is actually a tablet board. It comes with 4 GB of on-board storage, 802.11n Wi-Fi, HDMI, and a microSD card slot. Android 4.0 "Ice Cream Sandwich" is officially supported.
+
+## Hackberry
+
+The [Hackberry](https://www.miniand.com/products/Hackberry%20A10%20Developer%20Board) development board comes with 1 GB of DDR3 RAM, 4 GB of NAND flash storage, a full-size SDHC card slot, Fast Ethernet, USB host ports, built-in 802.11n Wi-Fi, HDMI output and can be had for 65 USD.

diff --git a/Lima+ISA.mdwn b/Lima+ISA.mdwn
index 8de419e..acc5951 100644
--- a/Lima+ISA.mdwn
+++ b/Lima+ISA.mdwn
@@ -541,8 +541,6 @@ Instruction format:
         7 - max/logical or (a || b)
         note: abs(a) is implemented as max(a, -a)
     86-89: Complex OpCode
-        For complex functions (rcp, sqrt, etc.), the inputs to the multiply ALU0 and
-        the input to the complex ALU are the same value.
         0 - unused
         2 - exp2 (Partial)
         3 - log2 (Partial)