2026-02-02

T64 compression

I’ve learned about T64 (Transpose 64x64) compression algorithm while reading Clickhouse docs. My assumption is that it has been created in-house because it’s the first time I’m seeing any reference to it. The overall approach and idea is actually quite old, I’m specifically referencing T64 here. It’s so simple that I’m going to explain it with examples

How does it work?

Choose a size of the matrix you’re going to be transposing, for this example we are going to choose 8x8 (the Clickhouse version uses a scaled up 64x64). Each cell in the matrix is going to be represented by a single bit - a bit matrix. Create the matrix out of 8 consecutive 8-bit (byte) values.

You can see that for most of these values the leading 3 bits are zero. It’s a shame because we know that 99% of the time we need to store values that require at most 5-bits. We also know that in very rare situations we need to store up to 8-bits of information. The last thing that we know is that even though the values are small most of the time they are very unpredictable (random). If it wasn’t the case we could try to compress the deltas between elements which in case of predictable data would result in heavy compression. So how can we compress this data with acceptable ratio but at the same time have a stupid fast decompression/compression? Well, what if we take the matrix and rotate it by 90 degrees (transpose), like so.

Now we can just store each row as a separate byte and we can omit rows which are all zeroes. We also need to store how many rows are non-zero, for this we are going to need 4 bits. In our example we will store 5 * 8 + 4 = 44-bits - only 68% compared to uncompressed. Decompression is reconstructing the matrix and transposing it one more time, since

$$ (A^{T})^{T} = A $$

We can reconstruct the matrix because we just need to pre-fill the first $n$ rows with zeroes. One nice characteristic of this compression is that it’s block compression. We can start decompressing basically in the middle of the stream for free.

Is this viable?

Now that we know this approach is capable of compressing data the biggest question to answer is - what do we need to pay to get it? How expensive is it to transpose a bit matrix? This was the question on top of my mind since I’ve learned about this technique. Almost never do you have to swap or manipulate more than a few bits at a time. And even when you do have to do it, you don’t have dependencies that require you to read different bytes just to get a single bit value to insert. Let’s think about it from the first principles.

We are going to assume that our matrix is square, being $N \times N$ and having $N^2$ elements. To find the value of element after the transpose we have to set the element at $<x,y>$ to the value found at $<y,x>$. We can observe that for the same $x$ and $y$ two assignments simplify to a single swap. This means we are going to have $N^2$ assignments or $N^2 \over 2$ swaps. Now the question becomes how many cycles do we have to pay to execute a bit-swap? It’s also interesting to note that we have a trade off, runtime cost scales as $O(N^2)$ but at the same time, as $N$ grows, the average compression ratio increases if our data is compressible by this approach.

Simple approach

First solution is going to be naive, just write code that creates a new matrix bit by bit.

pub fn transpose64Naive(matrix: *[64]u64) void {
  var temp = std.mem.zeroes([64]u64);
  for (matrix) |row| {
    for (&temp, 0..) |*col, i| {
      col.* <<= 1;
      col.* |= @as(u1, @truncate(row >> @intCast(63 - i)));
    }
  }

  @memcpy(matrix, &temp);
}

This works of course, but if we test it we will get a result that is motivating but not impressive of 1.2GB/s¹.

Divide and conquer

Just like coming up with something better compared to bubble sort, let us turn to a divide and conquer approach. What we can notice is that matrix transposition is a self-similar problem. If we have a matrix of four elements, it’s a trivial operation

$$ \begin{bmatrix} A & B \\
C & D \end{bmatrix}^T = \begin{bmatrix} A & C \\
B & D \end{bmatrix} $$

You can notice that $A$ and $D$ didn’t even move. Now we should ask ourselves, what if $A$ is also a matrix? Well, we need to transpose it. But since we already now that our matrix is square and we can guarantee that $N$ is a power of two, we can just divide and conquer. All we have to do is always a swap between the two elements of our sub-matrices. The power of this approach is in the fact that we our operation level is now at word-level instead of bit level. For each sub-matrix we need to loop through all the rows so we have $N$. But we also need to go down into all of our sub-matrices, which is a $log N$ operation. So all in all, it’s $O(N log N)$.

Our quick analysis is telling us that we can expect a $\frac{N^2}{N log N} = \frac{N}{log N}$ speed-up. Which for $N = 64$ is around $\frac{64}{6} \approx 10.6$. This approach is actually called Steele transpose, after Guy Steele. The actual implementation is a little wild, but it’s brilliant of how it comes together, so I highly recommend trying to reimplement this yourself.

pub fn transpose64Steele(matrix: *[64]u64) void {
  comptime var mask: u64 = 0x00000000FFFFFFFF;
  comptime var j: u8 = 32;
  inline while (j > 0) : ({
    j >>= 1;
    mask ^= mask << @truncate(j);
  }) {
    comptime var k: u8 = 0;
    inline while (k < 64) : (k = ((k | j) + 1) & ~j) {
      const l = k;
      const r = k | j;

      const t = (matrix[l] ^ (matrix[r] >> @truncate(j))) & mask;
      matrix[l] ^= t;
      matrix[r] ^= (t << @truncate(j));
    }
  }
}

And in the benchmark we get around 7.73GB/s, which is something way more acceptable as something that will be run on every decompression of our data.

Is there more?

You can push this further to use SIMD and achieve almost another 2x over what we have gotten above without much trouble. Even playing with @Vector(S, T) types in Zig and relying on LLVM to understand what I wanted to do I was able to reach 13.13GB/s. Hand rolled intrinsics based code will probably get a little more.

What are compression on real world data?

I took data that represents token lengths after lexing a big source file. There are big values represented by U32_MAX = -1. Most of the tokens are small but we need to be able to store BIG tokens, so we assume that token can’t bigger than 4GB. In this exercise assume that we have ton of this data and we need to store it compressed but also be able to decompress it quickly. After compressing it using 32x32 the final ratio was 10.7x. But be wary that this is basically perfect data for this compression method as MOST tokens are very small. This will not work on random data as well!

All tests on M2 Pro and 64x64 matrices ↩︎