..

A way too serious look at the Weissman score

Over the past days I’ve been rewatching Silicon Valley. Something moved in me, maybe I’ve gotten nostalgic for this show or maybe I just remembered that I enjoyed watching it. In the first season - which I think is the most iconic one - the final episode gives a memorable scene where Richard manages to create a “middle-out” compression algorithm that breaks the theoretical limit set by the Weissman score. I wanted to have some fun by trying to look at this as if it’s super serious and it’s not a fictional show but a documentary. A magical number 2.9 is referenced as a limit of how small can files really get. Of course the protagonist manages to surpass that limit and achieves a score of 5.2, instantly creating a potential billion dollar company. But does the logic behind the script hold any water, what even is the Weissman score and why didn’t we hear of it before the show?

What does it measure?

The first part of the “Weissman score” is a surname of the professor working at Stanford that created this for the show. His equation is even shown during the finale of the first season so we have something to work with. We can write it down as

$$ W = \alpha \frac{r}{\bar{r}} \frac{log \hspace{0.25em} \bar{T}}{log \hspace{0.25em} T} $$

Where:

  • $W$ = Weissman score
  • $\alpha$ = constant
  • $r$ = compression ratio of benchmark
  • $\bar{r}$ = compression ratio of baseline
  • $T$ = time taken to compress using benchmark
  • $\bar{T}$ = time taken to compress using baseline

Let’s focus on what this essentially means. We have some algorithm that allows us to compress data and we want to benchmark it to know how good it is. To do that we find some baseline algorithm we want to compare against, most likely it’s going to be gzip. If we understand that we can notice the first issue inside the script of the show.

It’s not absolute

The score itself is given without any context. This is an issue because the score itself is comparative in its nature, it’s not absolute. It’s not saying: “I’m traveling at 300km/h”, it’s saying: “I’m traveling 10x faster than X is”. If we understand this, it makes the claim of 2.9 being the theoretical limit weird. I think the show is trying to create something like Shannon Entropy but for algorithms instead of data.

The equation is really messy

Constants in equations are not desirable because they mean that the result can be anything you want, just change the constant. Of course there are famous constants like Boltzmann constant, but they seem to be inherent in the fabric of the universe, just like the speed of light ($c$). Meanwhile $\alpha$ is not defined and even the Wikipedia article examples assume that it’s one. If you want to have the score represent the improvement between algorithms you need to agree on the constant across cultures. Ideally it’s going to memorable, like the number one. In that case you can just drop it from the equation. Also, why is the time in a log? Why not square the ratio? Where is the decompression speed in this equation?

Breaking 5.2 ourselves

I have a json file that is 29MB, let’s break Pied Pipers performance ourselves with cp. The time it takes on my computer to copy that file is 4ms. Compression ratio is of course 1, because we didn’t compress anything. Now let’s use gzip, and it takes 450ms to achieve a ratio of 6.1804 - quite high but I’ll get to that also.

$$ W = \frac{1}{6.1804} \frac{log \hspace{0.25em} 0.45s}{log \hspace{0.25em} 0.04s} = 0.04 $$

Womp, womp! We got crushed, nooo! But wait, let me just rearrange the numbers a little bit because these seconds are really small.

$$ W = \frac{1}{6.1804} \frac{log \hspace{0.25em} 450ms}{log \hspace{0.25em} 4ms} = 0.713 $$

Huh? Our patented compression algorithm cp just got 15x better by just changing the timescale? What if we just…

$$ W = \frac{1}{6.1804} \frac{log \hspace{0.25em} 0.45s * 26}{log \hspace{0.25em} 0.04s * 26} = 10.14 $$

OH MY! WE JUST WON TECH CRUNCH DISRUPT WITH OUR COPY OUT ALGORITHM!

It’s not absolute part 2

Not only is the score relative to the algorithm we use as a baseline, but it’s also relative to the data we actually compress. I have a video file that when stored as raw frames takes up around 583MB. Compressing with gzip with default settings reduces it down to 84MB in 6.3 seconds. But using Pied Piper of our times, I can call on ffmpeg to compress it losslessly with libx264 in 0.3 seconds and store it in 31MB. Let’s check the Weissman score without doing any tricks with the time.

$$ W = \frac{18.806}{6.940} \frac{log \hspace{0.25em} 6.3s}{log \hspace{0.25em} 0.3s} = -4.13 $$

HUH? What does negative even mean? I thought ffmpeg is going to destroy gzip here, I’ll touch on what happened later. The point I want to make here is different data will give you different scores. If I ran ffmpeg on a text file of course it would produce trash. This goes back to the fact that there is no best generic compressor. H264 exploits patterns in the data to extract as much compression as possible meanwhile being 20 times faster than gzip. But if I change the file to even a different video file, the score will change, because the data changed.

What does it mean that it’s negative?

If we want to use the score as way to say that this algorithm is x times better on this set of data it still misses the mark. That’s because if we make our compressor and/or decompressor too fast we end up with a score that starts to decline. The fact that the time it takes for the compressor we are benchmarking is in the denominator and it’s also inside a logarithm means that to maximize the score we want to be fast - but not too fast. Logarithm gives negative results for values which are less than one. For me this is the biggest point that proves that Weissman score is nothing more than just some fun equation made for the show. It’s as if the creator of the equation tried really hard to create something that passes the sniff test but in reality is next to useless. Imagine that you are tasked with working on a compressor and you notice that your scores are going up and up. Motivated by this, you decide to optimize it so hard as to achieve the best score ever seen. Only to be even more confused as the score you trust now tells you that the thing you wrote compared to the baseline has a negative score. As if it was inflating the files instead of making them smaller. All of this because you tried too hard and *pop* you passed a magical barrier of being too fast.

Real world Weissman scores

I decided that it’s only fitting to try and use the Weissman score on something. The setup is as follows: using zstd I compressed the same file (29MB text file) and measure how much time it took to compress it across 19 different compression levels. Later I computed the ratios of the resulting files and created a matrix that allows us to see the Weissman score between any pair of configurations. The matrix itself is not mirrored across the diagonal. To understand why, we need to see what each cell represents. The first column displays values of using the first compression level as the benchmark comparing to results gotten from compressing at different levels. Also I had to multiply the times by 1000 to compare the times in milliseconds, otherwise we would have negative results. Personally I don’t really see any value in looking at these scores, which still reinforces my personal position that Weissman score is not that useful.

alt