Why 20% of Probability distributions show up 80% of the time
Why do I keep seeing Pareto distributions everywhere? The 80:20 rule is a decent heuristic, but it doesn’t tell you WHY those are the proportions. I eventually answered this for myself, and ended up learning a lot both in the foundations of probability and the frontier of complexity science.
So let’s start with the known unknowns.
Uncertainty is a Relation
Uncertainty is the opposite of determinism, right? It’s not. Uncertainty is a property of models but not of the systems themselves.
Consider the following: You walk into a room where I have my hand on the table, underneath which, I tell you, is a coin. I know it is definitely heads or definitely tails, but you don’t have enough information to determine that, so all you can express is probability.
Uncertainty isn't really a property of a system, so much as it's a property of a system and an observer. Stated differently, uncertainty is something that's in your head, not in the world.
While uncertainty is not a system property, chaos is. Any systems we use as psudorandom number generators are chaotic. What does that mean, precisely? In common parlance, chaotic just means that something is hard to predict, or that there's a lot of noise.
In mathematics, chaos is a very specific property. The famous quote is “When the present determines the future but the approximate present does not approximately determine the future”.
A chaotic system is very, very susceptible to initial inputs, such that very similar inputs can give very different outputs.
A classic non-chaotic example is a pendulum. If you pull the pendulum back a small amount, the swing length will increase according to nice, friendly laws of harmonic motion. A double pendulum, however, could act mostly like a single pendulum, do crazy loop-de-loops or more, depending on very small changes in how far back you pull it.
Keep in mind that this behavior is still fully determined. The loop repeats itself the same way every time. However, it requires a lot of expensive precision to know the starting conditions to the required level of detail, and a lot of computational power to predict the system evolution as time goes on. What I’m saying, for the programmers reading this, is that all random number generation is pseudorandom number generation.
Let’s return to the coin. It’s chaotic, but not as chaotic as the double pendulums. In fact multiple people have built deterministic coin-flipping machines. But, to me or you flipping a coin, it is chaotic enough to be unpredictable. This is still a known unknown though: flip it a bunch, there will be some heads and some tails, and in predictable proportions.
From Chaos to Probability
Even when we can’t predict the outputs of a chaotic system, we can still draw bounds on its unknowability. This is where probability distributions come into play. For example, a dreidel will land on one of four sides with equal probability. A coin will be either heads or tails.
Most gamblers’ physical psudorandom number generators are chaotic, but use attractors to produce something useful. So, while a coin tumbling in air is obviously a chaotic system, the shape of the coin means that when it lands, it will either end up resting heads-up or tails-up. Similarly with a top, or dice, or a roulette table. (divination methods such as reading chicken livers are more like uncovering the results of an unseen coin-flip, but let’s not get into it for now).
There’s a sort of grand cosmic dance: Chaotic systems take simple input and produce output that is complex and unpredictable, but attractors rein in that chaos into something more regular and, in the long run, predictable.
Now, if you have a background in probability, you know that a coin flip is modeled by a Bernoulli distribution. Bernoulli trials don’t just model coin flips — they also model many other chaotic processes, such as bouncing a ball off a peg that sends it to either the left or right based on minor changes in trajectory. The ball-and-peg has two attractors: left and right.
Add a bunch of these Bernoulli trials up, you end up getting the Binomial distribution. One famous physical model of this generative process is the Galton Board. It’s a chaotic process made up of multiple other chaotic processes. It has its own built in attractors, just like the individual ball-and-peg trials it is composed of.
What happens as n, the number of balls (or coin flips), becomes large [1]? Well, by the Central Limit Theorem, the limit is the Normal or Gaussian distribution.
If you're not familiar with the central limit theorem, if you're not familiar with the normal distribution, and the amazing reasons why it's so common in nature, you should check this video out. A key point it makes is that the mathematical operation of convolution is intrinsically related to the shape of the Normal distribution (and, of course, convolution is how you add two random variables together). There are a lot of ways to formulate the central limit theorem, but the way it works in my head is:
There are Normal distributions everywhere in nature. Why? Well, it's because there's a process, you add a bunch of random coin-flip type events in these Bernoulli trials into a Binomial distribution, and as you add more that will eventually produce something that looks like a Normal distribution.
It gets better: You can add independent samples from almost any distribution [2], not just the Binomial distribution. As long as you add them together, rather than some other operation, they will form a Normal distribution. The important thing is the addition.
Imagine a multi dimensional space describing different types of probability distributions. if you take something sampled from a Uniform and add another Uniform, you will get something kinda like Uniform, but eventually, you're going to start circling that drain of the Normal distribution. Once you're there, there's no path out, (under addition of probability distributions with finite variance). So, much like “heads” is an attractor for the coin system, the Normal distribution is an attractor for the Galton Board system, and those like it.
So there's some intrinsic result of addition of random variables that ends up getting you closer and closer to a Normal distribution. And what happens if you add a Normal distribution to a Normal distribution? You get a Normal distribution. What happens if you add some other distribution to a Normal distribution? Probably a Normal distribution, (unless its variance is infinite or doesn’t exist). So to wrap that all up, a Normal distribution is what's called a stable fixed point in the space of probability distributions, under summation.
This is all pretty basic chaos theory. And pretty basic probability. But they're not often talked about together. So that's why they interest me. Because if you just apply some very basic chaos theory and some very basic probability, you develop a really strong intuition about where these things come from and why they're everywhere.
So the Normal distribution is an attractor under addition. What about other operations, such as multiplication?
Have You Tried Logarithms?
Normally-distributed data are common, but increasingly, they’re not the go-to in many areas of science. So we’ll turn our attention to two others. First up is the Lognormal distribution.
A very important fact about logarithms: If you take the logarithm of something, multiplication in the original space becomes addition in the logarithm space.
Take a second and think about what happens when you apply that multiplication/addition idea to the Central Limit Theorem.
If addition of random variables gets you a Normal distribution, multiplication of random variables gets you a Lognormal distribution. You can imagine all sorts of processes that involve multiplication of random variables. Gwern goes through a bunch here.
To repeat, when you multiply random variables together, you get a Lognormal distribution, because you're just adding them in the log space (assuming the variance exists and is finite).
Lognormal distributions have a lot of handy properties. They don’t require “centering” operations to set the mean to zero, because they are not symmetric around 0 like the Normal. They also have fat tails, which captures rare but relatively high-probability events. This is in contrast with a Normal distribution, where truly exceptional values almost never happen. For these reasons and others, it’s been strongly argued that Lognormal should be the default choice for modeling “amounts”.
There are also generating models similar to the Galton Board for Lognormal distributions, notably random multiplication (such as principal invested in the stock market) or random splitting (such as astroids colliding in outer space).
These distributions have a fat tail, but it's not fat enough for some people. They have high variance, but they don't have infinite variance. And that's one of the key things that distinguish them from a Pareto distribution.
Mandelbrot uses the metaphor of the blind archer to explain how different infinite variance is. It’s a tough concept that’s controversial in our finite world [3]. And it's true that there are a lot of ways you can mathematically massage Lognormals to get Pareto distributions [4]. So, for now, I want to just move on to Pareto distributions rather than trying to argue why one or the other is more correct in general, or for some particular set of data. If you’re interested in that debate I highly recommend the aptly-named “Scale-free Networks Well Done”.
Paretos Lost, Paretos Regained
Why are Pareto distributions so common? That was the question that motivates me. And I’m not exactly the first to ask this question.
A funny historical note about these infinite-variance distributions is that they keep getting invented. People forget about them, or rediscover them in a new field. So Pareto, the original, came up with these looking at economic data in the late 1800s. Yule was looking at biological data in 1925. Zipf, was looking at linguistic data in the 1930s. And Mandelbrot was looking at financial data in the 60s, and was a big popularizer of them. Nassim Nicholas Taleb, in the present day, is sort of Mandelbrot’s Bulldog. He’s a big proponent of infinite-variance distributions because, as shown by financial panics, finite-variance distributions underestimate risks.
Culturally speaking, we have a strange situation: these distributions “and their quantitative features [e.g. infinite moments] are generally taken at face value and considered as simultaneously ubiquitous, arcane, and exotic” (“More ‘Normal’ than Normal”, Willinger et al, 2004). So, let’s survey some of these exotic solutions, and then work our way back to what's actually kind of a pretty simple answer.
Infinite moments must arise from some exotic model, right? But what do we mean by models? Remember the Galton board. That’s one example of a generative model for a probability distribution. There are a huge number of such generative models for Pareto distributions, in the Complexity Science and Scale-Free Networks space.
The above demo I coded up shows how a preferential attachment process produces an increasingly Pareto-like distribution. This is a foundational model in the scale-free networks literature. For a review of these models I recommend M.E.J. Newman.
After surveying these generative models, I would say they broadly fall into two categories: Random and Optimization. This also applies to the simple examples we started out to provide some intuition into the Central Limit Theorem and Normal distribution. The Galton board, I’d argue, is random. It just lays out some simple rules and plays them out to show that they result in the Normal distribution. There are also some optimization approaches to the CLT, which show that the Normal maximizes Entropy if a sample mean and variance is known.
There's so so many of these exotic generative models, it’s really dizzying. You can spend an entire lifetime looking through them. There are generally both random and optimization versions of any given model. They have pretty awesome names: Random Graph Generation with Preferential Attachment, Edge Of Chaos, Self-Organizing Criticality, Stochastic Splitting.
I don’t want to get into the weeds, so I’m just going to skim over the most important example of this.
The classic debate, for me, is the debate over Zipf’s law (box a, word frequency, in the top-left corner of the log-log plots).
We have:
Shannon Entropy (optimization, Mandelbrot)
Preferential Attachment model (random, Simon)
Monkeys On Infinite Keyboards (random, Miller).
These models and the heated exchanges between their authors are all worth looking up. They’re not only arguing about the models themselves, but are likely trying to smuggle in metaphysical claims about the nature of the universe:
logos marshaling order from chaos, top-down (Mandelbrot)
emergence of order from randomness, bottom-up (Simon)
life’s a tale, typed by a monkey, signifying nothing (Miller).
There's a lot of these models, and a lot of ink has been spilled. Perhaps, someday, a ruthless science warlord will forge the True Model, the One Model, to Rule Them All. For the time being, we’ll have to put up with the squabbling, both of the greats like Mandelbrot and Simon, and of lesser men rolling around in the mud of Nature Communications, making desperate ploys to save some model or other, all while smuggling in metaphysics or crude sociological conclusions. But, you and I, dear reader, have no need for such muckery.
What do we see as we beneficently survey the landscape of battling models? Most of them are quite plausible. This is not a problem. This makes sense. You can describe vastly different physical systems with the same governing dynamics. Conversely, you can also draw analogies between physical and electrical systems. For example, it’s possible to model a DC motor as a mechanical object or as an electrical circuit. So, to sum up, most of them seem to be mostly right.
I still want something decisive though, something that makes me feel like I’m communing with the mind of God rather than tinkering with models. Something like the Central Limit Theorem, and how it shows the Normal distribution is a fixed point under addition. Maybe it’s been under our nose the whole time...
We return, yet again, to the Central Limit Theorem. It turns out there’s a generalization we can make.
There is a more general form of the CLT for which the Normal CLT is a special case. In fact, it’s a boundary case. Remember that the normal CLT only applies when you add random variables with finite variance? This more general form of the CLT applies to both finite- and infinite-variance distributions, describing a class called “Stable distributions”. with infinite variance. This family includes the Pareto distribution. As Mandelbrot described, the Normal distribution is the well-behaved cousin of this family, with five intermediary distributions of increasing wildness until we get to the maximally-chaotic log-Cauchy distribution.
Why Paretos Are So Common
These distributions aren’t only stable under aggregation (like the Normal), they are also stable under Maximization, Weighted Mixture, Marginalization, and all sorts of other transformations. They are stable fixed points, like the Normal and like heads-up. So, it only makes sense that given the abundance of plausible generating models for fat-tailed distributions, combined with their status as attractors under most common transformations, that we see them so often in so many different situations
[1]: If you take the limit a different way, you get the Poisson Limit Theorem
[2]: The Lyapunov CLT, my preferred formulation, does not require identically distributed random variables, as long as some higher-order moments (such as variance) exist and are finite.
[3]: There are serious problems that come from trying to gather evidence for infinite, rather than merely large, moments. This is because the data is necessarily finite. There has been some promising work using subsampling.
[4]: Search “lower reflective barrier” in this review paper.