Benford's law refers to probability distributions that seem to govern the significant digits in real data sets. The law is named for the American physicist and engineer Frank Benford, although the law
was actually discovered earlier by the astronomer and mathematician Simon Newcomb.
To understand Benford's law, we need some preliminaries. Recall that a positive real number \(x\) can be written uniquely in the form \(x = y \cdot 10^n\) (sometimes called scientific notation) where \(y \in \left[\frac{1}{10}, 1\right)\) is the mantissa and \(n \in \Z\) is the exponent (both of these terms are base 10, of course). Note that \[ \log(x) = \log(y) + n \] where the logarithm function is the base 10 common logarithm instead of the usual base \(e\) natural logarithm. In the old days BC (before calculators), one would compute the logarithm of a number by looking up the logarithm of the mantissa in a table of logarithms, and then adding the exponent. Of course, these remarks apply to any base \(b \gt 1\), not just base 10. Just replace 10 with \(b\) and the common logarithm with the base \(b\) logarithm.
Suppose now that \(X\) is a number selected at random from a certain data set of positive numbers. Based on empirical evidence from a number of different types of data, Newcomb, and later Benford, noticed that the mantissa \(Y\) of \(X\) seemed to have distribution function \(F(y) = 1 + \log(y)\) for \(y \in \left[\frac{1}{10}, 1\right)\). We will generalize this to an arbitrary base \(b \gt 1\).
Random variable \( Y \) has the Benford mantissa distribution with base \( b \in (1, \infty) \), if \( Y \) has a continuous distribution on \(\left[\frac{1}{b}, 1\right)\) with distribution function \( F \) given by \[ F(y) = 1 + \log_b(y), \quad \frac{1}{b} \le y \lt 1 \]
Note that \( F \) is continuous and strictly increasing on \( \left[\frac{1}{b}, 1\right) \) with \(F\left(\frac{1}{b}\right) = 0 \) and \( F(1) = 1 \).
In the special case \( b = 10 \), \( Y \) has the standard Benford mantissa distribution.
\( Y \) has probability density function \( f \) given by \[ f(y) = \frac{1}{y \ln(b)}, \quad \frac{1}{b} \le y \lt 1 \]
These results follow from the distribution function \( F \) above and standard calculus. Recall that \( f = F^\prime \).
Open the Special Distribution Simulator and select the Benford mantissa distribution. Vary the base \( b \) and note the shape of the probability density function. For various values of \( b \), run the simulation 1000 times and compare the empirical density function to the probability density function.
The quantile function \( F^{-1} \) is given by \[ F^{-1}(p) = \frac{1}{b^{1-p}}, \quad p \in [0, 1] \]
The formula for \( F^{-1}(p) \) follows by solving \( F(x) = p \) for \( x \) in terms of \( p \).
Open the special distribution calculator and select the Benford mantissa distribution. Vary the base and note the shape and location of the distribution and probability density functions. For selected values of the base, compute the median and the first and third quartiles.
Assume again that \( Y \) has the Benford mantissa distribution with base \( b \in (1, \infty) \).
The moments of \( Y \) are \[ \E\left(Y^n\right) = \frac{b^n - 1}{n b^n \ln(b)}, \quad n \in (0, \infty) \]
For \( n \gt 0 \), \[ \E\left(Y^n\right) = \int_{1/b}^1 y^n \frac{1}{y \ln(b)} dy = \frac{1}{\ln(b)} \int_{1/b}^1 y^{n-1} dy = \frac{1 - 1 \big/ b^n}{n \ln(b)} \]
Note that for fixed \( n \gt 0 \), \( \E(Y^n) \to 1 \) as \( b \downarrow 1 \) and \( \E(Y^n) \to 0 \) as \( b \to \infty \). We will learn more about the below.
The mean and variance of \( Y \) are
In the Special Distribution Simulator, select the Benford mantissa distribution. Vary the base \( b \) and note the size and location of the mean \( \pm \) standard deviation bar. For selected values of \( b \), run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
The Benford mantissa distribution has the usual connections to the standard uniform distribution by means of the distribution and quantile functions.
Suppose that \( b \in (1, \infty) \).
Since the quantile function has a simple closed form, the Benford mantissa distribution can be simulated using the random quantile method.
Open the random quantile experiment and select the Benford mantissa distribution. Vary the base \( b \) and note again the shape and location of the distribution and probability density functions. For selected values of \( b \), run the simulation 1000 times and compare the empirical density function, mean, and standard deviation to their distributional counterparts.
Also of interest, of course, are the limiting distributions of \( Y \) with respect to the base \( b \).
The Benford mantissa distribution with base \( b \in (1, \infty) \) converges to
Note that the distribution function of \( Y \) above can be written as \( F(y) = 1 + \ln(y) \big/ \ln(b) \) for \( 1/b \le y \lt 1 \), and of course we also have \( F(y) = 0 \) for \( y \lt 1/b \) and \( F(y) = 1 \) for \( y \ge 1 \).
Since the probability density function is bounded on a bounded support interval, the Benford mantissa distribution can also be simulated via the rejection method.
Open the rejection method experiment and select the Benford mantissa distribution. Vary the base \( b \) and note again the shape and location of the probability density functions. For selected values of \( b \), run the simulation 1000 times and compare the empirical density function, mean, and standard deviation to their distributional counterparts.
Assume now that the base is a positive integer \(b \in \{2, 3, \ldots\}\), which of course is the case in standard number systems. Suppose that the sequence of digits of our mantissa \(Y\) (in base \(b\)) is \(\left(N_1, N_2, \ldots\right)\), so that \[ Y = \sum_{k=1}^\infty \frac{N_k}{b^k} \] Thus, our leading digit \(N_1\) takes values in \(\{1, 2, \ldots, b - 1\}\), while each of the other significant digits takes values in \(\{0, 1, \ldots, b - 1\}\). Note that \( \left(N_1, N_2, \ldots\right) \) is a stochastic process so at least we would like to know the finite dimensional distributions. That is, we would like to know the joint probability density function of the first \(k\) digits for every \( k \in \N_+ \). But let's start, appropriately enough, with the first digit law. The leading digit is the most important one, and fortunately also the easiest to analyze mathematically.
\( N_1 \) has probability density function \( g_1 \) given by \(g_1(n) = \log_b \left(1 + \frac{1}{n} \right) = \log_b(n + 1) - \log_b(n)\) for \(n \in \{1, 2, \ldots, b - 1\}\). The density function \( g_1 \) is decreasing and hence the mode is \( n = 1 \).
Note that \(N_1 = n\) if and only if \(\frac{n}{b} \le Y \lt \frac{n + 1}{b}\) for \( n \in \{1, 2, \ldots, b - 1\} \). Hence using the PDF of \( Y \) above, \[ \P(N_1 = n) = \int_{n/b}^{(n+1)/b} \frac{1}{y \ln(b)} dy = \log_b\left(\frac{n+1}{b}\right) - \log_b\left(\frac{n}{b}\right) = \log_b(n + 1) - \log_b(n) \]
Note that when \( b = 2 \), \( N_1 = 1 \) deterministically, which of course has to be the case. The first significant digit of a number in base 2 must be 1
In the Special Distribution Simulator, select the Benford first digit distribution. Vary the base \( b \) with the input control and note the shape of the probability density function. For various values of \( b \), run the simulation 1000 times and compare the empirical density function to the probability density function.
\( N_1 \) has distribution function \( G_1 \) given by \( G_1(x) = \log_b \left(\lfloor x \rfloor + 1\right) \) for \( x \in [1, b - 1] \).
Using the PDF above, note that \[ G_1(n) = \sum_{k=1}^n [\log_b(k + 1) - \log_b(k)] = \log_b(n + 1), \quad n \in \{1, 2, \ldots, b - 1\} \] More generally, \( G_1(x) = G_1 \left(\lfloor x \rfloor\right) \) for \( x \in [1, b - 1] \)
\( N_1 \) has quantile function \( G_1^{-1} \) given by \( G_1^{-1}(p) = \left\lceil b^p - 1\right\rceil \) for \( p \in (0, 1] \).
As usual, the formula for \( G_1^{-1}(p) \) follows from the CDF above, by solving \( p = G(x) \) for \( x \) in terms of \( p \).
Open the special distribution calculator and choose the Benford first digit distribution. Vary the base and note the shape and location of the distribution and probability density functions. For selected values of the base, compute the median and the first and third quartiles.
For the most part the moments of \( N_1 \) do not have simple expressions. However, we do have the following result for the mean.
\( \E(N_1) = (b - 1) - \log_b[(b - 1)!] \).
From the PDF result above and using standard properties of the logarithm, \[ \E(N_1) = \sum_{n=1}^{b-1} n \log_b\left(\frac{n + 1}{n}\right) = \log_b\left[\prod_{n=1}^{b-1} \left(\frac{n + 1}{n}\right)^n\right] \] The product in the displayed equation simplifies to \( b^{b - 1} / (b - 1)! \), and the base \( b \) logarithm of this expression is \( (b - 1) - \log_b[(b - 1)!] \).
Opne the Special Distribution Simulator and select the Benford first digit distribution. Vary the base \( b \) with the input control and note the size and location of the mean \( \pm \) standard deviation bar. For various values of \( b \), run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation..
Since the quantile function has a simple, closed form, the Benford first digit distribution can be simulated via the random quantile method.
Open the random quantile experiment and select the Benford first digit distribution. Vary the base \( b \) and note again the shape and location of the probability density function. For selected values of the base, run the experiment 1000 times and compare the empirical density function, mean, and standard deviation to their distributional counterparts.
Now, to compute the joint probability density function of the first \(k\) significant digits, some additional notation will help.
If \(n_1 \in \{1, 2, \ldots, b - 1\}\) and \(n_j \in \{0, 1, \ldots, b - 1\}\) for \(j \in \{2, 3, \ldots, k\}\), let \[ [n_1 \, n_2 \, \cdots \, n_k]_b = \sum_{j=1}^k n_j b ^{k - j} \]
Of course, this is just the base \(b\) version of what we do in our standard base 10 system: we represent integers as strings of digits between 0 and 9 (except that the first digit cannot be 0). Here is a base 5 example: \[ [324]_5 = 3 \cdot 5^2 + 2 \cdot 5^1 + 4 \cdot 5^0 = 89 \]
The joint probability density function \( h_k \) of \( (N_1, N_2, \ldots, N_k) \) is given by
\[ h_k\left(n_1, n_2, \ldots, n_k\right) = \log_b \left( 1 + \frac{1}{[n_1 \, n_2 \, \cdots \, n_k]_b} \right), \quad n_1 \in \{1, 2, \ldots, b - 1\}, (n_2, \ldots, n_k) \in \{2, \ldots, b - 1\}^{k-1} \]Note that \(\{N_1 = n_1, N_2 = n_2, \ldots, N_k = n_k\} = \{l \le Y \lt u \}\). where \[ l = \frac{[n_1 \, n_2 \, \cdots, n_k]_b}{b^k}, \; u = \frac{[n_1 \, n_2 \, \cdots \, n_k]_b + 1}{b^k} \] Hence using the PDF of \( Y \) above and properties of logarithms, \[ h_k\left(n_1, n_2, \ldots, n_k\right) = \int_l^u \frac{1}{y \ln(b)} dy = \log_b(u) - \log_b(l) = \log_b\left([n_1 \, n_2 \, \cdots \, n_k]_b + 1\right) - \log_b\left([n_1 \, n_2 \, \cdots, n_k]_b\right) \]
Of course, the probability density function of a given digit can be obtained by summing the joint probability density over the unwanted digits in the usual way. However, except for the first digit, these functions do not reduce to simple expressions.
The probability density function \( g_2 \) of \( N_2 \) is given by \[ g_2(n) = \sum_{k=1}^{b-1} \log_b \left(1 + \frac{1}{[k \, n]_b} \right) = \sum_{k=1}^{b-1} \log_b \left(1 + \frac{1}{k \, b + n} \right), \quad n \in \{0, 1, \ldots, b - 1\} \]
Aside from the empirical evidence noted by Newcomb and Benford (and many others since), why does Benford's law work? For a theoretical explanation, see the article A Statistical Derivation of Significant Digit Law, by Ted Hill.
In the following exercises, suppose that \( Y \) has the standard Benford mantissa distribution (the base 10 decimal case), and that \( \left(N_1, N_2, \ldots\right) \) are the digits of \( Y \).
Find each of the following for the mantissa \( Y \)
For \( N_1 \), find each of the following numerically
\(n\) | \(\P(N_1 = n)\) |
---|---|
1 | 0.3010 |
2 | 0.1761 |
3 | 0.1249 |
4 | 0.0969 |
5 | 0.0792 |
6 | 0.0669 |
7 | 0.0580 |
8 | 0.0512 |
9 | 0.0458 |
Explicitly compute the values of the joint probability density function of \((N_1, N_2)\).
\(\P(N_1 = n_1, N_2 = n_2)\) | \(n_1 = 1\) | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
\(n_2 = 0\) | 0.0414 | 0.0212 | 0.0142 | 0.0107 | 0.0086 | 0.0072 | 0.0062 | 0.0054 | 0.0048 |
1 | 0.0378 | 0.0202 | 0.0138 | 0.0105 | 0.0084 | 0.0071 | 0.0061 | 0.0053 | 0.0047 |
2 | 0.0348 | 0.0193 | 0.0134 | 0.0102 | 0.0083 | 0.0069 | 0.0060 | 0.0053 | 0.0047 |
3 | 0.0322 | 0.0185 | 0.0130 | 0.0100 | 0.0081 | 0.0068 | 0.0059 | 0.0052 | 0.0046 |
4 | 0.0300 | 0.0177 | 0.0126 | 0.0098 | 0.0080 | 0.0067 | 0.0058 | 0.0051 | 0.0046 |
5 | 0.0280 | 0.0170 | 0.0122 | 0.0092 | 0.0078 | 0.0066 | 0.0058 | 0.0051 | 0.0045 |
6 | 0.0263 | 0.0164 | 0.0119 | 0.0093 | 0.0077 | 0.0065 | 0.0057 | 0.0050 | 0.0045 |
7 | 0.0248 | 0.0158 | 0.0116 | 0.0091 | 0.0076 | 0.0064 | 0.0056 | 0.0050 | 0.0045 |
8 | 0.0235 | 0.0152 | 0.0113 | 0.0090 | 0.0074 | 0.0063 | 0.0055 | 0.0049 | 0.0044 |
9 | 0.0223 | 0.0147 | 0.0110 | 0.0088 | 0.0073 | 0.0062 | 0.0055 | 0.0049 | 0.0044 |
For \( N_2 \), find each of the following numerically
\(n\) | \(\P(N_2 = n)\) |
---|---|
0 | 0.1197 |
1 | 0.1139 |
2 | 0.1088 |
3 | 0.1043 |
4 | 0.1003 |
5 | 0.0967 |
6 | 0.0934 |
7 | 0.0904 |
8 | 0.0876 |
9 | 0.0850 |
Comparing the result above for \( N_1 \) and the result above for \( N_2 \), note that the distribution of \(N_2\) is flatter than the distribution of \(N_1\). In general, it turns out that distribution of \(N_k\) converges to the uniform distribution on \(\{0, 1, \ldots, b - 1\}\) as \(k \to \infty\). Interestingly, the digits are dependent.
\(N_1\) and \(N_2\) are dependent.
This result follows from the joint PDF, the marginal PDF of \( N_1 \), and the marginal PDF of \( N_2 \) above.
Find each of the following.