The Multivariate Hypergeometric Distribution

\(\newcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\R}{\mathbb{R}}\) \(\newcommand{\N}{\mathbb{N}}\) \(\newcommand{\bs}{\boldsymbol}\) \(\newcommand{\var}{\text{var}}\) \(\newcommand{\cov}{\text{cov}}\) \(\newcommand{\cor}{\text{cor}}\)

Basic Theory

The Multitype Model

Suppose that \(D\) is a populatioon with \(m \in \N_+\) objects. Each object is one of \(k \in \N_=\) types, so that we have a multitype population. Let \(D_i\) denote the subset of all type \(i\) objects and let \(m_i = \#(D_i)\) for \(i \in \{1, 2, \ldots, k\}\). So \(D = \bigcup_{i=1}^k D_i\) and \(m = \sum_{i=1}^k m_i\). We sample \(n\) objects at random from \(D\). The outcome of the experiment is \(\bs{X} = (X_1, X_2, \ldots, X_n)\) where \(X_i \in D\) is the \(i\)th object chosen. The parameters of the model are \((m_1, m_2, \ldots, m_k)\) and \(n\).

For example, we could have an urn with balls of several different colors, or a population of voters who are either democrat, republican, or independent. The dichotomous model considered earlier is clearly a special case, with \(k = 2\).

Let \(Y_i\) denote the number of type \(i\) objects in the sample, for \(i \in \{1, 2, \ldots, k\}\), so that \(\sum_{i=1}^k Y_i = n\) and \[ Y_i = \sum_{j=1}^n \bs{1}\left(X_j \in D_i\right) \]

Note that if we know the values of \(k - 1\) of the counting variables, we can find the value of the remaining counting variable. We assume initially that the sampling is without replacement, since this is the realistic case in most applications, and so \(n \in \{1, 2, \ldots, m\}\).

Distributions

Basic combinatorial arguments can be used to derive the probability density function of the random vector of counting variables. Recall that since the sampling is without replacement, the unordered sample is uniformly distributed over the combinations of size \(n\) chosen from \(D\).

The probability density funtion of \((Y_1, Y_2, \ldots, Y_k)\) is given by \[ \P(Y_1 = y_1, Y_2 = y_2, \ldots, Y_k = y_k) = \frac{\binom{m_1}{y_1} \binom{m_2}{y_2} \cdots \binom{m_k}{y_k}}{\binom{m}{n}}, \quad (y_1, y_2, \ldots, y_k) \in \N^k \text{ with } \sum_{i=1}^k y_i = n \]

Details:

The binomial coefficient \(\binom{m_i}{y_i}\) is the number of unordered subsets of \(D_i\) (the type \(i\) objects) of size \(y_i\). The binomial coefficient \(\binom{m}{n}\) is the number of unordered samples of size \(n\) chosen from \(D\). Thus the result follows from the multiplication principle of combinatorics and the uniform distribution of the unordered sample

The distribution of \((Y_1, Y_2, \ldots, Y_k)\) is the multivariate hypergeometric distribution with parameters \((m_1, m_2, \ldots, m_k)\), and \(n\). We also say that \((Y_1, Y_2, \ldots, Y_{k-1})\) has this distribution (recall again that the values of any \(k - 1\) of the variables determines the value of the remaining variable). Usually it is clear from context which meaning is intended. The ordinary hypergeometric distribution corresponds to \(k = 2\).

An alternate form of the probability density function of \(Y_1, Y_2, \ldots, Y_k)\) is \[ \P(Y_1 = y_1, Y_2 = y_2, \ldots, Y_k = y_k) = \binom{n}{y_1, y_2, \ldots, y_k} \frac{m_1^{(y_1)} m_2^{(y_2)} \cdots m_k^{(y_k)}}{m^{(n)}}, \quad (y_1, y_2, \ldots, y_k) \in \N_k \text{ with } \sum_{i=1}^k y_i = n \]

Details:

A combinatorial proof is to consider the ordered sample, which is uniformly distributed on the set of permutations of size \(n\) from \(D\). The multinomial coefficient on the right is the number of ways to partition the index set \(\{1, 2, \ldots, n\}\) into \(k\) groups where group \(i\) has \(y_i\) elements (these are the coordinates of the type \(i\) objects). The number of (ordered) ways to select the type \(i\) objects is \(m_i^{(y_i)}\). The denominator \(m^{(n)}\) is the number of ordered samples of size \(n\) chosen from \(D\).

There is also a simple algebraic proof, starting with the PDF in . Write each binomial coefficient \(\binom{a}{j} = a^{(j)}/j!\) and rearrange a bit.

For \(i \in \{1, 2, \ldots, k\}\), \(Y_i\) has the hypergeometric distribution with parameters \(m\), \(m_i\), and \(n\) \[ \P(Y_i = y) = \frac{\binom{m_i}{y} \binom{m - m_i}{n - y}}{\binom{m}{n}}, \quad y \in \{0, 1, \ldots, n\} \]

Details:

An analytic proof is possible, starting with the joint PDF in or and summing over the unwanted variables. However, a probabilistic proof is much better: \(Y_i\) is the number of type \(i\) objects in a sample of size \(n\) chosen at random (and without replacement) from a population of \(m\) objects, with \(m_i\) of type \(i\) and the remaining \(m - m_i\) not of this type.

The multivariate hypergeometric distribution is preserved when the counting variables are combined.

Suppose that \((A_1, A_2, \ldots, A_l)\) is a partition of the index set \(\{1, 2, \ldots, k\}\) into nonempty, disjoint subsets. Let \(W_j = \sum_{i \in A_j} Y_i\) and \(r_j = \sum_{i \in A_j} m_i\) for \(j \in \{1, 2, \ldots, l\}\). Then \((W_1, W_2, \ldots, W_l)\) has the multivariate hypergeometric distribution with parameters \((r_1, r_2, \ldots, r_l)\), and \(n\).

Details:

Again, an analytic proof is possible, but a probabilistic proof is much better. Effectively, we now have a population of \(m\) objects with \(l\) types, and \(r_i\) is the number of objects of the new type \(i\). As before we sample \(n\) objects without replacement, and \(W_i\) is the number of objects in the sample of the new type \(i\).

Note that the marginal distribution of \(Y_i\) in is a special case of grouping. We have two types: type \(i\) and not type \(i\). More generally, the marginal distribution of any subsequence of \( (Y_1, Y_2, \ldots, Y_n) \) is hypergeometric, with the appropriate parameters. The multivariate hypergeometric distribution is also preserved when some of the counting variables are observed.

Suppose that \((A, B)\) is a partition of the index set \(\{1, 2, \ldots, k\}\) into nonempty, disjoint subsets. Suppose that we observe \(Y_j = y_j\) for \(j \in B\). Let \(z = n - \sum_{j \in B} y_j\) and \(r = \sum_{i \in A} m_i\). The conditional distribution of \((Y_i: i \in A)\) given \(\left(Y_j = y_j: j \in B\right)\) is multivariate hypergeometric with parameters \((m_i: i \in A)\), and \(z\).

Details:

Once again, an analytic argument is possible using the definition of conditional probability and the appropriate joint distributions. A probabilistic argument is much better. Effectively, we are selecting a sample of size \(z\) from a population of size \(r\), with \(m_i\) objects of type \(i\) for each \(i \in A\).

Combinations of the grouping result in and the conditioning result in can be used to compute any marginal or conditional distributions of the counting variables.

Moments

We will compute the mean, variance, covariance, and correlation of the counting variables. Results from the hypergeometric distribution and the representation in terms of indicator variables in are the main tools.

For \(i \in \{1, 2, \ldots, k\}\), \begin{align*} & E(Y_i) = n \frac{m_i}{m} \\ & \var(Y_i) = n \frac{m_i}{m}\frac{m - m_i}{m} \frac{m-n}{m-1} \end{align*}

Details:

This follows immediately, since \(Y_i\) has the hypergeometric distribution with parameters \(m\), \(m_i\), and \(n\).

Now let \(I_{t i} = \bs{1}(X_t \in D_i)\), the indicator variable of the event that the \(t\)th object selected is type \(i\), for \(t \in \{1, 2, \ldots, n\}\) and \(i \in \{1, 2, \ldots, k\}\).

Suppose that \(r\) and \(s\) are distinct elements of \(\{1, 2, \ldots, n\}\), and \(i\) and \(j\) are distinct elements of \(\{1, 2, \ldots, k\}\). Then \begin{align*} \cov\left(I_{r i}, I_{r j}\right) & = -\frac{m_i}{m} \frac{m_j}{m}\\ \cov\left(I_{r i}, I_{s j}\right) & = \frac{1}{m - 1} \frac{m_i}{m} \frac{m_j}{m} \end{align*}

Details:

Recall that if \(A\) and \(B\) are events, then \(\cov(A, B) = \P(A \cap B) - \P(A) \P(B)\). In the first case the events are that sample item \(r\) is type \(i\) and that sample item \(r\) is type \(j\). These events are disjoint, and the individual probabilities are \(\frac{m_i}{m}\) and \(\frac{m_j}{m}\). In the second case, the events are that sample item \(r\) is type \(i\) and that sample item \(s\) is type \(j\). The probability that both events occur is \(\frac{m_i}{m} \frac{m_j}{m-1}\) while the individual probabilities are the same as in the first case.

Suppose again that \(r\) and \(s\) are distinct elements of \(\{1, 2, \ldots, n\}\), and \(i\) and \(j\) are distinct elements of \(\{1, 2, \ldots, k\}\). Then \begin{align*} \cor\left(I_{r i}, I_{r j}\right) & = -\sqrt{\frac{m_i}{m - m_i} \frac{m_j}{m - m_j}} \\ \cor\left(I_{r i}, I_{s j}\right) & = \frac{1}{m - 1} \sqrt{\frac{m_i}{m - m_i} \frac{m_j}{m - m_j}} \end{align*}

Details:

This follows from and the definition of correlation. Recall that if \(I\) is an indicator variable with parameter \(p\) then \(\var(I) = p (1 - p)\).

In particular, \(I_{r i}\) and \(I_{r j}\) are negatively correlated while \(I_{r i}\) and \(I_{s j}\) are positively correlated.

For distinct \(i, \, j \in \{1, 2, \ldots, k\}\), \begin{align*} \cov\left(Y_i, Y_j\right) = & -n \frac{m_i}{m} \frac{m_j}{m} \frac{m - n}{m - 1}\\ \cor\left(Y_i, Y_j\right) = & -\sqrt{\frac{m_i}{m - m_i} \frac{m_j}{m - m_j}} \end{align*}

Sampling with Replacement

Suppose now that the sampling is with replacement, even though this is usually not realistic in applications. In this case, the sample size \(n\) can be any integer in \(\N_+\).

The types of the objects in the sample form a sequence of \(n\) multinomial trials with parameters \((m_1 / m, m_2 / m, \ldots, m_k / m)\).

The following results now follow immediately from the general theory of multinomial trials, although modifications of the arguments above could also be used.

\((Y_1, Y_2, \ldots, Y_k)\) has the multinomial distribution with parameters \(n\) and \((m_1 / m, m_2, / m, \ldots, m_k / m)\): \[ \P(Y_1 = y_1, Y_2 = y_2, \ldots, Y_k = y_k) = \binom{n}{y_1, y_2, \ldots, y_k} \frac{m_1^{y_1} m_2^{y_2} \cdots m_k^{y_k}}{m^n}, \quad (y_1, y_2, \ldots, y_k) \in \N^k \text{ with } \sum_{i=1}^k y_i = n \]

For distinct \(i, \, j \in \{1, 2, \ldots, k\}\), \begin{align*} & \E\left(Y_i\right) = n \frac{m_i}{m} \\ & \var\left(Y_i\right) = n \frac{m_i}{m} \frac{m - m_i}{m} \\ & \cov\left(Y_i, Y_j\right) = -n \frac{m_i}{m} \frac{m_j}{m} \\\ & \cor\left(Y_i, Y_j\right) = -\sqrt{\frac{m_i}{m - m_i} \frac{m_j}{m - m_j}} \end{align*}

Comparing with and , note that the means and correlations are the same, whether sampling with or without replacement. The variances and covariances are smaller when sampling without replacement, by a factor of the finite population correction factor \((m - n) / (m - 1)\)

Convergence to the Multinomial Distribution

Suppose that the population size \(m\) is very large compared to the sample size \(n\). In this case, it seems reasonable that sampling without replacement is not too much different than sampling with replacement, and hence the multivariate hypergeometric distribution should be well approximated by the multinomial. The following exercise makes this observation precise. Practically, it is a valuable result, since in many cases we do not know the population size exactly. For the approximate multinomial distribution, we do not need to know \(m_i\) and \(m\) individually, but only in the ratio \(m_i / m\).

Suppose that \(m_i\) depends on \(m\) and that \(m_i / m \to p_i\) as \(m \to \infty\) for \(i \in \{1, 2, \ldots, k\}\). For fixed \(n\), the multivariate hypergeometric probability density function with parameters \((m_1, m_2, \ldots, m_k)\), and \(n\) converges to the multinomial probability density function with parameters \((p_1, p_2, \ldots, p_k)\) and \(n\).

Details:

Consider the version of the PDF in . In the fraction, there are \(n\) factors in the denominator and \(n\) in the numerator. If we group the factors to form a product of \(n\) fractions, then each fraction in group \(i\) converges to \(p_i\).

Examples and Applications

A population of 100 voters consists of 40 republicans, 35 democrats and 25 independents. A random sample of 10 voters is chosen. Find each of the following:

The joint density function of the number of republicans, number of democrats, and number of independents in the sample
The mean of each variable in (a).
The variance of each variable in (a).
The covariance of each pair of variables in (a).
The probability that the sample contains at least 4 republicans, at least 3 democrats, and at least 2 independents.

Details:

\[\P(X = x, Y = y, Z = z) = \frac{\binom{40}{x} \binom{35}{y} \binom{25}{z}}{\binom{100}{10}}, \quad x, \, y, \, z \in \N, \; x + y + z = 10\)\]
\(\E(X) = 4\), \(\E(Y) = 3.5\), \(\E(Z) = 2.5\)
\(\var(X) = 2.1818\), \(\var(Y) = 2.0682\), \(\var(Z) = 1.7045\)
\(\cov(X, Y) = -1.6346\), \(\cov(X, Z) = -0.9091\), \(\cov(Y, Z) = -0.7955\)
0.2474

Bridge

Recall that a bridge hand consists of 13 cards selected at random from a standard deck of 52 cards. Bridge has a wealth of applications where the multivariate hypergeometric distribution plays an important role. A more complete analysis is given in the section on bridge in the chapter on games of chance.

In bridge, the aces, kings, queens, and jacks are honor cards or high cards. Let \(A, \, K, \, Q, \, J, \, N\) denote the number of aces, kings, queens, jacks, and non-honor cards, respectively, in a random bridge hand. Find the probability density function of \((A, K, Q, J, N)\)

Details:

\((A, K, Q, J, N)\) has the multivariate hypergeometric distribution with parameters \((4, 4, 4, 4, 36)\) and \(13\). So the probability density function is given by \[\P(A = a, K = k, Q = q, J = j, N = n) = \frac{\binom{4}{a} \binom{4}{k} \binom{4}{q} \binom{4}{j} \binom{36}{n}}{\binom{53}{13}}, \quad a, \, k, \, q, \, j, \, n \in \N, \; a + k + q + j + n = 13\]

For the variables \(A, \, K, \, Q, \, J, \, N\) in exercise , find

The mean and variance of each variable
The covariance and correlation of each pair of distinct variables

Details:

We just need to use and ,

The high card variables are identically distributed with common mean \(13 (4 / 52) = 1\) and the common variance is \(13 (4 / 52) (48 / 52) (39 / 51) = 12 / 17\).
Variable \(N\) has mean \(13 (36 / 52) = 9\) and variance \(13 (36 / 52) (16 / 52) (39 / 51) = 36 / 17\).
The common covariance of a pair of distinct high card variables is \(-13 (4 / 52) (4 / 52) (39 / 51) = -1 / 17\) and the common correlation is \(-1 / 12\).
The covariance of \(N\) with a high card variable is \(-13 (4 / 52) (36 / 52) (39 / 51) = -9 / 17\) and the correlation is \(-\sqrt{3} / 4\).

Run the bridge app 1000 times. For each of the following variables, note the location and shape of the probability density function and moments, and compare with the empirical density function and moments.

The number of aces \(A\)
The number of kings \(K\)
The number of queens \(Q\)
The number of jacks \(J\)
The number of non-honor cards \(N\)

In the most common high-card point system in bridge, an ace is worth four points, a king three points, a queen two points, and a jack one point. The non-honor cards are not awarded points. So in the notation of exercise , the high card value of the hand is \(V = 4 A + 3 K + 2 Q + J\). The high card value is one measure of the strength of a bridg hand.

Find each of the following:

\(\P(V = 8)\)
\(\E(V)\)
\(\var(V)\)

Details:

Part (a) of course can be obtained from the joint density in . The following table gives the 12 points in the support set of \((A, K, Q, J, N)\) that have high card value 8, along with the probability.

\(V\) = 8
\(a\)	\(k\)	\(q\)	\(j\)	\(n\)	\(p\)
0	0	2	4	7	0.000078874032
0	0	3	2	8	0.001143673468
0	0	4	0	9	0.000148253968
0	1	1	3	8	0.003049795915
0	1	2	1	9	0.014232380936
0	2	0	2	9	0.005337142851
0	2	1	0	10	0.009606857132
1	0	0	4	8	0.000190612245
1	0	1	2	9	0.014232380936
1	0	2	0	10	0.009606857132
1	1	0	1	10	0.025618285686
2	0	0	0	11	0.005676779214

So summing gives \(\P(V = 8) = 0.088921893516\). Parts (b) and (c) follow from and standard properties of mean, variance and covariance, \(\E(V) = 10\) and \(\var(V) = 290 / 17\). The standard deviation is about \(4.13\).

As you can probably tell from part (a) of , the probability density function of \(V\) is a bit of a computational mess, but relies only on the joint density of \((A, K, Q, J, N)\) in exercise . The complete density of \(V\) is given in the section on bridge.

Open the bridge app and select High-card value in the drop-down box. Note the location and shape of the probability density function of \(V\). Run the experiment 1000 times and compare the empirical distributioon with the probability distribution.

In addition to high cards, the distribution of a bridge hand by suits is important.

Let \(S, \, H, \, D, \, C\) denote the number of spades, hearts, diamonds, and clubs in a bridge hand. Find the probability density function of \((S, H, D, C)\).

Details:

\((S, H, D, C)\) has the multivariate hypergeometric distribution with parameters \((13, 13, 13, 13)\) and \(13\). Hence \[\P(S = s, H = h, D = d, C = c) = \frac{\binom{13}{s} \binom{13}{h} \binom{13}{d} \binom{13}{c}}{\binom{52}{13}}, \quad s, \, h, \, d, \, c \in \N, \; s + h + d + c = 13 \]

For the variables \(S, \, H, \, D, \, C\) in exercise , find

The mean and variance of each variable
The covariance and correlation of each pair of distinct variables

Details:

Note that the variables are identically distributed, as are each pair of distinct variables. From and ,

The common mean is \(13 (13 / 52) = 13 / 4\) and the common variance is \(13 (13 / 52) (39 / 52) (39 / 51) = 507 / 272\).
The common covariance is \(-13 (13 / 52) (13 / 52) (39 / 51) = -169 / 272\) and the common correlation is \(-1 / 3\).

The number of spades \(S\)
The number of hearts \(H\)
The number of diamonds \(D\)
The number of clubs \(Q\)

Bridge hands that have sparse suits are important, particularly when the contract is in another suit because of the opportunity to trump.

Consider a bridge hand.

If the hand has no cards in a suit then the hand is void in that suit.
If the hand has just one card in a suit then the hand has a singleton in the suit.
If the hand has just two cards in a suit then the hand has a doubleton in the suit.

The distribution value of the hand is \(W = 3 Z_0 + 2 Z_1 + Z_2\) where \(Z_0\), \(Z_1\), and \(Z_2\) are the number of voids, singletons, and doubletons in the hand, respectively.

The section on bridge has the joint and marginal density functions of \((Z_0, Z_1, Z_2)\) and the density function of \(W\).

Run the bridge experiment 1000 times. For each of the following variables, note the location and shape of the probability density function and moments, and compare with the empirical density function and moments.

The number of voids \(Z_0\)
The number of singletons \(Z_1\)
The number of doubleton \(Z_2\)
the distribution value \(W\)

Use the inclusion-exclusion rule to show that the probability that a bridge hand is void in at least one suit is \[ \frac{32\,427\,298\,180}{635\,013\,559\,600} \approx 0.051 \]

Bridge is a rich and interesting game in part because the players gain information as the game progresses.

Suppose that East has won a contract in hearts and then, when West lays down her hand, realizes that she and her partner have only 7 hearts between them. What is the most likely split of the remaining 6 hearts among the opponents North and South?

Details

Let \(X\) denote the number of hearts in Norths's hand. Then \(X\) has the hypergeometric distribution with parameters 26 (the number of outstanding cards), 6 (the number of outstanding hearts) and 13 (North's cards). The smaller number in the split is \(U = \min\{X, 6 - X\}\). Here are the probabilities of the various splits.

(3, 3) split: \(\P(U = 3) = \P(X = 3) = 286 / 885 \approx 0.35528\)
(2, 4) split: \(\P(U = 2) = \P(X = 2) + \P(X = 4) = 2 \P(X = 2) = 78 / 161 \approx 0.484472\)
(1, 5) split: \(\P(U = 1) = \P(X = 1) + \P(X = 5) = 2 \P(X = 1) = 117 / 805 \approx 0.145342\)
(0, 6) split: \(\P(U = 0) = \P(X = 0) + \P(X = 6) = 2 \P(X = 0) = 12 / 805 \approx 0.0149068\)

So the (2, 4) split is the most likely.

Open the missing card app. Note the the location and shape of the probability density function and moments of \(U\). Run the experiment 1000 times and compare with the empirical density function and moments.

The split distributions are given for any number of missing cards in the section on bridge.