Basic Theory
The Multitype Model
Suppose that is a populatioon with objects. Each object is one of types, so that we have a multitype population. Let denote the subset of all type objects and let for . So and . We sample objects at random from . The outcome of the experiment is where is the th object chosen. The parameters of the model are and .
For example, we could have an urn with balls of several different colors, or a population of voters who are either democrat, republican, or independent. The dichotomous model considered earlier is clearly a special case, with .
Let denote the number of type objects in the sample, for , so that and
Note that if we know the values of of the counting variables, we can find the value of the remaining counting variable. We assume initially that the sampling is without replacement, since this is the realistic case in most applications, and so .
Distributions
Basic combinatorial arguments can be used to derive the probability density function of the random vector of counting variables. Recall that since the sampling is without replacement, the unordered sample is uniformly distributed over the combinations of size chosen from .
The probability density funtion of is given by
Details:
The binomial coefficient is the number of unordered subsets of (the type objects) of size . The binomial coefficient is the number of unordered samples of size chosen from . Thus the result follows from the multiplication principle of combinatorics and the uniform distribution of the unordered sample
The distribution of is the multivariate hypergeometric distribution with parameters , and . We also say that has this distribution (recall again that the values of any of the variables determines the value of the remaining variable). Usually it is clear from context which meaning is intended. The ordinary hypergeometric distribution corresponds to .
An alternate form of the probability density function of is
Details:
A combinatorial proof is to consider the ordered sample, which is uniformly distributed on the set of permutations of size from . The multinomial coefficient on the right is the number of ways to partition the index set into groups where group has elements (these are the coordinates of the type objects). The number of (ordered) ways to select the type objects is . The denominator is the number of ordered samples of size chosen from .
There is also a simple algebraic proof, starting with the PDF in [3]. Write each binomial coefficient and rearrange a bit.
For , has the hypergeometric distribution with parameters , , and
Details:
An analytic proof is possible, starting with the joint PDF in [3] or [4] and summing over the unwanted variables. However, a probabilistic proof is much better: is the number of type objects in a sample of size chosen at random (and without replacement) from a population of objects, with of type and the remaining not of this type.
The multivariate hypergeometric distribution is preserved when the counting variables are combined.
Suppose that is a partition of the index set into nonempty, disjoint subsets. Let and for . Then has the multivariate hypergeometric distribution with parameters , and .
Details:
Again, an analytic proof is possible, but a probabilistic proof is much better. Effectively, we now have a population of objects with types, and is the number of objects of the new type . As before we sample objects without replacement, and is the number of objects in the sample of the new type .
Note that the marginal distribution of in [5] is a special case of grouping. We have two types: type and not type . More generally, the marginal distribution of any subsequence of is hypergeometric, with the appropriate parameters. The multivariate hypergeometric distribution is also preserved when some of the counting variables are observed.
Suppose that is a partition of the index set into nonempty, disjoint subsets. Suppose that we observe for . Let and . The conditional distribution of given is multivariate hypergeometric with parameters , and .
Details:
Once again, an analytic argument is possible using the definition of conditional probability and the appropriate joint distributions. A probabilistic argument is much better. Effectively, we are selecting a sample of size from a population of size , with objects of type for each .
Combinations of the grouping result in [6] and the conditioning result in [7] can be used to compute any marginal or conditional distributions of the counting variables.
Moments
We will compute the mean, variance, covariance, and correlation of the counting variables. Results from the hypergeometric distribution and the representation in terms of indicator variables in [2] are the main tools.
For ,
Details:
This follows immediately, since has the hypergeometric distribution with parameters , , and .
Now let , the indicator variable of the event that the th object selected is type , for and .
Suppose that and are distinct elements of , and and are distinct elements of . Then
Details:
Recall that if and are events, then . In the first case the events are that sample item is type and that sample item is type . These events are disjoint, and the individual probabilities are and . In the second case, the events are that sample item is type and that sample item is type . The probability that both events occur is while the individual probabilities are the same as in the first case.
Suppose again that and are distinct elements of , and and are distinct elements of . Then
Details:
This follows from and the definition of correlation. Recall that if is an indicator variable with parameter then .
In particular, and are negatively correlated while and are positively correlated.
Sampling with Replacement
Suppose now that the sampling is with replacement, even though this is usually not realistic in applications. In this case, the sample size can be any integer in .
The types of the objects in the sample form a sequence of multinomial trials with parameters .
The following results now follow immediately from the general theory of multinomial trials, although modifications of the arguments above could also be used.
has the multinomial distribution with parameters and :
Comparing [14] with [8] and [10], note that the means and correlations are the same, whether sampling with or without replacement. The variances and covariances are smaller when sampling without replacement, by a factor of the finite population correction factor
Convergence to the Multinomial Distribution
Suppose that the population size is very large compared to the sample size . In this case, it seems reasonable that sampling without replacement is not too much different than sampling with replacement, and hence the multivariate hypergeometric distribution should be well approximated by the multinomial. The following exercise makes this observation precise. Practically, it is a valuable result, since in many cases we do not know the population size exactly. For the approximate multinomial distribution, we do not need to know and individually, but only in the ratio .
Suppose that depends on and that as for . For fixed , the multivariate hypergeometric probability density function with parameters , and converges to the multinomial probability density function with parameters and .
Details:
Consider the version of the PDF in [4]. In the fraction, there are factors in the denominator and in the numerator. If we group the factors to form a product of fractions, then each fraction in group converges to .
Examples and Applications
A population of 100 voters consists of 40 republicans, 35 democrats and 25 independents. A random sample of 10 voters is chosen. Find each of the following:
- The joint density function of the number of republicans, number of democrats, and number of independents in the sample
- The mean of each variable in (a).
- The variance of each variable in (a).
- The covariance of each pair of variables in (a).
- The probability that the sample contains at least 4 republicans, at least 3 democrats, and at least 2 independents.
Details:
- , ,
- , ,
- , ,
- 0.2474
Bridge
Recall that a bridge hand consists of 13 cards selected at random from a standard deck of 52 cards. Bridge has a wealth of applications where the multivariate hypergeometric distribution plays an important role. A more complete analysis is given in the section on bridge in the chapter on games of chance.
In bridge, the aces, kings, queens, and jacks are honor cards or high cards. Let denote the number of aces, kings, queens, jacks, and non-honor cards, respectively, in a random bridge hand. Find the probability density function of
Details:
has the multivariate hypergeometric distribution with parameters and . So the probability density function is given by
For the variables in exercise [17], find
- The mean and variance of each variable
- The covariance and correlation of each pair of distinct variables
Details:
We just need to use [8] and [11],
- The high card variables are identically distributed with common mean and the common variance is .
- Variable has mean and variance .
- The common covariance of a pair of distinct high card variables is and the common correlation is .
- The covariance of with a high card variable is and the correlation is .
Run the bridge app 1000 times. For each of the following variables, note the location and shape of the probability density function and moments, and compare with the empirical density function and moments.
- The number of aces
- The number of kings
- The number of queens
- The number of jacks
- The number of non-honor cards
In the most common high-card point system in bridge, an ace is worth four points, a king three points, a queen two points, and a jack one point. The non-honor cards are not awarded points. So in the notation of exercise [17], the high card value of the hand is . The high card value is one measure of the strength of a bridg hand.
Find each of the following:
Details:
Part (a) of course can be obtained from the joint density in [17]. The following table gives the 12 points in the support set of that have high card value 8, along with the probability.
= 8
|
|
|
|
|
|
0 |
0 |
2 |
4 |
7 |
0.000078874032 |
0 |
0 |
3 |
2 |
8 |
0.001143673468 |
0 |
0 |
4 |
0 |
9 |
0.000148253968 |
0 |
1 |
1 |
3 |
8 |
0.003049795915 |
0 |
1 |
2 |
1 |
9 |
0.014232380936 |
0 |
2 |
0 |
2 |
9 |
0.005337142851 |
0 |
2 |
1 |
0 |
10 |
0.009606857132 |
1 |
0 |
0 |
4 |
8 |
0.000190612245 |
1 |
0 |
1 |
2 |
9 |
0.014232380936 |
1 |
0 |
2 |
0 |
10 |
0.009606857132 |
1 |
1 |
0 |
1 |
10 |
0.025618285686 |
2 |
0 |
0 |
0 |
11 |
0.005676779214 |
So summing gives .
Parts (b) and (c) follow from [18] and standard properties of mean, variance and covariance, and . The standard deviation is about .
As you can probably tell from part (a) of [20], the probability density function of is a bit of a computational mess, but relies only on the joint density of in exercise [17]. The complete density of is given in the section on bridge.
Open the bridge app and select High-card value in the drop-down box. Note the location and shape of the probability density function of . Run the experiment 1000 times and compare the empirical distributioon with the probability distribution.
In addition to high cards, the distribution of a bridge hand by suits is important.
Let denote the number of spades, hearts, diamonds, and clubs in a bridge hand. Find the probability density function of .
Details:
has the multivariate hypergeometric distribution with parameters and . Hence
For the variables in exercise [22], find
- The mean and variance of each variable
- The covariance and correlation of each pair of distinct variables
Details:
Note that the variables are identically distributed, as are each pair of distinct variables. From [8] and [11],
- The common mean is and the common variance is .
- The common covariance is and the common correlation is .
Run the bridge app 1000 times. For each of the following variables, note the location and shape of the probability density function and moments, and compare with the empirical density function and moments.
- The number of spades
- The number of hearts
- The number of diamonds
- The number of clubs
Bridge hands that have sparse suits are important, particularly when the contract is in another suit because of the opportunity to trump.
Consider a bridge hand.
- If the hand has no cards in a suit then the hand is void in that suit.
- If the hand has just one card in a suit then the hand has a singleton in the suit.
- If the hand has just two cards in a suit then the hand has a doubleton in the suit.
The distribution value of the hand is where , , and are the number of voids, singletons, and doubletons in the hand, respectively.
The section on bridge has the joint and marginal density functions of and the density function of .
Run the bridge experiment 1000 times. For each of the following variables, note the location and shape of the probability density function and moments, and compare with the empirical density function and moments.
- The number of voids
- The number of singletons
- The number of doubleton
- the distribution value
Use the inclusion-exclusion rule to show that the probability that a bridge hand is void in at least one suit is
Bridge is a rich and interesting game in part because the players gain information as the game progresses.
Suppose that East has won a contract in hearts and then, when West lays down her hand, realizes that she and her partner have only 7 hearts between them. What is the most likely split of the remaining 6 hearts among the opponents North and South?
Details
Let denote the number of hearts in Norths's hand. Then has the hypergeometric distribution with parameters 26 (the number of outstanding cards), 6 (the number of outstanding hearts) and 13 (North's cards). The smaller number in the split is . Here are the probabilities of the various splits.
- (3, 3) split:
- (2, 4) split:
- (1, 5) split:
- (0, 6) split:
So the (2, 4) split is the most likely.
Open the missing card app. Note the the location and shape of the probability density function and moments of . Run the experiment 1000 times and compare with the empirical density function and moments.
The split distributions are given for any number of missing cards in the section on bridge.