In the basic statistical model, we have a population of objects of interest. The objects could be persons, families, computer chips, acres of corn. In addition, we have various measurements or variables defined on the objects. We select a sample from the population and record the variables of interest for each object in the sample. Here are a few examples based on the data sets in this project:
Examples
So often the observed outcome of a statistical experiment (the data) has the form \(\bs x = (x_1, x_2, \ldots, x_n)\) where \(x_i\) is the vector of measurements for the \(i\)th object chosen from the population for each \( i \in \{1, 2, \ldots, n\} \). The set \(S\) of possible values of \(\bs x\) (before the experiment is conducted) is called the sample set. It is literally the set of samples. Thus, although the outcome of a statistical experiment can have quite a complicated structure (a vector of vectors), the hallmark of mathematical abstraction is the ability to gray out the features that are not relevant at any particular time, to treat a complex structure as a single object. This we do with the outcome \(\bs x\) of the experiment.
The techniques of statistics have been enormously successful; these techniques are widely used in just about every subject that deals with quantification—the natural sciences, the social sciences, law, and medicine. On the other hand, statistics has a legalistic quality and a great deal of terminology and jargon that can make the subject a bit intimidating at first. In the rest of this section, we begin discussing some of this terminology.
Suppose that a statistical experiment results in the data \(\bs x = (x_1, x_2, \ldots, x_n)\) for some \( n \in \N_+ \). The empirical distribution associated with \(\bs x\) is the probability distribution that places probability \(1/n\) at \(x_i\) for each \( i \in \{1, 2, \ldots, n\} \).
So if the values are distinct, the empirical distribution is the discrete uniform distribution on \(\{x_1, x_2, \ldots, x_n\}\). More generally, if \(x\) occurs \(k\) times in the data for some \( k \in \{1, 2, \ldots, n\} \), then the empirical distribution assigns probability \(k / n\) to \(x\). Thus, every finite data set defines a probability distribution. The empirical distribution is very important, not just practically but also theoretically. Modern resampling methods are based on the empirical distribution.
Suppose again that a statistical experiment results in the data \(\bs x = (x_1, x_2, \ldots, x_n)\) for some \( n \in \N_+ \).
Technically, a statistic \(w = w(\bs x)\) is an observable function of the outcome \(\bs x\) of the experiment.
That is, a statistic is a computable function defined on the sample set \(S\). The term observable means that the function should not contain any unknown quantities, because we need to be able to compute the value \(w\) of the statistic from the observed data \(\bs x\). As with the data \(\bs x\), a statistic \(w\) may have a complicated structure; typically, \(w\) is vector valued. Indeed, the outcome \(\bs x\) of the experiment is itself is a statistic; all other statistics are derived from \(\bs x\).
Statistics \(u\) and \(v\) are equivalent if there exists a one-to-one function \(r\) from the range of \(u\) onto the range of \(v\) such that \(v = r(u)\). Equivalent statistics give equivalent information about \(\bs x\).
Statistics \(u\) and \(v\) are equivalent if and only if the following condition holds: for any \(\bs x \in S\) and \(\bs{y} \in S\), \(u(\bs x) = u(\bs{y})\) if and only if \(v(\bs x) = v(\bs{y})\).
Equivalence really is an equivalence relation on the collection of statistics for a given statistical experiment. That is, if \(u\), \(v\), and \(w\) are arbitrary statistics then
There are two broad branches of statistics. The term descriptive statistics refers to methods for summarizing and displaying the observed data \(\bs x\). As the name suggests, the methods of descriptive statistics usually involve computing various statistics (in the technical sense) that give useful information about the data: measures of center and spread, measures of association, and so forth. In the context of descriptive statistics, the term parameter refers to a characteristic of the entire population.
The deeper and more useful branch of statistics is known as inferential statistics. Our point of view in this branch is that the statistical experiment (before it is conducted) is a random experiment with a probability measure \(\P\) on an underlying sample space. Thus, the outcome \(\bs x\) of the experiment is an observed value of a random variable \(\bs{X}\) defined on this probability space, with the distribution of \(\bs{X}\) not completely known to us. Our goal is to draw inferences about the distribution of \(\bs{X}\) from the observed value \(\bs x\). Thus, in a sense, inferential statistics is the dual of probability. In probability, we try to predict the value of \(\bs{X}\) assuming complete knowledge of the distribution. In statistics, by contrast, we observe the value of \(\bs x\) of the random variable \(\bs{X}\) and try to infer information about the underlying distribution of \(\bs{X}\). In inferential statistics, a statistic (a function of \(\bs{X}\)) is itself a random variable with a distribution of its own. On the other hand, the term parameter refers to a characteristic of the distribution of \(\bs{X}\). Often the inferential problem is to use various statistics to estimate or test hypotheses about a parameter. Another way to think of inferential statistics is that we are trying to infer from the empirical distribution associated with the observed data \(\bs x\) to the true distribution associated with \(\bs{X}\).
There are two basic types of random experiments in the general area of inferential statistics. A designed experiment, as the name suggests, is carefully designed to study a particular inferential question. The experimenter has considerable control over how the objects are selected, what variables are to be recorded for these objects, and the values of certain of the variables. In an observational study, by contrast, the researcher has little control over these factors. Often the researcher is simply given the data set and asked to make sense out of it. For example, the Polio field trials were designed experiments to study the effectiveness of the Salk vaccine. The researchers had considerable control over how the children were selected, and how the children were assigned to the treatment and control groups. By contrast, the Challenger data sets used to explore the relationship between temperature and O-ring erosion are observational studies. Of course, just because an experiment is designed does not mean that it is well designed.
A number of difficulties can arise when trying to explore an inferential question. Often, problems arise because of confounding variables, which are variables that (as the name suggests) interfere with our understanding of the inferential question. In the first Polio field trial design, for example, age and parental consent are two confounding variables that interfere with the determination of the effectiveness of the vaccine. The entire point of the Berkeley admissions data, to give another example, is to illustrate how a confounding variable (department) can create a spurious correlation between two other variables (gender and admissions status). When we correct for the interference caused by a confounding variable, we say that we have controlled for the variable.
Problems also frequently arise because of measurement errors. Some variables are inherently difficult to measure, and systematic bias in the measurements can interfere with our understanding of the inferential question. The first Polio field trial design again provides a good example. Knowledge of the vaccination status of the children led to systematic bias by doctors attempting to diagnose polio in these children. Measurement errors are sometimes caused by hidden confounding variables.
Confounding variables and measurement errors abound in political polling, where the inferential question is who will win an election. How do confounding variables such as race, income, age, and gender (to name just a few) influence how a person will vote? How do we know that a person will vote for whom she says she will, or if she will vote at all (measurement errors)? The Literary Digest poll in the 1936 presidential election and the professional polls in the 1948 presidential election illustrate these problems.
Confouding variables, measurement errors and other causes often lead to selection bias, which means that the sample does not represent the population with respect to the inferential question at hand. Often randomization is used to overcome the effects of confounding variables and measurement errors.
The most common and important special case of the inferential statistical model occurs when the observation variable \[ \bs{X} = (X_1, X_2, \ldots, X_n) \] is a sequence of independent and identically distributed random variables. Again, in the standard sampling model, \(X_i\) is itself a vector of measurements for the \(i\)th object in the sample, and thus, we think of \((X_1, X_2, \ldots, X_n)\) as independent copies of an underlying measurement vector \(X\). In this case, \((X_1, X_2, \ldots, X_n)\) is said to be a random sample of size \(n\) from the distribution of \(X\).
The mathematical operations that make sense for variable in a statistical experiment depend on the type and level of measurement of the variable.
Recall that a real variable \(x\) is continuous if the possible values form an interval of real numbers. For example, the weight variable in the M&M data set, and the length and width variables in Fisher's iris data are continuous. In contrast, a discrete variable is one whose set of possible values forms a discrete set. For example, the counting variables in the M&M data set, the type variable in Fisher's iris data, and the denomination and suit variables in the card experiment are discrete. Continuous variables represent quantities that can, in theory, be measured to any degree of accuracy. In practice, of course, measuring devices have limited accuracy so data collected from a continuous variable are necessarily discrete. That is, there is only a finite (but perhaps very large) set of possible values that can actually be measured. So, the distinction between a discrete and continuous variable is based on what is theoretically possible, not what is actually measured. Some additional examples may help:
A real variable \(x\) is also distinguished by its level of measurement.
Qualitative variables simply encode types or names, and thus few mathematical operations make sense, even if numbers are used for the encoding. Such variables have the nominal level of measurement. For example, the type variable in Fisher's iris data is qualitative. Gender, a common variable in many studies of persons and animals, is also qualitative. Qualitative variables are almost always discrete; it's hard to imagine a continuous infinity of names.
A variable for which only order is meaningful is said to have the ordinal level of measurement; differences are not meaningful even if numbers are used for the encoding. For example, in many card games, the suits are ranked, so the suit variable has the ordinal level of measurement. For another example, consider the standard 5-point scale (terrible, bad, average, good, excellent) used to rank teachers, movies, restaurants etc.
A quantitative variable for which difference, but not ratios are meaningful is said to have the interval level of measurement. Equivalently, a variable at this level has a relative, rather than absolute, zero value. Typical examples are temperature (in Fahrenheit or Celsius) or time (clock or calendar).
Finally, a quantitative variable for which ratios are meaningful is said to have the ratio level of measurement. A variable at this level has an absolute zero value. The count and weight variables in the M&M data set, and the length and width variables in Fisher's iris data are examples.
In the basic statistical model, subsamples corresponding to some of the variables can be constructed by filtering with respect to other variables. This is particularly common when the filtering variables are qualitative. Consider the cicada data for example. We might be interested in the quantitative variables body weight, body length, wing width, and wing length by species, that is, separately for species 0, 1, and 2. Or, we might be interested in these quantitative variables by gender, that is separately for males and females.
Study Michelson's experiment to measure the velocity of light.
Study Cavendish's experiment to measure the density of the earth.
Study Short's experiment to measure the parallax of the sun.
In the M&M data, classify each variable in terms of type and level of measurement.
Each color count variable: discrete, ratio; Net weight: continuous, ratio
In the Cicada data, classify each variable in terms of type and level of measurement.
Body weight, wing length, wing width, body length: continuous, ratio. Gender, type: discrete, nominal
In Fisher's iris data, classify each variable in terms of type and level of measurement.
Petal width, petal length, sepal width, sepal length: continuous, ratio. Type: discrete, nominal
Study the Challenger experiment to explore the relationship between temperature and O-ring erosion.
In the Vietnam draft data, classify each variable in terms of type and level of measurement.
Birth month: discrete, interval; Birth day: discrete, interval
In the two SAT data sets, classify each variable in terms of type and level of measurement.
SAT math and verbal scores: probably continuous, ratio; State: discrete, nominal; Year: discrete, interval
Study the Literary Digest experiment to to predict the outcome of the 1936 presidential election.
Study the 1948 polls to predict the outcome of the presidential election between Truman and Dewey. Are these designed experiments or an observational studies?
Designed experiments, but poorly designed
Study Pearson's experiment to explore the relationship between heights of fathers and heights of sons.
Study the Polio field trials.
Identify the parameters in each of the following:
Note the parameters for each of the following families of special distributions:
During World War II, the Allies recorded the serial numbers of captured German tanks. Classify the underlying serial number variable by type and level of measurement.
discrete, ordinal.
For a discussion of how the serial numbers were used to estimate the total number of tanks, see the section on Order Statistics in Chapter 11 on finite sampling models.