|
Clinical Epidemiology & Evidence-Based Medicine
Glossary:
Experimental Design and Statistics Terminology
Updated August 22, 1999
Contents:
- General Statistical Terms:
- Statistics: Statistics are the methods used to evaluate the effects of
chance. They are the methods to quantify and evaluate information containing uncertainty
of random origin (noise) in results from groups of individuals, each with inherent
biological differences and thus biological variability, when these individuals represent a
sample drawn from a population that could not be evaluated in its entirety (e.g., all the
individuals on which the test could have been done, to which the treatment could have been
applied, could have been vaccinated with the product, ... ). Statistics are valid only
to the degree that the opportunity for bias is minimized in the design and execution of
the study.
- P-value:
The p-value is the probability that an outcome as large as or larger than
that observed would occur in a properly designed, executed, and analyzed analytical study if
in reality there was no difference between the groups, i.e., that the outcome
was due entirely to chance variability of individuals or measurements alone. A
p-value isnt the probability that a given result is wrong or right, the probability
that the result occurred by chance, or a measure of the clinical significance of the
results. A very small p-value cannot compensate for the presence of a large amount of
systematic error (bias). If the opportunity for bias is large, the p-value is likely
invalid and irrelevant. Some introductory texts seriously miss-define this term.
- Biological (Clinical) Significance:
Biological significance is the significance of
the difference between outcomes in the clinical situation and must be determined by the
clinician with respect to the patient. Biological (clinical) significance is unrelated
to statistical significance. What is biologically or clinically significant is
measured in terms of a biological outcome (e.g., difference in measures such as morbidity
or mortality, difference in weight gain). Many studies with statistically insignificant
findings are not of sufficient size to detect the minimum clinically significant
difference. Conversely, with a large enough sample size any study will obtain statistical
significance for differences that are too small to have any biological (clinical)
significance.
- Statistically Significant:
The conclusion that the results of a study are not likely
to be due to chance alone because the P-value derived from the statistical analysis is
smaller than the critical alpha value (usually 0.05). A conclusion of statistical
significance must occur prior to (but is not directly related to) conclusions about
biologic, clinical, or economic significance. No matter how small the P-value, the
conclusion of statistical significance is valid only when opportunities for bias are
minimal.
- Statistically Insignificant:
The conclusion that the results of a study are likely
to be due to chance alone because the P-value derived from the statistical analysis is
larger than the critical alpha value (usually 0.05). Note that this conclusion is not
directly related to conclusions about biological, clinical, or economical significance
unless one considers the minimum difference or effect that the study had the power to
detect (but did not).
- Power:
Power is the likelihood that a study will detect a true difference of a given
magnitude between groups if it actually exists (i.e., a true positive). Power is a
function of study sample size, the biological variability in the population, the desired
proportions of false positives (alpha) and false negatives (beta), and the type of
statistical test used. Establishing the minimum clinically or biologically significant
difference one wishes to detect and the power with which one wishes to detect at least
that difference determine study size. Typical power levels are 0.80 and 0.90; higher
powers require larger study sizes. The concept of power is extremely important because the
lack of it (i.e., the study size was too small) can lead to statistical insignificance in
the presence of biological significance.
- Sample:
A sample is a group of individuals that is a subset of a population and has
been selected from the population in some fashion (random or haphazard).
- Sample Size (n):
The number of individuals in a group under study. The larger the
sample size, the greater the precision and thus power for a given study design to detect
an effect of a given size. For statisticians, an n > 30 is usually sufficient for the
Central Limit Theorem to hold so that normal theory approximations can be used for
measures such as the standard error of the mean. However, this sample size (n =30) is unrelated
to the clinicians objective of detecting biologically significant effects, which
determines the specific sample size needed for a specific study.
- Variability (Variation):
"Noise" due to random (chance) and non-random
(systematic) factors that obscure the actual factor of interest.
- Biological Variability:
Natural variability either within an individual over time
due to diurnal cycles and other rhythms, biological repair mechanisms, intermittent and
varying food consumption, aging, and so on or between individuals due to dietary
differences, genetic differences, immune status differences, and so on. The natural
variability of a physiologic parameter in a normal individual tested over time often
equals that in a population of normal individuals tested at one time. The presence of
biological variability in a group generally means that studies of that group must be
large, particularly if the variability is large compared to the size of the difference in
the biological parameter being measured. Because biological repair mechanisms tend to
reduce a disease in an individual over time, this source of biological variability must be
taken in to account in study designs, particularly when individuals are compared with
themselves over time. Otherwise, doing anything innocuous may appear to be associated with
improvement, just as doing nothing would have been.
- Laboratory Variability:
Variability in the laboratory setting due to changing
environmental conditions, aging and batch differences of testing components, personnel
differences, and so on. Laboratory variability is minimized by testing samples collected
over time from an individual all at one time and by replicating the tests on a single
sample with the personnel blind to the replications.
- Observer Variability
: Variability due to differences in interpretation of measures
that require any degree of subjective judgment (e.g., auscultation and palpation findings,
radiographs, histology sections) either within the same observer over time or between
observers. Observer variability is minimized by blinding observers to hypotheses, group
assignment in trials, and other findings, by increasing objectivity of measures as much as
possible, by providing standards and guidelines, and by training of observers. Observer
variability can be random but is usually systematic (bias) and is usually due to human
nature and the subtle effects of prior beliefs on perception rather than being due to
deliberate deception.
- Correlation Coefficient (r)
: The Pearsons correlation coefficient is the
extent to which the association between two variables can be described by a straight line.
Plus one is a straight line with a positive slope and all data points being on the line, 0
being no linear association (completely random), and -1 being a straight line with a
negative slope and all data points being on the line. Values in between -1 and +1 indicate
that the data points are scattered around the line with values closer to zero indicating
wider scatter. Depending on how the points are distributed, the correlation coefficient
can be a very misleading indicator of the relationship between the two variables so
looking at a plot of the data points is recommended.
- Coefficient of Determination (R2):
The proportion of the variability
observed in the response (or dependent) variable (from 0.0 to 1.0), that is accounted for
by the statistical model of the predictor (or independent) variables, usually in the form
of a linear regression equation. Note that the test of statistical significance of R2
is usually whether it equals 0 or not, which is dependent on sample size, and is not a
test of biological significance. For linear regression models with one predictor
variable, R2 is the square of the correlation coefficient.
- Confidence Interval (CI):
A confidence interval indicates the likely location of the
true value of a measure estimated in a sample from a population, the width of which is
inversely proportional to sample size. The "95" of a 95% CI means that the
estimation procedure has a 0.95 probability of producing an interval containing the true
population value if the study is repeated numerous times. Note that this is the long-run
probability that the interval contains the true value over many studies but is not
the probability for the single study; the interval either does or does not include the
true population value for a given study. A 100% interval is infinitely wide and 99%, 95%
and 90% intervals are successively narrower. If the confidence intervals for a measure in
two groups overlap, the measures are not statistically significantly different between the
two groups. If the confidence intervals of comparative measures such as relative risk or
odds ratios include 1 or 0 (if the measure is in log scale), the association between the
risk factor and the outcome is not statistically significant.
- "Normally" (Gaussian) Distributed Data:
"Normally" distributed
data are data whose frequency distribution "fits" (i.e., is closely approximated
by) the bell-shaped curve described by the Gaussian distribution, which is an exact
function described by the data mean and standard deviation. Such a distribution arises
from the independent contributions of many sources of random variation of different
magnitudes. Data distributed in this fashion allows the use of statistical procedures
based on normal theory (e.g., t-tests). Note that "normally" distributed in
the statistical sense has no relationship to "normal" in the medical
sense.
- Non-parametric Test:
A non-parametric test is a statistical test or procedure that
requires no assumptions about the distribution of the data (e.g., normally distributed)
but rather uses the relative positions or ranks (sorted order) of the data points to
establish a p-value. If data are normally distributed, these tests are less powerful than
equivalent parametric procedures because not all the information contained in the data is
used. However, under other conditions, the p-values from non-parametric tests are more
valid, such as when applied to data with censored values, outliers, or non-normal
distributions (i.e., most biologic data). Such tests are often called "robust".
- Parametric Test:
A parametric test is a statistical test or procedure using a
quantitative measure (standard error, standard deviation, mean square error) of
variability or spread in the data to establish a p-value (t-tests, ANOVA). For these tests
to produce valid p-values, the data must closely follow Gaussian or "normal"
distributions.
[Return to Section Contents List]
[Return to Glossary Contents List]
- Data Types:
Form of the information obtained from
observation and measurements, which determines the types of summary measures, analysis
procedures, and graphical displays appropriate for the data.
- Categorical Data:
Integer data with two or more exclusive categories that are
enumerated (counted) rather than measured;. The values for a group of individuals are
usually tabulated in a contingency (multi-cell row by column) table with each individual
contributing only once to the table.
- Binary
(Dichotomous) Data: Data with only two exclusive categories (alive /
dead, sick / well, smoker / non-smoker, pregnant / non-pregnant, high / low).
- Nominal:
Data values consist of scores that have no inherent ordering (hair color,
breed, reproductive status (e.g., female, male, neutered)).
- Ordinal:
Data values consist of scores that are inherently ordered (e.g., disease
severity 0, 1+, 2+, 3+, high / moderate / low). Note that unless the steps between the
scores are equal, parametric procedures should not be used to summarize and compare such
data.
- Continuous Data:
Data based on a continuous scale of measurement, such as age,
weight, serum chemistry values, and temperature, that is not restricted to integer values
and that is measured rather than enumerated. Continuous data can be reduced to discrete
data by rounding and to categorical data by establishing cutoffs and classifying it into
categories.
- Discrete Data:
Integer data based on an ordered scale with the same interval width
between intervals such as parity (number of offspring), heart and respiratory counts per
unit time, blood cell counts per unit volume.
- Qualitative (Subjective) Data:
Data, typically categorical, that are prone to
observer variation and to low repeatability without strict, validated criteria (e.g.,
disease severity 0, 1+, 2+, 3+, ...).
- Quantitative (Objective) Data:
Data, typically measured with calibrated instrument,
that are less prone to observer variation (age, weight, heart rate, ...).
- Primary Data:
Primary data are data collected by the investigators for the purposes
of the study. This allows the opportunity to improve precision and to minimize measurement
bias through the use of precise definitions, systematic procedures, trained observers, and
blinding during data collection. Such data are usually expensive to acquire compared to
secondary data.
- Secondary Data:
Secondary data are data collected for purposes other than that of
the study, such as patient clinical records, and are used frequently for case-control
studies. Because the investigator has no control over definitions, collection procedures,
observers (clinicians) or other opportunities for measurement bias reduction, the
opportunity for bias is large. The advantages of secondary data are that these data are
usually considerably less expensive and much more readily available than are primary data.
The severe disadvantage is the opportunity for the presence of large amounts of
measurement bias.
- Censored (Truncated) Data:
Commonly, follow-up data are incomplete for some
individuals in a study that occurs over time. Left-censored data occur when follow-up of
an individual at risk of an event starts at a later time than other subjects.
Right-censored data occur when an individual is lost to follow-up for reasons other than
the occurrence of the event of interest, such as the end of the study, death due to
another cause or simply loss of contact prior to the event of interest. Failure to account
for individuals with censored data can seriously bias the results of a study.
[Return to Section Contents List]
[Return to Glossary Contents List]
- Data Description:
- Statistic:
A numerical value calculated to summarize the values in a sample and that
provides an estimate of that characteristic in the population.
- Rank:
The position of a data value when the data values are sorted in numerical
order.
- 25th Percentile:
The data value that separates the bottom quarter of the
data from the upper three-quarters, which numerically is the data value at rank 0.25 * (n
+ 1).
- Lower Quartile:
The lower quartile of a data set is those values below the 25th
percentile, which is one-fourth of the data in a data set. The lower quartile data values
that are not outliers are depicted by the lower whisker on a box-and-whisker plot.
- 75th Percentile:
The data value that separates the top quarter of the
data from the bottom three-quarters, which numerically is the data value at rank 0.75 * (n
+ 1).
- Upper Quartile:
The upper quartile is those values above the 75th
percentile, which is one-fourth of the data in a data set. The upper quartile data values
that are not outliers are depicted by the upper whisker on a box-and-whisker plot.
- Interquartile Range (IQR):
The difference between the values of the 25th
and 75th percentiles, which define the boundaries of the middle one-half of the
values of a data set when sorted in numerical order. The IQR appears as the width of the
box on a box-and-whisker plot and contains one-half of the data values in a data set.
- Median (50th percentile):
The median is the value that exactly one-half
of the values are less than and one-half of the values are more than when the values are
sorted in numerical order. Numerically, the median is the data value at rank 0.5 * (n +
1). The median is a better measure than is the mean of the center of a data distribution
when the data are not symmetrically (normally) distributed because it is not affected as
severely as the mean by the outliers and non-symmetry typical of biological data. The
median appears as a line in the box of a box-and-whisker plot and divides the middle two
quartiles. Medians are compared by non-parametric statistical procedures.
- Mean (µ, x-bar):
The mean is the average value of a data set and mathematically is
the sum of all values divided by the number of values. Used as a measure of the most
common value, or "center", of a data distribution, the mean applies only to
symmetrically (normally) distributed datasets and is severely affected by outliers common
in biological data sets. Means are compared by parametric statistical procedures.
- Mode:
Most common data value, which is the highest peak of a frequency distribution.
The mode is not particularly useful other than for describing shape: unimodal - one peak,
bimodal - two peaks, ... .
- Outlier:
Outliers are unusually large or small values compared to the rest of the
data in a data set. Outliers are often defined as any value larger or smaller than the
median plus or minus 1.5 times the interquartile range or any value 2 or more standard
deviations from the mean in a large "normally" distributed data set. By
convention, mild outliers are depicted by asterisks beyond the whiskers on box-and-whisker
plots and severe outliers by open circles beyond the asterisks.
- Standard Deviation (SD, s ):
The standard deviation is a
mathematical measure of the spread or dispersion of the data around the mean value for
normally distributed data. What proportion of the data lies within multiples of the
standard deviation depends upon the underlying distribution (e.g., t-distribution,
"normal", normalized z, uniform).
- Standard Error of the Mean (SEM):
The precision of the estimate of a sample mean,
which is very common in the literature. SEM is a measure of the spread of the sample means
from repeated samples of a population and is the basis of parametric statistical
procedures for comparing group means. Mathematically, the SEM is the SD divided by the
square root of the sample size, meaning that it is always smaller than the SD. This
relationship means that to halve the SEM, the n must be quadrupled. SEM is often used
incorrectly in place of the SD to describe variability of individuals in a population.
- Standard Error of a Proportion (SEP):
The precision of the estimate of a proportion,
which is very common in the literature. Mathematically, the standard error of a proportion
p is (p(1-p)/n)0.5 where n is the sample size. For reasonably large n and
proportions that are not close to 0.0 or 1.0 so that normal theory approximations are
reasonable, the confidence interval for the proportion is p ±
1.96 * SEP.
- Range:
The range is the difference between the largest and smallest values in a set
of data. Because of the severe influence of outliers on the range, it is not particularly
useful statistically.
[Return to Section Contents List]
[Return to Glossary Contents List]
- Data Display:
- X-axis (Abscissa):
By convention, the horizontal axis of a plot or graph.
- Y-axis (Ordinate):
By convention, the vertical axis of a plot or graph.
- Error Bar:
"T" shaped bars of various lengths on plots that indicate the
precision of the estimate of the mean value of a variable at that point. The length of the
bar is usually the SEM (standard error of the mean) but may be the CI (confidence
interval) or the SD (standard deviation) of that point.
- Frequency Plot:
A plot of the data distribution. The data values of the variable
being plotted are on the x-axis, a count or percentage is on the y-axis. Each point on the
plot indicates the number or percentage of the datapoints that have that value. The
Gaussian or bell-shaped "normal" curve is a frequency plot.
- Box-and-Whisker Plot:
A frequency plot that indicates the median, the interquartile
range (the box), the range of the non-outlier data (the whiskers), and the outliers in the
data set;. Subsets of the data categorized by values of another variable (case-control
status, sex, ...) may be plotted with their own set of boxes and whiskers on the same
graph.
- Histogram:
A frequency plot using bars. The x-axis may be a continuous variable
classified into categories or be a categorical variable.
- "Normal" (Gaussian) Curve:
A frequency plot of a "normal
distribution" defined by a mean and standard deviation where 95% of the points lie
within ± 1.96 standard deviations of the mean and 68% of the
points lie within ± 1 standard deviation of the mean.
- Scatter Plot:
A plot of data points in which each point represents the simultaneous
value of two variables, usually with the independent or explanatory variable on the x-axis
and the dependent or outcome variable on the y-axis. The x-axis variable may be
continuous, interval, or categorical. Scatterplots are often used to show relationships
between levels of two variables.
- Epidemic Curve:
A histogram of the number of cases by time of onset.
- Survival Curve:
A plot of the probability that a member of a group is event-free up
to a time point. The x-axis is follow-up time starting with a common zero time and the
y-axis is a probability from 0.0 to 1.0. The name is derived from a plot of group
mortality over time, but it has more general application; e.g. to recovery, pregnancy, or
other health outcomes that occur in a group over time.
[Return to Section Contents List]
[Return to Glossary Contents List]
- Statistical Analysis Methods:
- Analysis of Variance (ANOVA):
The most common parametric procedure for comparing
multiple group means by using mean square error in an F-test to produce a p-value.
- Linear Regression:
A parametric procedure for determining the relationship between
one or more (multiple) continuous or categorical predictor (or independent) variables and
a continuous outcome (or dependent) variable that results in an equation of the general
form y = ax + b.
- Logistic Regression:
A special form of regression to determine the relationship
between one or more continuous or categorical predictor variables and a binary outcome
variable (live / dead, sick / well, ...). The regression procedure produces an equation
that predicts an outcome probability between 0.0 and 1.0 for values of the predictor
variables.
- Repeated Measures:
Data from successive testing of the same individuals over time or
under different treatment. Such data usually requires special repeated measures analysis
procedures to arrive at the correct statistical conclusion because later measurements on
an individual are related to previous ones (i.e., are not independent). Analyzing such
data as if they were single measurements on more individuals has been reported to be the
most common error in veterinary data analysis (JAVMA 182:138(1985)), resulting in a biased
p-value.
- C 2 (Chi-square) test:
A non-parametric test
for association in categorical data arranged as counts in cells of a row by column table
with the number of cells or counts equal to the number of rows times the number of
columns.
- Two Sample (Independent) t-test:
A parametric test that determines whether the means
from two independent groups are similar, within the bounds of chance variation.
- Paired (Dependent) t-test:
A parametric test that determines whether the mean
difference obtain by testing the same individuals on two different occasions (e.g., before
treatment, after treatment) is similar to zero, within the bounds of chance variation.
- Survival Analysis:
Procedures to compare survival curves.
[Return to Section Contents List]
[Return to Glossary Contents List]
|