﻿ Clinical Epidemiology & EBM Glossary: Experimental Design & Statistics

Clinical Epidemiology & Evidence-Based Medicine Glossary:

Experimental Design and Statistics Terminology

Updated November 02, 2010

Contents:

1. General Statistical Terms:
1. Statistics: Statistics are the methods used to evaluate the effects of chance. They are the methods to quantify and evaluate information containing uncertainty of random origin (noise) in results from groups of individuals, each with inherent biological differences and thus biological variability, when these individuals represent a sample drawn from a population that could not be evaluated in its entirety (e.g., all the individuals on which the test could have been done, to which the treatment could have been applied, could have been vaccinated with the product, ... ). Statistics are valid only to the degree that the opportunity for bias is minimized in the design and execution of the study.
2. P-value: The p-value is the probability that an outcome as large as or larger than that observed would occur in a properly designed, executed, and analyzed analytical study if in reality there was no difference between the groups, i.e., that the outcome was due entirely to chance variability of individuals or measurements alone. A p-value isn’t the probability that a given result is wrong or right, the probability that the result occurred by chance, or a measure of the clinical significance of the results. A very small p-value cannot compensate for the presence of a large amount of systematic error (bias). If the opportunity for bias is large, the p-value is likely invalid and irrelevant. Some introductory texts seriously miss-define this term.
3. Biological (Clinical) Significance: Biological significance is the significance of the difference between outcomes in the clinical situation and must be determined by the clinician with respect to the patient. Biological (clinical) significance is unrelated to statistical significance. What is biologically or clinically significant is measured in terms of a biological outcome (e.g., difference in measures such as morbidity or mortality, difference in weight gain). Many studies with statistically insignificant findings are not of sufficient size to detect the minimum clinically significant difference. Conversely, with a large enough sample size any study will obtain statistical significance for differences that are too small to have any biological (clinical) significance.
4. Statistically Significant: The conclusion that the results of a study are not likely to be due to chance alone because the P-value derived from the statistical analysis is smaller than the critical alpha value (usually 0.05). A conclusion of statistical significance must occur prior to (but is not directly related to) conclusions about biologic, clinical, or economic significance. No matter how small the P-value, the conclusion of statistical significance is valid only when opportunities for bias are minimal.
5. Statistically Insignificant: The conclusion that the results of a study are likely to be due to chance alone because the P-value derived from the statistical analysis is larger than the critical alpha value (usually 0.05). Note that this conclusion is not directly related to conclusions about biological, clinical, or economical significance unless one considers the minimum difference or effect that the study had the power to detect (but did not).
6. Power: Power is the likelihood that a study will detect a true difference of a given magnitude between groups if it actually exists (i.e., a true positive). Power is a function of study sample size, the biological variability in the population, the desired proportions of false positives (alpha) and false negatives (beta), and the type of statistical test used. Establishing the minimum clinically or biologically significant difference one wishes to detect and the power with which one wishes to detect at least that difference determine study size. Typical power levels are 0.80 and 0.90; higher powers require larger study sizes. The concept of power is extremely important because the lack of it (i.e., the study size was too small) can lead to statistical insignificance in the presence of biological significance.
7. Sample: A sample is a group of individuals that is a subset of a population and has been selected from the population in some fashion (random or haphazard).
8. Sample Size (n): The number of individuals in a group under study. The larger the sample size, the greater the precision and thus power for a given study design to detect an effect of a given size. For statisticians, an n > 30 is usually sufficient for the Central Limit Theorem to hold so that normal theory approximations can be used for measures such as the standard error of the mean. However, this sample size (n =30) is unrelated to the clinicians’ objective of detecting biologically significant effects, which determines the specific sample size needed for a specific study.
9. Variability (Variation):"Noise" due to random (chance) and non-random (systematic) factors that obscure the actual factor of interest.
1. Biological Variability: Natural variability either within an individual over time due to diurnal cycles and other rhythms, biological repair mechanisms, intermittent and varying food consumption, aging, and so on or between individuals due to dietary differences, genetic differences, immune status differences, and so on. The natural variability of a physiologic parameter in a normal individual tested over time often equals that in a population of normal individuals tested at one time. The presence of biological variability in a group generally means that studies of that group must be large, particularly if the variability is large compared to the size of the difference in the biological parameter being measured. Because biological repair mechanisms tend to reduce a disease in an individual over time, this source of biological variability must be taken in to account in study designs, particularly when individuals are compared with themselves over time. Otherwise, doing anything innocuous may appear to be associated with improvement, just as doing nothing would have been.
2. Laboratory Variability: Variability in the laboratory setting due to changing environmental conditions, aging and batch differences of testing components, personnel differences, and so on. Laboratory variability is minimized by testing samples collected over time from an individual all at one time and by replicating the tests on a single sample with the personnel blind to the replications.
3. Observer Variability: Variability due to differences in interpretation of measures that require any degree of subjective judgment (e.g., auscultation and palpation findings, radiographs, histology sections) either within the same observer over time or between observers. Observer variability is minimized by blinding observers to hypotheses, group assignment in trials, and other findings, by increasing objectivity of measures as much as possible, by providing standards and guidelines, and by training of observers. Observer variability can be random but is usually systematic (bias) and is usually due to human nature and the subtle effects of prior beliefs on perception rather than being due to deliberate deception.
10. Correlation Coefficient (r): The Pearson’s correlation coefficient is the extent to which the association between two variables can be described by a straight line. Plus one is a straight line with a positive slope and all data points being on the line, 0 being no linear association (completely random), and -1 being a straight line with a negative slope and all data points being on the line. Values in between -1 and +1 indicate that the data points are scattered around the line with values closer to zero indicating wider scatter. Depending on how the points are distributed, the correlation coefficient can be a very misleading indicator of the relationship between the two variables so looking at a plot of the data points is recommended.
11. Coefficient of Determination (R2): The proportion of the variability observed in the response (or dependent) variable (from 0.0 to 1.0), that is accounted for by the statistical model of the predictor (or independent) variables, usually in the form of a linear regression equation. Note that the test of statistical significance of R2 is usually whether it equals 0 or not, which is dependent on sample size, and is not a test of biological significance. For linear regression models with one predictor variable, R2 is the square of the correlation coefficient.
12. Confidence Interval (CI): A confidence interval indicates the likely location of the true value of a measure estimated in a sample from a population, the width of which is inversely proportional to sample size. The "95" of a 95% CI means that the estimation procedure has a 0.95 probability of producing an interval containing the true population value if the study is repeated numerous times. Note that this is the long-run probability that the interval contains the true value over many studies but is not the probability for the single study; the interval either does or does not include the true population value for a given study. A 100% interval is infinitely wide and 99%, 95% and 90% intervals are successively narrower. If the confidence intervals for a measure in two groups overlap, the measures are not statistically significantly different between the two groups. If the confidence intervals of comparative measures such as relative risk or odds ratios include 1 or 0 (if the measure is in log scale), the association between the risk factor and the outcome is not statistically significant.
13. "Normally" (Gaussian) Distributed Data: "Normally" distributed data are data whose frequency distribution "fits" (i.e., is closely approximated by) the bell-shaped curve described by the Gaussian distribution, which is an exact function described by the data mean and standard deviation. Such a distribution arises from the independent contributions of many sources of random variation of different magnitudes. Data distributed in this fashion allows the use of statistical procedures based on normal theory (e.g., t-tests). Note that "normally" distributed in the statistical sense has no relationship to "normal" in the medical sense.
14. Non-parametric Test: A non-parametric test is a statistical test or procedure that requires no assumptions about the distribution of the data (e.g., normally distributed) but rather uses the relative positions or ranks (sorted order) of the data points to establish a p-value. If data are normally distributed, these tests are less powerful than equivalent parametric procedures because not all the information contained in the data is used. However, under other conditions, the p-values from non-parametric tests are more valid, such as when applied to data with censored values, outliers, or non-normal distributions (i.e., most biologic data). Such tests are often called "robust".
15. Parametric Test: A parametric test is a statistical test or procedure using a quantitative measure (standard error, standard deviation, mean square error) of variability or spread in the data to establish a p-value (t-tests, ANOVA). For these tests to produce valid p-values, the data must closely follow Gaussian or "normal" distributions.

1. Data Types: Form of the information obtained from observation and measurements, which determines the types of summary measures, analysis procedures, and graphical displays appropriate for the data.
1. Categorical Data: Integer data with two or more exclusive categories that are enumerated (counted) rather than measured;. The values for a group of individuals are usually tabulated in a contingency (multi-cell row by column) table with each individual contributing only once to the table.
1. Binary (Dichotomous) Data: Data with only two exclusive categories (alive / dead, sick / well, smoker / non-smoker, pregnant / non-pregnant, high / low).
2. Nominal: Data values consist of scores that have no inherent ordering (hair color, breed, reproductive status (e.g., female, male, neutered)).
3. Ordinal: Data values consist of scores that are inherently ordered (e.g., disease severity 0, 1+, 2+, 3+, high / moderate / low). Note that unless the steps between the scores are equal, parametric procedures should not be used to summarize and compare such data.
2. Continuous Data: Data based on a continuous scale of measurement, such as age, weight, serum chemistry values, and temperature, that is not restricted to integer values and that is measured rather than enumerated. Continuous data can be reduced to discrete data by rounding and to categorical data by establishing cutoffs and classifying it into categories.
3. Discrete Data: Integer data based on an ordered scale with the same interval width between intervals such as parity (number of offspring), heart and respiratory counts per unit time, blood cell counts per unit volume.
4. Qualitative (Subjective) Data: Data, typically categorical, that are prone to observer variation and to low repeatability without strict, validated criteria (e.g., disease severity 0, 1+, 2+, 3+, ...).
5. Quantitative (Objective) Data: Data, typically measured with calibrated instrument, that are less prone to observer variation (age, weight, heart rate, ...).
6. Primary Data: Primary data are data collected by the investigators for the purposes of the study. This allows the opportunity to improve precision and to minimize measurement bias through the use of precise definitions, systematic procedures, trained observers, and blinding during data collection. Such data are usually expensive to acquire compared to secondary data.
7. Secondary Data: Secondary data are data collected for purposes other than that of the study, such as patient clinical records, and are used frequently for case-control studies. Because the investigator has no control over definitions, collection procedures, observers (clinicians) or other opportunities for measurement bias reduction, the opportunity for bias is large. The advantages of secondary data are that these data are usually considerably less expensive and much more readily available than are primary data. The severe disadvantage is the opportunity for the presence of large amounts of measurement bias.
8. Censored (Truncated) Data: Commonly, follow-up data are incomplete for some individuals in a study that occurs over time. Left-censored data occur when follow-up of an individual at risk of an event starts at a later time than other subjects. Right-censored data occur when an individual is lost to follow-up for reasons other than the occurrence of the event of interest, such as the end of the study, death due to another cause or simply loss of contact prior to the event of interest. Failure to account for individuals with censored data can seriously bias the results of a study.

1. Data Description:
1. Statistic: A numerical value calculated to summarize the values in a sample and that provides an estimate of that characteristic in the population.
2. Rank: The position of a data value when the data values are sorted in numerical order.
3. 25th Percentile: The data value that separates the bottom quarter of the data from the upper three-quarters, which numerically is the data value at rank 0.25 * (n + 1).
4. Lower Quartile: The lower quartile of a data set is those values below the 25th percentile, which is one-fourth of the data in a data set. The lower quartile data values that are not outliers are depicted by the lower whisker on a box-and-whisker plot.
5. 75th Percentile: The data value that separates the top quarter of the data from the bottom three-quarters, which numerically is the data value at rank 0.75 * (n + 1).
6. Upper Quartile: The upper quartile is those values above the 75th percentile, which is one-fourth of the data in a data set. The upper quartile data values that are not outliers are depicted by the upper whisker on a box-and-whisker plot.
7. Interquartile Range (IQR): The difference between the values of the 25th and 75th percentiles, which define the boundaries of the middle one-half of the values of a data set when sorted in numerical order. The IQR appears as the width of the box on a box-and-whisker plot and contains one-half of the data values in a data set.
8. Median (50th percentile): The median is the value that exactly one-half of the values are less than and one-half of the values are more than when the values are sorted in numerical order. Numerically, the median is the data value at rank 0.5 * (n + 1). The median is a better measure than is the mean of the center of a data distribution when the data are not symmetrically (normally) distributed because it is not affected as severely as the mean by the outliers and non-symmetry typical of biological data. The median appears as a line in the box of a box-and-whisker plot and divides the middle two quartiles. Medians are compared by non-parametric statistical procedures.
9. Mean (average, x-bar): The mean is the average value of a data set and mathematically is the sum of all values divided by the number of values. Used as a measure of the most common value, or "center", of a data distribution, the mean applies only to symmetrically (normally) distributed datasets and is severely affected by outliers common in biological data sets. Means are compared by parametric statistical procedures.
10. Mode: Most common data value, which is the highest peak of a frequency distribution. The mode is not particularly useful other than for describing shape: unimodal - one peak, bimodal - two peaks, ... .
11. Outlier: Outliers are unusually large or small values compared to the rest of the data in a data set. Outliers are often defined as any value larger or smaller than the median plus or minus 1.5 times the interquartile range or any value 2 or more standard deviations from the mean in a large "normally" distributed data set. By convention, mild outliers are depicted by asterisks beyond the whiskers on box-and-whisker plots and severe outliers by open circles beyond the asterisks.
12. Standard Deviation (SD, s ): The standard deviation is a mathematical measure of the spread or dispersion of the data around the mean value for normally distributed data. What proportion of the data lies within multiples of the standard deviation depends upon the underlying distribution (e.g., t-distribution, "normal", normalized z, uniform).
13. Standard Error of the Mean (SEM): The precision of the estimate of a sample mean, which is very common in the literature. SEM is a measure of the spread of the sample means from repeated samples of a population and is the basis of parametric statistical procedures for comparing group means. Mathematically, the SEM is the SD divided by the square root of the sample size, meaning that it is always smaller than the SD. This relationship means that to halve the SEM, the n must be quadrupled. SEM is often used incorrectly in place of the SD to describe variability of individuals in a population.
14. Standard Error of a Proportion (SEP): The precision of the estimate of a proportion, which is very common in the literature. Mathematically, the standard error of a proportion p is (p(1-p)/n)0.5 where n is the sample size. For reasonably large n and proportions that are not close to 0.0 or 1.0 so that normal theory approximations are reasonable, the confidence interval for the proportion is p 1.96 * SEP.
15. Range: The range is the difference between the largest and smallest values in a set of data. Because of the severe influence of outliers on the range, it is not particularly useful statistically.

1. Data Display:
1. X-axis (Abscissa): By convention, the horizontal axis of a plot or graph.
2. Y-axis (Ordinate): By convention, the vertical axis of a plot or graph.
3. Error Bar: "T" shaped bars of various lengths on plots that indicate the precision of the estimate of the mean value of a variable at that point. The length of the bar is usually the SEM (standard error of the mean) but may be the CI (confidence interval) or the SD (standard deviation) of that point.
4. Frequency Plot: A plot of the data distribution. The data values of the variable being plotted are on the x-axis, a count or percentage is on the y-axis. Each point on the plot indicates the number or percentage of the datapoints that have that value. The Gaussian or bell-shaped "normal" curve is a frequency plot.
5. Box-and-Whisker Plot: A frequency plot that indicates the median, the interquartile range (the box), the range of the non-outlier data (the whiskers), and the outliers in the data set;. Subsets of the data categorized by values of another variable (case-control status, sex, ...) may be plotted with their own set of boxes and whiskers on the same graph.
6. Histogram: A frequency plot using bars. The x-axis may be a continuous variable classified into categories or be a categorical variable.
7. "Normal" (Gaussian) Curve: A frequency plot of a "normal distribution" defined by a mean and standard deviation where 95% of the points lie within 1.96 standard deviations of the mean and 68% of the points lie within 1 standard deviation of the mean.
8. Scatter Plot: A plot of data points in which each point represents the simultaneous value of two variables, usually with the independent or explanatory variable on the x-axis and the dependent or outcome variable on the y-axis. The x-axis variable may be continuous, interval, or categorical. Scatterplots are often used to show relationships between levels of two variables.
9. Epidemic Curve: A histogram of the number of cases by time of onset.
10. Survival Curve: A plot of the probability that a member of a group is event-free up to a time point. The x-axis is follow-up time starting with a common zero time and the y-axis is a probability from 0.0 to 1.0. The name is derived from a plot of group mortality over time, but it has more general application; e.g. to recovery, pregnancy, or other health outcomes that occur in a group over time.

1. Statistical Analysis Methods:
1. Analysis of Variance (ANOVA): The most common parametric procedure for comparing multiple group means by using mean square error in an F-test to produce a p-value.
2. Linear Regression: A parametric procedure for determining the relationship between one or more (multiple) continuous or categorical predictor (or independent) variables and a continuous outcome (or dependent) variable that results in an equation of the general form y = ax + b.
3. Logistic Regression: A special form of regression to determine the relationship between one or more continuous or categorical predictor variables and a binary outcome variable (live / dead, sick / well, ...). The regression procedure produces an equation that predicts an outcome probability between 0.0 and 1.0 for values of the predictor variables.
4. Repeated Measures: Data from successive testing of the same individuals over time or under different treatment. Such data usually requires special repeated measures analysis procedures to arrive at the correct statistical conclusion because later measurements on an individual are related to previous ones (i.e., are not independent). Analyzing such data as if they were single measurements on more individuals has been reported to be the most common error in veterinary data analysis (JAVMA 182:138(1985)), resulting in a biased p-value.
5. C 2 (Chi-square) test: A non-parametric test for association in categorical data arranged as counts in cells of a row by column table with the number of cells or counts equal to the number of rows times the number of columns.
6. Two Sample (Independent) t-test: A parametric test that determines whether the means from two independent groups are similar, within the bounds of chance variation.
7. Paired (Dependent) t-test: A parametric test that determines whether the mean difference obtain by testing the same individuals on two different occasions (e.g., before treatment, after treatment) is similar to zero, within the bounds of chance variation.
8. Survival Analysis: Procedures to compare survival curves.