Clinical Epidemiology & Evidence-Based Medicine Glossary:

Terminology Specific to Clinical Testing

Updated November 02, 2010

 Clinical Testing

  1. Test: A test is anything that produces evidence from a patient at any stage in the clinical process, based on which a different clinical course will be taken depending on the different possible test outcomes (positive or negative, normal or abnormal, present or absent, high or low, ...). From the lay perspective, a laboratory test performed on a specimen (feces, urine, blood, CSF, biopsy, ...) from a patient. From the clinical epidemiology perspective, the following are examples of a "test": history taking (presence or absence of a component), clinical exam results (presence or absence of a sign), imaging findings (presence or absence of a feature on a radiograph), or response to therapy (as anticipated or not). Few if any tests in medicine are perfect; that is, produce results that can always be interpreted with absolute certainty on every patient to which the test is applied. The performance of tests can be compared objectively (e.g., two clinicians (the "tests") can be compared in their ability to detect a particular clinical sign in each of a group of patients).
    1. Sensitive Test: In a diagnostic sense, a higher proportion of the individuals with the disease will test positive than with a less sensitive test.
    2. Specific Test: In a diagnostic sense, a higher proportion of the individuals without the disease will test negative than with a less specific test.
    3. Screening Test: A test applied to individuals without observed signs of disease and in which differential diagnoses of the disease of interest or clinically similar diseases have not been established. That is, the population being tested is comprised predominately of normal individuals that have not been identified as possibly having a clinical case of the disease. Thus, the probability that such an individual has the disease is the prevalence of the disease in the population being screened. Because the disease manifestations are likely minimal in affected individuals, the spectrum of disease is generally less severe in a screening than in a diagnostic setting.
    4. Diagnostic Test: A test used in the clinical environment on individuals with clinical signs or other clinical information consistent with the presence of the condition. The presence of disease has been recognized and the disease of interest is one of the differential diagnoses. This fact raises the expected prevalence (the clinician’s estimate of the probability that the individual has the disease based on what the clinician knows to that point) prior to performing the test and thus changes the test performance considerably compared to the situation when the same test is used as a screening test (the probability that a randomly selected individual has the disease is the prevalence of the disease in that population). Affected individuals are more likely to have more prominent disease manifestations in the clinical setting, meaning that the spectrum of disease is generally more severe for in a diagnostic than in a screening setting.
    5. "Gold" Standard Reference (Definitive) Test: The tests and procedures necessary to definitively establish to a high level of certainty the presence or absence of the disease in an individual. The reference standard usually requires death (necropsy examination) or is too expensive, too risky, or too slow to be used regularly in the clinical setting. For inevitably progressive, chronic conditions, the "gold standard" may be prolonged follow-up. Note that the standard test, the current most widely accepted test used day-to-day, is often not the "gold" standard test and using it as such may be a serious (but not uncommon) mistake.
  2. Accuracy: Accuracy is the degree to which, on average, a test represents the true value (that is, is unbiased). Accuracy is insufficient for describing the performance of medical tests and deciding when to use what because accuracy has two separate components (see Se, Sp below) and is dependent on the prevalence of the condition for which the test is appropriate.
  3. Precision: Precision is the inverse of the influence of random or chance error on a measurement, with the less the error the greater the precision. When testing an individual, repeating the measurements and using a summary value increases precision. When testing individuals to learn more about a group, precision is increased by increasing the number of individuals tested (i.e., increasing sample size) and increases as the reciprocal of the square root of the number tested. However, precise measurements may still deviate systematically from the true value and thus be biased or invalid, a problem that can’t be reduced by repeating the measurements on an individual or testing more individuals in a group.
  4. Disease Spectrum: For sensitivity of diagnostic tests, the disease spectrum is the range of the disease states represented by the diseased individuals (acute vs. chronic or convalescent cases, mild vs. severe cases, clinical vs. subclinical). For specificity of diagnostic tests, the disease spectrum is the range of the disease states in the individuals with diseases presenting similar clinical signs but not having the disease of interest. The two disease spectrums among the patients used to develop tests tend to be more severe than those of the typical clinical situation, meaning that test performance in practice is often lower than published estimates. Removing the test positives from a group changes the two disease spectrums, which is likely to adversely affect the performance of subsequent tests. For determining sensitivity and specificity of screening tests, the appropriate disease spectrums are those of a cross-section of an appropriate population, which are usually less severe than those of individuals being diagnosed in clinical settings.
  5. Baye’s Theorem: The mathematical relationship between the probability that an individual has the disease before the test is run to the probability that the individual has the disease after the test result is known. This theorem relates five different probabilities (Se, Sp, Pvn, Pvp, and Pr) and is crucial to understanding how to optimize the use of imperfect tests in the diagnostic process. Baye’s theorem essentially relates the certainty that the individual has a disease prior to doing the test, the two possible test results, and the certainty that the individual has the disease after doing the test.
  1. Test Performance Measures:
    1. False Negative (Fn): An individual that is test negative but is disease positive (equivalent to a type II error in statistics, governed by Beta). False negatives are undesirable test outcomes as such individuals are missclassified.
    2. False Positive (Fp): An individual that is test positive but is disease negative (equivalent to a type I error in statistics, governed by alpha). False positives are undesirable test outcomes as such individuals are missclassified.
    3. True Negative (Tn): An individual that is both test negative and disease negative.
    4. True Positive (Tp): An individual that is both test positive and disease positive.
    5. Diagnostic Sensitivity (Se): Given that an individual has the disease, sensitivity is the probability (between 0 and 1.0) that the individual will test positive. For groups, sensitivity is the proportion of diseased individuals that will test positive. Sensitivity is mathematically equivalent to Tp / (Tp + Fn). Raising a test’s Se by changing its cutoff lowers the test’s specificity. Note that although many mistakenly view this value as fixed, Se depends on the spectrum of the target disease that is present in the group or conceptual population that the test is being applied to.
    6. Analytical Sensitivity: Analytical sensitivity is the ability of a test to detect the target analyte, such as an antibody or antigen, and usually expressed as the minimum concentration of the analyte that can be detected. Analytical sensitivity is related to the sensitivity above but is not a probability.
    7. Diagnostic Specificity (Sp): Given that an individual does not have the disease, specificity is the probability that the individual will test negative. For groups, specificity is the proportion of non-diseased individuals that will test negative. Specificity is mathematically equivalent to Tn / (Tn + Fp). Raising a test’s Sp by changing its cutoff lowers the test’s Se. Note that although many mistakenly view this value as fixed, for diagnostic tests Sp depends on the disease spectrum of the competing diseases in the conceptual population that the test is being applied to. The competing diseases are often different for individuals in different regions and different circumstances.
    8. Negative Predictive Value (Pvn): Given a negative test result (the clinician’s perspective), negative predictive value is the probability that the individual does not have the disease. Negative predictive value is the proportion of individuals without the disease that are correctly diagnosed. Negative predictive value is mathematically equivalent to Tn / (Tn + Fn). Note that for a given Se and Sp, this value changes depending on the disease prevalence estimate prior to the testing being done.
    9. Positive Predictive Value (Pvp): Given a positive test result (the clinician’s perspective), positive predictive value is the probability that the individual actually has the disease. Positive predictive value is the proportion of individuals with the disease that are correctly diagnosed. Positive predictive value is mathematically equivalent to Tp / (Tp + Fp). Note that for a given Se and Sp, this value changes depending on the disease prevalence estimate prior to the testing being done.
    10. Receiver Operator Characteristic (ROC) Curve: Plot of Sensitivity vs. (1 - Specificity) for different test cutoff values, which is used to establish the "best" cutoff for a test with variable parameters. The optimum cutoff depends on the relative costs of false-positives and false-negatives.
    11. Apparent Prevalence (Test Prevalence): The proportion of test positives in the population tested. Note that apparent prevalence is equivalent to disease prevalence under most circumstances only if a perfect test (no false negatives or false positives) is used.
  2. Information Gain: Having done a test, the amount of information the clinician gained about the probability that the individual has the disease. This is the difference between the clinician’s estimate of the probability that an individual has the disease before the test is done and the probability that the individual has the disease after the test result is known. Depending on the test’s Se and Sp values, on the pre-test probability, and in a relationship that is defined by Baye’s Theorem, the information gain from a positive test is usually different from the information gain from a negative test.
    1. "Rule-in" Test: A rule-in test has a large information gain when it is positive, which means that the clinician keeps the differential on the list. In a diagnostic situation, tests with very high specificity are generally rule-in tests.
    2. "Rule-out" Test: A rule-out test has a large information gain when it is negative, which means the clinician removes the differential from the list. In a diagnostic situation, tests with very high sensitivity are generally rule-out tests.
    3. Negative Likelihood Ratio: The number of times more likely that a negative test comes from an individual with the disease rather than from an individual without the disease. Equivalent to (1 - Se) / Sp.
    4. Positive Likelihood Ratio: The number of times more likely that a positive test comes from an individual with the disease rather than from an individual without the disease. Equivalent to Se / (1 - Sp).
  3. Reproducibility (Repeatability, Consistency): The degree to which a test yields the same results when repeated under identical conditions on identical specimens.
  4. Reliability: How good is a procedure when applied by different users. The degree to which different clinicians (observers) applying the procedure classify diseased individuals into the same diagnostic, prognostic or treatment categories.
  5. Cohen’s Kappa (k ): Cohen’s Kappa is a measure for comparing two tests. It is a summary measure, ranging between -1 and +1, of the level of agreement beyond chance when two tests (or observers) are classifying the same set of specimens into two or more exclusive categories (e.g., infected, not infected or normal, mild, moderate, severe, critical) with 0 being no agreement beyond that expected by chance, 1 being complete agreement, and -1 being contrary to agreement. This measure is used when two imperfect tests (or observers) are being compared rather one test (or observer) being compared with the "Gold" or definitive standard. When the classification categories are ordered and more than three (e.g., normal, mild, severe, critical), Cohen’s Kappa often underestimates the degree of actual agreement and a weighted Kappa or other statistic captures it better. Many components of the clinical examination range between 0.4 and 0.7, which is the source of many differences between clinicians.
    1. Intra-Rater Agreement: A measure of repeatability. The level of agreement beyond chance, typically quantified by Cohen’s Kappa, that a test or observer has with itself or themselves ("intra" = within) when repeated on the same set of materials. For example, the agreement that an observer such as radiologist has with themselves when they unknowingly (blindly) repeat reading the same films.
    2. Inter-Rater Agreement: A measure of reliability. The level of agreement beyond chance, typically quantified by Cohen’s Kappa, that two different tests (observers) have ("inter" = between) when performed on the same materials. For example, the agreement that two observers such as two radiologists have between them when they unknowingly (blindly) read the same films or that two different diagnostic tests have when they are performed on the same specimens.

  [Return to Top of Section]             [Return to Glossary Contents List]