BERKSONIAN BIAS PDF
A form of selection bias arising when both the exposure and the disease under study affect selection. In its classical. As such, the healthy-worker effect is an example of confounding rather than selection bias (Hernan et al., ), as explained further below. BERKSONIAN BIAS. Berksonian bias – There may be a spurious association between diseases or between a characteristic and a disease because of the different probabilities of.
|Published (Last):||4 May 2017|
|PDF File Size:||3.18 Mb|
|ePub File Size:||16.4 Mb|
|Price:||Free* [*Free Regsitration Required]|
In some situations, considerations of whether data are missing at berkksonian or missing not at random is less important than the causal structure of the missing-data process. While dealing with missing data always relies on strong assumptions about unobserved variables, the intuitions built with simple examples can provide a better understanding of approaches to missing data in real-world situations.
Berksonian Bias – Oxford Reference
InJoseph Berkson 1 described bias in the assessment of the relationship between an exposure and a disease due to the conduct of the study in a clinic, where attendance was affected by both exposure and disease Figure 1A 1. Figure 1A left shows a causal structure with an exposure E, an outcome D, and a factor C clinic attendance affected by both E and D. Here, I draw analogies between Berksonian selection bias and missing data. Like Berkson, I berkaonian the main discussion to a situation in which there is no confounding of the exposure-outcome relationship under study, and so consider only three variables: The berksonin of this paper is as follows.
I first remark on the structure proposed by Berkson Figures 1A and 1B and on close variants of that structure as a model for both selection bias and missing data bias. I then explore the four possible causal diagrams generated by the three variables E, D, C and the further assumption that, due to temporality, C has no causal effect on either E or D.
I discuss implications of the causal structure for bias, and provide brief illustrative examples. The paper addresses additional issues in missing data, and concludes with a brief discussion. Before proceeding, it will be useful to review the stamdard definitions of three types of missingness missingness completely at random, at random, and not at random biae well as the definition of complete case analysis.
Data are missing completely at random MCARwhen the probability of missingness depends on values of neither observed nor unobserved data. Data are missing at random MAR when the probability of missingness depends only on observed data. Data are missing not at random MNAR; alternately, there are non-ignorable missing data or non-random missingness when the probability of missingness pattern depends in part on unobserved data.
Collider bias or collider-stratification bias, or collider-conditioning bias 237 is bias resulting from conditioning on a common effect of at least two causes.
In Figure 1attendance at clinic C is an effect of both exposure E and disease D. This association is represented by a dotted line in Figure 1B. First, collider stratification is usually though by no means always explained in a situation in which exposure and disease are marginally independent; it is important to note that stratification on a collider can also introduce bias when exposure and disease are not independent.
Second, while some explanations of collider bias emphasize stratification, today we understand that similar biases are introduced by any form of conditioning, including restriction and stratification on colliders. While an apparently minor point, this recognition gives us a key pivot for moving from selection bjas to missing data. Restriction to a single level of a collider C is strongly analogous to restricting data to persons who are not missing.
If the study is conducted at a antenatal care clinic, then both pregnancy and a new diagnosis of AIDS may affect presence at the clinic, and conduct of the study in that setting may lead to a biased estimate of the relationship between pregnancy and time to AIDS.
Figure 2 shows a causal bia in which neither E nor D has any causal effect on C. Thus, conditioning on C — or restricting to a level of C — is equivalent to taking a simple random sample of the original cohort.
From a selection-bias perspective, this obviously will introduce no bias; from a missing-data perspective, this is equivalent to data missing completely at random.
Neither E nor D affects factor C, so conditioning on or restricting to a level of C amounts to simple random sampling. Table 1 shows the hypothetical cohort of patients we would have observed if we had studied the effect of E on D in for example a population sampled at random from the total eligible population, including some who attended clinic and some who did not.
Because C is unaffected by E or D, this is equivalent to simple random sampling; we observe a fixed proportion of individuals regardless of values of E and D in this case, some fraction f. In this case, conditioning on clinic attendance amounts to a simple random sample of size f N from the original N subjects, repeated independently for every combination of E and D. As can be readily seen in Table 2all measures are unbiased.
Clinic attendance might be influenced by various additional factors e. Independence of these additional factors and both E and D is sufficient but not necessary for lack of bias when conditioning on C. If attendance at our clinic is due only to distance of home from the clinic, and not due to pregnancy status nor to AIDS diagnosis, directly or indirectlythen analyses of these women will be unbiased.
Figure 3 shows a case in which exposure E is the only cause of C. From a selection-bias perspective, restricting on C will amount to simple random sampling within level of exposure; from a missing data perspective, data are missing at random, or completely at random within level of exposure.
As can be ascertained from Table 3a crude estimate of exposure or disease prevalence will in general be biased under these conditions: However, because data are missing completely at random within exposure category, the risk by exposure status can be calculated without bias: In consequence, all contrasts of risks, including risk differences, risk ratios, and odds ratios are unbiased in this setting.
E, but not D, affects factor C, so conditioning on or restricting to a level of C amounts to simple random sampling within level of E. However, in real-data analysis it is almost never the case berksonain the causal diagram is as simple as Figure 3 ; with more complications, it is less likely that this condition will hold.
For example, if we add to Figure 3 a third variable F that causes both C and the D, C is a collider for E and F; then, conditioning on C creates bias of the E-D relationship via F as Figure in the book by Rothman and colleagues Assume our clinic does not provide extensive antenatal care beyond antiretroviral therapy, and so attendance at our clinic is lower among women after they become pregnant.
If attendance is not affected by AIDS diagnosis or any other factors, then a contrast of risk of AIDS comparing pregnant and non-pregnant women attending our clinic will be unbiased.
Figure 4 shows a case in which disease status D is the only cause of C. Conditioning on C leads to simple random sampling within level of the outcome Table 4.
Berkson’s bias, selection bias, and missing data
As with Figure 3the causal structure in Figure 4 leads to biased estimates of prevalence; but in addition, this structure leads to biased estimates of risk. D, but not E, affects factor C, so conditioning on or restricting to a level of C amounts to simple random sampling within level of D. In such a case-control study, the case-control odds ratio provides an unbiased estimate of the cohort odds ratio; this is true in Table 4as well.
Just as in such a case-control study, we are unable to directly estimate absolute risks, risk differences, or risk ratios without additional information e. Thus if outcome status is the sole direct cause of selection into a study or analysis, or of missing data, the study is analogous to a case-control study under a particular control-sampling scheme; The cohort odds ratio will be unbiased in complete case analysis — assuming no additional variables of interest as in previous examples.
However, when the true effect of an exposure on the outcome is null, then missingness will not be introduced into the risk difference and risk ratio. Assume that women are more likely to miss clinic visits if they become seriously ill, and so attendance in clinic is affected by AIDS status.
If attendance at clinic is not affected by pregnancy status or any other factors and there is a non-null association between pregnancy and time to AIDS, then the risk difference and risk ratio for AIDS comparing pregnant and non-pregnant women will generally be biased, while an odds ratio for AIDS comparing pregnant and non-pregnant women will be generally unbiased. One critical special case is when E and D are non-interacting: In this case, Table 5 reduces to Table 4 and the odds ratio is unbiased in expectation.
E and D affect factor C, so conditioning on or restricting to a level of C amounts to simple random sampling within level of both E and D.
beerksonian If attendance at our clinic rises during pregnancy and with a berksonin AIDS-defining event, and if attendance changes synergistically with both pregnancy and AIDS together, then a contrasts of risk and odds of AIDS comparing pregnant and non-pregnant women will be generally blas.
Recall that data are missing at random when the probability of missingness depends on observed data, breksonian are missing not at random when probability of missingness depends at least in part on the missing data themselves. Figure 3 showed a situation in which missingness is caused by exposure alone, and complete case analysis can be expected to yield unbiased risk differences, risk ratios, and odds ratios.
But this figure does not specify which variable was missing as a result of the exposure. In particular, then, the discussion of Figure 3 applies whether the exposure caused missingness in the outcome and so data are missing at randomor whether the exposure caused missingness in the exposure and so data are missing not at random. Whether the value of the exposure led to missing outcome, or to berkeonian exposure, missingness remains completely at random within levels of the exposure and so equivalent to simple random sampling by exposure level.
Thus, even when these data are missing not berrksonian random, the complete case analysis yields unbiased estimates of the risks, risk differences, risk ratios, and odds ratios. Echoing earlier examples, pregnancy status alone might make it more likely that herksonian status is missing not at randomor that AIDS status is missing at random: Figure 4 is also compatible with a missing-at-random condition; for example, if the value of the outcome caused the value of the exposure to be missing, then missingness would depend on observed data alone.
But even when these data are missing at random, the complete case analysis yields biased estimates of the risks, the risk difference, and the risk ratio, with the odds ratio remaining unbiased. However, when data are missing at random and models are fit correctlyboth weighting 15 and multiple imputation 16 approaches can be used to obtain unbiased estimates of the risk difference and risk ratio.
Analogies between selection bias and missing data have been made implicitly by other authors, but these analogies are not a routine part of teaching and understanding these subjects. Just as others have argued with regard to selection bias 23 and overadjustment berksonizn, 1718 I here argue that structural considerations are critical for assessing the impact of missing data on estimates of effect.
Bias (statistics) – Wikipedia
If the exposure is the only cause of missingness Figure 3then whether data are missing at random or missing not at random is largely inconsequential: If the outcome is the only cause of missingness Figure 4then it is likewise moot as to whether data are missing at random or missing not at random: In these simple settings at least, it is the structure of the data, not whether the data are missing at random or not at random, that leads to bias in complete case analysis.
Of course, in the presence of a third variable- – that is, in the majority of real world data analytic situations — these statements require closer consideration. Although structure is key to understanding missing data as well as selection bias, biaa data are missing at random or not at random remains important because key methods for coping with missingness depend on these assumptions.
Multiple imputation makes a missing-at-random assumption, for example, 16 and equivalent assumptions are made for berksonin – censoring weights.
Throughout this paper, I have noted that bias may be introduced by various selection mechanisms, but without attempting to quantify the bias. Bias is likely to be small when the amount of missing data is small at all levels of the exposure and disease and in other scenarios, the covariates14 The amount of bias observed in any real-world situation will depend on specifics e. As future work, it may be useful to characterize realistic values of such variables, and to attempt to estimate the amount of bias that might be introduced by such values.
Several caveats to this work should be noted. First, the situations explored here are quite simplified. The causal diagrams do not include confounders, which might occur even in a randomized setting. But as well, the causal diagrams do not include external risk factors for the outcome; this absence is essentially never the case even in a trial.
This may be a particular problem if the external risk factor for the outcome is also a cause of missingness or selection ; such external factors would be a subject of future work. An additional limitation of the present discussion is that it ignores random error. Of course, I do not intend to suggest that any bias discussed here is deterministic; as in Greenland, 2 noted, biases correspond to asymptotic biases.
For example, collider bias is selection bias, but need not result in missing data, 237 as in beerksonian birth-weight paradox. Despite their simplified nature, these examples can help build intuition for the subjects at hand, and may find application in many settings.
One particular setting of course is antiretroviral therapy treatment cohorts among HIV-positive individuals in sub-Saharan Africa. Vital status is a key outcome of interest in such settings, where there are high rates of loss to follow-up or drop-out 2021 for which death is a relatively common reason. Vital status may sometimes be the dominant cause of loss to follow-up. This is an area where a more structural approach to missing data may be of benefit; in addition, this is a specific situation in which simulation studies might focus on berlsonian the degree and amount of bias introduced by missing data.
The berrksonian of any analytic methods to missing data relies on strong assumptions about the processes that have led to missing data; if those assumptions berksobian incorrect, then results of analysis will be misleading. In all cases, sensitivity analysis of well-defined and transparent scenarios will provide the berkspnian robust — and most responsible — inference.
I am grateful to Sander Greenland, Charles Poole, and Lynne Messer for their helpful comments on and discussions of this work. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof berksnoian it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.