The modern subject is the subject of the sciences.

Pre-Lecture: Casual Interference


  • Notation

    • A: Exposure

    • Y: Outcome

    • L: Measured known covariates

      • Can condition

    • U: Unmeasured covariates

      • Can never condition

  • Counterfactual 

    • What would have happened to the exposed had they not been exposed?

    • What would have happened to the person who got treatment A, had they got treatment B?

  • The Causal Effect for an Individual

    • Potential outcomes or counterfactual outcomes:

    • Ya=1

      • Y under treatment a=1

      • Outcomes variable that would have been observed under treatment a=1 

    • Ya=0 

      • Outcome variable that would have been observed with no treatment

    • The treatment A has a causal effect on an individual’s outcome
      Y if: 

      • Ya=1≠ Ya=0

    • Causation

      • DIFFERENT RISK IN THE ENTIRE POPULATION UNDER TWO EXPOSURE VALUES  

      • Pr[Ya=1] : risk in all subjects of the population had they received the counterfactual exposure level a

        • If everybody had been exposed, what’s the probability of the outcome?

      • Causal Risk Ratio: Pr[Ya=1] / Pr[Ya=0

        • Probability of the outcome had everyone been exposed / Probability of the outcome had everyone not been exposed

        • But we don’t have this!

        • = 1 if no association 

    • Association

      • DIFFERENT RISK IN TWO DISJOINT SUBSETS OF THE POPULATION DETERMINED BY THE SUBJECTS’ ACTUAL EXPOSURE VALUE 

      • Pr[Y=1|A=1] is the risk in subjects of the population that meet the condition “having actually received exposure level a”

        • Population that you have outcome Y given that you received treatment A

      • Associational Risk Ratio: Pr[Y=1|A=1] / Pr[Y=1|A=0] = 1

        • Probability of outcome for those who were treated / Probability of outcome for those who weren’t treated

  • Q93m6Z5eDUcFz3tTAXiOPncOhRZbOXDsPu7VLiML

    • Causal: need to know what happened if whole diamond got treatment and whole diamond didn’t get treatment

    • Association: each half of diamond gets either treatment or no treatment and are compared to each other 


  • What we want to be true

    • Pr[Ya=1] = Pr[Y=1|A=1] 

    • Pr[Ya=0] = Pr[Y=1|A=0] 

    • We want the probability of the outcome for people who were treated to be the same as the probability of the outcome had everyone been treated

    • We want the probability of the outcome for people who were untreated to be the same as the probability of the outcome had everyone been untreated


  • Confounding via the counterfactual

    • Confounding arises when the outcome in the truly non-exposed differs from what would have occurred in the exposed group in the absence of exposure

      • Pr[Ya=1] ≠  Pr[Y=1|A=1] 

  • Conditional exchangeability

    • Critical criterion for causal inference

    • Exchangeability says within levels of L (measured covariates):

      • Exposed subjects would have had the same risk as unexposed subjects had they been unexposed

        • Exposed are 

      • Unexposed subjects would have had the same risk as exposed subjects had they been unexposed

      • In other words, a group of people exposed are a fine approximation for counterfactual of people who didn’t get exposure

    • Conditional exchangeability says

      • There’s something about these groups that are different (usually confounders)

      • Something different about those who are exposed vs not exposed

        • Ex. Those who choose/are able to take treatment and those who don’t

      • When we look within levels of the confounder (ex. Gender) we should have exchangeability

        • Ex. When we look at just women do we have exchangeability? When we look at men do we have exchangeability?

      • Goal when trying to control confounding is to achieve the greatest degree of conditional exchangeability as possible


Directed Acyclic Graphs (DAG)

  • Aka Casual diagrams

  • Counterfactual models underlie DAGS


DAG basics

  • Time as an invisible X-axis (Helpful when in longitudinal structures)

  • Directed – edges/arrows imply direction

  • Acyclic – no cycles, variable cannot cause itself

  • Graph


  • Simple picture that:

  • 1. Encodes subject-matter knowledge

  • 2. Our assumptions


  • Under the Null

DXn0MCoIQOo6tIOB0yug_JSnUvuX2NjgNy4F95nr

  • A is not associated with Y

    • They are independent of one another (no arrow between them)

  • L (confounder) is associated with A and Y

    • Precede exposure and outcome in time

    • Are associated with exposure and outcome 

    • Not an intermediate in the pathway between A and Y

  • Even though A and Y are not associated with each other, there’s a backdoor path between them

  • Structural / causal properties of confounders:

    • precede both the exposure and outcome in time

    • associated with exposure and the outcome

    • not an intermediate in the causal pathway between A and Y

  • We say we’ve conditioned on L to block it (accounting for it in some way)

    • When conditioned, the backdoor path of association between A and Y is blocked

    • Ex. Of conditioning is resitriction 

Ft9iEO1QV25pNF5ONu3KGkMi0z4bk3FQ8RtRDxXZ


D-separation

  • Graphical rules to assess whether 2 variables are independent (vs. D-connected, which implies they are not independent)

  • Can we get these through a backdoor path?

  • Path: arrow-based route between the 2 variables in the graph

  • Rule 1:

    • If there are no variables being conditioned on, a path is blocked if and only if two arrowheads on the path collide at some variable on the path

AMpjPNSefJJQbEaNleRFvZXmZ9qUtdBdhU9d_Yws

  • D should be Y

  • 2 arrow heads come together on D, so D is called a collider 


  • Rule 2:

    • Any path that contains a noncollider that has been conditioned on is blocked

    • Conditioning on is depicted by a box

      • Think of it as a door is closed

    • When you condition on a noncollider, it blocks that path (WANT OPEN, NOT BLOCKED!!
      )

    • So we want noncolliders that are conditioned on

g-9yJ93KpE1s5xvsnH9EQ4OezM3H4g7HKc1hBd0H


  • Rule 3:

    • A collider that has been conditioned on does not block a path

Q5GRupJ-RvworPI7S9IF3PpxXNExPCvtbe6yKgvc

  • Conditioning on D opens the path between L and A

  • By conditioning on a common outcome, knowing D and having information on either A or B, gives me information on the other 

  • Arrows pointing toward it remains

  • So we want colliders that aren’t conditioned on (conditioning on leads to selection bias
    )

Vhtb9pZ_02a88ouD5Xv9fUwvJuddDcOkniEbg9JN


zcn5keUsAwr4gLnvx0H8YCXrgzi9pjRf733PyQut


  • Rule 4

    • A collider that has a descendant (something that comes after) that has been conditioned on does not block a path (opens the path)

    • Conditioning on D opens the path between L and A


qcX2vjDZSRfS3ILE2rcfdnJKdMxvdYrWRzvZhTTe


  • If exposure precedes disease  (A→Y) (as we want it to) then the overall association has 2 components

    • 1. Spurious association due to the sharing of a common cause- CONFOUNDING

    • 2. Causal effect of A on Y- THE GOAL


  • Example 1: Interested in A→D – is there confounding?

    • VAf7W0PGAgDSDXU7yYY9Q2ZrkBlti0It5oguyXoP


  1. First, assume that null is true


2Tk-uxl4wMhqI_MvnTNKeGoYqXhhjKz9lcNwCxSy

  1. Can you still get from outcome to exposure? (make sure you don’t include arrow from exposure to outcome when considering collider)

  • The look for d-separation to determine if A ┴ D

  • ┴ is independent

  • In this case, L is a collider and it isn’t conditioned so path is closed and an arrow remains between A and D (so no backdoor path): NOT CONFOUNDER

  • If you condition on a collider it opens up the path between A and D (no arrow, is backdoor path)


Backdoor path = confounding


Example 2: Interested in A→D – is there confounding?


-eJuVngc79kJpAmQhvq4X7IS4i755tDYqFqlKuYY


1. Suppose the null is true (remove arrows from A)

2. Is there a back door path? (YES)

  • Arrow direction doesn’t matter UNLESS there are colliders

3. Can we block it?

  • YES by conditioning for L


XorbWQWNGuZdBwFb4_GGwJMzUmdk5k8qERV7nHWg

  • U is confounder

  • Should we adjust for L

    • No backdoor path that L is involved in

    • But U and L are correlated

      • How much adjusting for L helps is dictated by the correlation between U and L


  • How do people look for confounding?

    • Change in estimate

    • Comparing crude and adjusted (without cutpoint)

    • A priori knowledge and DAG

    • Autopilot (what others have done, period)

    • Automated selection methods

    • Satisfying the definition of a confounder


What if you don’t know which DAG is correct?

  • Sensitivity Analyses


ha6cHZIIiasaAW53-y5HRlq7gWu7uwskiGZPAa8C


  • Must distinguish between incidence and prevalence

  • How do you define an individual?


Steps to get diagnosis

  • Take a while to go to doctors

  • Doesn’t get better

  • Primary care physician runs tests (symptomatic but not yet diagnosed)

  • Referred to specialist

  • WHAT HAPPENS WHEN PEOPLE AREN’T IN YOUR SYSTEM


Methodologic Challenges – Incidence vs Prevalence

  • Is the first time that a patient is seen in a dataset with a specific diagnosis when they are diagnosed?

  • It depends

    • We need to understand where these data come from

  • If exposure comes a lot earlier than outcome may be issue


Defining outcome

Timing of outcomes with complex multistage diagnoses complex multistage diagnoses 

  • Multiple sclerosis (>1 episode) (have to have outcome a few times before getting diagnosed?)

  • Systemic lupus erythematosus (avg 2-4 yrs onset to dx)

  • Myocardial infarction

  • Prostate cancer


  • What is the time that outcome occurs

    • 1st suspected MS episode?

    • Symptom onset?

    • Admission to the emergency department? 

    • Elevated PSA?


  • If exposure can change over time (ex. Someone is on a medication, then off the medication), how does the outcome get assigned?

  • Do you define onset as when first diagnosed?


  • Why people like using kaiser (closed system)

    • Primary care, pharmacy, specialists all there


  • Case reports and case series

    • Important in evidence-based practice

    • Often first line of evidence, hypothesis generating

    • Not stand-alone nor definitive selection bias

    • No comparator group

    • Describe rare clinical events or unusual manifestations

    • Describes series of cases


  • Ecologic study

    • Ex.

90Mo5TDHq1EIhMLe6pBYwa4ISkNfagdoA45KxyOX

    • Don’t know if people who had high fat intake had breast cancer, don’t know at individual level

  • Ecological fallacy: the associations observed at the population or group level may not hold up when looking at the same association among individuals within the group

    • Use of aggregate data to draw individual level inferences

  • Why conduct ecological studies?

    • Individual level-study is not possible

    • Measurement impossible

    • Design not possible, including unethical

    • Relatively new hypothesis

    • Time or money is limited but data are easily available

    • Interested at the ecological level


  • Ecologic as a level

    • Can still be interested in biologic hypothesis/mechanism but have ecologic-level exposure data (e.G. Environmental measurements)

    • Can have ecologic exposure and outcome measures

    • Might be interested in group and individual level effects


Cross-sectional studies


  • Snapshot of a community or group

  • Exposure and outcome are measured at the same time

  • Usually captures prevalent outcomes/disease

  • Can be descriptive and/or analytic


  • Relationship Between Prevalence, Incidence Density/Rate and Duration of Disease:


P= I X D


where, I = incidence rate, P = prevalence,  D = duration


  • Prevalent outcome interpretation

  • With prevalence, you don’t know if exposure causes incident disease, survival with disease, or exposure

  • If we see an association between a prevalent outcome and exposure it may be that:

1. Exposure → incident disease

2. Exposure → survival with disease

3. Disease → exposure


  • Cross sectional studies use Point Prevalence Rate Ratio (PPRR) to estimate the relative risk 

    • 2 types of potential bias will differentiate these 2 measures

      • The ratio of the disease durations

      • The ratio of the complements of the point prevalence estimates in the exposed and unexposed groups.


Cohort Studies

  • Extension of RTC into observational study realm

  • If this question could be answered by a randomized experiment, what would that experiment look like?

  • Why is this appealing?

  • How does randomization change a DAG?

  • “Causal inference from observational data then revolves around the hope that the observational study can be viewed as a conditionally randomized experiment.”


  • How does the DAG change with randomization?

PC8y45nim61pfWjsCmj2BEfwxNmOEKGj13sCJLx_

  • Arrow from C to D stays

  • BUT arrow from C to E disappears, getting rid of backdoor path


  •  Simplest transition from experimental studies

    • Except that subjects “choose” exposure rather than it being assigned by investigator


What is a cohort?

  • A group of individuals (potential subjects)

    • defined on the basis of the presence or absence of exposure to a suspected risk factor for a disease 

    • whose disease or mortality is measured over time 


  • We want to know the effect of an exposure on the occurrence of a particular outcome during some observation period.

  • We determine what the outcome is in the exposed group.

  • Do the same for unexposed group


kSC6ZOOmgZt7UTGSqwONF87KKad5butBbnyOfmkq

Q7TUMc-kf25HDNhczOv_E2WNNAx_Nz6KlSSyi5Xt

For external comparison group, use source population (represents what happens to population in absence of exposure)


LIapp-j0AyeZSBPvIy8usj_KisDY0tsZyefQyZq-


Old Test Question: 

  • What is the randomized experiment that we would like to conduct (but cannot)?

  • How does the observational study emulate that randomized experiment?

  • (Note: Target trials and emulating a trial to come later this term)


BwO009RQ62E6MGnRGjQgiVHG1_gHRcl6maYp0dne


  • When use the OR as an estimate of the RR

    • There is a “built in” bias, which is away from the null hypothesis

    • OR will always be farther from the RR than the OR

    • Closes from rare disease

    • The OR  is always further away from 1.0 then the RR 


  • OR vs. RR: Advantages

    • OR can be estimated from a case-control study.

    • OR can be estimated from logistic regression.

    • OR of an “event” is the reciprocal of the OR of a “non-event.”


Closed Cohort

  • Once a member, always a member

  • Fixed population

  • Membership-defining event

  • The number of people in the cohort can be counted and is fixed at baseline (T0)


Open cohort: people come in and out

  • Because people come in and out of the study population, they contribute person-time for the time that they are observed

  • Person-time accrues from a changing population of individuals

  • Can account for differing length of follow-up as well as loss to follow-up

  • Participants can contribute time to multiple exposure categories § Person-time may not be intuitive


Usszg1HoXr0j6lUX4R9uMbJMoQrkeqHGi9ql1cT_

Follow-up

Participants are followed for the outcome of interest, and therefore must be at risk for the outcome.

In practice we can only follow-up until:

  • Outcome

  • Death

  • Emigration

  • Drop out

  • Lost for unknown reasons


  • Prospective Cohort Study: Exposures are measured by the investigator before the outcomes have occurred


  • Retrospective Cohort Study: Exposures are measured by the investigator after the outcomes have occurred


  • Cohort is not the same as prospective (can be either prospective or retrospective)


  • Relative risk is not the same as risk ratio

    • Relative risk is generic term for any ratio based measure of association (odds ratio, rate ratio, hazard ratio, risk ratio)

    • Risk ratio = risk of outcome in exposed/risk of outcome in unexposed


  • Induction vs Latent Period

    • Induction: how long it takes exposure to induce disease

    • Latent Period: from time when event happens and until it’s on our radar

    • Induction time is an important part of the study hypothesis

1ZyMs-bz5nTFEfW57ax1fic8amYSF1NVn8A4bNOA


  • In studies of chronic exposures, it is easy to confuse the time during which exposure occurs with the time at risk of exposure effects

  • Ex. Atomic bomb has very long risk period due to exposure

  • WHAT ABOUT WHEN EXPOSURE CHANGES OVER TIME?

    • Start treatment A, switch to treatment B.

    • If outcome occurs, when is it attributed to A and when to B?


  • Changing exposure

    • There are many exposures where patients vary between exposed and unexposed, or vary across different exposures.

    • How do you handle events that occur during transition periods or shortly after treatment switch?

      • Assumptions about washout, induction, and the underlying biology are all important in making these decisions.


Immortal time bias

  • Time under observation or during follow up during which the outcome could not have occurred (ex. Heart transplant patient is counted as heart transplant patient before actually received transplant, should technically be counted as unexposed time, they had to go through that time to get exposure, essentially immortal during that time)

  • Also has been referred to as survivor treatment selection bias in some studies


AvVQdLG5RqfH9oEMxg5DTIuyEYGeG-9ei_dxHclT


Case-Control Study


How do we get controls?

  • We generally understand where cases come from…

  • The biggest concern in case-control studies is control selection

  • Imagine underlying hypothetical cohort

  • Cases are all cases that occurred in the hypothetical cohort during the study

  • Controls are selected from among those without the disease of interest (non-cases)

  • Nested case control study: have cohort and do case control study within it (you know your entire cohort, but aren’t doing the full cohort study)

    • ex. Nurse’s study, do experiments on subset


  • 2 main ways for sampling for controls


  1. Cumulative incidence sampling

  • Wait until the end of follow-up (assuming an underlying closed cohort) and sample all cases regardless of when they occurred.

  • The controls are those that did not get the outcome

  • People who survive all that time may be different!!!! (healthier, very adherent to certain lifestyle)

JYXWErnxAGDivZZTa_oZTnj6mHc0Uf-JsIfMuzkt

  1. Risk-set sampling (Incidence density sampling)


  • Choosing control from cohort when a case becomes a case (?)

    • Every time a case becomes a case, you pull a control (can be matched or unmatched) from other members of the cohort that don’t yet have the disease 

  • Advantage: samples on-person time

    • Matching cases and controls to be eligible up until the same amount of time

  • Sampling must be independent of exposure

  • Controls are matched to cases on time at risk (same amount of follow-up time)

  • Because controls are matched on time, the probability of being selected is proportional to an individual’s person-time in the study base

  • Someone who is a control at one time can later be a control again as well as a case

  • BEST WAY (LESS BIAS)


xwQWUkRxSmJZkQOlIjIKDScoA0LVLCww2iuBgzeJ


  • Are there differences in the type of controls we get from these two different sampling mechanisms?

    • YES


Rare disease assumption & sampling

  • Cumulative incidence sampling requires that the outcome is rare for the OR to approximate the risk ratio

  • The rare disease assumption is not needed with risk-set (density) sampling of controls for the OR to approximate the rate ratio


Nested vs. Non-nested case-control studies

  • Nested:hypothetical cohort is real…

  • Non-nested: think about hypothetical cohort

  • Goal of controls is to sample the study base to get an unbiased estimate of the exposure distribution in population that gave rise to the cases

  • Incidence density sampling preferable


eaYx-u5AxLzsh_vNbBnCHGX8dB70AsFF7DHWSVkM


  • Where does the information on controls come from?

    • Source population: the larger population that the cohort was derived from

      • How does it compare to the study population?

        • These are the individuals in my cohort, sampled from the source population

      • And what is this study base that I keep hearing about?

        • The person-time from which the cases arrive out of the cohort


Primary Study Base (Population Controls)

  • Base population identified first 

  • Cases identified from this population (or person-time experience)

    • Nested case-control

    • Roster

  • Enumerated (everyone in source population is enumerated)


Secondary Study Base

  • Source of cases identified first

  • Thinking backwards

  • Investigator determines where the cases came from

    • Cases from a hospital with no pre-defined base population 

    • Disease clusters

  • More error prone


Where can I find controls?

  • Population register 

  • Neighborhood 

  • Friends/family 

  • Hospitals

  • Dead case = dead control

  • Random digit dialing


Population controls (primary base)

  • Often relies on a roster or register (same study base)

  • In the absence of a roster or register, it is possible that not every person eligible has the same chance of being selected (possible selection bias)

    • Random digit dialing

      • May cause problems because some people have multiple phone numbers (no longer proportional)

    • Neighborhood controls (residences instead of phone numbers)

      • May cause problems due to environmental exposures,


Hospital or disease registry controls

  • Not all cases within a hospital are the same

    • Ex. Multiple sclerosis patients referred to an academic center and those who live nearby

      • Some disease groups may travel from really far because specific/complicated conditions

      • Other disease groups may be from near by

    • Therefore, how does one control group reflect the same differences in referral patterns within that one academic center? (some people come from far for complicated stuff, some people come from close)

  • Secondary study base may not be identifiable

  • Berkson’s Bias

    • If exposure is related to risk of being hospitalized with the ”control disease”, then the distribution of exposure in our control group will be different from the distribution in the study base.??

    • (control is more likely to be exposed)


Friends and family as controls

  • Relying on the case to identify controls (behaviors are more similar in family and friend groups)


Dead controls

  • Records or family interviews for exposure information 

  • No longer at risk of the outcome

  • What if case is dead?

  • What if exposure is related to mortality?


Questions:



LlcjQnUEmG3oq20qYqqdRcI12zo0zVk5hiJo-vtB


False, you can leave them in. They still get an outcome and they are unexposed (there are lots of reasons people can be exposed or unexposed)

For controls, population must be risk for outcome, NOT exposure


  1. What about women who had a hysterectomy before menopause? Should they be excluded?


  • No, Not at risk for the outcome


Matched case-control studies

  • Cases are matched to controls with respect to important confounders

  • Case:Control ratio can vary from 1:1 to 1:x (x>1)

  • Remember that matching makes cases and controls more similar to each other than they would be with random sampling


Applications of Stratified Analysis Methods

  • Analysis of matched data involves the same statistical methods as used for unmatched data. Even though many textbooks present special ’matched-data’ techniques, these…Are just special cases of general stratified methods for sparse data!

  • Stratified Analysis

    • Stratify data on the confounding variables to form strata (perhaps these are your matching factors)

    • Test of no association uses the Mantel-Haenszel chi-sq test statistic

    • Can do 1-1 matching

    • Mcnemar’s test: test discordance and concordance (get info just from discordance, for matched data)


  • What could we do differently?

    • Can we learn something by only looking at cases?

    • Can we design a study similar to a case-control design but have multiple outcomes/cases defined?

    • Can a case-control study look at multiple exposures for the same outcome?

    • Any other thoughts?


Could we do this?

HFwm8fhEDx0vxWKHKCDHrIw6z6UUM3IW4M1dWGMv

  • Yes, with a subcohort


Case-cohort

  • Pick a subcohort of the full cohort

    • Randomly sampled subcohort from the entire cohort/base

    • PLUS all cases (who may or may not be in the subcohort)

  • This is your ”control” population

  • + collect information on cases that might occur outside of subcohort within larger cohort

  • Pros of this

    • Efficity: Testing and detailed data not required for the entire cohort Flexible: can test multiple hypotheses and multiple outcomes

    • Cases and subcohort arise from the same base (reduce selection bias)

    • Collect exposure information independent of outcome (reduce information/recall bias)

    • Subcohort can calculate person-time

    • Sampling of subcohort in proximity to study start


Case-only designs


  • Gene-Environment Interaction Studies

    • Stratify by genes and environment to see if the environment is an effect modifier/there is interaction ONLY IN CASES

      • All this question can answer is the gene-environment association

      • Effect modifier: 3rd variable that is neither the exposure nor the outcome, by which when we look at different levels of that variable, the exposure-outcome relationship changes

    • This design is advantageous because:

      • Estimate the association between exposure and genotype among cases

      • Do not have to worry about control selection and the corresponding biases

      • Cannot assess main effects

      • Assumption that E and G are independent in the underlying source population


  • Cross-over trials

    • Each individual is exposed to both treatments

    • But still considered randomization!!!

      • Because order matters

dwYN51BKRLI7dv0DeQvOerLVEF37EqIRRPl2e_RW

  • A washout period is defined as the time between treatment periods. Instead of immediately stopping and then starting the new treatment, there will be a period of time where the treatment from the first period where the drug is washed out of the patient’s system


  • What are the strengths of cross-over designs?

    • Don’t have to identify a comparator/control population

    • Closer to the counterfactual, but without the time mchine

  • Limitations

    • Assumptions about induction and washout period


  • Case-crossover

    • Observational study

    • Case serves as their own control

    • For each case, earlier time periods are selected as control periods

    • Tend to see it for acute events


  • Case-crossover assumptions

    • Triggers

      • Risk factors for outcomes in close temporal proximity

    • Acute events

    • No confounding by time-invariant factors

EniHf6Su4pNYLY19R0OjQzgx3j30aiHubDSK3OMz

  • Challenges

    • Index and reference interval/times not always obvious

    • What about exposures that increase over time?

    • For example ambient air pollution

    • Solution: bidirectional/ ambispective design(take day before and day after)

    • Assumption: case events will not influence subsequent exposure (aka if someone has heart attack, will not exercise the next day)


Homework 1 Answer: Was supposed to be case-control study but was cohort

  • IF effect modifier would look at in CASES (exposed) AND CONTROLS (unexposed)



Selection Bias


  • Statistical definition of bias: Bias occurs when the average value of the association measure obtained from an infinite number of studies is not the true value

  • Epidemiological Definition of Bias: deviation of results or inferences from the truth, or processes leading to such deviation. Any trend in the collection, analysis, interpretation, publication, or review of data that can lead to conclusions that are systematically different from the truth


yDxnSMnR5Ft3Mp9pcjfNzkGi0_OQBEpSoAayoIpq

  • What is selection bias?

    • The study population is not representative of the population one intended to analyze

    • Present when individuals have different probabilities of being included in the study sample according to relevant study characteristics

    • ISSUE OF INTERNAL VALIDITY

      • Generalizability: can apply to population? (not a bias ias)

      • Difference: Internal validity is saying can i trust these results? External validity is asking to whom do these results apply

    • Most biases we have in epi can be reduced to missing data


  • Is it true that you can’t have selection bias in a prospective study? NOT TRUE

    • 1. Differential loss to follow-up

    • 2. Volunteer/self-selection

    • More in-depth discussion in “structural approach to selection bias”


  • Sampling fractions/selection probabilities

_jMYUr78afFdFo_YvfmqszsbnZn_Epm13EtldWWw

  • No selection bias is present if the cross product of the sampling fractions is 1 (i.E. No association between exposure and disease)

  • The cross product is called the “selection bias factor” or the “selection odds ratio” = αδ/βγ


eSYsWRxlQQTvqRxaKs0ai864H7ttcBkAehss51_l


Selection bias can occur in many designs

  • Cross-sectional

    • Prevalence/incidence bias – survival, long duration 

    • Non participation

  • Cohort

    • Non-response (complete data)

    • Unrepresentative group (E- for example) 

    • Attrition

  • Case-control

  • Case selection

Control selection


Self-selection bias: Study can only occur among those who volunteer


PLQ4fCbitzAcm8yuXn3de9GfLGCsrtgwOU1j3GaX

  • Because volunteering is conditioned on, put a box around it (on DAGS, put box around when conditioning around some factor (restricting,stratifying, adjusting for, matching on)

  • U’s are unmeasured covariates; missing data, could be confounders but not included

  • What you’re doing when you’re looking for a backdoor path, is you’re saying “if I remove the arrow that connects my exposure to my outcome can I start at the outcome and end up getting back to the exposure? (backward arrows)

  • Volunteer is a collider: 2 arrows going into it

    • If you don’t anything to a collider, the path is naturally blocked

    • But if you condition on a collider, you open the path

  • Good that it is open, allows path to continue (induced backdoor path)


  • Selection bias is not only about how people are selected, but also how they are retained

    • This is called: Differential Loss to Follow-up Remember:

The study population is not representative of the population one intended to analyze


  • Censor = included or not included 

w516HVEr-IaPauB-WTNCSA84y4-OtC-BtiCUCvQO

YVnbxUW3MUQdq8yqb5h73bz46437glnY3EK4vYm9


Depletion of Susceptibles: Studying people who are gone (ex. Infections, people are susceptible and get disease and then become immune so population of those susceptible is depleted


  • Example of depletion of susceptibles

    • Those who were susceptible to getting disease, their risk was highest in the first year and then starts to drop off because they have been depleted from the population (if they were gonna have an effect, it would’ve happened shortly after)

    • Those who remain were the superwoman who were basically immune to that effect (superwomen effect)


Homework 2 Answers:  incidence vs prevalence, survivorship, depletion of susceptibles


Do the types of selection bias that we just discussed also have the potential to impact randomized trials?

  • Yes!

  • At least two different ways that the benefit of randomization may be gone:

    • 1. Post-randomization run-in phase

    • 2. Differential retention by treatment arm

    • Run in Phase: Give people treatment, just to see if they can tolerate it

    • People who end up not successfully completing run-in phase may have an adverse reaction to treatment and can’t complete study

      • Could be problem if differential

    • To fix that: randomize after but make sure have proper wash-out


  • Practicalities

    • Cost (time & money)

      • Rare outcomes require large populations

      • Long induction time for event requires long follow-up

        • Loss to follow-up may lead to bias

    • Exposure changing over time

      • How to attribute outcome to exposure

      • How to define exposure

      • But, can there be potential biases with time-varying exposures?


Collider-Stratification Bias (DAGS)  is the structural form of selection bias


Measurement Error and Misclassification


  • Confounding isn’t the only important error/bias to look out for

  • We should always be concerned for measurement error

  • Measurement error can occur at any phase of a study

    • Instrument design

    • Errors in protocol RE: instrument 

    • Improper execution of protocol during data collection

    • Individual subject limitations

      • Memory

      • Day-to-day variability in biologic characteristics

      • Social acceptance

    • Errors during data entry and analysis

  • Measurement Error and Misclassification are same type of bias, misclassification just occurs for categorical variables and measurement error occurs for continuous variables


  • Exposure

    • Risk factor under investigation

    • Can be ascertained a number of ways, depending upon the study

      • Questionnaire

      • Register/record data

      • Direct measurement

    • Because we are interested in the association between exposure and outcome, we need to compare outcomes among those exposed and those unexposed. 


  • Defining exposure is a key issue

    • Putting people into exposed and unexposed groups and seeing who gets the outcome and who doesn’t isn’t enough

    • How is exposure classified? 

      • Yes/no; continuous dosage; high/low/none

      • Timing of exposure 

      • Etiologically relevant time period

    • Misclassification vs Model misspecification?

      • What distinguishes is missing data

      • Misclassification: don’t have data but categorized it because doesn’t belong in a category

      • Model misspecification: have data but chose not to use it (have data, just didn’t model it in the right way)


  • Chronic Exposures

    • Persist over time

    • Accumulation of exposure is a function of intensity and time

    • Options include (but not limited to):

      • Maximum intensity

      • Average intensity over some time

      • Cumulative amount

    • Example:  Pack-years of cigarette smoking is a composite of duration and intensity.  Often analyses include exposure reclassified as duration of smoking or packs per day.

    • The choice of exposure metric makes implicit assumptions.

    • What are the implicit assumptions in the above example?


  • Where can we have misclassification

    • Exposure

    • Outcome

    • Confounders


Misclassification of exposure

  • Exposure misclassification related to measurement error in onset time of an acute CV event

    • Misclassification of exposure DUE TO measurement error


c2EN9S4lVFgAQ1DmaZc0s-De2Wl5qJMdk67pxRFY


Classification of outcome

  • What is the trajectory?

    • Pathogenesis?

    • Subclinical and clinical manifestations?

    • Can we get insights from understanding the disease or outcome that will help us better understand the degree and form of misclassification? 

  • When is someone defined as having the outcome?

  • How can heterogeneity of disease impact our studies?

    • Ex. MS there are 3 types

    • Lumping vs splitting:do we create our own misclassification by categorizing or dichotomizing some things?


How can we prevent outcome misclassification?

  • Consider homogeneous subgroups or phenotypes

  • Think about the causal pathway are there precursors to consider?

    • Precursors instead of events!!

    • But are the precursors the only thing contributing to the event?


What about if we are talking about measurement error of a confounder?

  • Persist over time

  • Accumulation of exposure is a function of intensity and time

  • Options include (but not limited to):

    • Maximum intensity

    • Average intensity over some time

    • Cumulative amount

  • Example:  Pack-years of cigarette smoking is a composite of duration and intensity.  Often analyses include exposure reclassified as duration of smoking or packs per day.

    • Because lumping together doesn’t account for dose-response effect and results in residual confounding

      • Is this residual confounding known or unknown?


Sources of data to reduce misclassification

  • Want something objective and close to the gold standard, if possible

  • imaging tests

  • pathology

  • databases

  • environmental measures

  • direct observation


Terminology:

  • Agreement: how close two measurement made on the same subject are, and is measured on the same scale as the measurements themselves

    • Kappa

  • Reliability: relates the magnitude of the measurement error in observed measurements to the inherent variability in the ‘error-free’, ‘true’, or underlying level of the quantity 

    • ICC

  • Repeatability: variation in repeat measurements made on the same subject under identical conditions

  • Reproducibility: “variation in the measurements made on a subject under changing conditions”

    • Reproducibility can influence both the validity and statistical precision of your studies.


How reliable is our measurement?

  • Is the result reproducible?

  • Test-Retest (same instrument, 2 people in time)

    • % agreement

    • Cohen’s Kappa

    • Weighted Kappa

    • Pearson correlation coefficient

    • Intraclass correlation coefficient (icc)

  • Inter-method reliability (between tests)

    • CC

    • Sensitivity and specificity

    • Misclassification matrix


XQmrSdeoNAFneQFsabrb4OE8TYfoKFTJQWp221o_


What’s the difference between Correlation and Kappa?

  • Correlation: can be correlated but doesn’t necessarily agree

    • Can you predict info of one person from the other

    • Ex. Could be negative correlation

    • When one thing changes the other does

    • Shows association but not same value

  • Kappa: how often do 2 people looking at the same thing agree, accounting for chance?


Validity

  • Do we measure what is intended to be measured?

  • Criterion validity: how well does the measure compare with a direct measure of the truth?

  • Content validity: Does the instrument capture all facets of a construct?

  • Construct validity: Instrument measures what it claims to be measuring


What is measurement error/misclassification?

  • Can be broken down into two components:

  • Systematic error (differential) – threatens validity

  • Random error (nondifferential) – threatens precision


yB9KtRh119FstK5XWwBO-GLlJpo7WYPSXWHLWVuP


Iu2Wnd-s4mijnnRzv1IfHBmR9hWoSD3zWZmq7oRB


Effects of measurement error

  • Measurement error leads to bias called: Misclassification bias or Information Bias

  • Differential misclassification

    • Detection of emphysema in smokers vs nonsmokers

    • Recall of pregnancy exposures by mothers with and without mothers who gave birth to healthy vs baby with malformation

    • Can bias in either direction

    • Time lapse between exposure and recall is an important marker/indicator of recall accuracy – therefore if have different duration of time for the exposed and the unexposed then could also have diff misclassification


  • Non-Differential misclassification

    • Does not depend on the status of a subject with respect to other variables 

    • More likely to bias towards the null, but not always

    • (Think of this as nondiff or diff misclassication of X with respect to Y)

    • With nondiff misclassification, the sensitivity and specificity of the measurement method is the same by groups

    • So if we are talking about the exposure, the sensitivity and specificity of the exposure measurement is the same for cases and control. 

hQ9kBZ0bwbNCMxB0RaMMk_Z_pHeyOMvZ-P-Ad9lk

  • Type 1 error: False Positive

  • Type 2 error: False Negative


Ex. Qj94YSRSI3bSyGcQkEeLhaKae7hq1aH0tT95l7Cq

  • Remember that we have non-differential misclassification here because we are saying that it doesn’t differ by whether the individuals are cases or controls. 

  • Now also remember that sensitivity represents the probability that individuals who have the disease are captured as having the disease – the truly positive. And we have a table here of the truth! We know that 60 cases and 200 controls are truly positive!

  • Specificity is the probability that you are correctly classified as unexposed given that you truly are unexposed. 

  • Then because of the law of total probability we know that 1-sensitivity gives us false negatives and 1-specificity gives us false positives.

  • To figure out how many individuals are misclassified as exposed, we need to figure out how many of the cases will be true positives and how many will be false positives given the numbers that we have of true exposure distribution in this table.


  • Nondifferential because sensitivity and specificity of true and misclassified are different

  • If differential, the validity (sensitivity and specificity) would be different


Misclassified A = TPs + FPs

Sensitivity: A/A+C (true positive rate)


tNdzYI38ebF5vtvo1UL5DgzTdYZZkoGeqWsfne1f

  • Sensitivity = Θ

  • Specificity = ????


2wms9HPxTH2RRnahhEhX2Iq-Tdj_PNi9AvVb_CB-


  • Nondifferential misclassification occurs when neither sensitivity nor specificity for disease classification varies by exposure category. By contrast, differential misclassification occurs when misclassification of disease status varies by exposure category.

  • INCLUDE HOMEWORK 3 AND UNDERSTAND IT

    • Include this in homework 3

cVayE05QGMEIz-GJN-8Q8shkDCM-9DAQ79wo8LVa


  • ADD HETEROGENEITY AND FLOWCHART


Confounding


  • Is this a causal relationship or might it be due to some sort of bias?

  • Traditional definition of confounders:

    • They are associated with A in the base population 

    • They are independently associated with Y in the unexposed (independent of exposure)

    • They are not intermediate variables (not on the causal pathway)

    • They must precede Y, and can precede or be at least be at the same time as A

    • “common cause of both A and Y”

  • Properties of a confounder

    • A confounding factor must be an extraneous risk factor for disease

    • A confounding factor must be associated with the exposure under study in the source population (the population at risk from which the cases are derived)

    • A confounding factor must not be affected by the exposure or the disease

  • Want adjusted to be different from crude

    • But stratified should be similar to one other 

  • Could we have predicted the direction of potential bias?

Z9z5bXBxrzjIzaMalNz81wqBtzgEPKNPEPkrmah7

  • Overestimates 

    • Individuals who have malaria are more likely to work outside

    • Individuals to work outside are more likely to be male 

  • The bias will result in a crude OR that is in absolute magnitude too big


Assuming the null is true (no association between smoking and MI) what would  we expect to see in an unadjusted OR?”

2ffKqxjvw56FvmfqrRtLbgpCeAtA8DjBGvyYnDKD

  • Underestimates: toward the null

    • People who have MI are less likely to be moderate alcohol consumers

    • If they are less likely to be alcohol consumers, they are also less likely to smoke

  • Bias is absolute downward, therefore observed crude OR <1


  • What confounding isn’t 

    • Effect modification

    • Outcome heterogeneity

    • Exposure heterogeneity

    • Mediation


Confounding vs Effect Modification

  • Confounding

    • Measure is Distorted

    • Source of bias

    • Crude vs adjusted

  • Effect Modification

    • Measure varies by modifier

    • “it depends”

    • Across strata

    • Not source of bias


kih1opql8DY74FIbTFXq5nGaN23wPNOGu5ExVq_y

  • Mediator: part of causal pathway, our exposure has an effect on disease that goes throw 2 pathways: one that goes through m and one that doesn’t go through m

    • If we adjust for m, it blocks effect of exposure and outcome (not understanding the full effect of exposure on outcome)


WITH EFFECT MODIFIERS: HAVE TO BE ABLE TO STRATIFY UNEXPOSED BY SAME CATEGORIES

  • Ex. If exposure is coffee drinking, Decaf vs. Regular can not be effect modifier


DAGS


  • L → A assumes a direct causal effect (that is not mediated by other variables) for at least one individual

  • The lack of an arrow is also important!

  • Arrows don’t encode effect size or direction

  • Interaction not encoded (i.E. We don’t know how A and L, both causes of Y, might interact

  • Causal DAG must include all common causes of any pair of variables in the graph whether U or C or L


  • Causation

  • Different risk in the entire population under two exposure values

  • Pr[Ya=1=1]  : risk in all subjects of the population had they received the counterfactual exposure level a

  • Causal Risk Ratio: Pr[Ya=1=1]  / Pr[Ya=0=1] = 1 

  • Can also define/assess in terms of odds ratio or difference.

  • What is the null value of the causal risk difference?

  • What is the null value of the causal odds ratio?

  • Impossible to find causation though!!!

  • What we observe

    • Pr[Y=1|A=a] : risk of outcome Y in subjects of the population that meet the condition “having actually received exposure level a” 

    • Associational Risk Ratio: 

      • Pr[Y=1|A=1] / Pr[Y=1|A=0] = 1

    • This also implies that A and Y are independent

      • A ╨ Y

  • Counterfactual: what would happen if exposed were unexposed?


  • Identifiability Conditions: IN ORDER TO ANALYZE A CAUSAL EFFECT THESE MUST BE TRUE

    • 1. The values of treatment under comparison correspond to well-defined interventions that, in turn, correspond to the versions of treatment in the data  → consistency

      • Observed outcome for every treated individual equals their outcome if they had received treatment, and that the observed outcome for every untreated individual equals their outcome if they had remained untreated

      • Requires sufficiently well-defined treatments and treatment-variation irrelevance

      • Observed outcome and potential outcome (theoretical world) are consistent 

    • 2. The conditional probability of receiving every value of treatment, though not decided by the investigators, depends only on the measured covariates  → exchangeability

      • Ya ╨ A for all a

      • Independence between the counterfactual outcome and the observed treatment

      • Potential outcomes are not conditional on what you actual treatment is in the world

      • Treatment and untreated experience same risk of outcome if received same level of treatment

      • Confounding leads to lack of exchangeability 

      • Conditional exchangeability

        • Critical criterion for causal inference

        • Weaker than marginal exchangeability

        • Within levels of L:

          • Exposed subjects would have had the same risk as unexposed subjects had they been unexposed

          • Unexposed subjects would have had the same risk as exposed subjects had they been unexposed

        • Goal for confounding: achieve greatest degree of exchangeability as possible 

    • 3. The conditional probability of receiving every value of treatment is greater than zero, i.E., positive → positivity

      • Probability of treatment 

      • Everybody has to have a nonzero probability of each treatment 

      • “P” [“A=a” │”L=l” ]”>0 for all values l with P” [“L=l” ]”≠0 in the population of interest”

  • There are 2 major benefits to randomization in RCT

    • Confounding 

    • Positivity

  • Identification of confounding – lots of different approaches

oyO-HD2yKNbZa1sH6KoROrr-ZRTk0Xy2dN5rFGLE

    • But evaluation of confounding is different in Case-control and cohort studies because of how the populations are sampled and what they represent

  • What can we do about confounding?

    • Design

      • Randomization

      • Restriction

      • Matching

    • Analysis

      • Stratification

      • Multivariable adjustment

      • Propensity scores

      • Instrumental variables