---
authors:
- givenNames:
- James
- A
familyNames:
- Watson
type: Person
emails:
- jwatowatson@gmail.com
affiliations:
- name: >-
Mahidol Oxford Tropical Medicine Research Unit, Faculty of Tropical
Medicine
parentOrganization:
name: Mahidol University
type: Organization
address:
addressCountry: Thailand
addressLocality: Bangkok
type: PostalAddress
type: Organization
- name: >-
Nuffield Department of Medicine, Centre for Tropical Medicine and
Global Health
parentOrganization:
name: University of Oxford
type: Organization
address:
addressCountry: United Kingdom
addressLocality: Oxford
type: PostalAddress
type: Organization
- givenNames:
- Stije
- J
familyNames:
- Leopold
type: Person
emails:
- stije@tropmedres.ac
affiliations:
- name: >-
Mahidol Oxford Tropical Medicine Research Unit, Faculty of Tropical
Medicine
parentOrganization:
name: Mahidol University
type: Organization
address:
addressCountry: Thailand
addressLocality: Bangkok
type: PostalAddress
type: Organization
- name: >-
Nuffield Department of Medicine, Centre for Tropical Medicine and
Global Health
parentOrganization:
name: University of Oxford
type: Organization
address:
addressCountry: United Kingdom
addressLocality: Oxford
type: PostalAddress
type: Organization
- givenNames:
- Julie
- A
familyNames:
- Simpson
type: Person
affiliations:
- name: >-
Centre for Epidemiology and Biostatistics, Melbourne School of
Population and Global Health
parentOrganization:
name: The University of Melbourne
type: Organization
address:
addressCountry: Australia
addressLocality: Melbourne
type: PostalAddress
type: Organization
- givenNames:
- Nicholas
- PJ
familyNames:
- Day
type: Person
affiliations:
- name: >-
Mahidol Oxford Tropical Medicine Research Unit, Faculty of Tropical
Medicine
parentOrganization:
name: Mahidol University
type: Organization
address:
addressCountry: Thailand
addressLocality: Bangkok
type: PostalAddress
type: Organization
- name: >-
Nuffield Department of Medicine, Centre for Tropical Medicine and
Global Health
parentOrganization:
name: University of Oxford
type: Organization
address:
addressCountry: United Kingdom
addressLocality: Oxford
type: PostalAddress
type: Organization
- givenNames:
- Arjen
- M
familyNames:
- Dondorp
type: Person
affiliations:
- name: >-
Nuffield Department of Medicine, Centre for Tropical Medicine and
Global Health
parentOrganization:
name: University of Oxford
type: Organization
address:
addressCountry: United Kingdom
addressLocality: Oxford
type: PostalAddress
type: Organization
- givenNames:
- Nicholas
- J
familyNames:
- White
type: Person
emails:
- nickw@tropmedres.ac
affiliations:
- name: >-
Nuffield Department of Medicine, Centre for Tropical Medicine and
Global Health
parentOrganization:
name: University of Oxford
type: Organization
address:
addressCountry: United Kingdom
addressLocality: Oxford
type: PostalAddress
type: Organization
editors:
- givenNames:
- Marc
familyNames:
- Lipsitch
type: Person
affiliations:
- name: Harvard TH Chan School of Public Health
address:
addressCountry: United States
type: PostalAddress
type: Organization
datePublished:
value: '2019-01-28'
type: Date
dateReceived:
value: '2018-10-26'
type: Date
dateAccepted:
value: '2019-01-22'
type: Date
title: >-
Collider bias and the apparent protective effect of glucose-6-phosphate
dehydrogenase deficiency on cerebral malaria
description: >-
Case fatality rates in severe falciparum malaria depend on the pattern and
degree of vital organ dysfunction. Recent large-scale case-control analyses of
pooled severe malaria data reported that glucose-6-phosphate dehydrogenase
deficiency (G6PDd) was protective against cerebral malaria but increased the
risk of severe malarial anaemia. A novel formulation of the balancing
selection hypothesis was proposed as an explanation for these findings,
whereby the selective advantage is driven by the competing risks of death from
cerebral malaria and death from severe malarial anaemia. We re-analysed these
claims using causal diagrams and showed that they are subject to collider
bias. A simulation based sensitivity analysis, varying the strength of the
known effect of G6PDd on anaemia, showed that this bias is sufficient to
explain all of the observed association. Future genetic epidemiology studies
in severe malaria would benefit from the use of causal reasoning.
isPartOf:
volumeNumber: 8
isPartOf:
title: eLife
issns:
- 2050-084X
identifiers:
- name: nlm-ta
propertyID: 'https://registry.identifiers.org/registry/nlm-ta'
value: elife
type: PropertyValue
- name: publisher-id
propertyID: 'https://registry.identifiers.org/registry/publisher-id'
value: eLife
type: PropertyValue
publisher:
name: 'eLife Sciences Publications, Ltd'
type: Organization
type: Periodical
type: PublicationVolume
licenses:
- url: 'http://creativecommons.org/licenses/by/4.0/'
content:
- content:
- 'This article is distributed under the terms of the '
- content:
- Creative Commons Attribution License
target: 'http://creativecommons.org/licenses/by/4.0/'
type: Link
- >-
, which permits unrestricted use and redistribution provided that
the original author and source are credited.
type: Paragraph
type: CreativeWork
keywords:
- causal inference
- severe malaria
- collider bias
- glucose-6-phosphate dehydrogenase deficiency
- Plasmodium falciparum
- P. falciparum
identifiers:
- name: publisher-id
propertyID: 'https://registry.identifiers.org/registry/publisher-id'
value: 43154
type: PropertyValue
- name: doi
propertyID: 'https://registry.identifiers.org/registry/doi'
value: 10.7554/eLife.43154
type: PropertyValue
- name: elocation-id
propertyID: 'https://registry.identifiers.org/registry/elocation-id'
value: e43154
type: PropertyValue
fundedBy:
- identifiers: []
funders:
- name: Wellcome Trust
type: Organization
type: MonetaryGrant
- identifiers:
- value: Senior Research Fellowship 1104975
type: PropertyValue
funders:
- name: National Health and Medical Research Council
type: Organization
type: MonetaryGrant
about:
- name: Epidemiology and Global Health
type: DefinedTerm
genre:
- Short Report
bibliography: article.references.bib
---

# Introduction

Severe falciparum malaria is defined by one or more criteria indicating vital organ dysfunction in the presence of microscopy confirmed asexual blood stages of _Plasmodium falciparum_ in the peripheral blood film (@bib15). Multiple vital organ dysfunction is associated with increased mortality (@bib14). Common major clinical manifestations of severe malaria include coma, acidosis, renal failure and anaemia. Of these manifestations, anaemia is an inevitable consequence of symptomatic malaria (@bib13). However, anaemia in individuals at risk of _Plasmodium falciparum_ infection can also be the consequence of red cell genetic polymorphisms frequent in the populations at risk, such as glucose-6-phosphate dehydrogenase deficiency (G6PDd) or haemoglobinopathies.

There is considerable interest in understanding the mechanisms conferring protective effects against severe falciparum malaria of the genetic polymorphisms which are common in malaria endemic areas (@bib12). For some, such as the sickle cell trait, several different mechanisms have been proposed. These include reduced parasite erythrocyte invasion, enhanced parasitised red cell phagocytosis and a reduced propensity of infected red cells to sequester in the microvasculature (@bib7; @bib1; @bib2; @bib16). The mechanism underlying protection from severe falciparum malaria is less clear for others such as glucose-6-phosphate dehydrogenase deficiency (G6PDd). This X-linked genetic polymorphism results in the most common human enzymopathy. Nearly 200 different genetic variants have been reported (@bib5; @bib6). The mechanism whereby G6PD deficiency protects against malaria, and the natural selection forces which have resulted in the different genotypes are still debated. Prospective observational hospital or clinic based patient studies have provided the major component of the evidence base. Estimating causal effects from observational studies in severe malaria patients is difficult due to both confounding and selection bias. This work focuses on collider bias introduced by inappropriate data filtering (@bib9; @bib8).

It has been suggested that G6PDd both increases the risk of severe malarial anaemia (SMA) and decreases the risk of cerebral malaria (CM) (@bib7; @bib3). These conclusions were based on a pooled analysis of observational data from over 11,000 patients with severe malaria studied in Africa and Asia, and relevant population controls. Based on these genetic association studies, a new formulation of the balancing-selection hypothesis was proposed in which G6PD polymorphisms are maintained in human populations, at least in part, by an evolutionary trade-off between different adverse outcomes of _P. falciparum_ infection (@bib3). Collider bias probably explains this negative association between G6PDd and CM, suggesting that causal interpretations of this association and the novel formulation of balancing selection in G6PDd are invalid.

# Results

Two published analyses of pooled data from observational studies of patients with severe falciparum malaria used severe malarial anaemia (SMA) and cerebral malaria (CM) as the main endpoints (outcomes) of interest (@bib7; @bib3). Both these published analyses defined cases of CM as the presence of coma but without concomitant SMA, and cases of SMA as patients with severe anaemia but who were conscious. Therefore, these case definitions excluded patients who had both SMA and CM. All other presentations of severe malaria were also excluded (pulmonary oedema, shock, etc.). Population controls were recruited at each site to match the ethnic composition of cases, and in some instances cord blood samples were used as controls. The consequence of these case definitions is to create an artificial dependency between SMA and CM: if a patient has SMA then they cannot have CM. G6PDd is known to influence haemoglobin concentrations directly by causing haemolysis of older erythrocytes in acute malaria. Therefore it is to be expected that SMA is positively correlated with G6PDd, thus creating a negative correlation between CM and G6PDd. In probabilistic terms, this conditional dependence is written as ${\displaystyle \mathrm{P}(\mathsf{S}\mathsf{M}\mathsf{A}|\mathsf{C}\mathsf{M})\ne \mathrm{P}(\mathsf{S}\mathsf{M}\mathsf{A})}$. Indeed, when all the G6PDd mutations were mapped onto the WHO severity classification score (@bib17), it was observed that "The mean G6PDd score was 13.5% in controls, 13% in cerebral malaria cases and 16.9% in severe malarial anaemia cases $..$." (page 8, @bib3). This pattern remained consistent (6, 5.6, and 7.1%, respectively) after exclusion of the G6PD c.202C > T mutation (one of the ‘A-’ mutations and the most prevalent in the pooled data).

By excluding patients with both SMA and CM (approximately 12% of those with either SMA or CM in the pooled data), the number of G6PDd patients in the CM category is artificially reduced and we would expect there to be fewer G6PDd patients than in the control group. [Figure 1](#fig1) proposes a simple causal diagram which posits plausible inter-dependencies limited to the variables of interest. A simple simulation study based on the assumptions shown in [Figure 1](#fig1) can be used to estimate the relationship between G6PDd and SMA which would result in the observed odds ratio for G6PDd in CM cases versus controls reported in @bib3. We assessed the null model in which there is no direct causal link between G6PDd and CM in severe falciparum malaria (i.e. no arrow from G6PD deficiency to CM). We also assume that there is no direct link between SMA and CM. We calibrated the model with the marginal probabilities of SMA and CM reported in @bib3 and only varied the odds ratio of G6PDd in SMA cases versus controls from 1 (assuming no effect of G6PDd on SMA) to 2 (twice as likely to be G6PDd in the SMA cases than in the controls). For simplicity (avoiding assumptions concerning gene dose effects) we restricted the analysis to males and only compare the simulation results with the reported associations in males.

figure: Figure 1.
:::
![](article.rmd.media/fig1.jpg)

## Causal diagram highlighting collider bias in @bib3 and @bib7.

G6PD deficiency is the exposure of interest (green) and cerebral malaria (CM) is the outcome of interest (red). By defining the CM cases as those who had coma but no severe anaemia, collider bias operates on the effect of G6PDd on CM.
:::
{#fig1}

[Figure 2](#fig2) shows that if the odds ratio for G6PDd in SMA cases versus controls is strictly greater than 1, the estimated odds ratio for G6PDd in CM cases versus controls is biased (the thick red line is below the true simulated value of 1). The magnitude of this bias increases monotonically as the odds ratio for G6PDd in SMA cases versus controls increases. As can be seen from the causal diagram in [Figure 1](#fig1), there is no bias in the estimated odds ratio for G6PDd in SMA cases versus controls (in [Figure 2](#fig2) the thick blue line approximates the identity line). This simple simulation model, restricted to males, shows that for any value of the odds ratio for G6PDd in SMA cases versus controls taken inside the reported 95% confidence interval (CI) $1.2–1.8$ from @bib3, will result in a biased odds ratio for G6PDd in CM cases versus controls inside the interval $0.69–0.98$, the reported 95% CI for G6PDd in CM cases versus controls (@bib3). Moreover, if we use their reported point estimate of 1.48, restricted to males, for the odds ratio of G6PDd in SMA cases versus controls, the simulation model estimates that the observed (biased) odds ratio for G6PDd in CM cases versus controls is 0.87, qualitatively very close to their estimate of 0.82. We note that the effect of G6PDd on severe anaemia in homozygous G6PDd girls and hemizygous G6PDd boys reported in @bib10, an odds ratio of 1.71 (95% CI: 1.34–2.18) for G6PDd in SMA cases versus controls, is also consistent with these results. Therefore collider bias could be sufficient to explain all the observed association.

chunk: Figure 2.
:::
## Results of the simulation based sensitivity analysis showing how collider bias can explain all the reported association between CM and G6PDd.

The simulation assumes that CM is independent of G6PD status but that SMA is dependent on G6PDd status ([Figure 1](#fig1)). Case definitions of CM and SMA exclude patients with both. The left panel shows the observed simulation based estimate of the odds ratio (OR) for G6PDd in SMA cases versus controls (y-axis) as a function of the true simulated value (x-axis). No bias arises (the observed and true values lie on the line of identity). The right panel shows the observed simulation based estimate of the OR for G6PDd in CM cases versus controls (y-axis), again as a function of the true simulated value of the OR for G6PDd in SMA cases versus controls (x-axis). This estimate suffers from collider bias since the true value of the OR for G6PDd in SMA cases versus controls was set to 1). The faint blue shaded areas show the 95% CI $1.22–1.8$ for the odds ratio of G6PDd in SMA cases versus controls, restricted to males (@bib3). The point estimate (1.48) is shown by the dashed blue line. The faint red shaded area shows the 95% CI $0.69–0.98$ for G6PDd in CM cases versus controls, also restricted to males, with the point estimate (0.82) shown by the dashed red line. CI: confidence interval.

{r}
#' @width 18
#' @height 20

require(RColorBrewer)

# The number of malaria patients
N = 10^6
P_coma = 0.34
P_anaemia = 0.24
P_G6PDdef = 0.15

PIs = seq(P_anaemia, 0.5, length.out = 100)
ORcoma = ORanaemia = array(dim=length(PIs))
TrueOR_anaemia = array(dim=length(PIs))

# odds of G6PDd in controls
O1 = P_G6PDdef/(1-P_G6PDdef)

for(i in 1:length(PIs)){

# The Probability of being anaemic if you are G6PD deficient
P_anaemia_def = PIs[i]

# We solve the equation to work out the Probability of being anaemic if you are G6PD normal
# This is dependent on the previous probabilities (simple algebra)
P_anaemia_norm = (P_anaemia - P_G6PDdef*P_anaemia_def)/(1-P_G6PDdef)

# We assume that G6PD status and Coma status are independent
G6PDstatus = sample(c('Normal','Def'), size = N, replace = T,
prob = c(1-P_G6PDdef, P_G6PDdef))
Comastatus = sample(c('No Coma','Coma'), size = N, replace = T,
prob = c(1-P_coma, P_coma))
# Generate anaemia status dependent on G6PD status
Anaemiastatus = array(dim = N)
normals = G6PDstatus=='Normal'
defs = !normals
Anaemiastatus[normals] = sample(x = c('No Anaemia','Anaemia'),
size = sum(normals),
replace = T,
prob = c(1-P_anaemia_norm,P_anaemia_norm))
Anaemiastatus[defs] = sample(x = c('No Anaemia','Anaemia'),
size = sum(defs),
replace = T,
prob = c(1-P_anaemia_def,P_anaemia_def))

Study_dat = data.frame(Coma = Comastatus,
G6PD = G6PDstatus,
Anaemia = Anaemiastatus)

P_G6PDd_Anaemia = PIs[i]*P_G6PDdef/P_anaemia
TrueOR_anaemia[i] =  P_G6PDd_Anaemia/(1-P_G6PDd_Anaemia)/O1
# now subselect only those without both
study_patients = xor(Comastatus=='Coma' , Anaemiastatus == 'Anaemia')

Study_dat = Study_dat[study_patients,]

# odds of G6PDd in cerebral malaria group
O2 = (sum(Study_dat$Coma=='Coma' & Study_dat$G6PD=='Def')/
sum(Study_dat$Coma=='Coma' & Study_dat$G6PD=='Normal'))

# odds ratio for G6PDd between cases and controls
ORcoma[i] = O2/O1

# odds of G6PDd in SMA group
O2 = sum(Study_dat$Anaemia=='Anaemia' & Study_dat$G6PD=='Def')/
sum(Study_dat$Anaemia=='Anaemia' & Study_dat$G6PD=='Normal')
# odds ratio for G6PDd between cases and controls: SMA
ORanaemia[i] = O2/O1

}
Results = list(ORanaemia=ORanaemia,
ORcoma=ORcoma,
TrueOR_anaemia=TrueOR_anaemia)

# Plots of results

bluecols = brewer.pal(3, 'Blues')
redcols = brewer.pal(3, 'Reds')

par(las = 1, bty='n', mfrow=c(1,2))

# Panel 1

{
plot(Results$TrueOR_anaemia, Results$ORanaemia,
xlab = '',
type='l', lwd=3, col =bluecols[3],
ylab='Observed odds ratio for G6PDd in SMA versus controls',
ylim = c(0.7,1.8), xlim=c(1,2))
title('Severe malarial anaemia (SMA)')
mtext(text = 'True odds ratio for G6PDd\nin SMA versus controls',
side = 1,line = 3.5)
axis(2, at = c(0.7,.8), labels = c(0.7,''))
polygon(c(-1,20,20,-1),c(1.22,1.22,1.8,1.8),
col=adjustcolor(bluecols[1],alpha.f = 0.4), border = NA)
polygon(c(1.22,1.22,1.8,1.8),c(-1,20,20,-1),
col=adjustcolor(bluecols[1],alpha.f = 0.4), border = NA)
abline(h = 1.48, col= bluecols[2], lwd=2, lty=2)
abline(v = 1.48, col= bluecols[2], lwd=2, lty=2)

abline(h=1)
lines(Results$TrueOR_anaemia, Results$ORanaemia,
lwd=3, col =bluecols[3])
}

# Panel 2

{
plot(Results$TrueOR_anaemia, Results$ORcoma,
lwd=3, col =redcols[3],
xlab = '',
type='l', xlim=c(1,2),
ylab='Observed odds ratio for G6PDd in CM versus controls',
ylim = c(0.7,1.8))
title('Cerebral malaria (CM)')
mtext(text = 'True odds ratio for G6PDd\nin SMA versus controls',
side = 1,line = 3.5)
axis(2, at = c(0.7,.8), labels = c(0.7,''))
polygon(c(-1,20,20,-1),c(0.69,0.69,0.98,0.98),
col=adjustcolor(redcols[1],alpha.f = 0.4), border = NA)
polygon(c(1.22,1.22,1.8,1.8),c(-1,20,20,-1),
col=adjustcolor(bluecols[1],alpha.f = 0.4), border = NA)
abline(h = 0.82, col= redcols[2], lwd=2, lty=2)
abline(v = 1.48, col= bluecols[2], lwd=2, lty=2)

abline(h=1)
lines(Results$TrueOR_anaemia, Results$ORcoma,
lwd=3, col =redcols[3])
}


![](article.rmd.media/fig2.jpg)

:::
{#fig2}

# Discussion

This re-analysis of recent reports that G6PDd reduced the risk of CM directly (@bib7; @bib3) suggests that the observations could have resulted entirely from collider bias. This highlights the difficulty of inferring causal relationships between baseline patient covariates (in this case G6PDd) and covariates which define inclusion criteria. The necessary causal odds ratios for G6PDd in SMA cases versus controls which would give rise to the biased observed association between G6PD status and CM fit with the recent estimate of 1.71 in homozygous girls and hemizygous boys (@bib10). This is not to say that the risk of CM is unaffected by G6PDd, but that the observations reported could have arisen purely as a result of the implicit collider bias induced by the selection of patients by the severe malaria criteria.

The example reported here highlights a major difficulty when attempting to estimate causal contributions when factors of interest define inclusion into the clinical study and the subsequent data analysis. All prospective observational severe malaria studies suffer from two major issues. First, case definitions are subjective and change over time (even though standard guidelines exist, see @bib15) and mortality is strongly dependent on the case definition. Second, enrolment into studies can only be done at the hospital or clinic level and neither duration of illness nor treatment seeking behaviour can be accounted for adequately.

The notion and definition of ‘severe malaria’ has two operational purposes. First it is a clinical tool for appropriate triage of malaria patients at high risk of death. Second it is a research tool for the evaluation of novel interventions seeking to reduce mortality. Interventions aimed at reducing mortality need to be trialled in the most severely ill patients in order to demonstrate intervention efficacy in this important subgroup. Pooled analyses of severe malaria studies need to take into account the variability of study inclusion and exclusion criteria. Researchers must appreciate that severe malaria is not an objective category but a subjective case definition subset from a spectrum of severity. Moreover, restricting analyses to specific patient subgroups, especially in the analysis of large pooled datasets, can have a considerable impact on the final result (@bib4). With increased emphasis on providing open access data so that analyses can be evaluated and best use made of clinical research it would be very helpful if investigators could publish reproducible code alongside their analyses. Future genetic epidemiological studies could benefit from use of causal diagrams and would be more readily evaluable by provision of accompanying code.

# Materials and methods

## Data analysis in @bib3 and @bib7

The odds ratios for G6PDd in cases versus controls are given in Table 3 of @bib7 (page 1201). The results, restricted to males, are 0.81 (95% CI: 0.68–0.96) for CM and 1.49 (95% CI: 1.24–1.79) for SMA. The case phenotype definitions are denoted ‘Cerebral malaria only’ for CM and ‘Severe malarial anaemia’ only for SMA. These case definitions, whereby patients with both SMA and CM are excluded, are also given in their Table 1. A total of 6283 cases had cerebral malaria or severe malarial anaemia, broken down as 3345 had cerebral malaria only; 2196 had severe malarial anaemia only; 742 had both cerebral malaria and severe malarial anaemia. The reported odds ratios were computed using logistic regression models with the main adjustment of interest being sickle haemoglobin genotype (HbS). We only consider the reported results restricted to males. The relevant section from the paper is: "_Single-SNP tests, adjusted for HbS genotype, sex and ancestry, for association with severe malaria and the severe malaria subtypes cerebral malaria only and severe malarial anemia only were performed for the 55 SNPs with a known association with severe malaria. Standard logistic regression models were used for tests of association at each autosomal SNP (Supplementary Table 25). Primary analyses comprised tests of association between each SNP and severe malaria phenotypes across all individuals combined as well as separately by sex (X-chromosome SNPs only) and study site: genotypic, additive, dominant, recessive and heterozygote advantage genetic models of inheritance were considered._" (online Methods, Statistical analysis, (@bib7)).

## @bib3

The odds ratios for G6PDd in cases versus controls are given in Table 3 of @bib3 (page 6). In this publication, the results, restricted to males, are 0.82 (95% CI: 0.69–0.98) for CM and 1.48 (95% CI: 1.22–1.8) for SMA (their Table 3). The reason for the slight discrepancy between the two publications does not appear to be stated. They included a total of 6284 patients with cerebral malaria or severe malarial anaemia, broken down as: 3359 individuals had cerebral malaria only; 2184 had severe malarial anaemia only; 741 had both cerebral malaria and severe malarial anaemia. Table 1 and 6 of @bib3 show the case definitions for CM and SMA, highlighting that those who have both clinical presentations are excluded from the respective case definitions. This is further confirmed on page 18 where the authors state: "_For reasons of sample size, we did not conduct a detailed analysis of other sub-types of severe malaria, or of those individuals who had both cerebral malaria and severe malarial anaemia’._ Standard logistic regression models were also used to obtain the odds ratios: _‘In primary analyses, standard fixed effects logistic regression methods were used for tests of association with severe malaria and sub-types at each SNP under additive, dominant, recessive and heterozygous models. $..$ Results were adjusted for sickle hemoglobin (HbS), gender and ethnicity._" (bottom of page 18).

## Sensitivity analysis

In order to characterise the proportion of the reported association explained by collider bias we constructed a simple simulation study based on the analytical procedures in @bib7; @bib3. We restrict our simulation to males, and calibrate and test the model using only the results reported in males in both publications (identical up to one decimal point). For simplicity, the simulation ignores the effect of HbS which is a confounder between SMA and CM, and generates data where the two presentations occur independently, which is equivalent to adjusting for HbS in the regression model. The procedure generates simulated data dependent on a parameter characterising the effect of G6PDd on severe malarial anaemia. As no adjustment is necessary, we then compute the non-parametric odds ratio for G6PDd in CM cases versus controls, excluding from the CM case definition all those who have SMA. This simulated ‘observed’ odds ratio estimated from the 2 x 2 table (cases and controls versus G6PDd and G6PD normals) is thereby directly comparable to the reported odds ratios (obtained from logistic regression with the appropriate adjustments) in @bib7; @bib3, if we assume that only sex, ethnicity and _HbS_ are the true confounders (i.e. all necessary adjustments were made in both publications).

The hypothetical data were simulated based on the following assumptions:

1.  G6PD deficiency increases the risk of SMA in acute symptomatic malaria (the size of the effect is the only free parameter in the model and varies from 1 to 2 as defined by the odds ratio for G6PDd in SMA cases versus controls). In reality this would be expected to be a function of genotype and gene dose (i.e. hemizygotes and homozygotes would have a greater risk than heterozygotes).
2.  CM is independent of G6PD status.
3.  A population of males only (i.e. no heterozygote women, so no partial effects) with homogeneous background frequency of G6PDd.
4.  CM and SMA occur independently.

From the data in @bib3 and @bib7, we can estimate the marginal probability of CM as 0.34, independent of G6PD status (assumption 2). We can also estimate the marginal probability of SMA as 0.24. The probability of G6PDd in males was 0.15.

If we denote ${\displaystyle \pi \phantom{\rule{0.1667em}{0ex}}=\phantom{\rule{0.1667em}{0ex}}\mathrm{P}(\mathsf{S}\mathsf{M}\mathsf{A}|\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})\ge \mathrm{P}(\mathsf{S}\mathsf{M}\mathsf{A})\phantom{\rule{0.1667em}{0ex}}=\phantom{\rule{0.1667em}{0ex}}0.24,}$ then by the law of total probability:

$${\displaystyle \mathrm{P}(\mathsf{S}\mathsf{M}\mathsf{A}|\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{n})\phantom{\rule{0.1667em}{0ex}}=\phantom{\rule{0.1667em}{0ex}}\frac{\mathrm{P}(\mathsf{S}\mathsf{M}\mathsf{A})-\mathrm{P}(\mathsf{S}\mathsf{M}\mathsf{A}|\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})\mathrm{P}(\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})}{1-\mathrm{P}(\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})}\phantom{\rule{0.1667em}{0ex}}=\phantom{\rule{0.1667em}{0ex}}\frac{0.24\phantom{\rule{0.1667em}{0ex}}-\phantom{\rule{0.1667em}{0ex}}0.15\pi }{0.85}}$$

where G6PDd denotes G6PD deficient and G6PDn denotes G6PD normal.

The true proportion of G6PDd in the controls is known and fixed as ${\displaystyle \mathrm{P}(\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})}$, with the proportion of G6PDn in the controls is given by ${\displaystyle 1-\mathrm{P}(\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})}$, therefore the odds of G6PDd in the control group is given by ${\displaystyle \frac{\mathrm{P}(\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})}{1-\mathrm{P}(\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})}}$.

We then simulate cases as follows. For each value of ${\displaystyle \pi \in [0.24,0.5]}$:

**Step 1**. Simulate 1 million patients such that:

${\displaystyle \mathrm{P}(\mathsf{C}\mathsf{M})\phantom{\rule{0.1667em}{0ex}}=\phantom{\rule{0.1667em}{0ex}}0.34}$,

${\displaystyle \mathrm{P}(\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})\phantom{\rule{0.1667em}{0ex}}=\phantom{\rule{0.1667em}{0ex}}0.15}$.

${\displaystyle \mathrm{P}(\mathsf{S}\mathsf{M}\mathsf{A}|\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})\phantom{\rule{0.1667em}{0ex}}=\phantom{\rule{0.1667em}{0ex}}\pi }$.

**Step 2**. Select only the patients who have either just CM, or just SMA, filtering out those with concomitant SMA and CM.

**Step 3**. In the remaining data compute ${\displaystyle {\mathrm{N}}_{\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d}}^{\mathsf{S}\mathsf{M}\mathsf{A}}}$ (number of G6PDd with SMA); ${\displaystyle {\mathrm{N}}_{\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{n}}^{\mathsf{S}\mathsf{M}\mathsf{A}}}$ (number of G6PDn patients with SMA); ${\displaystyle {\mathrm{N}}_{\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d}}^{\mathsf{C}\mathsf{M}}}$ (number of G6PDd patients with CM); ${\displaystyle {\mathrm{N}}_{\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{n}}^{\mathsf{C}\mathsf{M}}}$ (number of G6PDn patients with CM). The simulated odds ratio for G6PDd in SMA cases versus controls is ${\displaystyle \frac{{\mathrm{N}}_{\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d}}^{\mathsf{S}\mathsf{M}\mathsf{A}}/{\mathrm{N}}_{\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{n}}^{\mathsf{S}\mathsf{M}\mathsf{A}}}{\mathrm{P}(\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})/(1\phantom{\rule{0.1667em}{0ex}}-\phantom{\rule{0.1667em}{0ex}}\mathrm{P}(\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d}))}}$, and the simulated odds ratio for G6PDd in CM cases versus controls is ${\displaystyle \frac{{\mathrm{N}}_{\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d}}^{\mathsf{C}\mathsf{M}}/{\mathrm{N}}_{\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{n}}^{\mathsf{C}\mathsf{M}}}{\mathrm{P}(\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})/(1\phantom{\rule{0.1667em}{0ex}}-\phantom{\rule{0.1667em}{0ex}}\mathrm{P}(\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d}))}}$.

The causal diagram which corresponds to exclusion operating in Step two is shown in [Figure 1](#fig1). Selection bias can be seen via the role of the vertex _Case definition_ which is a collider between _ G6PD deficiency_ and _Cerebral malaria (CM)_. Implementation of the simulation model in R can be found at: <https://github.com/Stije/SevereMalariaAnalysis/SelectionBiasSimulation.Rmd> (@bib11; copy archived at <https://github.com/elifesciences-public