--- authors: - givenNames: - James - A familyNames: - Watson type: Person emails: - jwatowatson@gmail.com affiliations: - name: >- Mahidol Oxford Tropical Medicine Research Unit, Faculty of Tropical Medicine parentOrganization: name: Mahidol University type: Organization address: addressCountry: Thailand addressLocality: Bangkok type: PostalAddress type: Organization - name: >- Nuffield Department of Medicine, Centre for Tropical Medicine and Global Health parentOrganization: name: University of Oxford type: Organization address: addressCountry: United Kingdom addressLocality: Oxford type: PostalAddress type: Organization - givenNames: - Stije - J familyNames: - Leopold type: Person emails: - stije@tropmedres.ac affiliations: - name: >- Mahidol Oxford Tropical Medicine Research Unit, Faculty of Tropical Medicine parentOrganization: name: Mahidol University type: Organization address: addressCountry: Thailand addressLocality: Bangkok type: PostalAddress type: Organization - name: >- Nuffield Department of Medicine, Centre for Tropical Medicine and Global Health parentOrganization: name: University of Oxford type: Organization address: addressCountry: United Kingdom addressLocality: Oxford type: PostalAddress type: Organization - givenNames: - Julie - A familyNames: - Simpson type: Person affiliations: - name: >- Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health parentOrganization: name: The University of Melbourne type: Organization address: addressCountry: Australia addressLocality: Melbourne type: PostalAddress type: Organization - givenNames: - Nicholas - PJ familyNames: - Day type: Person affiliations: - name: >- Mahidol Oxford Tropical Medicine Research Unit, Faculty of Tropical Medicine parentOrganization: name: Mahidol University type: Organization address: addressCountry: Thailand addressLocality: Bangkok type: PostalAddress type: Organization - name: >- Nuffield Department of Medicine, Centre for Tropical Medicine and Global Health parentOrganization: name: University of Oxford type: Organization address: addressCountry: United Kingdom addressLocality: Oxford type: PostalAddress type: Organization - givenNames: - Arjen - M familyNames: - Dondorp type: Person affiliations: - name: >- Nuffield Department of Medicine, Centre for Tropical Medicine and Global Health parentOrganization: name: University of Oxford type: Organization address: addressCountry: United Kingdom addressLocality: Oxford type: PostalAddress type: Organization - givenNames: - Nicholas - J familyNames: - White type: Person emails: - nickw@tropmedres.ac affiliations: - name: >- Nuffield Department of Medicine, Centre for Tropical Medicine and Global Health parentOrganization: name: University of Oxford type: Organization address: addressCountry: United Kingdom addressLocality: Oxford type: PostalAddress type: Organization editors: - givenNames: - Marc familyNames: - Lipsitch type: Person affiliations: - name: Harvard TH Chan School of Public Health address: addressCountry: United States type: PostalAddress type: Organization datePublished: value: '2019-01-28' type: Date dateReceived: value: '2018-10-26' type: Date dateAccepted: value: '2019-01-22' type: Date title: >- Collider bias and the apparent protective effect of glucose-6-phosphate dehydrogenase deficiency on cerebral malaria description: >- Case fatality rates in severe falciparum malaria depend on the pattern and degree of vital organ dysfunction. Recent large-scale case-control analyses of pooled severe malaria data reported that glucose-6-phosphate dehydrogenase deficiency (G6PDd) was protective against cerebral malaria but increased the risk of severe malarial anaemia. A novel formulation of the balancing selection hypothesis was proposed as an explanation for these findings, whereby the selective advantage is driven by the competing risks of death from cerebral malaria and death from severe malarial anaemia. We re-analysed these claims using causal diagrams and showed that they are subject to collider bias. A simulation based sensitivity analysis, varying the strength of the known effect of G6PDd on anaemia, showed that this bias is sufficient to explain all of the observed association. Future genetic epidemiology studies in severe malaria would benefit from the use of causal reasoning. isPartOf: volumeNumber: 8 isPartOf: title: eLife issns: - 2050-084X identifiers: - name: nlm-ta propertyID: 'https://registry.identifiers.org/registry/nlm-ta' value: elife type: PropertyValue - name: publisher-id propertyID: 'https://registry.identifiers.org/registry/publisher-id' value: eLife type: PropertyValue publisher: name: 'eLife Sciences Publications, Ltd' type: Organization type: Periodical type: PublicationVolume licenses: - url: 'http://creativecommons.org/licenses/by/4.0/' content: - content: - 'This article is distributed under the terms of the ' - content: - Creative Commons Attribution License target: 'http://creativecommons.org/licenses/by/4.0/' type: Link - >- , which permits unrestricted use and redistribution provided that the original author and source are credited. type: Paragraph type: CreativeWork keywords: - causal inference - severe malaria - collider bias - glucose-6-phosphate dehydrogenase deficiency - Plasmodium falciparum - P. falciparum identifiers: - name: publisher-id propertyID: 'https://registry.identifiers.org/registry/publisher-id' value: 43154 type: PropertyValue - name: doi propertyID: 'https://registry.identifiers.org/registry/doi' value: 10.7554/eLife.43154 type: PropertyValue - name: elocation-id propertyID: 'https://registry.identifiers.org/registry/elocation-id' value: e43154 type: PropertyValue fundedBy: - identifiers: [] funders: - name: Wellcome Trust type: Organization type: MonetaryGrant - identifiers: - value: Senior Research Fellowship 1104975 type: PropertyValue funders: - name: National Health and Medical Research Council type: Organization type: MonetaryGrant about: - name: Epidemiology and Global Health type: DefinedTerm genre: - Short Report bibliography: article.references.bib --- # Introduction Severe falciparum malaria is defined by one or more criteria indicating vital organ dysfunction in the presence of microscopy confirmed asexual blood stages of _Plasmodium falciparum_ in the peripheral blood film (@bib15). Multiple vital organ dysfunction is associated with increased mortality (@bib14). Common major clinical manifestations of severe malaria include coma, acidosis, renal failure and anaemia. Of these manifestations, anaemia is an inevitable consequence of symptomatic malaria (@bib13). However, anaemia in individuals at risk of _Plasmodium falciparum_ infection can also be the consequence of red cell genetic polymorphisms frequent in the populations at risk, such as glucose-6-phosphate dehydrogenase deficiency (G6PDd) or haemoglobinopathies. There is considerable interest in understanding the mechanisms conferring protective effects against severe falciparum malaria of the genetic polymorphisms which are common in malaria endemic areas (@bib12). For some, such as the sickle cell trait, several different mechanisms have been proposed. These include reduced parasite erythrocyte invasion, enhanced parasitised red cell phagocytosis and a reduced propensity of infected red cells to sequester in the microvasculature (@bib7; @bib1; @bib2; @bib16). The mechanism underlying protection from severe falciparum malaria is less clear for others such as glucose-6-phosphate dehydrogenase deficiency (G6PDd). This X-linked genetic polymorphism results in the most common human enzymopathy. Nearly 200 different genetic variants have been reported (@bib5; @bib6). The mechanism whereby G6PD deficiency protects against malaria, and the natural selection forces which have resulted in the different genotypes are still debated. Prospective observational hospital or clinic based patient studies have provided the major component of the evidence base. Estimating causal effects from observational studies in severe malaria patients is difficult due to both confounding and selection bias. This work focuses on collider bias introduced by inappropriate data filtering (@bib9; @bib8). It has been suggested that G6PDd both increases the risk of severe malarial anaemia (SMA) and decreases the risk of cerebral malaria (CM) (@bib7; @bib3). These conclusions were based on a pooled analysis of observational data from over 11,000 patients with severe malaria studied in Africa and Asia, and relevant population controls. Based on these genetic association studies, a new formulation of the balancing-selection hypothesis was proposed in which G6PD polymorphisms are maintained in human populations, at least in part, by an evolutionary trade-off between different adverse outcomes of _P. falciparum_ infection (@bib3). Collider bias probably explains this negative association between G6PDd and CM, suggesting that causal interpretations of this association and the novel formulation of balancing selection in G6PDd are invalid. # Results Two published analyses of pooled data from observational studies of patients with severe falciparum malaria used severe malarial anaemia (SMA) and cerebral malaria (CM) as the main endpoints (outcomes) of interest (@bib7; @bib3). Both these published analyses defined cases of CM as the presence of coma but without concomitant SMA, and cases of SMA as patients with severe anaemia but who were conscious. Therefore, these case definitions excluded patients who had both SMA and CM. All other presentations of severe malaria were also excluded (pulmonary oedema, shock, etc.). Population controls were recruited at each site to match the ethnic composition of cases, and in some instances cord blood samples were used as controls. The consequence of these case definitions is to create an artificial dependency between SMA and CM: if a patient has SMA then they cannot have CM. G6PDd is known to influence haemoglobin concentrations directly by causing haemolysis of older erythrocytes in acute malaria. Therefore it is to be expected that SMA is positively correlated with G6PDd, thus creating a negative correlation between CM and G6PDd. In probabilistic terms, this conditional dependence is written as ${\displaystyle \mathrm{P}(\mathsf{S}\mathsf{M}\mathsf{A}|\mathsf{C}\mathsf{M})\ne \mathrm{P}(\mathsf{S}\mathsf{M}\mathsf{A})}$. Indeed, when all the G6PDd mutations were mapped onto the WHO severity classification score (@bib17), it was observed that "The mean G6PDd score was 13.5% in controls, 13% in cerebral malaria cases and 16.9% in severe malarial anaemia cases \[..\]." (page 8, @bib3). This pattern remained consistent (6, 5.6, and 7.1%, respectively) after exclusion of the G6PD c.202C > T mutation (one of the ‘A-’ mutations and the most prevalent in the pooled data). By excluding patients with both SMA and CM (approximately 12% of those with either SMA or CM in the pooled data), the number of G6PDd patients in the CM category is artificially reduced and we would expect there to be fewer G6PDd patients than in the control group. [Figure 1](#fig1) proposes a simple causal diagram which posits plausible inter-dependencies limited to the variables of interest. A simple simulation study based on the assumptions shown in [Figure 1](#fig1) can be used to estimate the relationship between G6PDd and SMA which would result in the observed odds ratio for G6PDd in CM cases versus controls reported in @bib3. We assessed the null model in which there is no direct causal link between G6PDd and CM in severe falciparum malaria (i.e. no arrow from G6PD deficiency to CM). We also assume that there is no direct link between SMA and CM. We calibrated the model with the marginal probabilities of SMA and CM reported in @bib3 and only varied the odds ratio of G6PDd in SMA cases versus controls from 1 (assuming no effect of G6PDd on SMA) to 2 (twice as likely to be G6PDd in the SMA cases than in the controls). For simplicity (avoiding assumptions concerning gene dose effects) we restricted the analysis to males and only compare the simulation results with the reported associations in males. figure: Figure 1. ::: ![](article.rmd.media/fig1.jpg) ## Causal diagram highlighting collider bias in @bib3 and @bib7. G6PD deficiency is the exposure of interest (green) and cerebral malaria (CM) is the outcome of interest (red). By defining the CM cases as those who had coma but no severe anaemia, collider bias operates on the effect of G6PDd on CM. ::: {#fig1} [Figure 2](#fig2) shows that if the odds ratio for G6PDd in SMA cases versus controls is strictly greater than 1, the estimated odds ratio for G6PDd in CM cases versus controls is biased (the thick red line is below the true simulated value of 1). The magnitude of this bias increases monotonically as the odds ratio for G6PDd in SMA cases versus controls increases. As can be seen from the causal diagram in [Figure 1](#fig1), there is no bias in the estimated odds ratio for G6PDd in SMA cases versus controls (in [Figure 2](#fig2) the thick blue line approximates the identity line). This simple simulation model, restricted to males, shows that for any value of the odds ratio for G6PDd in SMA cases versus controls taken inside the reported 95% confidence interval (CI) \[1.2–1.8\] from @bib3, will result in a biased odds ratio for G6PDd in CM cases versus controls inside the interval \[0.69–0.98\], the reported 95% CI for G6PDd in CM cases versus controls (@bib3). Moreover, if we use their reported point estimate of 1.48, restricted to males, for the odds ratio of G6PDd in SMA cases versus controls, the simulation model estimates that the observed (biased) odds ratio for G6PDd in CM cases versus controls is 0.87, qualitatively very close to their estimate of 0.82. We note that the effect of G6PDd on severe anaemia in homozygous G6PDd girls and hemizygous G6PDd boys reported in @bib10, an odds ratio of 1.71 (95% CI: 1.34–2.18) for G6PDd in SMA cases versus controls, is also consistent with these results. Therefore collider bias could be sufficient to explain all the observed association. chunk: Figure 2. ::: ## Results of the simulation based sensitivity analysis showing how collider bias can explain all the reported association between CM and G6PDd. The simulation assumes that CM is independent of G6PD status but that SMA is dependent on G6PDd status ([Figure 1](#fig1)). Case definitions of CM and SMA exclude patients with both. The left panel shows the observed simulation based estimate of the odds ratio (OR) for G6PDd in SMA cases versus controls (y-axis) as a function of the true simulated value (x-axis). No bias arises (the observed and true values lie on the line of identity). The right panel shows the observed simulation based estimate of the OR for G6PDd in CM cases versus controls (y-axis), again as a function of the true simulated value of the OR for G6PDd in SMA cases versus controls (x-axis). This estimate suffers from collider bias since the true value of the OR for G6PDd in SMA cases versus controls was set to 1). The faint blue shaded areas show the 95% CI \[1.22–1.8\] for the odds ratio of G6PDd in SMA cases versus controls, restricted to males (@bib3). The point estimate (1.48) is shown by the dashed blue line. The faint red shaded area shows the 95% CI \[0.69–0.98\] for G6PDd in CM cases versus controls, also restricted to males, with the point estimate (0.82) shown by the dashed red line. CI: confidence interval. ```{r} #' @width 18 #' @height 20 require(RColorBrewer) # The number of malaria patients N = 10^6 P_coma = 0.34 P_anaemia = 0.24 P_G6PDdef = 0.15 PIs = seq(P_anaemia, 0.5, length.out = 100) ORcoma = ORanaemia = array(dim=length(PIs)) TrueOR_anaemia = array(dim=length(PIs)) # odds of G6PDd in controls O1 = P_G6PDdef/(1-P_G6PDdef) for(i in 1:length(PIs)){ # The Probability of being anaemic if you are G6PD deficient P_anaemia_def = PIs[i] # We solve the equation to work out the Probability of being anaemic if you are G6PD normal # This is dependent on the previous probabilities (simple algebra) P_anaemia_norm = (P_anaemia - P_G6PDdef*P_anaemia_def)/(1-P_G6PDdef) # We assume that G6PD status and Coma status are independent G6PDstatus = sample(c('Normal','Def'), size = N, replace = T, prob = c(1-P_G6PDdef, P_G6PDdef)) Comastatus = sample(c('No Coma','Coma'), size = N, replace = T, prob = c(1-P_coma, P_coma)) # Generate anaemia status dependent on G6PD status Anaemiastatus = array(dim = N) normals = G6PDstatus=='Normal' defs = !normals Anaemiastatus[normals] = sample(x = c('No Anaemia','Anaemia'), size = sum(normals), replace = T, prob = c(1-P_anaemia_norm,P_anaemia_norm)) Anaemiastatus[defs] = sample(x = c('No Anaemia','Anaemia'), size = sum(defs), replace = T, prob = c(1-P_anaemia_def,P_anaemia_def)) Study_dat = data.frame(Coma = Comastatus, G6PD = G6PDstatus, Anaemia = Anaemiastatus) P_G6PDd_Anaemia = PIs[i]*P_G6PDdef/P_anaemia TrueOR_anaemia[i] = P_G6PDd_Anaemia/(1-P_G6PDd_Anaemia)/O1 # now subselect only those without both study_patients = xor(Comastatus=='Coma' , Anaemiastatus == 'Anaemia') Study_dat = Study_dat[study_patients,] # odds of G6PDd in cerebral malaria group O2 = (sum(Study_dat$Coma=='Coma' & Study_dat$G6PD=='Def')/ sum(Study_dat$Coma=='Coma' & Study_dat$G6PD=='Normal')) # odds ratio for G6PDd between cases and controls ORcoma[i] = O2/O1 # odds of G6PDd in SMA group O2 = sum(Study_dat$Anaemia=='Anaemia' & Study_dat$G6PD=='Def')/ sum(Study_dat$Anaemia=='Anaemia' & Study_dat$G6PD=='Normal') # odds ratio for G6PDd between cases and controls: SMA ORanaemia[i] = O2/O1 } Results = list(ORanaemia=ORanaemia, ORcoma=ORcoma, TrueOR_anaemia=TrueOR_anaemia) # Plots of results bluecols = brewer.pal(3, 'Blues') redcols = brewer.pal(3, 'Reds') par(las = 1, bty='n', mfrow=c(1,2)) # Panel 1 { plot(Results$TrueOR_anaemia, Results$ORanaemia, xlab = '', type='l', lwd=3, col =bluecols[3], ylab='Observed odds ratio for G6PDd in SMA versus controls', ylim = c(0.7,1.8), xlim=c(1,2)) title('Severe malarial anaemia (SMA)') mtext(text = 'True odds ratio for G6PDd\nin SMA versus controls', side = 1,line = 3.5) axis(2, at = c(0.7,.8), labels = c(0.7,'')) polygon(c(-1,20,20,-1),c(1.22,1.22,1.8,1.8), col=adjustcolor(bluecols[1],alpha.f = 0.4), border = NA) polygon(c(1.22,1.22,1.8,1.8),c(-1,20,20,-1), col=adjustcolor(bluecols[1],alpha.f = 0.4), border = NA) abline(h = 1.48, col= bluecols[2], lwd=2, lty=2) abline(v = 1.48, col= bluecols[2], lwd=2, lty=2) abline(h=1) lines(Results$TrueOR_anaemia, Results$ORanaemia, lwd=3, col =bluecols[3]) } # Panel 2 { plot(Results$TrueOR_anaemia, Results$ORcoma, lwd=3, col =redcols[3], xlab = '', type='l', xlim=c(1,2), ylab='Observed odds ratio for G6PDd in CM versus controls', ylim = c(0.7,1.8)) title('Cerebral malaria (CM)') mtext(text = 'True odds ratio for G6PDd\nin SMA versus controls', side = 1,line = 3.5) axis(2, at = c(0.7,.8), labels = c(0.7,'')) polygon(c(-1,20,20,-1),c(0.69,0.69,0.98,0.98), col=adjustcolor(redcols[1],alpha.f = 0.4), border = NA) polygon(c(1.22,1.22,1.8,1.8),c(-1,20,20,-1), col=adjustcolor(bluecols[1],alpha.f = 0.4), border = NA) abline(h = 0.82, col= redcols[2], lwd=2, lty=2) abline(v = 1.48, col= bluecols[2], lwd=2, lty=2) abline(h=1) lines(Results$TrueOR_anaemia, Results$ORcoma, lwd=3, col =redcols[3]) } ``` ![](article.rmd.media/fig2.jpg) ::: {#fig2} # Discussion This re-analysis of recent reports that G6PDd reduced the risk of CM directly (@bib7; @bib3) suggests that the observations could have resulted entirely from collider bias. This highlights the difficulty of inferring causal relationships between baseline patient covariates (in this case G6PDd) and covariates which define inclusion criteria. The necessary causal odds ratios for G6PDd in SMA cases versus controls which would give rise to the biased observed association between G6PD status and CM fit with the recent estimate of 1.71 in homozygous girls and hemizygous boys (@bib10). This is not to say that the risk of CM is unaffected by G6PDd, but that the observations reported could have arisen purely as a result of the implicit collider bias induced by the selection of patients by the severe malaria criteria. The example reported here highlights a major difficulty when attempting to estimate causal contributions when factors of interest define inclusion into the clinical study and the subsequent data analysis. All prospective observational severe malaria studies suffer from two major issues. First, case definitions are subjective and change over time (even though standard guidelines exist, see @bib15) and mortality is strongly dependent on the case definition. Second, enrolment into studies can only be done at the hospital or clinic level and neither duration of illness nor treatment seeking behaviour can be accounted for adequately. The notion and definition of ‘severe malaria’ has two operational purposes. First it is a clinical tool for appropriate triage of malaria patients at high risk of death. Second it is a research tool for the evaluation of novel interventions seeking to reduce mortality. Interventions aimed at reducing mortality need to be trialled in the most severely ill patients in order to demonstrate intervention efficacy in this important subgroup. Pooled analyses of severe malaria studies need to take into account the variability of study inclusion and exclusion criteria. Researchers must appreciate that severe malaria is not an objective category but a subjective case definition subset from a spectrum of severity. Moreover, restricting analyses to specific patient subgroups, especially in the analysis of large pooled datasets, can have a considerable impact on the final result (@bib4). With increased emphasis on providing open access data so that analyses can be evaluated and best use made of clinical research it would be very helpful if investigators could publish reproducible code alongside their analyses. Future genetic epidemiological studies could benefit from use of causal diagrams and would be more readily evaluable by provision of accompanying code. # Materials and methods ## Data analysis in @bib3 and @bib7 The odds ratios for G6PDd in cases versus controls are given in Table 3 of @bib7 (page 1201). The results, restricted to males, are 0.81 (95% CI: 0.68–0.96) for CM and 1.49 (95% CI: 1.24–1.79) for SMA. The case phenotype definitions are denoted ‘Cerebral malaria only’ for CM and ‘Severe malarial anaemia’ only for SMA. These case definitions, whereby patients with both SMA and CM are excluded, are also given in their Table 1. A total of 6283 cases had cerebral malaria or severe malarial anaemia, broken down as 3345 had cerebral malaria only; 2196 had severe malarial anaemia only; 742 had both cerebral malaria and severe malarial anaemia. The reported odds ratios were computed using logistic regression models with the main adjustment of interest being sickle haemoglobin genotype (HbS). We only consider the reported results restricted to males. The relevant section from the paper is: "_Single-SNP tests, adjusted for HbS genotype, sex and ancestry, for association with severe malaria and the severe malaria subtypes cerebral malaria only and severe malarial anemia only were performed for the 55 SNPs with a known association with severe malaria. Standard logistic regression models were used for tests of association at each autosomal SNP (Supplementary Table 25). Primary analyses comprised tests of association between each SNP and severe malaria phenotypes across all individuals combined as well as separately by sex (X-chromosome SNPs only) and study site: genotypic, additive, dominant, recessive and heterozygote advantage genetic models of inheritance were considered._" (online Methods, Statistical analysis, (@bib7)). ## @bib3 The odds ratios for G6PDd in cases versus controls are given in Table 3 of @bib3 (page 6). In this publication, the results, restricted to males, are 0.82 (95% CI: 0.69–0.98) for CM and 1.48 (95% CI: 1.22–1.8) for SMA (their Table 3). The reason for the slight discrepancy between the two publications does not appear to be stated. They included a total of 6284 patients with cerebral malaria or severe malarial anaemia, broken down as: 3359 individuals had cerebral malaria only; 2184 had severe malarial anaemia only; 741 had both cerebral malaria and severe malarial anaemia. Table 1 and 6 of @bib3 show the case definitions for CM and SMA, highlighting that those who have both clinical presentations are excluded from the respective case definitions. This is further confirmed on page 18 where the authors state: "_For reasons of sample size, we did not conduct a detailed analysis of other sub-types of severe malaria, or of those individuals who had both cerebral malaria and severe malarial anaemia’._ Standard logistic regression models were also used to obtain the odds ratios: _‘In primary analyses, standard fixed effects logistic regression methods were used for tests of association with severe malaria and sub-types at each SNP under additive, dominant, recessive and heterozygous models. \[..\] Results were adjusted for sickle hemoglobin (HbS), gender and ethnicity._" (bottom of page 18). ## Sensitivity analysis In order to characterise the proportion of the reported association explained by collider bias we constructed a simple simulation study based on the analytical procedures in @bib7; @bib3. We restrict our simulation to males, and calibrate and test the model using only the results reported in males in both publications (identical up to one decimal point). For simplicity, the simulation ignores the effect of HbS which is a confounder between SMA and CM, and generates data where the two presentations occur independently, which is equivalent to adjusting for HbS in the regression model. The procedure generates simulated data dependent on a parameter characterising the effect of G6PDd on severe malarial anaemia. As no adjustment is necessary, we then compute the non-parametric odds ratio for G6PDd in CM cases versus controls, excluding from the CM case definition all those who have SMA. This simulated ‘observed’ odds ratio estimated from the 2 x 2 table (cases and controls versus G6PDd and G6PD normals) is thereby directly comparable to the reported odds ratios (obtained from logistic regression with the appropriate adjustments) in @bib7; @bib3, if we assume that only sex, ethnicity and _HbS_ are the true confounders (i.e. all necessary adjustments were made in both publications). The hypothetical data were simulated based on the following assumptions: 1. G6PD deficiency increases the risk of SMA in acute symptomatic malaria (the size of the effect is the only free parameter in the model and varies from 1 to 2 as defined by the odds ratio for G6PDd in SMA cases versus controls). In reality this would be expected to be a function of genotype and gene dose (i.e. hemizygotes and homozygotes would have a greater risk than heterozygotes). 2. CM is independent of G6PD status. 3. A population of males only (i.e. no heterozygote women, so no partial effects) with homogeneous background frequency of G6PDd. 4. CM and SMA occur independently. From the data in @bib3 and @bib7, we can estimate the marginal probability of CM as 0.34, independent of G6PD status (assumption 2). We can also estimate the marginal probability of SMA as 0.24. The probability of G6PDd in males was 0.15. If we denote ${\displaystyle \pi \phantom{\rule{0.1667em}{0ex}}=\phantom{\rule{0.1667em}{0ex}}\mathrm{P}(\mathsf{S}\mathsf{M}\mathsf{A}|\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})\ge \mathrm{P}(\mathsf{S}\mathsf{M}\mathsf{A})\phantom{\rule{0.1667em}{0ex}}=\phantom{\rule{0.1667em}{0ex}}0.24,}$ then by the law of total probability: $$ {\displaystyle \mathrm{P}(\mathsf{S}\mathsf{M}\mathsf{A}|\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{n})\phantom{\rule{0.1667em}{0ex}}=\phantom{\rule{0.1667em}{0ex}}\frac{\mathrm{P}(\mathsf{S}\mathsf{M}\mathsf{A})-\mathrm{P}(\mathsf{S}\mathsf{M}\mathsf{A}|\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})\mathrm{P}(\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})}{1-\mathrm{P}(\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})}\phantom{\rule{0.1667em}{0ex}}=\phantom{\rule{0.1667em}{0ex}}\frac{0.24\phantom{\rule{0.1667em}{0ex}}-\phantom{\rule{0.1667em}{0ex}}0.15\pi }{0.85}} $$ where G6PDd denotes G6PD deficient and G6PDn denotes G6PD normal. The true proportion of G6PDd in the controls is known and fixed as ${\displaystyle \mathrm{P}(\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})}$, with the proportion of G6PDn in the controls is given by ${\displaystyle 1-\mathrm{P}(\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})}$, therefore the odds of G6PDd in the control group is given by ${\displaystyle \frac{\mathrm{P}(\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})}{1-\mathrm{P}(\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})}}$. We then simulate cases as follows. For each value of ${\displaystyle \pi \in [0.24,0.5]}$: **Step 1**. Simulate 1 million patients such that: ${\displaystyle \mathrm{P}(\mathsf{C}\mathsf{M})\phantom{\rule{0.1667em}{0ex}}=\phantom{\rule{0.1667em}{0ex}}0.34}$, ${\displaystyle \mathrm{P}(\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})\phantom{\rule{0.1667em}{0ex}}=\phantom{\rule{0.1667em}{0ex}}0.15}$. ${\displaystyle \mathrm{P}(\mathsf{S}\mathsf{M}\mathsf{A}|\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})\phantom{\rule{0.1667em}{0ex}}=\phantom{\rule{0.1667em}{0ex}}\pi }$. **Step 2**. Select only the patients who have either just CM, or just SMA, filtering out those with concomitant SMA and CM. **Step 3**. In the remaining data compute ${\displaystyle {\mathrm{N}}_{\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d}}^{\mathsf{S}\mathsf{M}\mathsf{A}}}$ (number of G6PDd with SMA); ${\displaystyle {\mathrm{N}}_{\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{n}}^{\mathsf{S}\mathsf{M}\mathsf{A}}}$ (number of G6PDn patients with SMA); ${\displaystyle {\mathrm{N}}_{\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d}}^{\mathsf{C}\mathsf{M}}}$ (number of G6PDd patients with CM); ${\displaystyle {\mathrm{N}}_{\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{n}}^{\mathsf{C}\mathsf{M}}}$ (number of G6PDn patients with CM). The simulated odds ratio for G6PDd in SMA cases versus controls is ${\displaystyle \frac{{\mathrm{N}}_{\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d}}^{\mathsf{S}\mathsf{M}\mathsf{A}}/{\mathrm{N}}_{\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{n}}^{\mathsf{S}\mathsf{M}\mathsf{A}}}{\mathrm{P}(\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})/(1\phantom{\rule{0.1667em}{0ex}}-\phantom{\rule{0.1667em}{0ex}}\mathrm{P}(\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d}))}}$, and the simulated odds ratio for G6PDd in CM cases versus controls is ${\displaystyle \frac{{\mathrm{N}}_{\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d}}^{\mathsf{C}\mathsf{M}}/{\mathrm{N}}_{\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{n}}^{\mathsf{C}\mathsf{M}}}{\mathrm{P}(\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d})/(1\phantom{\rule{0.1667em}{0ex}}-\phantom{\rule{0.1667em}{0ex}}\mathrm{P}(\mathsf{G}\mathsf{6}\mathsf{P}\mathsf{D}\mathsf{d}))}}$. The causal diagram which corresponds to exclusion operating in Step two is shown in [Figure 1](#fig1). Selection bias can be seen via the role of the vertex _Case definition_ which is a collider between _ G6PD deficiency_ and _Cerebral malaria (CM)_. Implementation of the simulation model in R can be found at: <https://github.com/Stije/SevereMalariaAnalysis/SelectionBiasSimulation.Rmd> (@bib11; copy archived at <https://github.com/elifesciences-public