X-chromosome association methods

X-chromosome Location (mean) association

X-chromosome association has a unique set of challenges and considering GWAS data alone, the following must be taken into consideration:

  • status of X-chromosome inactivation (XCI)
  • \(G_A\): baseline allele \(a\) vs \(A\)
  • \(S\): sex as a confounder
  • \(G_A\times S\): interaction?
  • \(G_D\): dominance effect

Ignoring any of the above could bring a flood of problems for inference and/or estimation of genetic effects. Our proposed approach addresses the correct inference, but the unbiased estimation of effects remains an open question.

This 3 d.f. test resolves above issues simultaneously from a testing perspective (Chen et al., 2021):

\[g(E(Y))=\beta_0+\beta_S Sex+\beta_{A} G_{Additive}+\beta_{D} G_{Dominance} + \beta_{GS} G\times Sex,\] \[H_0: \beta_A=\beta_D=\beta_{GS}=0.\]

Essentially, with this model, we are free to code the \(G_A\) whichever way, and the association testing results will be the same! At the same time, we have shown the statistical equivalence between

  • XCI uncertainty = \(G\times Sex\)
  • XCI skewness = a dominance effect

This is a both a blessing and a curse. On the bright side, we have a great testing model where we do not have to worry about XCI status when performing Xchr association testing. But at the same time, because of the statistical equivalence, we cannot identify the true data generative model with GWAS data alone! Indeed, Song et al., (2021) has demonstrated that SNP coefficient estimated using an mis-specified data generative model can create biases in different situations (unmatched XCI assumptions vs. truth).

X-chromosome Scale (variance) association

The solution here is essentially a model-based generalized Levene’s test, where additional covariates and the confounding sex (S) variable can be modelled explicitly. The general framework of the approach implies a two-step regression on the genotype (G), and if additional covariates are included, they must be present in both steps:

Stage One: Mean models \[ Y \sim \beta_0 + \beta_{G}G + \beta_{S} S + \beta_{GS}GS \]

Stage Two: Variance models \[ Z* \sim \gamma_0 + \gamma_{G}G + \gamma_{S}S + \gamma_{GS}GS, \] where \(Z^{*}\) is the weighted absolute residual.

Let \(Z = |Y - \hat{Y}|\) be the different between the observed and fitted values with respect to the Mean model. The weighted residual is: \[ z^{*} = \frac{z}{1_{(S=0)}\hat{\sigma}_\text{F} + 1_{(S=1)}\hat{\sigma}_\text{M}}, \] where \(\hat{\sigma}_\text{F}\) and \(\hat{\sigma}_\text{M}\) denote the sample standard deviations of \(y\) in females and males, respectively.

Note that a non-additive variance model (NAV) in stage two may also be considered to capture non-linear variance effects (via the genotypic option): \[ Z* \sim \gamma_0 + \gamma_{G1}G1 + \gamma_{G2} G2 + \gamma_{S}S + \gamma_{G1S}G1S \]

where \(G1\) and \(G2\) are indicator variables for the Bb and BB+B groups under X-inactivation, respectively, or alternatively for the Bb+B and BB groups without X-inactivation.

Test for Variance heterogeneity is achieved in stage two by testing \[ H_o: \gamma_{G}= \gamma_{GS} = 0 \quad \text{or} \gamma_{G1} =\gamma_{G2}= \gamma_{G1S}= 0 \] via the standard regression \(F\)-test, where the model is fitted using an OLS for independent samples or a generalized least square for dependent samples.