Using PLINK R Plugin and R scripts

Considering the computational and memory requirement of genome-wide data, the package offers two possible approaches for running the gJLS analyses in non-GUI: 1) using the command-line Rscript, or 2) using the genetic software PLINK via an R plugin function.

In general, we recommend the users to divide the genotype data by chromosome (internally in the scripts) and maybe take advantage of parallel computing when using a server with multiple cores or processors.

PLINK and an R plugin function

We illustrate gJLS analyses using simulated phenotype, covariates, and real X-chromosome genotypes from 1000 Genomes Project via a simple yet adaptable R plugin function for autosome and X-chromsome; the function can be easily modified to suit the user’s needs. We plan to have a separate software just for autosomal scale analysis in the future, so here we focus on X-chromosome analyses alone.

Before we start, we have to make sure both and packages are installed. More details on the R plugin function can be found here. In some cases, the default port does not work and you can specify a port number when starting and in PLINK.

The following R plugin script is quite simple since the majority of the coding is now nested in the gJLS2 function. Notice that for X-chromosome, the genetic sex must be provided and this is done through setting the first column of the covariate file to be the sex variable. To run PLINK R plugin function, save this coding chunk into an R script and give it a name, such as ‘run_gJLS2PLINK_Xchr.R’.

Rplink <- function(PHENO,GENO,CLUSTER,COVAR){
 require(gJLS2)
 
  f1 <- function(s) 
       {    
      r <-  gJLS2(GENO=s, SEX=COVAR[,1], Y=PHENO, COVAR=COVAR[,-1], Xchr=TRUE, genotypic=TRUE)
      rr <- as.numeric(r[3:5])
      
      c( length(rr) , rr )
       }
      apply( GENO , 2 , f1 )

}

Notice that the first variable in the PLINK COVAR term is in fact the sex variable, taken from the .fam file. So please be sure to check if sex information is correctly stored in .fam file prior to this analysis.

The remaining step is done in the bash command line by calling PLINK and you can test the script with the files extracted from the file here.

R CMD Rserve --RS-port 8221 --no-save

plink --bfile ./input/chrX_5_snp \
--R run_gJLS2PLINK_Xchr.R \
--pheno ./input/Pheno.txt \
--pheno-name pheno1 \
--R-port 8221 \
--covar ./input/Pheno.txt \
--covar-name SEX,covar1,covar2,covar3 \
--out ./output/testRun

The PLINK has an option to debug the script when necessary, it can be done on a subset of the data using the following:

plink --bfile ./input/chrX_5_snp \
--R run_gJLS2PLINK_Xchr.R \
--R-debug \
--pheno ./input/Pheno.txt \
--pheno-name pheno1 \
--R-port 8221 \
--covar ./input/Pheno.txt \
--covar-name SEX,covar1,covar2,covar3 \
--out ./output/testRun

This generates a testRun.debug.R file in which the data needed for the analysis are stored, and the user can proceed to debug in R:

source("testRun.debug.R")
require(gJLS2)
 
  f1 <- function(s) 
       {    
      r <-  gJLS2(GENO=s, SEX=ifelse(COVAR[,1]==0, 2, COVAR[,1]), Y=PHENO, COVAR=COVAR[,-1], Xchr=TRUE)
      rr <- as.numeric(r[3:5])
      
      c( length(rr) , rr )
       }
apply(GENO , 2 , f1)

This option has the flexibility to perform data filtering via PLINK, but sometimes setting up Rserve to communicate with PLINK properly could be a problem on some servers. We recommend running this analysis by either chromosome or smaller units when sample sizes become large as the R apply() function supported by PLINK is not the most efficient. There is a risk of losing the results if the analyses were not fully completed on GENO due to external factors, such as power loss or server downtime.

To have better write control over larger analyses (whole chromosome), we recommend splitting data into manageable chunks and combine the results after using the following bash script:

#!/bin/bash

export R_LIBS_USER=~/R/x86_64-pc-linux-gnu-library/4.1/
R CMD ~/R/x86_64-pc-linux-gnu-library/4.1/Rserve/libs/Rserve --RS-port 4221 --no-save

declare -a chunk_name=()

ctr=$1 ## your choice of the chunk size

total=($(awk '{print NR}' ./input/chrX_5_snp))
END=(${#total[@]})

b=($(seq 1 $ctr $END))
b+=($END)

for ((i = 0; i < ((${#b[@]}-1)); i++))
do
    echo "$i"
    echo "Running anaysis between ${b[$i]}"
    j=$((1 + i))
    echo "and ${b[$j]}"
   
    if [ $i -gt 0 ]; then
        chunk_1=$((${b[$i]} + 1))
    else
        chunk_1=$((${b[$i]}))
    fi
    chunk_2=(${b[$j]})
    snp1=$(awk -v pattern="$chunk_1" 'NR==pattern' ./input/chrX_5_snp.bim | cut -f2)
    snp2=$(awk -v pattern="$chunk_2" 'NR==pattern' ./input/chrX_5_snp.bim | cut -f2)
    echo $snp1
    echo $snp2
    
plink --bfile ./input/chrX_5_snp \
--R run_gJLS2PLINK_Xchr.R \
--R-port 4221 \
--from $snp1 \
--to $snp2 \
--pheno /input/Pheno.txt \
--pheno-name pheno1 \
--covar /input/Pheno.txt \
--covar-name SEX covar1 covar2 covar3 \
--out chunk${j}.results.temp   
done

The below script can be created and named run_chr_PLINK.sh. And the user can run on either the entire genome/chromosome by splitting data into chunk size of chunk_num:

chunk_num=1
bash run_chr_PLINK.sh $chunk_num

The script can be further customized, such as to use variables in place of the genotype/pheno files, etc. to suit the needs of the user.

Rscript

The Rscript run_gJLS2.R or run_gJLS2_foreach.R in the extdata folder can serve as a starting point to customized the analyses for each user. The arguments available in this Rscript included the very basic ones and additional ones can be added easily. A useful option is “write”, where the user can specify the chunk size for results to be written in the file as the program runs. Another important feature is the “-nThreads” option, where the user can take advantage of the multiple cores and processors available from high performance computing clusters.

As an example, in the extdata directory, the Rscript can be executed on the command line and output the results:

Rscript run_gJLS2.R --bfile ./input/chrX_5_snp \
--pfile ./input/Pheno.txt \
--pheno pheno1 \
--Xchr TRUE \
--write 2 \
--nThreads 2 \
--covar SEX,covar1,covar2,covar3 \
--out ./output/testRun.results.txt

run_gJLS2.R has only been tested on linux and Mac OS as it invokes mclapply from parallel R package, which has some known issues for multi-core computing on Windows systems. Also, the current implementation of multi-core does not work well on biobank scale data (needing to create copies of the data over cores) and renders the nTasks option useless (no significant improvement on speed on Linux, but is okay on Mac OS). So another option using foreach from the foreach package is provided:

Rscript run_gJLS2_foreach.R --bfile ./input/chrX_5_snp \
--pfile ./input/Pheno.txt \
--pheno pheno1 \
--Xchr TRUE \
--nTasks 2 \
--write 2 \
--covar SEX,covar1,covar2,covar3 \
--out ./output/testRun.results.txt

The write option here can be specified to control the number of markers results printed to file in case of any interruptions. For genome-wide/chromosome-wide analyses, it can be set to a large number such as 100 or 1000.

Alternatively, summary statistics can also be analyzed using the command-line option (without the need for multi-cores as it is quite fast):

Rscript run_gJLS2.R --sumfile ./input/GIANT_BMI_chr16_gJLS_summary.txt \
--out ./output/GIANT_BMI_Sum.chr16_results.txt