Abstract
Large genome-wide association studies (GWAS) and expression quantitative trait locus (eQTL) analyses can be integrated to better understand the genetic causality of complex traits and diseases. Colocalization methods allow to combine those datasets. One of the main challenges in colocalization analyses is the presence of multiple signals in GWAS regions around a gene. This multiplicity of signals can confound the colocalization process, making it difficult to pinpoint specific variants driving the associations. To address this, colocRedRibbon isolates the signal peaks of interest before conducting the colocalization analysis.
To demonstrate colocRedRibbon’s basic workflow, we will perform a colocalization analysis using type 2 diabetes GWAS data and pancreatic islet eQTL data. We will use the real dataset “th”, which is composed of a GWAS for type 2 diabetes (derived from the https://www.diagram-consortium.org/) and a human pancreatic eQTL colocalization study for the gene TH, encoding tyrosine hydroxylase (originating from http://tiger.bsc.es/). The region around the TH gene will be analyzed to identify shared genetic variants that influence both type 2 diabetes risk and TH gene expression in pancreatic islets.
We first load the package and the dataset with
library(colocRedRibbon)
#> Loading required package: coloc
#> This is coloc version 5.2.3
#> Loading required package: data.table
#> Loading required package: ggplot2
#> Loading required package: ggpubr
#> Loading required package: RedRibbon
#> Loading required package: scales
#> Loading required package: ggrepel
## load data.table package for fast data.frame data structure
library(data.table)
## load the test dataset
data("th", package = "colocRedRibbon")
The data() function loads a data.table
named th.dt into the current environment. Once the data is
loaded, you can use the head() function to display the
first few rows of the dataset.
| rsid | pval.GWAS | n.GWAS | eaf.GWAS | or.GWAS | ea.GWAS | nea.GWAS | pval.eQTL | pos | n.eQTL | zscore.eQTL | ea.eQTL | nea.eQTL | eaf.eQTL | dir.GWAS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| rs1003483 | 0.35 | 231420 | 0.510 | 1.0063199 | T | G | 0.68710 | 2167543 | 404 | 0.403 | T | G | 0.510 | 0.0063 |
| rs1003484 | 0.64 | 231420 | 0.260 | 0.9965061 | A | G | 0.92180 | 2167618 | 404 | -0.098 | A | G | 0.260 | -0.0035 |
| rs1003889 | 0.99 | 187126 | 0.011 | 0.9993002 | T | G | 0.58720 | 1970108 | 317 | 0.543 | T | G | 0.011 | -0.0007 |
| rs1004446 | 0.38 | 231420 | 0.380 | 1.0059174 | A | G | 0.92870 | 2170143 | 404 | 0.089 | A | G | 0.380 | 0.0059 |
| rs1005135 | 0.52 | 231420 | 0.210 | 0.9949130 | C | G | 0.08475 | 2539890 | 404 | -1.724 | C | G | 0.210 | -0.0051 |
| rs1005236 | 0.16 | 231420 | 0.730 | 1.0110607 | A | G | 0.89080 | 2448519 | 404 | 0.137 | A | G | 0.730 | 0.0110 |
Create an S3 colocRedRibbon object for the risk alleles decreasing expression,
rrc.dec <- RedRibbonColoc(th.dt, risk="a", effect=`<=`,
columns=c(id="rsid", position="pos", a="pval.GWAS", b="pval.eQTL",
a.n="n.GWAS", a.eaf="eaf.GWAS", a.dir="dir.GWAS",
b.n="n.eQTL", b.eaf="eaf.eQTL", b.dir="zscore.eQTL"))
The risk allele is specified with the risk parameter,
which can be either NULL, ‘a’ or ‘b’. It
represents the odds ratios for the analysis assigned to the a
or b column. The effect parameter is set to
<= to subselect the decreasing risk alleles. The
column parameter defines the role of the
data.table columns. The data.table should include the
columns id, a, b and position, which
represent the name of the SNP (Single Nucleotide Polymorphism), the
p-value of the first analysis (here, type 2 diabetes GWAS), the p-value
of the second analysis (here, human pancreatic islet eQTL), and the
position of the SNP on the chromosome, respectively. In the following
steps, RedRibbon will use the specified p-values for the overlap
analysis. In addition to these essential columns, the data.table may
include supplementary columns that facilitate the computation of
co-localization.
Next, the co-localization can be computed with
## Run C. Wallace coloc()
rrc.dec <- coloc(rrc.dec)
Several optional parameters can be specified to customize the behavior of the coloc function. For more information see the documentation.
To extract the best SNP:
coloc.res["bestSnp"]
## Is this command correct?
## Anything else that we can extract?
The RedRibbon overlap p-value is given by
rrc.dec$quadrants$whole$pvalue
#> [1] 2.621389e-41
and the overlapping variants are
rrc.dec$data[rrc.dec$quadrants$whole$positions]$id
#> [1] "rs10743152" "rs10770140" "rs10770141" "rs10770142" "rs10770143"
#> [6] "rs10840495" "rs10840496" "rs10840500" "rs11042965" "rs11042966"
#> [11] "rs11042976" "rs11042978" "rs11564711" "rs4929964" "rs4929965"
#> [16] "rs4929966" "rs4930046" "rs7115640" "rs7119275" "rs7128097"
#> [21] "rs72853903" "rs7482891"
The coloc structure contains
rrc.dec$coloc
#> $bestSnp
#> [1] "rs4929965"
#>
#> $PP.H4.abf
#> PP.H4.abf
#> 0.9799012
#>
#> $SNP.PP.H4
#> [1] 0.2890225
#>
#> $ncredibleSet99
#> [1] 12
#>
#> $credibleSet99
#> [1] "rs4929965" "rs10770142" "rs4929964" "rs7482891" "rs10840496"
#> [6] "rs11042966" "rs10743152" "rs10840495" "rs7128097" "rs7115640"
#> [11] "rs7119275" "rs10770141"
#>
#> $all.snps
#> snp position pvalues.df1 MAF.df1 N.df1 V.df1 z.df1
#> 1: rs4929965 2197286 4.8e-25 0.38 231420 9.170517e-06 10.336862
#> 2: rs10770142 2194420 7.7e-25 0.38 231420 9.170517e-06 10.291460
#> 3: rs4929964 2197132 1.4e-24 0.38 231420 9.170517e-06 10.233744
#> 4: rs7482891 2197112 1.7e-24 0.38 231420 9.170517e-06 10.214931
#> 5: rs10840496 2195844 1.9e-24 0.38 231420 9.170517e-06 10.204138
#> 6: rs11042966 2195538 1.9e-24 0.38 231420 9.170517e-06 10.204138
#> 7: rs10743152 2195981 1.9e-24 0.38 231420 9.170517e-06 10.204138
#> 8: rs10840495 2195837 2.3e-24 0.38 231420 9.170517e-06 10.185573
#> 9: rs7128097 2195045 2.3e-24 0.38 231420 9.170517e-06 10.185573
#> 10: rs7115640 2194914 2.6e-24 0.38 231420 9.170517e-06 10.173642
#> 11: rs7119275 2194810 3.1e-24 0.38 231420 9.170517e-06 10.156502
#> 12: rs10770141 2193840 3.1e-24 0.39 231420 9.081857e-06 10.156502
#> 13: rs10770143 2195267 7.1e-23 0.39 231420 9.081857e-06 9.846461
#> 14: rs10770140 2193597 2.9e-23 0.39 231420 9.081857e-06 9.936085
#> 15: rs72853903 2198665 1.6e-12 0.31 231420 1.010086e-05 7.065534
#> 16: rs4929966 2197436 1.2e-12 0.28 231420 1.071713e-05 7.105371
#> 17: rs11042976 2198259 6.9e-13 0.48 231420 8.656145e-06 7.181402
#> 18: rs11042978 2198418 8.6e-13 0.48 231420 8.656145e-06 7.151236
#> 19: rs11564711 2194062 6.9e-13 0.48 231420 8.656145e-06 7.181402
#> 20: rs4930046 2197148 1.3e-12 0.48 231420 8.656145e-06 7.094308
#> 21: rs10840500 2196910 1.5e-12 0.48 231420 8.656145e-06 7.074489
#> 22: rs11042965 2195288 2.1e-12 0.48 231420 8.656145e-06 7.027677
#> snp position pvalues.df1 MAF.df1 N.df1 V.df1 z.df1
#> r.df1 lABF.df1 pvalues.df2 MAF.df2 N.df2 V.df2 z.df2 r.df2
#> 1: 0.9995926 49.50075 9.255e-06 0.38 404 0.005253072 4.433885 0.8107211
#> 2: 0.9995926 49.03266 1.155e-05 0.38 404 0.005253072 4.385914 0.8107211
#> 3: 0.9995926 48.44058 1.041e-05 0.38 404 0.005253072 4.408477 0.8107211
#> 4: 0.9995926 48.24831 1.004e-05 0.38 404 0.005253072 4.416310 0.8107211
#> 5: 0.9995926 48.13817 1.112e-05 0.38 404 0.005253072 4.394164 0.8107211
#> 6: 0.9995926 48.13817 1.132e-05 0.38 404 0.005253072 4.390289 0.8107211
#> 7: 0.9995926 48.13817 1.179e-05 0.38 404 0.005253072 4.381435 0.8107211
#> 8: 0.9995926 47.94898 1.113e-05 0.38 404 0.005253072 4.393968 0.8107211
#> 9: 0.9995926 47.94898 1.172e-05 0.38 404 0.005253072 4.382732 0.8107211
#> 10: 0.9995926 47.82757 1.313e-05 0.38 404 0.005253072 4.357927 0.8107211
#> 11: 0.9995926 47.65341 1.326e-05 0.38 404 0.005253072 4.355770 0.8107211
#> 12: 0.9995965 47.64876 2.967e-05 0.39 404 0.005202286 4.175985 0.8122073
#> 13: 0.9995965 44.54914 2.046e-05 0.39 404 0.005202286 4.259811 0.8122073
#> 14: 0.9995965 45.43528 7.092e-05 0.39 404 0.005202286 3.973178 0.8122073
#> 15: 0.9995513 21.09513 2.730e-04 0.31 404 0.005785992 3.639658 0.7954467
#> 16: 0.9995239 21.40618 1.919e-03 0.28 404 0.006139007 3.102490 0.7856418
#> 17: 0.9996154 21.84466 4.742e-03 0.48 404 0.004958429 2.824057 0.8194205
#> 18: 0.9996154 21.62856 4.996e-03 0.48 404 0.004958429 2.807292 0.8194205
#> 19: 0.9996154 21.84466 7.957e-03 0.48 404 0.004958429 2.653889 0.8194205
#> 20: 0.9996154 21.22323 5.884e-03 0.48 404 0.004958429 2.754177 0.8194205
#> 21: 0.9996154 21.08288 5.926e-03 0.48 404 0.004958429 2.751848 0.8194205
#> 22: 0.9996154 20.75293 7.348e-03 0.48 404 0.004958429 2.680647 0.8194205
#> r.df1 lABF.df1 pvalues.df2 MAF.df2 N.df2 V.df2 z.df2 r.df2
#> lABF.df2 internal.sum.lABF SNP.PP.H4 SNP.PP.H4.cumsum
#> 1: 7.136852 56.63760 2.890225e-01 0.2890225
#> 2: 6.965345 55.99800 1.524606e-01 0.4414830
#> 3: 7.045782 55.48636 9.140217e-02 0.5328852
#> 4: 7.073802 55.32211 7.755740e-02 0.6104426
#> 5: 6.994708 55.13287 6.418574e-02 0.6746284
#> 6: 6.980911 55.11908 6.330628e-02 0.7379346
#> 7: 6.949430 55.08760 6.134434e-02 0.7992790
#> 8: 6.994012 54.94299 5.308502e-02 0.8523640
#> 9: 6.954038 54.90301 5.100483e-02 0.9033688
#> 10: 6.866150 54.69372 4.137307e-02 0.9447419
#> 11: 6.858530 54.51194 3.449607e-02 0.9792380
#> 12: 6.245771 53.89453 1.860508e-02 0.9978431
#> 13: 6.532944 51.08209 1.117385e-03 0.9989604
#> 14: 5.574603 51.00989 1.039554e-03 1.0000000
#> 15: 4.475223 25.57036 9.302493e-15 1.0000000
#> 16: 3.011022 24.41720 2.936230e-15 1.0000000
#> 17: 2.411768 24.25643 2.500162e-15 1.0000000
#> 18: 2.373088 24.00165 1.937845e-15 1.0000000
#> 19: 2.029849 23.87451 1.706486e-15 1.0000000
#> 20: 2.252061 23.47530 1.144792e-15 1.0000000
#> 21: 2.246808 23.32969 9.896725e-16 1.0000000
#> 22: 2.088332 22.84127 6.072547e-16 1.0000000
#> lABF.df2 internal.sum.lABF SNP.PP.H4 SNP.PP.H4.cumsum
Co-localization results can be visualized using the helper function ggRedRibbonColoc. This function helps generate plots that illustrate the colocalization analysis. By evaluating the p-values of the overlap map, users can easily interpret the relationships between SNPs associated with an increased risk for disease and those responsible for gene expression variation.
gg <- ggRedRibbonColoc(rrc.dec, shortid = "TH")
gg
For additional optional parameters to customize the visualization of
colocalization results please see the function documentation in R (with
?ggRedRibbonColoc).
The gg variable contains a standard ggplot2 object, which can be modified and customized using the ggplot2 parameters in R.
class(gg)
#> [1] "gg" "ggplot"
colocRedRibbon is a method designed to identify common causal candidates by examining the co-localization of GWAS and eQTL. This approach aims to pinpoint variants that are linked to both disease risk and variation in gene expression.
The method employs a two-step approach for shortlisting variants: - Risk allele effect step: In this step, variants are categorized into two distinct groups based on their direction of effect on gene expression, i.e., down- or upregulating. The upregulating variant set comprises the variants whose risk alleles increase gene expression, and the downregulating set risk alleles that decrease gene expression. Each of these variant sets is analyzed independently in subsequent steps. - RedRibbon Overlap Step: In this step, the RedRibbon rank-rank hypergeometric overlap method is applied on both GWAS and eQTL variants, which are ranked according to their P-values. This analysis examines the potential overlap between the two ranked lists. If a significant overlap is detected by RedRibbon, these shortlisted SNPs are further analyzed by the coloc package. If no significant overlap is found, the coloc method is still applied to the two effect sets without the preliminary overlap shortlisting.
The area between the 25th percentile (first quartile) and the 75th percentile (third quartile) of the chromosomal positions of the core set is referred to as the interquartile range (IQR). IQR method includes the overlapping variants from RedRibbon and the variants in the IQR. The method delimits the region scrutiny.
The IQR mode is activated by setting the region.mode
parameter,
## Run C. Wallace coloc()
rrcIQR.dec <- coloc(rrc.dec, region.mode = "IQR")
gg <- ggRedRibbonColoc(rrcIQR.dec, shortid = "TH")
gg
sessionInfo()
#> R version 4.2.2 Patched (2022-11-10 r83330)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 12 (bookworm)
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] colocRedRibbon_0.2-1 RedRibbon_1.1-1 ggrepel_0.9.3
#> [4] scales_1.2.1 ggpubr_0.6.0 ggplot2_3.4.1
#> [7] data.table_1.14.8 coloc_5.2.3 kableExtra_1.4.0
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.10 svglite_2.1.1 lattice_0.20-45 tidyr_1.3.0
#> [5] assertthat_0.2.1 digest_0.6.31 utf8_1.2.3 R6_2.5.1
#> [9] plyr_1.8.8 backports_1.4.1 evaluate_0.20 highr_0.10
#> [13] pillar_1.8.1 rlang_1.0.6 rstudioapi_0.14 irlba_2.3.5.1
#> [17] car_3.1-1 jquerylib_0.1.4 Matrix_1.5-3 rmarkdown_2.20
#> [21] labeling_0.4.2 stringr_1.5.0 munsell_0.5.0 mixsqp_0.3-48
#> [25] broom_1.0.3 compiler_4.2.2 xfun_0.37 pkgconfig_2.0.3
#> [29] systemfonts_1.0.4 htmltools_0.5.4 tidyselect_1.2.0 tibble_3.1.8
#> [33] gridExtra_2.3 matrixStats_0.63.0 reshape_0.8.9 fansi_1.0.4
#> [37] viridisLite_0.4.1 crayon_1.5.2 dplyr_1.0.10 withr_2.5.0
#> [41] grid_4.2.2 jsonlite_1.8.4 gtable_0.3.1 lifecycle_1.0.3
#> [45] DBI_1.1.3 magrittr_2.0.3 cli_3.6.0 stringi_1.7.12
#> [49] cachem_1.0.7 carData_3.0-5 farver_2.1.1 ggsignif_0.6.4
#> [53] viridis_0.6.2 xml2_1.3.3 bslib_0.4.2 generics_0.1.3
#> [57] vctrs_0.5.2 cowplot_1.1.1 tools_4.2.2 glue_1.6.2
#> [61] susieR_0.12.35 purrr_1.0.1 abind_1.4-5 fastmap_1.1.1
#> [65] yaml_2.3.7 colorspace_2.1-0 rstatix_0.7.2 knitr_1.42
#> [69] sass_0.4.5