Large-scale plasma proteomics comparisons through genetics and disease associations

Study populations

Plasma samples collected from 54,265 UKB participants at their baseline visit were measured using Olink Explore 3072 as a part of UKB-PPP (UK Biobank application number 65851). All participants provided informed consent. A large majority of the samples were randomly selected across the UK Biobank, and only those were used for the analysis presented here. Many GWASs using the UKB data⁵⁶ have been based on a prescribed European ancestry subset of 409,559 participants who self-identified as ‘white British’⁵⁷. To better leverage the value of a wider range of UKB participants, we defined three cohorts encompassing 450,690 individuals, based on genetic clustering of microarray genotypes informed by self-described ethnicity and supervised ancestry inference¹²: 431,805 individuals with British or Irish ancestry (UKB-BI, 46,218 with Olink data), 9,633 individuals with African ancestries (UKB-AF, 1,513 with Olink data) and 9,252 individuals with South Asian ancestries (UKB-SA, 953 with Olink data).

We measured the plasma protein levels of 35,892 Icelanders using SomaScan v4 (ref. ²). All participants who donated samples gave informed consent, and the National Bioethics Committee of Iceland approved the study, which was conducted in agreement with conditions issued by the Data Protection Authority of Iceland (VSN_14-015). Personal identities for the participants’ data and biological samples were encrypted by a third-party system (Identity Protection System), approved and monitored by the Data Protection Authority. In addition, we measured 1,514 of these Icelanders with the Olink Explore 3072 platform using the same plasma sample.

We used 1,474 and 227 additional duplicate measurements of samples to evaluate assay precision for the Olink Explore (UKB sets) and SomaScan (Iceland 36K) platforms, respectively. For samples that were measured more than twice, two of the measurements were chosen at random.

External data sources

URLs for external data used are as follows: the GWAS catalogue (https://www.ebi.ac.uk/gwas/), the GTEx project (https://gtexportal.org/home/), the Human Protein Atlas (https://www.proteinatlas.org/), STRING database (https://string-db.org/; file name: 9606.protein.actions.v11.txt.gz) and UniProt (https://www.uniprot.org/).

Software and data processing pipelines

We used the following publicly available software in conjunction with the algorithms described above. BamQC (v1.0.0, https://github.com/DecodeGenetics/BamQC), GraphTyper (v2.7.1, v1.4, v2.7.2, https://github.com/DecodeGenetics/graphtyper), GATK resource bundle (v4.0.12, gs://genomics-public-data/resources/broad/hg38/v0), Svimmer (v0.1, https://github.com/DecodeGenetics/svimmer), popSTR (v2.0, https://github.com/DecodeGenetics/popSTR), Admixture (v1.3.0, https://dalexander.github.io/admixture), Dipcall (v0.1, https://github.com/lh3/dipcall), RTG Tools (v3.8.4, https://github.com/RealTimeGenomics/rtg-tools), bcl2fastq (v2.20.0.422, https://support.illumina.com/sequencing/sequencing_software/bcl2fastq-conversion-software.html), Samtools (v1.9, v1.3, https://github.com/samtools/samtools), samblaster (v0.1.24, https://github.com/GregoryFaust/samblaster), BWA (v0.7.10 mem, https://github.com/lh3/bwa), GenomeAnalysisTKLite (v2.3.9, https://github.com/broadgsa/gatk), Picard tools (v1.117, https://broadinstitute.github.io/picard), Bedtools (v2.25.0-76-g5e7c696z, https://github.com/arq5x/bedtools2), Variant Effect Predictor (release 100, https://github.com/Ensembl/ensembl-vep), BOLT-LMM (v2.1, https://data.broadinstitute.org/alkesgroup/BOLT-LMM/downloads), IMPUTE2 (v2.3.1, https://mathgen.stats.ox.ac.uk/impute/impute_v2.html), dbSNP (v140, https://www.ncbi.nlm.nih.gov/SNP), BiNGO (v3.0.3, https://www.psb.ugent.be/cbd/papers/BiNGO/Download.html), Cytoscape (v3.7.1, https://cytoscape.org/download.html), COLOC (v5.1.0.1, https://github.com/chr1swallace/coloc). The genomics and pQTL processing pipelines have been extensively described previously^2,12. To process data generated on the Olink platform, we used Olink Explore (v1.9.0, https://www.olink.com/products-services/data-analysis-products/npx-explore/). Data were analysed and figures generated using Python (version 3.9.1), along with packages numpy (version 1.20.3), scipy (version 1.7.1), matplotlib (version 3.4.3), and pandas (version 1.3.0), and R (version 3.6.0).

Proteomic platforms

The Olink Explore 3072 proximity extension assay (PEA) platform is based upon an in-solution binding of two polyclonal antibody pools to a target protein and subsequent hybridization and enrichment of two unique single-stranded DNA probes to create a double stranded barcode unique for the antigen⁵⁸. The platform consists of 2,941 immunoassays targeting 2,925 proteins. Each assay is based on a pair of polyclonal antibodies. The antibodies bind to different sites on the target protein and are labelled with single-stranded complementary oligonucleotides. If matching pairs of antibodies bind to the protein, the attached oligonucleotides hybridize, and are then measured using next-generation sequencing^59,60. Olink Explore 3072 consists of 8 panels of 384 assays analysed by next-generation sequencing. Four of those panels make up a previous iteration of the platform, Olink Explore 1536, which can be considered a subset of the Olink Explore 3072, along with the Expansion set. The Olink measurements were based on the NPX values recommended by the manufacturer, which include normalization⁵⁸.

The UKB plasma samples were measured at Olink’s facilities in Uppsala Sweden. All samples were randomized and plated by the UK Biobank laboratory team prior to delivery. Samples were processed across three NovaSeq 6000 Sequencing Systems. Extensive quality control measures and normalization of protein concentration was performed at Olink’s facilities, producing NPX values for each protein per participant. NPX is Olink’s relative protein quantification unit on a log₂ scale.

The Olink measurements of the Icelandic plasma samples were performed at deCODE’s facility in accordance with the Olink Explore manual⁶¹. Quality control measures were the same as used by Olink for the UK Biobank samples.

The SomaScan platform utilizes a surface bound enrichment of proteins alongside a universal polyanionic competitor to prevent transient non-specific interactions⁶². SomaScan v4 consists of 4,907 aptamer-based assays targeting 4,719 proteins. Aptamers are short, single-stranded oligonucleotides that bind to protein targets. The bound aptamers are then quantified using DNA microarray technology^62,63. In most studies performed using SomaScan, the last step in the normalization process adjusts the median protein levels for each individual to a reference^5,64. As this can affect the correlation of protein levels to other factors, some studies omit this step². We refer to the former data as normalized and the latter as non-normalized. In addition to using non-normalized SomaScan protein measurements as we had done before², we also applied SomaLogic’s SMP normalization⁶⁴ and performed all analyses using both non-normalized and normalized data. Comparison of the two normalization methods can be found in Supplementary Note 4.

We refer to the outcome of a particular assay as the level of the protein, noting that the assay may not in fact measure the targeted protein.

Both Olink and SomaScan use dilutions of plasma samples to compensate for different concentrations of proteins in plasma^13,59,62. For the set of proteins targeted by both platforms, the two platforms are generally in agreement on the placement of proteins into low, intermediate or high dilution groups (Supplementary Tables 1, 2 and 7).

Genotyping and imputation

The whole genomes of 150,119 UKB participants were sequenced to a median of 32.5× using Illumina technology¹². Sequence variant calling was performed using GraphTyper⁶⁵. In addition, all UKB participants were single-nucleotide polymorphism (SNP) genotyped with Affymetrix SNP chips^66,67. After filtering, the sequence variants along with the phased SNP chip data by Bycroft et al.⁵⁷ were used to create a haplotype reference panel. Sequence variants were then imputed into the chip-genotyped samples using tools and methods described previously^68,69. The genotyping and imputation of the UKB dataset have previously been described in greater detail¹². We restricted our analysis to variants with MAF >0.01% and imputation information >0.9, resulting in 57.7 million variants in the UKB-BI, 36.5 million variants in the UKB-SA and 68.6 million variants in the UKB-AF datasets.

The whole genomes of 63,118 Icelanders were sequenced to a median of 32× using Illumina technology⁶⁸. Sequence variants were called using GraphTyper⁶⁵. In addition, the samples were SNP genotyped with Illumina SNP chips and long-range phased, and the data was used to impute genotypes. In total, 173,025 Icelanders were SNP genotyped, long-range phased and imputed based on the sequenced datasets. Where genotypes for an individual were missing for association studies, they were inferred using genealogic information if possible. The imputation learning set was based on whole-genome sequencing of 15% of Icelanders, which allowed rare variant imputation. The genotyping and imputation on the Icelandic dataset have been previously described in greater detail⁵⁰. We restricted our analysis to variants with MAF >0.01% and imputation information >0.9, resulting in 33.5 million variants. Other software tools used for various tasks in the genotyping pipeline were BamQC, GATK resource bundle, Svimmer, popSTR, Admixture, Dipcall, RTG Tools, bcl2fastq, Samtools, samblaster, BWA, GenomeAnalysisTKLite, Picard tools, Bedtools, Variant Effect Predictor, IMPUTE2, dbSNP, BiNGO and Cytoscape.

Phenotypes

In UKB we used health care records to identify the diagnosis of a disease or disease category, both prior and post plasma collection, based on the first three letters of the corresponding ICD10 code. When the number of individuals diagnosed exceeded 50, we estimated the association of protein levels with disease diagnosis. This resulted in 324, 29 and 20 case–control phenotypes for UKB-BI, UKB-AF and UKB-SA, respectively. In addition, we had measurements of 208, 56, and 60 quantitative traits in UKB-BI, UKB-AF and UKB-SA respectively with at least 50 individuals measured for each trait. The quantitative traits were measured at the same time as the plasma was collected, when available.

In Iceland we used health care records to construct lists of disease diagnoses, both prior and post plasma collection. This resulted in 275 case–control phenotypes. We furthermore had measurements of 110 quantitative traits from various sources, in general not measured at the same time as the plasma was collected.

Protein–phenotype associations

Annotation of assay targets

We assigned genomic coordinates to assay targets using UniProt IDs⁷⁰ for each assay provided by the manufacturer. Out of 4,963 valid assays on the SomaScan platform (excluding non-human proteins and assays marked as defective by the manufacturer), this resulted in 4,961 assays getting assigned the genomic coordinates of their intended targets. Out of 2,941 valid assays on the Olink Explore platform, this resulted in 2,923 assays getting assigned the genomic coordinates of their intended targets.

We identified assays targeting the same protein using their UniProt IDs. This resulted in 2,023 pairs of assays targeting 1,848 UniProt IDs; 1,864 Olink assays and 1,994 SomaScan assays (Supplementary Table 4).

Assay precision

Following Olink⁵⁸, we assumed a log-normal distribution of protein levels. On the logarithm scale, denoting the mean protein level with \(\mu \) and variance with \({\sigma }^{2}\), the mean and variance of protein levels will be \({e}^{\mu +{\sigma }^{2}/2}\) and \(\left({e}^{{\sigma }^{2}}-1\right){e}^{2\mu +{\sigma }^{2}}\). The CV is defined as the s.d. divided by the mean and therefore equals \(\sqrt{{e}^{{\sigma }^{2}}-1}\) assuming a log-normal distribution.

To evaluate the precision of the assays, we estimated the CV for the available duplicate measurements and the expectation of the CV under the assumption that the two duplicates were independent of each other, that is, if the repeated measurements were not of the same sample, but of samples selected at random from the population (Supplementary Fig. 1). We used the robust median absolute deviation estimator to estimate the s.d. of the repeated measurements on the log-scale and inserted this estimate into the formula for the CV above (Fig. 1, Supplementary Fig. 1 and Supplementary Tables 1 and 2).

Relative evaluation of batch effects between platforms

Both Olink and SomaScan use repeated measurements of control samples, specific to the platform, for quality control. When using two measurements of the same control sample on the same plate to evaluate the CV, the evaluation does not include the inter-plate variation, while the CV estimated assuming that the samples are not measured on the same plate but chosen at random from the set of all samples does include inter-plate variation. Comparing the CV computed from the repeated control samples between the two platforms can therefore help comparison in terms of batch effects, with values closer to one suggesting that the platform is less susceptible to batch effects and closer to zero that the platform is more so (Supplementary Note 2, Supplementary Fig. 9).

Correlation of assays across platforms

We calculated correlations between protein levels measured in the same samples using Spearman correlation.

pQTL analysis

We carried out pQTL analysis in the same way as we have previously described². The following three sections briefly describe this process.

Genome-wide association study

We computed P values using a likelihood ratio test and adjusted for multiple testing by using the same significance threshold (1.8 × 10⁻⁹) as in our previous study on the Icelandic dataset².

We defined a pQTL association to be cis if the pQTL was located within 1 Mb of the transcription start site for the gene that encodes the target protein, as reported by UniProt, and trans otherwise.

Of the 2,941 assays on the Olink Explore 3072 platform, data from UKB for 2,931 assays were used for GWAS analysis.

The number of variants we test in Iceland (33.5 million) is about 40% lower than in UKB (57.7 million). The difference is largely due to very rare variants. However, the difference between them would result in a multiple testing correction threshold in UKB of 8.7 × 10⁻¹⁰ instead of 1.8 × 10⁻⁹. A total of 153 (1%) of the cis pQTLs are between those two thresholds and 1,608 (5%) of the trans pQTLs.

For replication between platforms, the P value threshold is 0.05, with the requirement that initial and replication associations are in the same direction.

Conditional analysis

We performed recursive conditional analysis separately for each assay and each chromosome based on individual-level genotypes. For computational efficiency, we restricted this analysis to the candidate set of sequence variants associating with the assay with a P <5 × 10⁻⁶. If the variant, v1, with the lowest P value had P <1.8 × 10⁻⁹, we removed v1 from the candidate set and the association of all other variants in the candidate set was recomputed, conditional on v1. If any variant in the candidate set had P < 1.8 × 10⁻⁹, we assigned the label v2 to the variant with the lowest P value, removed v2 from the candidate set, and calculated the conditional association of the variants remaining in the candidate set given v1 and v2. We repeated this process until no variant in the candidate set had P < 1.8 × 10⁻⁹. Conditional analysis for two assays did not finish for all secondary signals but did return values for sentinel pQTLs.

We observe that 92% and 97% of secondary variants have an r² below 0.2 and 0.5, respectively, to the primary variant on Olink (based on r² calculated in the UK Biobank British and Irish set).

In addition, we estimated significance and effect based on a joint model of all variants at the locus to the phenotype for the variants selected in the stepwise model. When jointly estimating the effect on a protein at a locus, and examining pQTL associations at loci that contain more than 1 variant associated to a protein, 96% and 92% of the associations detected using SomaScan and Olink, respectively, remained significant when using the same genome-wide significance threshold as in the stepwise model (that is, 1.8 × 10⁻⁹).

Merging pQTLs

We considered sequence variants from the conditional analysis to belong to the same region if they were within 2 Mb of each other. Furthermore, we considered the major histocompatibility complex (MHC) region (build 38 chr. 6:25.5-34.0MB) as a single region. We refer to the most significant variant in each region as the sentinel variant for the assay in the region, and other variants as secondary variants.

We used the ‘LD-based clumping approach’ proposed by Sun et al.⁶ to identify pQTLs associating with multiple assays: we considered variants associating with a different assay to belong to the same pQTL if they are in high LD (r² > 0.8).

pQTL replication

For replication between platforms, the P value threshold was 0.05, with the requirement that initial and replication associations were in the same direction.

Power calculation

For a given P value threshold P, sample size N, effect size β, and MAF f, the probability of rejecting the null hypothesis of no association is given by 1 − F(X − 1(1 − P), 2Nβ²f(1 − f)), where X^–1(·) denotes the inverse cumulative distribution function (inverse CDF) of the chi-squared distribution with one degree of freedom, while F(a, b) denotes the CDF of the non-central chi-squared distribution with one degree of freedom for quantile a and non-centrality parameter⁷³ b.

PAV annotation of pQTL variants

For each pQTL, we tested whether the variant itself and variants in high LD (r² > 0.8) could affect the coding sequence of genes or their splicing, as described previously².

Based on SomaScan, 40% of variants with cis pQTL and 28% of variants with trans pQTL are in high LD with a PAV (r² > 0.80), and 44% of variants with cis pQTL and 38% of variants with trans pQTL are in high LD with cis eQTL (r² > 0.8).

Based on Olink, 39% of variants with cis pQTL and 23% of variants with trans pQTL are in high LD with a PAV (r² > 0.80), and 47% of variants with cis pQTL and 41% of variants with trans pQTL are in high LD with cis eQTL (r² > 0.8).

Thus, when considering the neighbouring genes within ±1 Mb, we note that cis pQTLs are more likely to be in high LD with a PAV or cis eQTL on both platforms compared to trans. Similar results were observed on both platforms and when restricting to assays measuring proteins targeted by both platforms.

In addition, for cis pQTLs we also report if the PAV or cis eQTL is for the gene encoding the targeted protein (Supplementary Tables 21 and 19).

Cis eQTL and cis pQTL

For each cis pQTL, we tested whether the variant itself and variants in high LD (r² > 0.8) corresponded to one or more top cis eQTLs based on 73 tissues and 17 sources including the GTEx project, using the same methods and data as described previously².

Integration of pQTL and disease associations

We calculated r² values (based on the Icelandic population for SomaScan-Iceland and the UKB-BI population for Olink-UKB) between all sentinel pQTL variants and top (most significantly associated) variants per Mb bin and per experimental factor ontology (EFO) term reported in the NHGRI-EBI GWAS catalogue⁷⁴ (downloaded 7 April 2022), using the same methods as described previously².

Relationship between pQTLs and disease-associated variants

We identified all variants reported in the NHGRI-EBI catalogue of human GWAS⁷⁴ (excluding proteomics studies) in high LD (r² > 0.8) with sentinel pQTLs based on Olink-UKB-BI data and Icelandic SomaScan data (Supplementary Tables 38 and 39). For each sentinel pQTL association, we also identified a 95% credible set of variants (variants that most parsimoniously explain regional association⁷⁵) likely to include the causal variant⁷⁶. We then checked whether GWAS catalogue variants in high LD with the pQTL variant (with r² > 0.8 with the pQTL) were included in the credible set. In addition to high LD between the disease-associated variant and both the pQTL and a variant in the credible set, for the highlighted examples, we estimated the posterior probability of statistical colocalization for the variants associating with disease and protein levels when they were not identical and when we had access to the necessary statistics⁷⁷.

Disease and pQTL colocalization

To test for colocalization of the pQTL signals with signals in other traits we used the COLOC software package implemented in R⁷⁷. Using summary statistics for the pQTL A and the trait B—that is, effects and P values—we calculated Bayes factors for each of the variants in the associated region for the two traits and used COLOC to calculate the posterior probability for two hypotheses: (1) that the association with the pQTL A and the trait B are independent signals (PP3) and (2) that the association with the pQTL A and the trait B are due to a shared signal (PP4). Prior probabilities for COLOC were left at default.

Protein subcellular locations

Protein subcellular locations were determined using annotations from the Human Protein Atlas¹⁴, using the same approach as in Sun et al.⁶ where proteins annotated as ‘membrane’ by the Human Protein Atlas were considered to be membrane proteins, proteins annotated in the Human Protein Atlas as ‘secreted’ (but not ‘membrane’) considered to be secreted proteins, while other proteins were considered to be intracellular.

IL-10 ELISA

Blood was collected in EDTA tubes that were inverted 4–5 times and then centrifuged for 10 min at 3,000g at 4 °C. Plasma samples were frozen in aliquots at −80 °C. Plasma aliquots were allowed to thaw on ice and kept away from light during defrosting. Before measurement, the aliquots were mixed by inverting the tubes a couple of times and then centrifuged for 10 min at 3,220g at 4 °C.

IL-10 in plasma was measured by using MSD V-PLEX Human IL-10 (cat: K151QUD) according to the manufacturer’s protocol (Meso Scale Diagnostics).

NFL ELISA

Plasma samples were measured in duplicates with commercially available Simoa NF-light Advantage (SR-X) kit (Quanterix, cat. 103400). Samples were diluted 4:1 and incubated with 25 µl anti-NF-light immunocapture beads and 20 µl biotinylated detector antibody at 30 °C and 800 rpm for 30 min. Following the incubation, the bead-immunocomplexes were washed and resuspended before being incubated with 100 µl streptavidin-labelled β-galactosidase at 30 °C and 800 rpm for 10 min. After a second washing step, the bead-immunocomplexes and resorufin β-d-galactopyranoside were loaded onto an SR-X instrument (Quanterix) for processing and analysis.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link

Subscribe to Updates

What's Hot