I was expecting these two different search strategies to identify the same variant IDs

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

discrepancies between get_variant and get_association about gwasrapidd HOT 10 CLOSED

ramiromagno commented on May 24, 2024

discrepancies between get_variant and get_association

from gwasrapidd.

Comments (10)

mightyphil2000 commented on May 24, 2024 1

Sorry I confused the get_variant with get_association function. The get_association function does not allow one to search on reported trait. I should have said "and" instead of "but" in point 2. I'll try to put some code together to illustrate what I mean in point 3. Thanks!

from gwasrapidd.

ramiromagno commented on May 24, 2024

Hi @mightyphil2000:

The example is not complete because the definition of the variable efo_id is missing. Could you provide that? May I suggest you use the reprex function to generate the minimal working example?

Furthermore, by searching by all those criteria simultaneously, do you intend to retrieve matches that fulfill all criteria, or that fulfill (at least) one of the criteria?

from gwasrapidd.

mightyphil2000 commented on May 24, 2024

Sorry about. I've updated the original post with efo_id set to NULL (which is how I've been setting it). I'm looking for matches that fulfill at least one of the criteria.

from gwasrapidd.

ramiromagno commented on May 24, 2024

Okay, thank you. I will take a look at it tomorrow.

from gwasrapidd.

ramiromagno commented on May 24, 2024

Hi @mightyphil2000:

I just looked into your example, and I do not have a good explanation for why we are seeing these differences; we should be getting the same variants either way. As far as I could inspect into gwasrapidd code, there seems to be no problem with gwasrapidd itself. So I forwarded your example to the GWAS Catalog team hoping to get some clarification on this issue. I will let you know as soon as I have a reply (likely this Monday).

from gwasrapidd.

ramiromagno commented on May 24, 2024

Hi @mightyphil2000,

I got a reply from Elliot Sollis (with the GWAS Catalog team).

Short answer

Use gwasrapidd::get_variants() function.

Explanation

Like you, I too had the expectation that both approaches would retrieve the same results. But apparently there are historical reasons that explain these differences for a subset of the older data in the Catalog.

Here is a post of the emails we exchanged (with permission from Elliot):

My question

Hi GWAS Catalog team,

I am getting a different number of variants depending on the query route I make using the REST API.

My query is based on the reported trait, i.e., "arachidonic acid measurement". I tried two approaches, here dubbed approach 1 and 2.

Approach 1: Search for studies by "arachidonic acid measurement" (which results in GCST002712 and GCST003236), then search for the respective associations, and from those associations retrieve the variants.

Approach 2: Search for variants directly by "arachidonic acid measurement".

I find that all variants obtained with approach 2 are included in those obtained with approach 1. But there is a set of variants that are exclusive to approach 1. Why? Is this a bug?

Here are the variants that are exclusive to approach 1:

# A tibble: 11 x 7
   association_id locus_id variant_id risk_allele risk_frequency genome_wide limited_list
   <chr>             <int> <chr>      <chr>                <dbl> <lgl>       <lgl>      
 1 10079997              1 rs4246215  NA                    0.36 FALSE       FALSE      
 2 10079998              1 rs174601   NA                    0.35 FALSE       FALSE      
 3 10079999              1 rs8523     A                     0.44 FALSE       FALSE      
 4 10080000              1 rs174549   NA                    0.29 FALSE       FALSE      
 5 10080001              1 rs174541   NA                    0.36 FALSE       FALSE      
 6 10080002              1 rs174549   NA                    0.29 FALSE       FALSE      
 7 10080003              1 rs174550   NA                    0.33 FALSE       FALSE      
 8 10080004              1 rs12580543 C                     0.06 FALSE       FALSE      
 9 10080005              1 rs3811444  T                     0.32 FALSE       FALSE      
10 10080006              1 rs174535   NA                    0.33 FALSE       FALSE      
11 10080007              1 rs2110073  T                     0.09 FALSE       FALSE

Elliot's reply

Hi Ramiro,

I think I’ve figured this one out.

To explain it, I think I just need to confirm some of the terminology to make sure we’re on the same page: sorry if it seems pedantic! The reported trait is a free-text trait description that we annotate independently for each study. The term you used in your query, “arachidonic acid measurement” is what we just call the trait, which is the more standardised ontology term that is more comparable across different studies.

What’s relevant here is that for some older publications that included multiple GWAS analyses (e.g. 25500335 from 2014), technical limitations meant that we used to keep all of the analyses together in a single study entry (GCST002712) with a broad reported trait (in this case “Red blood cell fatty acid levels”), and put multiple ontology terms for each of the traits analysed in the trait field (in this case arachidonic acid measurement, fatty acid measurement, linolenic acid measurement, oleic acid measurement, docosapentaenoic acid measurement, linoleic acid measurement). So if you search for any of those trait terms, this study will come up. But be aware that not all variants in the study are associated with every one of those terms.

That’s at the level of the study, but at the level of the association we were able to be more specific, so each association is only annotated with ontology terms for the specific analysis that association came from, e.g. in that same study rs2581624-C has [arachidonic acid measurement, fatty acid measurement] but rs4246215-? has [fatty acid measurement, linoleic acid measurement]. That means that if you search by “arachidonic acid measurement” you will get rs2581624 but not rs4246215.

We now try to split every publication into separate studies for every GWAS analysis, but I’m not sure if we’ll necessarily be able to adjust all of the older studies. So I would suggest that Approach 2 is a better option if you want to make sure you only capture direct associations with a particular trait.

Hope that helps!

Best wishes,

Elliot Sollis, PhD
Scientific Curator, GWAS Catalog
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus
Hinxton
Cambridge, CB10 1SD

from gwasrapidd.

mightyphil2000 commented on May 24, 2024

Thanks @ramiromagno. I see now. A problem with the get_variants function is that you can't search on reported trait (I guess because the older studies are not sufficiently annotated at the association level?). I am concerned that by only searching on EFO I might miss some associations. This could happen when the EFO is less specific than the reported trait. For example, rs4770891-? is associated with the reported trait "Circulating odd-numbered chain saturated fatty acid levels (C19:0)" but with EFO "fatty acid measurement" in study GCST005862. If I was specifically interested in the saturated fatty acid C19:0, then searching on EFO would not identify this association, whereas searching for studies on reported trait would. In other words, you could argue that search strategy 1 (search for studies by reported trait then retrieve associations) is more sensitive (but suffers more false positives), while search strategy 2 (search for variant-EFO associations) is more specific (but suffers more false negatives). What do you think?

from gwasrapidd.

ramiromagno commented on May 24, 2024

Hi @mightyphil2000:

I have divided this answer into three points.

Point 1

A problem with the get_variants function is that you can’t search on
reported trait (I guess because the older studies are not sufficiently
annotated at the association level?). I am concerned that by only
searching on EFO I might miss some associations.

I am a bit confused now. You say you cannot search on report_trait
with the get_variants() function, but you can, right?

my_variants_by_rep_trait <- get_variants(reported_trait = "Circulating odd-numbered chain saturated fatty acid levels (C19:0)")
my_variants_by_efo_trait <- get_variants(efo_trait = "fatty acid measurement")

my_variants_by_rep_trait@variants
## # A tibble: 15 x 7
##    variant_id merged functional_class    chromosome_name chromosome_position
##    <chr>       <int> <chr>               <chr>                         <int>
##  1 rs1468510       0 3_prime_UTR_variant 7                         151360599
##  2 rs11842790      0 intron_variant      13                         50350312
##  3 rs13226693      0 intron_variant      7                         151353528
##  4 rs17074145      0 intron_variant      13                         50355253
##  5 rs17363566      0 intron_variant      13                         50360756
##  6 rs706603        0 intron_variant      13                         50280887
##  7 rs12871645      0 intron_variant      13                         50357429
##  8 rs12703118      0 intron_variant      7                         151345197
##  9 rs12853498      0 intron_variant      13                         50339738
## 10 rs17074143      0 intron_variant      13                         50348637
## 11 rs13221923      0 intron_variant      7                         151357494
## 12 rs17074093      0 intron_variant      13                         50314567
## 13 rs12874827      0 intron_variant      13                         50326600
## 14 rs12874278      0 intron_variant      13                         50361786
## 15 rs4770891       0 intron_variant      13                         25983588
## # … with 2 more variables: chromosome_region <chr>, last_update_date <dttm>

my_variants_by_efo_trait@variants
## # A tibble: 156 x 7
##    variant_id merged functional_class          chromosome_name chromosome_posit…
##    <chr>       <int> <chr>                     <chr>                       <int>
##  1 rs1604805       0 intron_variant            4                        21569464
##  2 rs7198595       0 intron_variant            16                       12471050
##  3 rs12098564      0 non_coding_transcript_ex… 10                       85193571
##  4 rs11842790      0 intron_variant            13                       50350312
##  5 rs788076        0 intergenic_variant        10                       29047920
##  6 rs742614        0 intergenic_variant        20                       33894826
##  7 rs10518201      0 regulatory_region_variant 4                        78730706
##  8 rs12871645      0 intron_variant            13                       50357429
##  9 rs7534537       0 intron_variant            1                       204305391
## 10 rs2118674       0 intron_variant            2                       170462384
## # … with 146 more rows, and 2 more variables: chromosome_region <chr>,
## #   last_update_date <dttm>

Point 2

This could happen when the EFO is less specific than the reported
trait. For example, rs4770891-? is associated with the reported trait
“Circulating odd-numbered chain saturated fatty acid levels (C19:0)”
but with EFO “fatty acid measurement” in study GCST005862.

I did not quite understand your “but” clause. Regardless, note that the
variant you mentioned, i.e., rs4770891 is present in both sets of
variants above:

'rs4770891' %in% my_variants_by_rep_trait@variants$variant_id
## [1] TRUE

'rs4770891' %in% my_variants_by_efo_trait@variants$variant_id
## [1] TRUE

Point 3

If I was specifically interested in the saturated fatty acid C19:0,
then searching on EFO would not identify this association, whereas
searching for studies on reported trait would. In other words, you
could argue that search strategy 1 (search for studies by reported
trait then retrieve associations) is more sensitive (but suffers more
false positives), while search strategy 2 (search for variant-EFO
associations) is more specific (but suffers more false negatives).
What do you think?

I could not follow your reasoning because I did not observe the
differences (as shown above) you are referring to for this specific
example… So perhaps you could illustrate your concerns with some code?

from gwasrapidd.

mightyphil2000 commented on May 24, 2024

Actually I think this specific issue is now resolved. I can see that in this particular example get_variants is the best option. The new problem I was trying to explain is actually to do with the get_associations function. I'm therefore going to post this issue on the get associations issue page I opened the other day.

from gwasrapidd.

ramiromagno commented on May 24, 2024

Okay, I will close this issue then.

from gwasrapidd.

discrepancies between get_variant and get_association about gwasrapidd HOT 10 CLOSED

Comments (10)

Short answer

Explanation

My question

Elliot's reply

Point 1

Point 2

Point 3

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent