Comments (10)
Sorry I confused the get_variant with get_association function. The get_association function does not allow one to search on reported trait. I should have said "and" instead of "but" in point 2. I'll try to put some code together to illustrate what I mean in point 3. Thanks!
from gwasrapidd.
Hi @mightyphil2000:
The example is not complete because the definition of the variable efo_id
is missing. Could you provide that? May I suggest you use the reprex function to generate the minimal working example?
Furthermore, by searching by all those criteria simultaneously, do you intend to retrieve matches that fulfill all criteria, or that fulfill (at least) one of the criteria?
from gwasrapidd.
Sorry about. I've updated the original post with efo_id set to NULL (which is how I've been setting it). I'm looking for matches that fulfill at least one of the criteria.
from gwasrapidd.
Okay, thank you. I will take a look at it tomorrow.
from gwasrapidd.
Hi @mightyphil2000:
I just looked into your example, and I do not have a good explanation for why we are seeing these differences; we should be getting the same variants either way. As far as I could inspect into gwasrapidd code, there seems to be no problem with gwasrapidd itself. So I forwarded your example to the GWAS Catalog team hoping to get some clarification on this issue. I will let you know as soon as I have a reply (likely this Monday).
from gwasrapidd.
Hi @mightyphil2000,
I got a reply from Elliot Sollis (with the GWAS Catalog team).
Short answer
Use gwasrapidd::get_variants()
function.
Explanation
Like you, I too had the expectation that both approaches would retrieve the same results. But apparently there are historical reasons that explain these differences for a subset of the older data in the Catalog.
Here is a post of the emails we exchanged (with permission from Elliot):
My question
Hi GWAS Catalog team,
I am getting a different number of variants depending on the query route I make using the REST API.
My query is based on the reported trait, i.e., "arachidonic acid measurement". I tried two approaches, here dubbed approach 1 and 2.
Approach 1: Search for studies by "arachidonic acid measurement" (which results in GCST002712 and GCST003236), then search for the respective associations, and from those associations retrieve the variants.
Approach 2: Search for variants directly by "arachidonic acid measurement".
I find that all variants obtained with approach 2 are included in those obtained with approach 1. But there is a set of variants that are exclusive to approach 1. Why? Is this a bug?
Here are the variants that are exclusive to approach 1:
# A tibble: 11 x 7
association_id locus_id variant_id risk_allele risk_frequency genome_wide limited_list
<chr> <int> <chr> <chr> <dbl> <lgl> <lgl>
1 10079997 1 rs4246215 NA 0.36 FALSE FALSE
2 10079998 1 rs174601 NA 0.35 FALSE FALSE
3 10079999 1 rs8523 A 0.44 FALSE FALSE
4 10080000 1 rs174549 NA 0.29 FALSE FALSE
5 10080001 1 rs174541 NA 0.36 FALSE FALSE
6 10080002 1 rs174549 NA 0.29 FALSE FALSE
7 10080003 1 rs174550 NA 0.33 FALSE FALSE
8 10080004 1 rs12580543 C 0.06 FALSE FALSE
9 10080005 1 rs3811444 T 0.32 FALSE FALSE
10 10080006 1 rs174535 NA 0.33 FALSE FALSE
11 10080007 1 rs2110073 T 0.09 FALSE FALSE
Elliot's reply
Hi Ramiro,
I think I’ve figured this one out.
To explain it, I think I just need to confirm some of the terminology to make sure we’re on the same page: sorry if it seems pedantic! The reported trait is a free-text trait description that we annotate independently for each study. The term you used in your query, “arachidonic acid measurement” is what we just call the trait, which is the more standardised ontology term that is more comparable across different studies.
What’s relevant here is that for some older publications that included multiple GWAS analyses (e.g. 25500335 from 2014), technical limitations meant that we used to keep all of the analyses together in a single study entry (GCST002712) with a broad reported trait (in this case “Red blood cell fatty acid levels”), and put multiple ontology terms for each of the traits analysed in the trait field (in this case arachidonic acid measurement, fatty acid measurement, linolenic acid measurement, oleic acid measurement, docosapentaenoic acid measurement, linoleic acid measurement). So if you search for any of those trait terms, this study will come up. But be aware that not all variants in the study are associated with every one of those terms.
That’s at the level of the study, but at the level of the association we were able to be more specific, so each association is only annotated with ontology terms for the specific analysis that association came from, e.g. in that same study rs2581624-C has [arachidonic acid measurement, fatty acid measurement] but rs4246215-? has [fatty acid measurement, linoleic acid measurement]. That means that if you search by “arachidonic acid measurement” you will get rs2581624 but not rs4246215.
We now try to split every publication into separate studies for every GWAS analysis, but I’m not sure if we’ll necessarily be able to adjust all of the older studies. So I would suggest that Approach 2 is a better option if you want to make sure you only capture direct associations with a particular trait.
Hope that helps!
Best wishes,
Elliot Sollis, PhD
Scientific Curator, GWAS Catalog
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus
Hinxton
Cambridge, CB10 1SD
from gwasrapidd.
Thanks @ramiromagno. I see now. A problem with the get_variants function is that you can't search on reported trait (I guess because the older studies are not sufficiently annotated at the association level?). I am concerned that by only searching on EFO I might miss some associations. This could happen when the EFO is less specific than the reported trait. For example, rs4770891-? is associated with the reported trait "Circulating odd-numbered chain saturated fatty acid levels (C19:0)" but with EFO "fatty acid measurement" in study GCST005862. If I was specifically interested in the saturated fatty acid C19:0, then searching on EFO would not identify this association, whereas searching for studies on reported trait would. In other words, you could argue that search strategy 1 (search for studies by reported trait then retrieve associations) is more sensitive (but suffers more false positives), while search strategy 2 (search for variant-EFO associations) is more specific (but suffers more false negatives). What do you think?
from gwasrapidd.
Hi @mightyphil2000:
I have divided this answer into three points.
Point 1
A problem with the get_variants function is that you can’t search on
reported trait (I guess because the older studies are not sufficiently
annotated at the association level?). I am concerned that by only
searching on EFO I might miss some associations.
I am a bit confused now. You say you cannot search on report_trait
with the get_variants()
function, but you can, right?
my_variants_by_rep_trait <- get_variants(reported_trait = "Circulating odd-numbered chain saturated fatty acid levels (C19:0)")
my_variants_by_efo_trait <- get_variants(efo_trait = "fatty acid measurement")
my_variants_by_rep_trait@variants
## # A tibble: 15 x 7
## variant_id merged functional_class chromosome_name chromosome_position
## <chr> <int> <chr> <chr> <int>
## 1 rs1468510 0 3_prime_UTR_variant 7 151360599
## 2 rs11842790 0 intron_variant 13 50350312
## 3 rs13226693 0 intron_variant 7 151353528
## 4 rs17074145 0 intron_variant 13 50355253
## 5 rs17363566 0 intron_variant 13 50360756
## 6 rs706603 0 intron_variant 13 50280887
## 7 rs12871645 0 intron_variant 13 50357429
## 8 rs12703118 0 intron_variant 7 151345197
## 9 rs12853498 0 intron_variant 13 50339738
## 10 rs17074143 0 intron_variant 13 50348637
## 11 rs13221923 0 intron_variant 7 151357494
## 12 rs17074093 0 intron_variant 13 50314567
## 13 rs12874827 0 intron_variant 13 50326600
## 14 rs12874278 0 intron_variant 13 50361786
## 15 rs4770891 0 intron_variant 13 25983588
## # … with 2 more variables: chromosome_region <chr>, last_update_date <dttm>
my_variants_by_efo_trait@variants
## # A tibble: 156 x 7
## variant_id merged functional_class chromosome_name chromosome_posit…
## <chr> <int> <chr> <chr> <int>
## 1 rs1604805 0 intron_variant 4 21569464
## 2 rs7198595 0 intron_variant 16 12471050
## 3 rs12098564 0 non_coding_transcript_ex… 10 85193571
## 4 rs11842790 0 intron_variant 13 50350312
## 5 rs788076 0 intergenic_variant 10 29047920
## 6 rs742614 0 intergenic_variant 20 33894826
## 7 rs10518201 0 regulatory_region_variant 4 78730706
## 8 rs12871645 0 intron_variant 13 50357429
## 9 rs7534537 0 intron_variant 1 204305391
## 10 rs2118674 0 intron_variant 2 170462384
## # … with 146 more rows, and 2 more variables: chromosome_region <chr>,
## # last_update_date <dttm>
Point 2
This could happen when the EFO is less specific than the reported
trait. For example, rs4770891-? is associated with the reported trait
“Circulating odd-numbered chain saturated fatty acid levels (C19:0)”
but with EFO “fatty acid measurement” in study GCST005862.
I did not quite understand your “but” clause. Regardless, note that the
variant you mentioned, i.e., rs4770891
is present in both sets of
variants above:
'rs4770891' %in% my_variants_by_rep_trait@variants$variant_id
## [1] TRUE
'rs4770891' %in% my_variants_by_efo_trait@variants$variant_id
## [1] TRUE
Point 3
If I was specifically interested in the saturated fatty acid C19:0,
then searching on EFO would not identify this association, whereas
searching for studies on reported trait would. In other words, you
could argue that search strategy 1 (search for studies by reported
trait then retrieve associations) is more sensitive (but suffers more
false positives), while search strategy 2 (search for variant-EFO
associations) is more specific (but suffers more false negatives).
What do you think?
I could not follow your reasoning because I did not observe the
differences (as shown above) you are referring to for this specific
example… So perhaps you could illustrate your concerns with some code?
from gwasrapidd.
Actually I think this specific issue is now resolved. I can see that in this particular example get_variants is the best option. The new problem I was trying to explain is actually to do with the get_associations function. I'm therefore going to post this issue on the get associations issue page I opened the other day.
from gwasrapidd.
Okay, I will close this issue then.
from gwasrapidd.
Related Issues (20)
- Fix documentation of `cytogenetic_bands`
- Unexpected error in get_variants_by_efo_id: Error: length(efo_trait) not greater than 0
- Make functions that search by either `efo_trait` or `reported_trait` case sensitive again
- Very Small P Values are Rounded to 0 HOT 2
- get_associations using reported trait HOT 7
- Retrieve all genetic associations searching on reported trait and efo simultaneously HOT 7
- NCBI build conversion HOT 8
- Extract other information using gwasrapidd HOT 1
- Is it possible to add additional gene information? HOT 4
- Response code 500 when using get_studies() HOT 8
- parsing issue while using get_variants() HOT 10
- Gwascatcollect<-function(gene, chr=xx, start=xx, end=xx) HOT 3
- GRASP: Genome-Wide Repository of Associations Between SNPs and Phenotypes HOT 4
- The most efficient way to retrieve association results by study_id? HOT 2
- institutional logo not rendered in footer
- revisit FAQ 5 HOT 1
- consider transitioning to the (not so) new pkgdown website template HOT 2
- Error: parse error: premature EOF in study responses
- Response code was 500. HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gwasrapidd.