Rationale The <a href="http://ecodataretriever.org" rel="nofollow"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks for the response <a class="user-mention notranslate" data-hovercard-type="user"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Add taxonomic name resolution to the EcoData Retriever to facilitate data science approaches to ecology about gsoc HOT 34 CLOSED

numfocus commented on July 17, 2024

Add taxonomic name resolution to the EcoData Retriever to facilitate data science approaches to ecology

from gsoc.

Comments (34)

akshayah3 commented on July 17, 2024

@ethanwhite @r-gaia-cs I would like to work on this. Can you guys guide me as to where to begin?

from gsoc.

Umang-Goel commented on July 17, 2024

Hi @ethanwhite @r-gaia-cs ,
I would like to contribute to this project. Can you help me in getting started ?

from gsoc.

anuraghota commented on July 17, 2024

Hello, I would like to work on this project.
While installing the EcoData Retriver, during the configuration settings, it always ends up saying "No module named pymysql" even if i have that module installed. What is the problem here?

from gsoc.

rgaiacs commented on July 17, 2024

@akshayah3, @Umang-Goel and @anuraghota Thanks for the interest in apply to GSoC.

@ethanwhite, @bendmorris and @sckott Could you help our possible students?

While installing the EcoData Retriver, during the configuration settings, it always ends up saying "No module named pymysql" even if i have that module installed. What is the problem here?

If this is a bug you can report it at https://github.com/weecology/retriever/issues.

from gsoc.

ethanwhite commented on July 17, 2024

Hi @akshayah3, @Umang-Goel and @anuraghota! Thanks for your interest in this project and my apologies for the slow response (I've been traveling). I'm really excited to see so many folks interested in this project.

So, how to get started on this project. I think there are really three main components:

Build the system for sending a species name to one or more of the taxonomic name resolution services and processing the response from that service to determine the best name to use for that species. This could either be built directly into the Retriever or we could help build out pytaxize and have the Retriever use it. Using these taxonomic name resolution service APIs is an area where @sckott is an expert and will be able to provide a lot of guidance and @bendmorris and I can help with integrating it into the Retriever.
Update the data model and the user interfaces to work with information about species and taxonomic name resolution.
Build the control flow for running or not running taxonomic name resolution depending on the type of data and the users desires.

@anuraghota If you could report a new issue over at the main repository as @r-gaia-cs suggests I'd be happy to work with you to figure out what's going on.

from gsoc.

akshayah3 commented on July 17, 2024

Thanks for the response @ethanwhite.
I just had a glance at pytaxize and it seems that a lot of work has gone into it even though its not upto the mark yet as compared to its R counterpart. Would it be reasonable to continue building that as a separate project and have the retriever use it?

from gsoc.

ethanwhite commented on July 17, 2024

@akshayah3 yes, definitely. If I was implementing this that's the direction I would probably choose to take and it's part of the reason that @sckott is on the mentoring team.

from gsoc.

sckott commented on July 17, 2024

I think it's best to modularize if possible, so using pytaxize in EcoData Retriever is a good idea. The aim with pytaxize was a complete port of the R version, but we can prioritize anything that needs to be done for the EcoData Retriever

from gsoc.

akshayah3 commented on July 17, 2024

@ethanwhite I have a query regarding your 3rd point. You mentioned that we have to validate the users data before running the name resolution, On what basis do we validate the data?

from gsoc.

ethanwhite commented on July 17, 2024

@akshayah3 Basically in the scripts that define each dataset we'll need to add an indication of which column(s) contain the species name information. If there is no species name information in a given table then we won't run name resolution on that table.

from gsoc.

lyttonhao commented on July 17, 2024

Hi @ethanwhite, I'm Yanghao Li and have contacted you on email about this project. I have a query about the best name. Is the best name always the top rank matched name in the result of services? And does our task mean to replace the origin species name with the found best name? Do I understand it correctly? Thanks.

from gsoc.

ethanwhite commented on July 17, 2024

Hi @lyttonhao - First, I think picking the top ranked match will serve as a good starting point, but ultimately we will want to do something a little more sophisticated. I think we'd also want to check the quality of that match (which is another value that most name resolution services provide). If the quality is poor we may want to error on the side of being cautious and not replace them name. We will also have to make decisions about what to do if there are two almost equally good matches. Some discussion with @sckott, myself, and maybe a couple of folks involved in taxonomic name resolution will be useful in terms of figuring out the specifics. Second, Yes, the task is to replace the original species with the best match if that match is of suitable quality.

from gsoc.

lyttonhao commented on July 17, 2024

Thanks for your response @ethanwhite. I will do some survey about the name resolution first and think about this problem more seriously. Are there any other ways to discuss with you folks, such as IRC? Or this issue is just the right place? Thanks.

from gsoc.

rajat503 commented on July 17, 2024

Hi @ethanwhite @r-gaia-cs @sckott ,
I am interested in working on this project . I have experience with Web development, Python and SQL. Looking forward to hear from you on how to go about contributing to the project, preparing the application and hopefully working over the summer! It would be great if you could assign a task so that I could get familiar with the code base.

from gsoc.

sckott commented on July 17, 2024

@lyttonhao Services that do name resolution per se include http://resolver.globalnames.org/, http://api.phylotastic.org/tnrs, and http://tnrs.iplantcollaborative.org/ These give back names to match the queried name with a score of how well they match (often man names have exactly the same score, with e.g., 10 or more slight variants on the same thing). These services i think have a field that says whether each name is an accepted name or a synonym, etc.

Other data sources like NCBI, ITIS, Tropicos, etc. don't do name resolution per se (they don't have an API endpoint for giving a name and getting back possible matches)

Some of these data sources provide SQL dumps, while others do not, so some allow a local query solution if needed

from gsoc.

ethanwhite commented on July 17, 2024

@lyttonhao - we have been focusing discussion on issues since their asynchronous nature works well for us. Feel free to open issues in the retriever repository for questions or discussion of particular points. Keeping each issue focused on one thing will make it easier for us to manage. If having another communication channel becomes useful/necessary as part of GSoC, we'll be happy to open one, but keeping everything in the same place an asynchronous has been working well so far.

from gsoc.

ethanwhite commented on July 17, 2024

Hi @rajat503! Glad to hear that you're interested in the project. In terms of good tasks to start getting familiar with the code base there are two levels of tasks that are good starter tasks.

The simplest would be to write a script for a new dataset. This is simple because the scripts are relatively isolated from the rest of the code base, but that's also a downside in terms of getting into the code (for examples of existing scripts see .script and .py files in the ./scripts directory). If you'd like to go this route let me know and I can recommend some datasets.
A better option in terms of something that integrates more directly with the core code would be to add a new data engine. These engines let us export datasets into a variety of file and database formats. Right now we support MySQL, PostgreSQL, SQLite, MS Access, and csv. There are open issues for adding JSON (weecology/retriever#152; see also weecology/retriever#165; most of this script is actually implemented, but it has a bug that needs to be cleaned up and tests added before being included), HDF5 (weecology/retriever#79), and XML (weecology/retriever#58).

from gsoc.

lyttonhao commented on July 17, 2024

@sckott @ethanwhite Thanks for your responses. My current idea is to do some experiments on some datasets with several services. With these results, we can see the score distributions and the consistency among these services. I think this can help us to determine how to choose the best name in the next moment.

from gsoc.

ethanwhite commented on July 17, 2024

@lyttonhao That sounds like an excellent next step.

from gsoc.

tpoisot commented on July 17, 2024

I like this idea. If it means continuing the work on pytaxize, I'm happy to help too.

from gsoc.

sckott commented on July 17, 2024

@tpoisot 👍 awesome

from gsoc.

lyttonhao commented on July 17, 2024

Hi, @ethanwhite @sckott. I have tested the Avian Body Size Dataset by GNR. I search the species name by gnr_resolve in pytaxize and want to show some results I have gotten.

Since the origin gnr_resolve() don't support multiple names at the same query, I have modified a bit but still have some problems as showed in this issue.
Time consuming problem. I have queried 3756 names by 300 names at a time and it takes about 900 seconds totally in my computer and network condition. I think it may be not tolerable for Retriever when generating databases.
I'm a little confused about two terms. The result returned by GNR service includes two names: name_string (The name string found in this data source) and canonical_form (A "canonical" version of the name generated by the Global Names parser). I found that name_strings are sometimes different in different data sources but canonical_forms are often the same. For example, when quering "Struthio camelus", the result name_strings contains "Struthio camelus", "Struthio camelus Linnaeus", "Struthio camelus L.", but the canonical_forms are all the same "Struthio camelus". So is it correct to replace species name by the returned canonical_form? I think this can reconcile of different species names from different datasets.
The results of 3754 (total is 3756) names belong to exact match type and have match score > 0.988. So I think the cases which need to do replacing are very rare.

I will also try other datasets and services to see if the above results are common.

from gsoc.

sckott commented on July 17, 2024

@lyttonhao See my recent commit and comment on sckott/pytaxize#12 - with those fixes, I imagine the speed problem will mostly go away? let me now if it doesn't

IF it is still too slow, there is a POST request option http://resolver.globalnames.org/api which may be faster.

I think canonical_form should work.

in terms of data sources in GNR, one can choose which data source included in GNR to trust the most, so you can query just on the one data source.

if speed becomes a problem, if it is an option to download or ship a SQL DB with Ecoretriever, then some of the taxonomy DBs provide SQL dumps, which should be very fast since its simply reading a local SQL DB

from gsoc.

ethanwhite commented on July 17, 2024

Time consuming problem. I have queried 3756 names by 300 names at a time and it takes about 900 seconds totally in my computer and network condition. I think it may be not tolerable for Retriever when generating databases.

This dataset has a really large number of species compared to most, so should be on the extreme end time wise. Personally I'd generally be happy to have the computer spend a few hours doing the hard work of cleaning up the taxonomy for me. That said, if @sckott's fixes don't speed it up a lot we may want to warn the user that it will be time consuming.

The results of 3754 (total is 3756) names belong to exact match type and have match score > 0.988. So I think the cases which need to do replacing are very rare.

This is partly a result of the taxon involved and the fact that it's a relatively recent dataset. Older datasets and datasets in taxa where the taxonomy is less stable/geographically consistent (e.g., plants) will be worse.

from gsoc.

sckott commented on July 17, 2024

Taxonomic DBs do vary in strengths/foci, so we may want to target certain DBs for certain groups (e.g., mammals vs. plants, or North America vs. global)

from gsoc.

tpoisot commented on July 17, 2024

I don't think 900 seconds is unreasonable for a large dataset. Especially since the results will be cached after the first query, so the user won't have to go through that every time (right?).

from gsoc.

ethanwhite commented on July 17, 2024

Especially since the results will be cached after the first query, so the user won't have to go through that every time (right?)

Definitely.

from gsoc.

lyttonhao commented on July 17, 2024

Thank you for all responses! I will test again on more datasets with the fixed pytaxize first.

from gsoc.

lyttonhao commented on July 17, 2024

Hi, @ethanwhite @sckott , I’m very sorry for my slow response. (I was busy with some other things these days and my computer also faced some problems.) I’ve tested some datasets again and found that

if we choose a few specific data sources and use POST method, the speed would be much faster. However, it seems that different data sources may have different results. So I think we need to determine some standard default data sources and give an option for user to choose specific data sources. If so, I think local DB is also unnecessary because the speed is acceptable.
Like @ethanwhite said, the results of other older datasets are a little different, for example, some names have no matched names and many returned scores are much lower.

At last, I think I must work with my proposal right now because I'm a little late. Again, I’m very sorry for my delay.

from gsoc.

ethanwhite commented on July 17, 2024

So I think we need to determine some standard default data sources and give an option for user to choose specific data sources.

Sounds like a good plan

At last, I think I must work with my proposal right now because I'm a little late.

Sounds good. Let us know when it's up and we'll try to provide some helpful feedback.

from gsoc.

lyttonhao commented on July 17, 2024

Hi, @ethanwhite @sckott. I have a question about different name resolution services when I working on my proposal. Since one name resolution service contains many data sources, there would be a lot of overlaps between different services. For example, it seems that the number of data sources in GNR is the largest and covers almost data sources in Phylotastic/TNRS. So I think maybe GNR is enough because it has much more data sources. Am I misunderstood something?

from gsoc.

sckott commented on July 17, 2024

Right, GNR has probably the largest number of data sources it pulls from. However, I think you'd still want to make sure it suits the needs of the project. Just saying to make sure during the project that it is all that is needed, instead of deciding now to ignore all other sources.

from gsoc.

lyttonhao commented on July 17, 2024

Okay, I get your point. Thanks @sckott.

from gsoc.

rgaiacs commented on July 17, 2024

I'm closing this issue since student application period is over.

from gsoc.

Add taxonomic name resolution to the EcoData Retriever to facilitate data science approaches to ecology about gsoc HOT 34 CLOSED

Comments (34)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent