gnames / gndiff Goto Github PK

GNdiff compares scientific names from two files

License: MIT License

Makefile 4.58% Go 95.13% Dockerfile 0.30%

gndiff's Introduction

GNames

The goal of the GNames project is to provide an accurate and fast verification of scientific names in unlimited quantities. The verification should be fast (at least 1000 names per second) and include exact and fuzzy matching of input strings to scientific names aggregated from a large number of data-sources.

In case if you do not need exact records of matched names from data-sources, and just want to know if a name-string is known, you can use GNmatcher instead of this project. The GNmatcher is significantly faster and has simpler output.

Features
Installation
- Installation prerequesites
- Installation process
Configuration
Usage as API
Usage with GNverifier
Web-Logs
Known limitations of the verification
Development
Authors
License

Features

Fast verification of unlimited number of scientific names.
Multiple levels of verification:
- Exact matching (exact string match for viruses, exact canonical form match for Plantae, Fungi, Bacteria, and Animalia).
- Fuzzy matching detects human and/or Optical Character Recognition (OCR) errors without producing large number of false positives. To avoid false positives uninomial names only checked for exact match.
- PartialExact matching happens when a match for the full name-string is not found. In such cases middle or end words are removed and each variant is verified. Matches of names with the last word intact does have a preference.
- PartialFuzzy matching is provided for partial matches of species and infraspecies. To avoid false positives uninomials only checked for exact match.
- Virus matching provides viruses verification.
- FacetedSearch allows to use flexible query language for searching.
Providing names information from data-sources that contain a particular name.
- Returning the "best" result. The BestResult is calculated by a scoring algorithm.
- Optionally, limiting results to data-sources that are important to a GNames user.
Providing outlink URLs to some data-sources websites to show the original record of a name.
Providing meta-information about aggregated data-sources.

Installation

Most of the users do not need to install GNames and can use remote GNames API service at http://verifier.globalnames.org/api/v1 or use a command line client GNverifier. Nevertheless, it is possible to install a local copy of the service.

Installation prerequesites

A Linux-based operating system.
At least 32GB of memory.
At least 50GB of a free disk space.
Fast Internet connection during installation. After installation GNames can operate without remote connection.
PostgreSQL database.

Installation process

PostgreSQL

We are not covering basics of PostgreSQL administration here. There are many tutorials and resources for Linux-based operating systems that can help.

Create a database named gnames. Download the gnames database dump. Restore the database with:
```
gunzip -c gnames_latest.tar.gz |pg_restore -d gnames
```
GNmatcher

Refer to the GNmatcher documentation for its installation.
GNames

Download the latest release of GNames, unpack it and place somewhere in the PATH.

Run gnames -V. It will show you the version of GNames and also generate $HOME/.config/gnames.yaml configuration file.

Edit $HOME/.config/gnames.yaml according to your preferences.

Try it by running
```
gnames rest -p 8888
```
To load service automatically you can create systemctl configuration for the service, if your system supports systemctl.

Alternatively you can use docker image to run GNames. You will need to create a file with corresponding environment variables that are described in the .env.example file.
```
docker pull gnames/gnames:latest
docker run -env_file path_to_env_file -d -i -t -p 8888:8888 \
  gnames/gnames:latest rest -p 8888
```
We provide an example of environment file. Environment variables override configuration file settings.

Configuration

Configuration settings can either be given in the config file located at $HOME/.config/gnames.yaml, or by setting the following environment variables:

Env. Var.	Configuration
GN_CACHE_DIR	CacheDir
GN_JOBS_NUM	JobsNum
GN_MATCHER_URL	MatcherURL
GN_MAX_EDIT_DIST	MaxEditDist
GN_PG_DB	PgDB
GN_PG_HOST	PgHost
GN_PG_PASS	PgPass
GN_PG_PORT	PgPort
GN_PG_USER	PgUser
GN_PORT	Port

The meaning of configuration settings are provided in the default gnames.yaml.

Usage as API

Please note, that currently developed API (documentation) is publically served at https://verifier.globalnames.org/api/v1.

If you installed GNames locally and want to run its API, run:

gnames rest
# to change from default 8888 port
gnames rest -p 8787

Refer to GNames' RESTful API Documentation about interacting with GNames API.

Usage with GNverifier

GNverifier is a command line client for GNames backend. It uses publically available remote API of GNames. Install and use it according to the GNverifier documentation.

GNverifier also provides web-based user interface to GNames. To launch it use something like:

gnverifier -p 8777

Known limitations of the verification

Exact matches of misspellings that might exist in poorly curated databases prevent to find fuzzy matches from better curated sources.

To increase performance we stop any further tries if a name matched
successfully. This prevents fuzzy-matching if a misspelled name is found
somewhere. It is helpful to check 'curation' field of returned result,
and see how many data-sources do contain the name.

Fuzzy matching of a name where genus string is broken by a space.

For example, we cannot match 'Abro stola triplasia' to 'Abrostola triplasia'. There is only 1 edit distance between the strings, however we stem specific epithets, so in reality we fuzzy-match 'Abro stol triplas' to 'Abrostola triplas'. That means now we have edit distance 2 which is usually beyond our threshold.

Development

Install Go language for your Linux operating system.
Create PostgreSQL database as described in installation.
Clone the GNames code.
Clone the GNmatcher and set it up for development.
Install docker and docker compose.
Go to your local gnames directory
- Run make dc
- Run docker-compose up
- In another terminal window run go test ./...

Authors

Dmitry Mozzherin

License

The GNames code is released under MIT license.

gndiff's People

Contributors

Stargazers

Watchers

Forkers

abubelinha

gndiff's Issues

Implement matching of a canonical/stem to record objects

build exact search engine

Add comparison between two datasets

CSV output creates inconsistent number of fields

show csv/tsv header in the output when needed

As a User I want to be able to have TSV and names only inputs

create database for reference data

When there are two close names, result sometimes has 4 rows

See #13 for example

As a User I want an option to return back "species group"

It is an option to return "Aus bus" and "Aus bus bus" when a search is done for one of them. The issue is related to gnames/gnmatcher#49 and was suggested by @abubelinha there.

Add a score for each match, that helps to find out level of similarity

I guess something similar to resolver score

As a User I want to use stemmed canonicals for all matches

Stemmed canonical will allow to alwas get names with different suffixes

Add command line app for comparison of two CSV files

As a User I want more parser-based information in the JSON output.

CSV file with only ScientificName field is not recognized

-p and --port options not working

When running gndiff -V shows no info about how to run as a server.
So I was assuming it is still work in progress.

But I have realized the documentation mentions it as an already implemented feature.

It is not working for me:

c:>gnames\gndiff -V

version: v0.2.1

build:   2023-09-06_18:45:19UTC

c:>gnames\gndiff --port 8080
Error: unknown flag: --port
Usage:
  gndiff source_file reference_file [flags]

Flags:
  -f, --format string   Sets output format. Can be one of:
                        'csv', 'tsv', 'compact', 'pretty'
                        default is 'csv'.
  -h, --help            help for gndiff
  -q, --quiet           Do not output info and warning logs.
  -V, --version         shows build version and date, ignores other flags.

Error: unknown flag: --port

add fuzzy search

create readers of the files and pick a file for data comparison

file disk read vs other input options

Hi @dimus
This is more a question than an issue.

My initial idea of gndiff was comparing two files so its .csv input design is perfect for that:

I call the executable from a Python script.
That script has to generate reference.csv and source.csv files
Then it calls gndiff command line executable passing those filenames
It catches the output and processes its content from Python again.

That just works.

Now I am wondering about some other possible use cases with frequent repetitive gndiff calls (also from Python).
My concern is whether so many disk-write/disk-read of .csv files could/should be avoided.

Figure out my script is parsing a long list of new specimens to include in a museum collection.
I might prefer to gndiff-match them one by one, for whatever reason (my script could need to make other intermediate tasks in a certain order before processing next specimen name).
So I would be passing gndiff a small source.csv with just one row, but so many times.

In such scenario, would it make sense not creating a source.csv in disk (which means a Python file-write, plus a gndiff file-read), but somehow passing the source info as a parameter instead?
Maybe this is already possible somehow although I am not sure about what syntax should I try.
Or maybe this doesn't make sense at all because the script performance would be similar (i.e. the intermediate tasks are slower than gndiff call).

Of course, I can always design my script to process all gndiff-matching operations in advance.
I am just thinking before scripting and I am not a professional, so don't take me too serious.

Somehow related to this, in #13 I suggested the possibility of using gndiff as a server (so we can run gndiff in one machine and call it from others).
If that feature ever becomes possible, I wonder how such a server would work.

I guess the idea is repeating exactly the same stuff (gndiff receiving two files, doing the work and returning the output as http response)
But another possible scenario is running it as a server of a predefined reference list: so reference.csv is not passed in http requests, but defined at server start time ... so the requests only contain a list of source taxa (or just one taxon) to match against that reference.csv ... so again, the server could be receiving small but repetitive matching tasks.

Just wondering

first impressions and questions

First of all, thanks a lot for creating gndiff. It's gonna be so useful for me.

I have just tried with a small file, to get the feeling of how it works.
I already found some issues to comment:

Unclear input file formats description, where it says "Prepare two files with names. There are 3 possible file formats:" ... but actually, only two formats are mentioned: (1) simple list, one name per line; (2) csv file, with some other fields (see below).
Also, it is unclear to me if the CSV format applies only to the reference.csv file or also to source.csv:
- Names to be matched (source.csv), might also contain their own ids, but it is unclear to me whether gndiff is suggesting user to provide them or not (i.e., for adding them as a new column in output, so it is easier for me to rejoin that output against my original database).
  I think that is not the idea, because output already provides an autonumeric index. So I understand source.csv would usually contain just one field, with names and nothing else (just one column). With one possible exception:
- Except in case of using Family: I suppose in that case it should be present in both files. Correct? But I couldn't make it work properly in my tries (see below):
I might have misunderstood input CSV format description above. But if Family and TaxonID are optional fields?, then JSON output contains errors sometimes:
1.. If I don't provide a Family column in reference.csv, then json output referenceRecords[n].family contains the same value as name (the ScientificName field provided in my reference.csv file).
2.. If I provide a Family column in reference.csv (even with empty values), then json output seems correct (referenceRecords[n].family contains those family values I provided).
3.. But if I also provide a Family in source.csv, then json output includes a new sourceRecord.id which does contain the same value as sourceRecord.name.
4.. If source.csv contains other columns (i.e., ScientificName + LifeForm) then json output produces sourceRecord.family=sourceRecord.id=sourceRecord.name (all containing the ScientificName provided in source.csv).

So I am a bit confused. I think it would be worth to provide a couple of sample input files, and explicitly say if they can/should contain some other columns or not.

Regarding family: A real example case of how "tricky homonyms where family helps to resolve taxa from each other" would be useful too (I think family is not going to solve anything in my case, but just to be sure). I wonder how this "use family" option affects speed: does it make matching faster or slower for large datasets?
CSV/TSV outputs are missing column headers? This could seem irrelevant, but it makes a bit difficult to check if the output content is correct. Also, I cannot proceed with further tasks, like merging this output with other tabular data by means of column joins (I can try to figure out headers and add them myself ... but it would be safer if gndiff did it to avoid mistakes).

EDIT: I have just realized that some of the above suggestions were already addressed by @Adafede in a previous closed issue (#12).
Sorry about that. My comments are pretty verbose, so @dimus might still find some helpful feedback in some of them.
This is a new one:

Shouldn't the output include a sort of calculated numeric similitude between the matched names? I have some cases where json produces several "Exact" matches (i.e. two referenceRecords for the same sourceRecord) because my reference.csv contains two similar versions of the same name (i.e. a subsp. rank vs. a var. rank, identical in everything else). But my source.csv only contains one (i.e. the subsp.). How can I make the decision to select the most similar in these cases?
I will better post an example in a new comment to illustrate this.

Thanks a lot in advance !!

If I don't provide a Family column in reference.csv, then json output
referenceRecords[n].family contains the same value as name (the
ScientificName field provided in my reference.csv file).

Have an option for CSV output when every name gets exactly 1 best match or no match, so number of rows is the same in each result.

I suspect it should be default, and an option should be to allow more than one row.

some feedbacks

Hi, I tested quickly your tool and here is what I can say:

Installation works fine
Commands are clear and consistent with other GNtools
Adding headers to csv/tsv format would be nice (found them via -f pretty)
The family column in the output is quite cryptic for external users, a brief description in the README would probably help.
The error message FATA[0000] the CSV file needs `scientifiName` field contains typos
Globally, very nice even if I don't see applications at the moment on my side, adding somehow a more straightforward way for the user to do gnfinder and then gndiff on files might open possibilities. I am sure you already had this in mind but I see this combination as very powerful in the future.

Your work is amazing and useful to so many people out there!