gnames / bhlindex Goto Github PK

BHLindex is used by Biodiversity Heritage Library to create their scientific names index

License: MIT License

Go 97.75% Makefile 1.01% Ruby 1.23%

bhlindex's Introduction

GNames

The goal of the GNames project is to provide an accurate and fast verification of scientific names in unlimited quantities. The verification should be fast (at least 1000 names per second) and include exact and fuzzy matching of input strings to scientific names aggregated from a large number of data-sources.

In case if you do not need exact records of matched names from data-sources, and just want to know if a name-string is known, you can use GNmatcher instead of this project. The GNmatcher is significantly faster and has simpler output.

Features
Installation
- Installation prerequesites
- Installation process
Configuration
Usage as API
Usage with GNverifier
Web-Logs
Known limitations of the verification
Development
Authors
License

Features

Fast verification of unlimited number of scientific names.
Multiple levels of verification:
- Exact matching (exact string match for viruses, exact canonical form match for Plantae, Fungi, Bacteria, and Animalia).
- Fuzzy matching detects human and/or Optical Character Recognition (OCR) errors without producing large number of false positives. To avoid false positives uninomial names only checked for exact match.
- PartialExact matching happens when a match for the full name-string is not found. In such cases middle or end words are removed and each variant is verified. Matches of names with the last word intact does have a preference.
- PartialFuzzy matching is provided for partial matches of species and infraspecies. To avoid false positives uninomials only checked for exact match.
- Virus matching provides viruses verification.
- FacetedSearch allows to use flexible query language for searching.
Providing names information from data-sources that contain a particular name.
- Returning the "best" result. The BestResult is calculated by a scoring algorithm.
- Optionally, limiting results to data-sources that are important to a GNames user.
Providing outlink URLs to some data-sources websites to show the original record of a name.
Providing meta-information about aggregated data-sources.

Installation

Most of the users do not need to install GNames and can use remote GNames API service at http://verifier.globalnames.org/api/v1 or use a command line client GNverifier. Nevertheless, it is possible to install a local copy of the service.

Installation prerequesites

A Linux-based operating system.
At least 32GB of memory.
At least 50GB of a free disk space.
Fast Internet connection during installation. After installation GNames can operate without remote connection.
PostgreSQL database.

Installation process

PostgreSQL

We are not covering basics of PostgreSQL administration here. There are many tutorials and resources for Linux-based operating systems that can help.

Create a database named gnames. Download the gnames database dump. Restore the database with:
```
gunzip -c gnames_latest.tar.gz |pg_restore -d gnames
```
GNmatcher

Refer to the GNmatcher documentation for its installation.
GNames

Download the latest release of GNames, unpack it and place somewhere in the PATH.

Run gnames -V. It will show you the version of GNames and also generate $HOME/.config/gnames.yaml configuration file.

Edit $HOME/.config/gnames.yaml according to your preferences.

Try it by running
```
gnames rest -p 8888
```
To load service automatically you can create systemctl configuration for the service, if your system supports systemctl.

Alternatively you can use docker image to run GNames. You will need to create a file with corresponding environment variables that are described in the .env.example file.
```
docker pull gnames/gnames:latest
docker run -env_file path_to_env_file -d -i -t -p 8888:8888 \
  gnames/gnames:latest rest -p 8888
```
We provide an example of environment file. Environment variables override configuration file settings.

Configuration

Configuration settings can either be given in the config file located at $HOME/.config/gnames.yaml, or by setting the following environment variables:

Env. Var.	Configuration
GN_CACHE_DIR	CacheDir
GN_JOBS_NUM	JobsNum
GN_MATCHER_URL	MatcherURL
GN_MAX_EDIT_DIST	MaxEditDist
GN_PG_DB	PgDB
GN_PG_HOST	PgHost
GN_PG_PASS	PgPass
GN_PG_PORT	PgPort
GN_PG_USER	PgUser
GN_PORT	Port

The meaning of configuration settings are provided in the default gnames.yaml.

Usage as API

Please note, that currently developed API (documentation) is publically served at https://verifier.globalnames.org/api/v1.

If you installed GNames locally and want to run its API, run:

gnames rest
# to change from default 8888 port
gnames rest -p 8787

Refer to GNames' RESTful API Documentation about interacting with GNames API.

Usage with GNverifier

GNverifier is a command line client for GNames backend. It uses publically available remote API of GNames. Install and use it according to the GNverifier documentation.

GNverifier also provides web-based user interface to GNames. To launch it use something like:

gnverifier -p 8777

Known limitations of the verification

Exact matches of misspellings that might exist in poorly curated databases prevent to find fuzzy matches from better curated sources.

To increase performance we stop any further tries if a name matched
successfully. This prevents fuzzy-matching if a misspelled name is found
somewhere. It is helpful to check 'curation' field of returned result,
and see how many data-sources do contain the name.

Fuzzy matching of a name where genus string is broken by a space.

For example, we cannot match 'Abro stola triplasia' to 'Abrostola triplasia'. There is only 1 edit distance between the strings, however we stem specific epithets, so in reality we fuzzy-match 'Abro stol triplas' to 'Abrostola triplas'. That means now we have edit distance 2 which is usually beyond our threshold.

Development

Install Go language for your Linux operating system.
Create PostgreSQL database as described in installation.
Clone the GNames code.
Clone the GNmatcher and set it up for development.
Install docker and docker compose.
Go to your local gnames directory
- Run make dc
- Run docker-compose up
- In another terminal window run go test ./...

Authors

Dmitry Mozzherin

License

The GNames code is released under MIT license.

bhlindex's People

Contributors

Stargazers

Watchers

Forkers

sebsebmc

bhlindex's Issues

Add classification ranks and ids

Fix pagination to use IDs instead of offsets and limits

As a User I want to use gnverifier for verification

As a User I want to know how many times a name occured in BHL and what is its average odds

Add indices/foreign keys after the end of the name indexing

As a User I want bhlindex to keep low memory profile

Currently lack of swap or small memory makes program unusable, because memory keeps data about all found names. Move this information to database. It will make checking if name is new slower, but will make much less memory footprint.

ExactMatch sometimes appears with edit distance > 0, or wrong classification.

The problem appears only with ExactMatch

  count  |     match_type      
---------+---------------------
   67978 | ExactMatch
 3328099 | FuzzyCanonicalMatch
  157216 | FuzzyPartialMatch

Sometimes it is just edit distance, in other times it also has wrong classification

-[ RECORD 1 ]------+-----------------------------------------------------------------------------------
id                 | 46732
name               | Abacocrinus cappelleri
match_type         | ExactMatch
edit_distance      | 2
stem_edit_distance | 2
matched_name       | Abacocrinus cappelleri
current_name       | Abacocrinus cappelleri
classification     | Animalia|Porifera|Demospongiae|Axinellida|Raspailiidae|Eurypon|Eurypon unispiculum
datasource_id      | 168
datasources_number | 1
curation           | Unknown
retries            | 1
error              | 
updated_at         | 2018-12-05 20:50:53.715639

The problem cannot be reproduced on the gnindex server or with gnfinder

As a User I want to know version number of bhlindex I have

As a User I want to see words before and after found name

As a User I want more information available via gRPC

As a Developer I want a better way to compile and pack the project.

we need a Makefile to create tarballs for uploading to github.

Upgrade to gnfinder v0.11.1

As a User I want to use latest verification code

Upgrade to gnfinder 0.7.0

As a User I want to use Ruby client for gRPC as a gem

As a User I want "bhlindex find" command to do name-finding

We also want this command to have --workers flag to set number of workers

As a User I want to verify if found scientific names are 'real'

We connect to gnindex to do verification

As a User I want to know expanded genus of abbreviated scientific names.

As a gRPC User I want to get a stream of titleid's only

The purpose of this method is to find out what titleIDs are out there, and then use them to get their pages and names.

For now we will stream all 200 000 of them. In the future we might need to add an offset and limit options to page through the titles.

Provide more details in the dump for classification of a verified name

We need to provide ranks and IDs of the classifications

As a User I want to know if there is nomenclatural annotation attached to a name

As a user I want to know the predominant language of every volume and page in BHL

Volume might have many languages. For example, Spanish journals might have papers in English, there
may be a big chunk of Latin text for species descritptions, etc. The best way to do this probably is by using cld2 from google. There is a binding to it for Go.

As a user I want name finding not to slow-down by name-verification

At the moment inserting unique names start to slow down name finding at some point. There could be 2 ways to solve it -- using key-value, or using verification sequentialy. I decided to go with second approach for now. It will make the process a couple of hours slower than the first one, but is also way easier to implement.