Code Monkey home page Code Monkey logo

checklistbank's Introduction

GBIF ChecklistBank

ChecklistBank is the taxonomy store with its associated webservices that allows GBIF to index any number of checklist datasets and match them to a global management taxonomy, the GBIF backbone taxonomy.

Warning! We have evolved ChecklistBank together with the Catalogue of Life into a standalone service based on a new open code base. This project is still in use by GBIF, but not very active any longer.

This is a multi module project containing all ChecklistBank modules, from api to persistence layer to the webservice client. Integration tests are part of the respective modules, in particular in the webservice client and the mybatis modules.

Maven profile

To run all tests you need a maven profile with the following properties defined:

<profile>
  <id>clb-local</id>
  <properties>
    <checklistbank.db.host>localhost</checklistbank.db.host>
    <checklistbank.db.name>clb</checklistbank.db.name>
    <checklistbank.db.username>postgres</checklistbank.db.username>
    <checklistbank.db.password>123456</checklistbank.db.password>
  </properties>
</profile>

ChecklistBank database schema

Checklistbank relies on postgres 9 and uses the HStore extension. The simplest way of enabling this is to add it to the postgres template database, which is used whenever postgres creates a new one. Thus if you run the following (or similar) before creating the database, you are all set:

psql -u postgres template1 -c 'create extension hstore;'

You can create a database schema from scratch or update one to the latest version using the maven liquiabse plugin like this:

mvn -P clb-local liquibase:update

A diagram of the database schema is available for convenience.

checklistbank's People

Contributors

bwakkie avatar cgendreau avatar fmendezh avatar gbif-jenkins avatar marcos-lg avatar mattblissett avatar mdoering avatar mike-podolskiy90 avatar omeyn avatar thomasstjerne avatar timrobertson100 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

checklistbank's Issues

Species suggest gives erratic results

species suggest is much better than it was a year ago, but there is still oddities like this: flabellina+voig gives zero results whereas flabellina+voigt gives 1

0 http://api.gbif.org/v1/species/suggest?limit=10&q=flabellina+vo
0 http://api.gbif.org/v1/species/suggest?limit=10&q=flabellina+voig
1 http://api.gbif.org/v1/species/suggest?limit=10&q=flabellina+voigt

There is something odd about seeing a result in the list of suggestions and then typing one (correct) character and then see it disappear.

Changes in names from NMV to ALA to GBIF

Bob Mesibov has a guest post on iPhylo http://iphylo.blogspot.co.uk/2017/09/guest-post-our-taxonomy-is-not-your.html In it he shows that of some 85,000 occurrences in a sample of NMV (Museums Victoria) data, 1 in 5 ends up with a different name in GBIF(!). He gives one example in detail, which I think is another case of basionym mismatching (closely related species names with same author and year being erroneously linked). We keep hitting this problem, and I think it's partly because we are relying on string matching names instead of matching identifiers. In this example, the Australian Faunal Directory (AFD) actually has enough information and identifiers to express the correct relationships between these names, we should attempt to use this info when it's available.

Names Fauna Europaea double author name /import error? Map error? parsing error?

Platycheirus granditarsus (Forster, 1771) in Fauna Europaea becomes Platycheirus granditarsus (Forster, 1771) Forster, 1771 in GBIF backbone
Vipera berus(Linnaeus, 1758), in Fauna Europaea becomes Vipera berus (Linnaeus, 1758) Linnaeus, 1758 in GBIf backbone
Limenitis camilla (Linnaeus, 1764) in Fauna Europaea becomes Limenitis camilla (Linnaeus, 1764) Linnaeus, 1764 in GBIF backbone (https://www.gbif.org/species/5772986)
etc...

It looks like the author name is placed again after the scientificName,

...

Verify backbone names from zoological sources as Bryophyta

Likely wrong to have moss names from zoological sources. Check these names:


   id    |    rank    |                     scientific_name                     
---------+------------+---------------------------------------------------------
 8539848 | GENUS      | Pilatobius
 8529939 | SPECIES    | Pilatobius bullatum Murray, 1905
 8671511 | SPECIES    | Pilatobius recamieri Richters, 1911
 8959396 | SPECIES    | Bathyscia kerkyrana Reitter, 1884
 8765481 | SPECIES    | Macrocoma cylindrica (Kuster, 1846) Kuster, 1846
 9142380 | SPECIES    | Macrocoma divisa (Wollaston, 1864) Wollaston, 1864
 8988393 | SPECIES    | Macrocoma dubia (Wollaston, 1864) Wollaston, 1864
 9059090 | SPECIES    | Macrocoma latifrons Lindberg, 1953
 8965228 | SPECIES    | Macrocoma leprieuri (Lefevre, 1876) Lefevre, 1876
 9216000 | SPECIES    | Macrocoma obscuripes (Wollaston, 1892) Wollaston, 1892
 9024104 | SPECIES    | Macrocoma occidentalis Palm, 1976
 8908751 | SPECIES    | Macrocoma oromiana Daccordi, 1978
 9179719 | SPECIES    | Macrocoma palmaensis Palm, 1977
 9022785 | SUBSPECIES | Macrocoma palmaensis subsp. franzi Palm, 1976
 8786927 | SUBSPECIES | Macrocoma palmaensis subsp. palmaensis
 9118312 | SPECIES    | Macrocoma rubripes (Schaufuss, 1862) Schaufuss, 1862
 8896681 | SPECIES    | Macrocoma splendens Lindberg, 1950
 9000646 | SPECIES    | Macrocoma splendidula (Wollaston, 1862) Wollaston, 1862
 8937212 | SPECIES    | Muelleriella cephalonica Jeannel, 1929
 8727482 | SPECIES    | Muelleriella cretica Jeannel, 1929
 8846704 | SPECIES    | Muelleriella kerkyrana (Reitter, 1884) Reitter, 1884
 9112295 | SPECIES    | Muelleriella moczarskii Jeannel, 1924
 9089017 | SPECIES    | Muelleriella taygetana Casale, 1983
 9155210 | SPECIES    | Pachnephorus cylindrica Kuster, 1846
 8754130 | SPECIES    | Pseudocolaspis divisa Wollaston, 1864
 8819080 | SPECIES    | Pseudocolaspis dubia Wollaston, 1864
 8732793 | SPECIES    | Pseudocolaspis leprieuri Lefevre, 1876
 8841119 | SPECIES    | Pseudocolaspis obscuripes Wollaston, 1892
 8874059 | SPECIES    | Pseudocolaspis rubripes Schaufuss, 1862
 9046522 | SPECIES    | Pseudocolaspis splendidula Wollaston, 1862
 8793854 | SPECIES    | Aptychus crassus Hebert, 1855
 9187447 | SPECIES    | Aptychus insignis Hebert, 1855
 9031607 | SPECIES    | Aptychus moravicus Blaschke, 1911
 8778885 | SPECIES    | Aptychus ortusus Hebert, 1855
 9015291 | SPECIES    | Aptychus praeseranonis Blaschke, 1911
 9213091 | SPECIES    | Thelia angulata Walker, 1851
 8906061 | SPECIES    | Thelia bipuncta Walker, 1851
 9085807 | SPECIES    | Thelia collina Walker, 1851
 8756659 | SPECIES    | Thelia conica Walker, 1851
 9145218 | SPECIES    | Thelia constans Walker, 1851
 8843714 | SPECIES    | Thelia gladiator Walker, 1851
 8991154 | SPECIES    | Thelia lutea Walker, 1851
 8934141 | SPECIES    | Thelia rufivitta Walker, 1851
 9182562 | SPECIES    | Thelia semifasciata Walker, 1851
 8871374 | SPECIES    | Thelia similis Walker, 1851
 9056342 | SPECIES    | Thelia spinigera Walker, 1851
 9115520 | SPECIES    | Thelia substriata Walker, 1851
 8962507 | SPECIES    | Thelia tacta Walker, 1851
 8789511 | SPECIES    | Thelia tumida Walker, 1851
 8730314 | SPECIES    | Thelia unanimis Walker, 1851
 8816539 | SPECIES    | Thelia varia Walker, 1851
(51 rows)

Add landrace and cultivar level names

To be able to cover relevant use cases in the field of agrobiodiversity, taxon names need to include relevant infraspecific levels, eg. cultivars and landraces.

For changing the backbone and API rank vocabulary (from Markus):

  • modify our API rank vocab to create a new landrace entry. This also means updating postgres vocabs
  • modify nub builder code to include such ranks
  • verify our species matching works fine with these names

Required: good source lists (@dschigel)

Requested by: fitness for use task group on agrobiodiversity (2015), TG report see http://www.gbif.org/resource/82283

also relates to: http://dev.gbif.org/issues/browse/POR-2789

clb importer mybatis exception

java.lang.RuntimeException: java.util.concurrent.ExecutionException: org.apache.ibatis.exceptions.PersistenceException: 
### Error querying database.  Cause: org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0x00
### The error may exist in org/gbif/checklistbank/service/mybatis/mapper/ParsedNameMapper.xml
### The error may involve org.gbif.checklistbank.service.mybatis.mapper.ParsedNameMapper.getByName-Inline
### The error occurred while setting parameters
### SQL: SELECT          n.id, n.scientific_name, n.canonical_name, n.type,     n.genus_or_above, n.infra_generic, n.specific_epithet, n.infra_specific_epithet, n.cultivar_epithet,     n.notho_type, n.rank, n.nom_status, n.sensu, n.remarks,     n.authors_parsed, n.parsed, n.authorship, n.year, n.bracket_authorship, n.bracket_year             FROM               name n             WHERE n.scientific_name=? AND n.rank=?::rank
### Cause: org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0x00
at com.google.common.base.Throwables.propagate(Throwables.java:160)
at org.gbif.checklistbank.cli.importer.Importer.run(Importer.java:142)

Incorrect structure of taxon name "Trochulus caelatus (S.Studer,1820) S.Studer, 1820

Incorrect structure of taxon name "Trochulus caelatus (S.Studer,1820) S.Studer, 1820

The correct name should be "Trochulus caelatus (S.Studer, 1820)", since a) recombinations in zoology do not get the combination author appended, and b) the combination author would be a much more recent person/event in any case. The cited source (Fauna Europaea, http://www.faunaeur.org/full_results.php?id=428061) does not contain that duplication. A name generation artifact?

From gbif/portal-feedback#93

Reported by: @ahahn-gbif
Referer: https://demo.gbif.org/species/5780344

Species of "spec." incorrectly placed in the backbone

There exists a species in the backbone as follows

{
  "key": 7011685,
  "nubKey": 7011685,
  "nameKey": 7526463,
  "taxonID": "gbif:7011685",
  "sourceTaxonKey": 108205698,
  "kingdom": "Animalia",
  "phylum": "Platyhelminthes",
  "order": "Cyclophyllidea",
  "family": "Paruterinidae",
  "genus": "Neyraia",
  "species": "spec.",
  "canonicalName": "spec.",
}

In addition, there are 335,424 entries in the nub_rel pointing to this entry.

There are 2 issues:

  1. We need to ensure that blacklisted epithets like spec. do not make it into the nub
  2. We need to remove those nub_rel entries

The nub_rel entries cause serious problem because the http://api.gbif.org/v1/species/7011685/related?limit=10&offset=0 call triggers a query whereby PG loads in 22GB of data into temporary tables when computing the query. This causes significant IOWait on the machine, and triggers a whole slew of subsequent issues.

For the time being http://api.gbif.org/v1/species/7011685/related* is blocked at the Varnish cache.

Checklistbank-ws lealks DB connections?

I found a lot of these in ELK

java.lang.Exception: Apparent connection leak detected
at org.apache.ibatis.transaction.jdbc.JdbcTransaction.openConnection(JdbcTransaction.java:138)
at org.apache.ibatis.transaction.jdbc.JdbcTransaction.getConnection(JdbcTransaction.java:60)
at org.apache.ibatis.executor.BaseExecutor.getConnection(BaseExecutor.java:336)
at org.apache.ibatis.executor.SimpleExecutor.prepareStatement(SimpleExecutor.java:84)
at org.apache.ibatis.executor.SimpleExecutor.doQuery(SimpleExecutor.java:62)
at org.apache.ibatis.executor.BaseExecutor.queryFromDatabase(BaseExecutor.java:324)
at org.apache.ibatis.executor.BaseExecutor.query(BaseExecutor.java:156)
at org.apache.ibatis.executor.CachingExecutor.query(CachingExecutor.java:109)
at org.apache.ibatis.executor.CachingExecutor.query(CachingExecutor.java:83)
at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:148)
at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:141)
at org.apache.ibatis.session.defaults.DefaultSqlSession.selectOne(DefaultSqlSession.java:77)
at sun.reflect.GeneratedMethodAccessor128.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.ibatis.session.SqlSessionManager$SqlSessionInterceptor.invoke(SqlSessionManager.java:357)
at com.sun.proxy.$Proxy38.selectOne(Unknown Source)
at org.apache.ibatis.session.SqlSessionManager.selectOne(SqlSessionManager.java:166)
at org.apache.ibatis.binding.MapperMethod.execute(MapperMethod.java:82)
at org.apache.ibatis.binding.MapperProxy.invoke(MapperProxy.java:59)
at com.sun.proxy.$Proxy85.count(Unknown Source)
at org.gbif.checklistbank.ws.resources.SpeciesSitemapResource.sitemapIndex(SpeciesSitemapResource.java:75)
at sun.reflect.GeneratedMethodAccessor573.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:287)
at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:277)
at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:182)
at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85)
at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119)
at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133)
at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130)
at com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:203)
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:130)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:518)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
at java.lang.Thread.run(Thread.java:748)

This seemingly resulted in a lot of PG connections acquired by CLB and brought down the registry services on the same DB which all reported timeouts getting connections.

I think we need to find the cause of this leak, and also consider dropping pool sizes to not starve other apps.

Capture unmatched occurrence species as unreviewed backbone taxa

GBIF has quite a few names in occurrences that are not covered by the backbone and end up either matched to the genus or in the worst case as entirely unmatched.

Many of these names are well accepted names. Most very recently published names are not present in the backbone, but get published through occurrences already.

It would be good if these missing taxa could be added to the backbone while indexing occurrences, marked as unreviewed to clearly flag them. We kind of do so already by including cleaned names from type specimens found in the GBIF occurrences (see blog post).

This will open up the door for dirty names and larger pollution of the backbone. The omni present typos / spelling variations of names in occurrences should mostly be captured by the fuzzy matching of occurrence names though and would therefore only enter the backbone in few cases.

An example being the bird records of recently changed duck species.
E.g. Mareca strepera in our backbone and formerly known as Anas strepera:
https://www.gbif.org/occurrence/search?taxon_key=2498314

When adding new occurrence based species to the backbone we should also try to discover their basionym and synonymize them if possible just as we do for the rest of the backbone species. E.g. the species epithet strepera within the family Anatidae is unique: https://www.gbif.org/species/search?q=strepera&rank=SPECIES&highertaxon_key=2986

Trochulus caelatus: verify whether present again in the new backbone version

via mail, Nov.16:
http://www.gbif.org/species/5780344, until some time in 2016, related to the Type specimen NMBE-WL-15419 of the Naturhistorischen Museums Bern. From late 2016, the taxon was no longer available, though it is contained in e.g. http://www.faunaeur.org/full_results.php?id=428061 and
http://www.iucnredlist.org/details/22108/0. The still-current GBIF interpretation (late 2016) lists Trochulus caelatus as a marine (!) species Trochus caelatus Gmelin, 1791, in the wrong group of organisms (Turbinidae). The issue comments this as "Taxon match fuzzy".

Check: does the taxon and genus reappear in the new version?

Author name variation considered synonyms

I notice some names are considered synonyms, but the only difference seems to be a slight variation in spelling of the author name:

Mesembryanthemum cordifolium L. f. http://www.gbif.org/species/3084878 SYNONYM
Mesembryanthemum cordifolium L. fil. http://www.gbif.org/species/8185812 ACCEPTED

Delosperma cooperi (Hook. f.) L. Bolus http://www.gbif.org/species/7538678 SYNONYM
Delosperma cooperi (Hook. fil.) L. Bol. http://www.gbif.org/species/8078333 ACCEPTED

Are those truly synonyms?

Maybe too strict author matching

The following names match in strict=false but not strict=true because of slight author differences. Some might be considered valid differences.

  • Eragrostis brownii (Kunth) Nees ex WightEragrostis brownii (Kunth) Nees
  • Linaria bipartita (Vent.) Desf.Linaria bipartita (Vent.) Willd.
  • Modiola caroliniana (L.) G. Don f.Modiola caroliniana (L.) G. Don fil.
  • Malva setigera F.K. Schimp. et Spenn.Malva setigera K.F. Schimp. et Spenn.
  • Nicotiana langsdorfii J.A. Weinm.Nicotiana langsdorfii Weinm.
  • Nonea lutea (Desr.) DC. ex Lam. et DC.Nonea lutea (Desr.) DC.
  • Oenothera affinis Cambess. ex A. St. Hil.Oenothera affinis Camb.
  • Veronica austriaca Jacq.Veronica austriaca L.
  • Trisetaria panicea (Lam.) MaireTrisetaria panicea (Lam.) Paunero

Review homonym merging by largest group in backbone builds

Thelia angulata stems from ZooBank with the genus Thelia being a known homonym genus already in CoL (insect & moss). The backbone build picks the largest genus (most children) in such cases and attaches the species to the wrong genus. Review this decision which probably should be removed. It would be better to add the new species to incertae sedis or even ignore it.

5190 [main] WARN org.gbif.checklistbank.nub.NubDb - 2 ambigous homonyms encountered for Thelia in source f4907135-435a-4adf-a48f-5c4a43311ec3, picking largest taxon

See ignored new unit test in 5add87a

Infraspecific names matched to incorrect infraspecific rank in strict mode

The following subspecies/variety plant names are all matched to the incorrect infraspecific rank (e.g. variety to subspecies or vice versa):

Carduus nutans L. subsp. leiophyllus (Petrovič) Arènes
Digitaria ciliaris (Retz.) Koeler subsp. ciliaris
Digitaria sanguinalis (L.) Scop. subsp. sanguinalis
Echinochloa muricata (Beauv.) Fernald subsp. muricata
Galanthus nivalis L. var. nivalis
Juncus tenuis Willd. subsp. tenuis
Mentha x piperita L. nothosubsp. piperita
Papaver somniferum L. subsp. somniferum
Physalis longifolia Nutt. subsp. subglabrata (Mack. et Bush) Cronquist
Salix purpurea L. var. purpurea
Schkuhria pinnata (Lam.) O. Kuntze ex Thell. var. pinnata
Scorpiurus muricatus L. subsp. muricatus
Vicia narbonensis L. subsp. narbonensis

Example: http://api.gbif.org/v1/species/match?verbose=False&strict=True&kingdom=Plantae&name=Vicia%20narbonensis%20L.%20subsp.%20narbonensis

Extend SciNameNormalizer with rules from zoological code

SciNameNormalizer allows for some fuzzy matching by normalizing a name string according to very specific rules incl gender suffix changes.

A recent Taxacom thread on the wasp spider ICZN contains rules to deal with common spelling corrections in prevailing usage. These include:

  • Article 32.5.2.1 an umlaut in a German word (or a word about which there is doubt as to its provenance) is to be deleted and an -e- is to be added.
  • Article 33.4.Use of -/i/for -/ii/and vice versa, and other alternative spellings, in subsequent spellings of species-group names.The use of the genitive ending -/i/in a subsequent spelling of a species-group name that is a genitive based upon a personal name in which the correct original spelling ends with -/ii/, or vice versa, is deemed to be an incorrect subsequent spelling, even if the change in spelling is deliberate; the same rule applies to the endings -/ae/and -/iae/, -/orum/and -/iorum/, and -/arum/and -/iarum/."

Spelling variants:

  • Argiope bruennichi
  • Argiope bruennichii
  • Argiope brunnichi
  • Argiope brunnichii

canonicalNames in taxon.txt

Feature request:

For the next version it would be nice if the canonicalName were split off from the scientificName, as it was in 2013. I have a hairy regular expression that seems to work most of the time (167 problem cases remaining; I may be able to tackle most of them), but having clients worry about this is unfortunate, given that checklistbank had its hands on this information at one point (if I understand correctly)...

Synonym chosen over doubtful?

Ambrosia artemisiifolia matches to the SYNONYM Ambrosia artemisiifolia Besser: http://www.gbif.org/species/3110596.

http://api.gbif.org/v1/species/match?verbose=False&strict=true&kingdom=Plantae&name=Ambrosia%20artemisiifolia

Rather than the DOUBTFUL Ambrosia artemisiifolia L.: http://www.gbif.org/species/8002952

According to The Plant List, the "Besser" name is illegitimate, while "L." is accepted and thus more trustworthy. The current matching will link Ambrosia artemisiifolia to Ambrosia polystachya DC., which in most cases (except "Besser") is wrong.

What is the ruleset for matching names without authors? Take SYNONYM over DOUBTFUL?

v2 "alpha" species/match API

To support the new occurrence ingestion pipelines [and in particular see issue 7] we'd like to revise the species/match API to reduce the possibility of mistakes by callers. In particular issues exist around handing of synonyms and especially where that spans species and subspecific ranks.

The design goals are to:

  1. Return a single result that should contain all keys and strings that can be copied verbatim into e.g. an interpreted occurrence view
  2. Enable the ability to add new ranks in the future without changing the API response format
  3. Make clearer separation of nomenclature and taxon concept
  4. Expose information explaining the reasoning behind the resulting match
  5. Raise the profile of nomenclature databases in their roll in this process. Exposing source nomenclature IDs (e.g. an IPNI LSID) enables reporting and search of occurrences using their IDs.

An initial response format for a Felis concolor conccolor match might be:

{
  "synonym": true,
  "confidence": 98,

  // this is a synonym, so this effectively documents the 'name' and not the taxon concept
  "usage": {
    "key": 6164627, "rank": "SUBSPECIES", "name": "Felis concolor concolor (Kerr 1871)",
  }
  
  // the following documents the accepted taxon concept (it may repeat the above)  
  "acceptedUsage": {
    "key": 7193927, "rank": "SUBSPECIES", "name": "Puma concolor subsp. concolor (Tim 1872)"
  }
  
  // holds the nomenclature ID (IPNI, IF, ZK etc)
  "nomenclature": {
    "source": "IPNI", "id": "urn:lsid:ipni.org:names:1234-1"  // obviously bogus here
  }

  // the classification includes the accepted taxon concept view
  "classification": [
    {"key": 1, "rank": "KINGDOM", "name": "Animalia"},
    {"key": 44, "rank": "PHYLUM", "name": "Chordata"},
    {"key": 359 "rank": "CLASS", "name": "Mammalia"},
    {"key": 732, "rank": "ORDER", "name": "Carnivora"},
    {"key": 9703, "rank": "FAMILY", "name": "Felidae},
    {"key": 2435098, "rank": "GENUS", "name": "Puma Jardin 1834"},
    {"key": 2435099, "rank": "SPECIES", "name": "Puma concolor (Linneaus 1771)"},
    {"key": 7193927, "rank": "SUBSPECIES", "name": "Puma concolor subsp. concolor}
  ],

  // intended for anyone seeking clarity on why it arrived on the  result (e.g. debugging)
  "lineage": [
    "Fuzzily matched [Felis concolor subsp. conccolor] to [Felis concolor subsp. concolor]",
    "Synonym [Felis concolor subsp. concolor] to [Puma concolor subsp. concolor] provided by [Catalogue of Life]",
    "No potential homonyms detected in ancestry" 
  ],

  // intended for "internal users"
  "diagnostics": {
    "status": "SYNONYM",
    "matchType": "EXACT",
  "note": "Similarity: name=110; authorship=0; classification=-2; rank=5; status=0;singleMatch=5",
  }
}

Only 1 of 3 subspecies listed for Trachemys scripta

In theory Trachemys scripta Schoepff, 1792 should have 3 subspecies:

  1. Trachemys scripta subsp. troostii (Holbrook, 1836)
  2. Trachemys scripta subsp. scripta
  3. Trachemys scripta subsp. elegans (Wied, 1838)

However, only the first one is listed. The second two are directly linked to the genus (see parentKey and the fact that no speciesKey is listed):

vs.

The whole species is listed as doubtful. Any idea why these 2 subspecies are not linked to the species, but the genus? That way a search for "Trachemys scripta" does not return all subspecies occurrences.

Allow quick incremental additions to backbone

As it takes a lot to rebuild the backbone, reprocess all of our occurrences (the bulk of the work) and rebuild solr indices it would be good to allow for quick, small additions to the backbone.

Adding just a few names directly into postgres would not necessarily require occurrences to be reprocessed and the clb solr index could be updated for that one name too very quickly.

As our backbone data model requires a reference back to some other name usage we could offer adding small indexed checklists to the backbone by:

  • doing a strict nub match for all checklist names and only process those further that do not match
  • add missing names and potentially their implicit genus & species directly into the backbone in postgres via mybatis, thereby also updating the solr index

We could then optionally also think to:

  • reprocess all occurrences that have TAXON_MATCH_NONE or TAXON_MATCH_FUZZY
  • reprocess all name usages in clb that have TAXON_MATCH_NONE (there is no fuzzy match for checklists)

Amaranthus albus: species with and without author in backbone

Amaranthus albus L. (with author name) matches the synonym Amaranthus albus L.

http://api.gbif.org/v1/species/match?verbose=false&name=Amaranthus%20albus%20L.&strict=true

Amaranthus albus (without author) matches the accepted name Amaranthus albus L. monosepalus Thell.

http://api.gbif.org/v1/species/match?verbose=false&name=Amaranthus%20albus&strict=true

Is this intended behavior for the authorship matching? Note that L. is also part of the second name.

Ellaborate Camelus ferus Przewalski, 1878 to a species on its own

Our backbone currently has the genus Camelus as:

Camelus Linnaeus, 1758
  = Camelius Bowdich, 1821
  = Camellus Molina, 1782
  = Dromedarius Gloger, 1841
  = Paracamelus Schlosser, 1903
 Camelus dromedarius Linnaeus, 1758
     = Camelus arabicus Desmoulins, 1823
     = Camelus ferus Falk, 1786
    Camelus dromedarius dromedarius
    Camelus dromedarius subsp. ferus Falk, 1786  [from ICZN]
 Camelus bactrianus Przewalski, 1878
    Camelus bactrianus bactrianus
    Camelus bactrianus ferus Przewalski, 1878
 Camelus guanicoe Müller, 1776
 Camelus ferus Przewalski, 1878

"C. bactrianus ferus" is in fact now considered a full species C. ferus (with an IUCN red list assessment: http://www.iucnredlist.org/details/63543/0).

According to Wikipedia, recent research has shown that these wild Bactrian Camels are not feral domesticated Bactrian Camels but share a more distant ancestor. Hence the name C. ferus has been elevated to be considered a third species (critically endangered), corresponding to small populations of wild two-humped camels.

Fuzzy match chosen over exact match

I have some cases were a fuzzy match is suggested instead of an exact match (issue moved here from gbif/backbone-patch#3)

Brassica rapa subsp. rapa

http://api.gbif.org/v1/species/match?verbose=false&name=Brassica%20rapa%20subsp.%20rapa&strict=true

Fuzzy matches with Brassica rapa var. laxa (Tsen & S.H.Lee) Hanelt, rather than exact matching with http://www.gbif.org/species/7225636 (an accepted name)

Petricola pholadiformis

http://api.gbif.org/v1/species/match?verbose=false&name=Petricola%20pholadiformis&strict=true

Fuzzy matches with Petricolaria pholadiformis (Lamarck, 1818) (different genus), rather than exact matching with http://www.gbif.org/species/8180054 (a synonym of the suggested name)

Hypericum maculatum maculatum

http://api.gbif.org/v1/species/match?verbose=false&name=Hypericum%20maculatum%20maculatum&strict=true

Fuzzy matches with Hypericum maculatum subsp. immaculatum (Murb.) A. Fröhlich, rather than exact matching with http://www.gbif.org/species/7330370 (an accepted variety)

I hope I'm not overlooking something this time 👓

Counts are off - numDescendants doesn't match search counts

A search for http://api.gbif.org/v1/species/4 return "numDescendants": 100630,

Whereas a search for http://api.gbif.org/v1/species/search?datasetKey=d7dddbf4-2cf0-4f39-9b2a-bb099caae36c&highertaxonKey=4 return "count": 131652,

The count should reflect only accepted and doubtful, thought being closer it is still of
with http://api.gbif.org/v1/species/search?datasetKey=d7dddbf4-2cf0-4f39-9b2a-bb099caae36c&highertaxonKey=4&status=ACCEPTED&status=DOUBTFUL having "count": 100890,

clb importer Failed to close neo4j

happens at the end of importing.
imports succeed and final message is send, but still this erro is worrisome:

INFO  [2017-08-28 14:40:47,724] [clb-importer-1] org.gbif.checklistbank.cli.importer.Importer [] Importing succeeded. 7 main, 0 subtree chunk and 0 pro parte usages synced
ERROR [2017-08-28 14:40:47,779] [clb-importer-1] org.gbif.checklistbank.neo.UsageDao [] Failed to close neo4j /home/crap/neo/3f438ffd-0acb-4805-85cb-096bf46c56f7
org.neo4j.kernel.lifecycle.LifecycleException: Component 'org.neo4j.kernel.NeoStoreDataSource$5@624c9a19' failed to transition from stopped to shutting_down. Please see the attached cause exception "Can't mark read only index.".
	at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.shutdown(LifeSupport.java:496)
	at org.neo4j.kernel.lifecycle.LifeSupport.shutdown(LifeSupport.java:206)
	at org.neo4j.kernel.NeoStoreDataSource.stop(NeoStoreDataSource.java:895)
	at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.stop(LifeSupport.java:457)
	at org.neo4j.kernel.lifecycle.LifeSupport.stopInstances(LifeSupport.java:161)
	at org.neo4j.kernel.lifecycle.LifeSupport.stop(LifeSupport.java:143)
	at org.neo4j.kernel.impl.transaction.state.DataSourceManager.stop(DataSourceManager.java:120)
	at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.stop(LifeSupport.java:457)
	at org.neo4j.kernel.lifecycle.LifeSupport.stopInstances(LifeSupport.java:161)
	at org.neo4j.kernel.lifecycle.LifeSupport.stop(LifeSupport.java:143)
	at org.neo4j.kernel.lifecycle.LifeSupport.shutdown(LifeSupport.java:191)
	at org.neo4j.kernel.impl.factory.ClassicCoreSPI.shutdown(ClassicCoreSPI.java:158)
	at org.neo4j.kernel.impl.factory.GraphDatabaseFacade.shutdown(GraphDatabaseFacade.java:365)
	at org.gbif.checklistbank.neo.UsageDao.closeNeo(UsageDao.java:321)
	at org.gbif.checklistbank.neo.UsageDao.close(UsageDao.java:297)
	at org.gbif.checklistbank.cli.importer.Importer.run(Importer.java:146)
	at org.gbif.checklistbank.cli.importer.ImporterService.process(ImporterService.java:63)
	at org.gbif.checklistbank.cli.importer.ImporterService.process(ImporterService.java:28)
	at org.gbif.checklistbank.cli.common.RabbitDatasetService.handleMessage(RabbitDatasetService.java:54)
	at org.gbif.checklistbank.cli.common.RabbitDatasetService.handleMessage(RabbitDatasetService.java:22)
	at org.gbif.common.messaging.MessageConsumer.handleCallback(MessageConsumer.java:101)
	at org.gbif.common.messaging.MessageConsumer.handleDelivery(MessageConsumer.java:65)
	at com.rabbitmq.client.impl.ConsumerDispatcher$4.run(ConsumerDispatcher.java:121)
	at com.rabbitmq.client.impl.ConsumerWorkService$WorkPoolRunnable.run(ConsumerWorkService.java:76)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.neo4j.kernel.impl.store.UnderlyingStorageException: Unable to force ContractCheckingIndexProxy -> OnlineIndexProxy[accessor:org.neo4j.kernel.api.impl.schema.LuceneIndexAccessor@ac5f9f8, descriptor::label[0](property[2])]
	at org.neo4j.kernel.impl.api.index.IndexingService.lambda$forceIndexProxy$3(IndexingService.java:709)
	at org.neo4j.kernel.impl.api.index.IndexMap.forEachIndexProxy(IndexMap.java:81)
	at org.neo4j.kernel.impl.api.index.IndexingService.forceAll(IndexingService.java:693)
	at org.neo4j.kernel.impl.storageengine.impl.recordstorage.RecordStorageEngine.flushAndForce(RecordStorageEngine.java:476)
	at org.neo4j.kernel.impl.transaction.log.checkpoint.CheckPointerImpl.doCheckPoint(CheckPointerImpl.java:202)
	at org.neo4j.kernel.impl.transaction.log.checkpoint.CheckPointerImpl.forceCheckPoint(CheckPointerImpl.java:114)
	at org.neo4j.kernel.NeoStoreDataSource$5.shutdown(NeoStoreDataSource.java:929)
	at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.shutdown(LifeSupport.java:488)
	... 26 common frames omitted
Caused by: java.lang.UnsupportedOperationException: Can't mark read only index.
	at org.neo4j.kernel.api.impl.schema.ReadOnlyDatabaseSchemaIndex.markAsOnline(ReadOnlyDatabaseSchemaIndex.java:93)
	at org.neo4j.kernel.api.impl.schema.LuceneIndexAccessor.force(LuceneIndexAccessor.java:81)
	at org.neo4j.kernel.impl.api.index.OnlineIndexProxy.force(OnlineIndexProxy.java:135)
	at org.neo4j.kernel.impl.api.index.AbstractDelegatingIndexProxy.force(AbstractDelegatingIndexProxy.java:84)
	at org.neo4j.kernel.impl.api.index.ContractCheckingIndexProxy.force(ContractCheckingIndexProxy.java:125)
	at org.neo4j.kernel.impl.api.index.IndexingService.lambda$forceIndexProxy$3(IndexingService.java:702)
	... 33 common frames omitted
INFO  [2017-08-28 14:40:47,780] [clb-importer-1] org.gbif.checklistbank.cli.importer.Importer [] Neo database shut down.
INFO  [2017-08-28 14:40:47,788] [clb-importer-1] org.gbif.checklistbank.cli.common.RabbitDatasetService [] Sending ChecklistSyncedMessage for dataset 3f438ffd-0acb-4805-85cb-096bf46c56f7

Filter backbone extension data

We want to exclude bad content from our backbone API, e.g. wikipedia images or descriptions should not be exposed on species pages and thus also not on the species API. All content should still be visible through the respective dataset.

Consider for each of the extension classes a positive inclusion filter by dataset or publisher - or a negative exclusion filter:

  • Description
  • Distribution
  • Identifier
  • Multimedia
  • References
  • Types
  • Vernacular Names
  • Species Profile

What happened to Dinophyta?

Not sure if this is the place to ask about this, but...

I can't find either Dinophyta=Dinoflagelleta=Dinokaryota or Dinophyceae. In 2013 GBIF they were 109 and 201 respectively. They may have disintegrated, with their families put directly under Chromista, as suggested by Gymnodiniaceae http://www.gbif.org/species/7476073 .

Here's a sample genus from this group in IRMNG: http://www.irmng.org/aphia.php?p=taxdetails&id=1324911
I can't find any of these things (indeed any alveolates at all) in CoL... tried searching for about 10 genus names and failed every time. GBIF 109 came from Dinophyta in CoL 2011. So maybe this is a problem with CoL?

Picea mariana lacks author in backbone

Catalogue of Life has the black spruce listed as “Picea mariana (Mill.) Britton & et al." which we should have as the accepted name in the backbone:
http://www.gbif.org/species/112582184
http://www.catalogueoflife.org/col/details/species/id/9f203c75c68d2023f85c160f8082b5d5

Instead we have an authorless "Picea mariana" and then a synonym "Picea mariana (P. Mill.) B.S.P. " from IRMNG which is the same authorship as (Mill.) Britton & et al. or (Mill.) Britton, Sterns & Poggenburg: https://en.wikipedia.org/wiki/Picea_mariana

Infraspecific names matched to species in strict mode

The following infraspecific plant names are all matches to SPECIES:

Dinebra decipiens (R. Brown) P.M. Peterson & N. Snow subsp. decipiens
Dinebra panicea (Retz.) P.M. Peterson & N. Snow var. brachiata (Steud.) P.M. Peterson & N. Snow
Mentha x piperita L. nsubsp. piperita
Oxybasis glauca (L.) Fuentes & al. var. salina (Standley) Verloove
Salix x sepulcralis Simonk. nvar. chrysocoma (Dode) Meikle

Example:

http://api.gbif.org/v1/species/match?verbose=False&strict=True&kingdom=Plantae&name=Dinebra%20decipiens%20(R.%20Brown)%20P.M.%20Peterson%20&%20N.%20Snow%20subsp.%20decipiens

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.