Code Monkey home page Code Monkey logo

taxonomic's People

Contributors

afuchs1 avatar claire0212 avatar jingli201802 avatar joshuatrevor avatar superfeone avatar tarasom123 avatar tyraeldlee avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

taxonomic's Issues

Finishing mapping original output to TNC

Totally 4 sheets.
TNC_TaxonomicName is almost finished. To do:

  • The fullNameWithAuthorship needs to import article author (if not mention taxonomic author) from Tarasom functions.
  • Find subspecies, common names.

TNC_TaxonomicNameUsage and TNC_BibliographicReference are in process..

Correct "kindOfNameUsage" column

Instead of describing whether the name is scientific or common, it should instead describe whether the name is an instance of sp. nov., comb. nov, gen. nov. etc.

Constrict type detection

Currently typification.csv records too much information, accidentally capturing some type descriptions. Each type should be described in around 1 sentence and this sentence should contain gender, location and identifier.

Remap output to TNU

The client has specified an output format which uses four classes. In order to fit this format the PDF output needs to be in the form of four CSVs.

Definition of done:
Executing the PDF extraction code will produce a folder with four CSV's- each CSV containing the list of instances of a particular class in the TNU schema.

Scrape citethisforme output

From the response generated by citethisforme, grab the appropriate fields from the page source and translate these into a dictionary output which can be used to populate bibliography.csv

server

use flask with ui, pdf input, xml input. output accordingly.

Correct TNU id's

Current TNU ID's are generated in the TNU csv, this is incorrect, they should be generated in TaxonomicName.csv and then passed on to TNU csv and typification.csv respectively

Add clean error handling for webservice failure

If either citethisforme or gnparser cannot be accessed, it is important that the client can be made aware of this:

A) So they can tell that the issue is not an error with our program
B) If the issue persists they can consult our documentation to understand the nature of the problem and possible solutions. (It is possible that instead of replacing the webservice our interface with that service might just have to be adapted to changes they have made.)

Implement support for multiple name authors

Currently even though the authorship value in GNParser's JSON output is always a list, only the first value is used (usually there is only one value).

Definition of done:
The authorship list in GNParser's output is concatenated into a single string and this string is inserted into the final CSV output of the program.

Improve name detection

Name detection has two significant ways it can be improved

  • By accounting for punctuation in the name which should tell the program certain words don't belong. Eg. capital letters where there shouldn't be, unclosed brackets and misplaced full stops.

  • By checking if GNParser returns an "unparsed tail" string which contains words adjacent to the sp./gen./comb. Because these words can be assumed to be in the name, this error means that too many words are included in front of the name (so remove the front word and try again)
    Since GNParser is a webservice it is important to do these checks in rounds, instead of 1 by 1 as that would increase the latency immensely and may also cause gnparser to flag the ip of the program's machine as malicious.

Change coordinate scope

Include coordinates which are described using decimals instead of DMS, and include more room for variation of spacing, punctuation etc. in the regex.

Correct spacing issues caused by PYPDF2

Currently PYPDF2 (the library used to convert PDFs to strings) adds many unnecessary and unpredictable line breaks which make it difficult to parse important information (eg. references)

Possible solutions include:

  • Replacing PYPDF2 with more a more suitable tool
  • Attempting to correct the line breaks manually

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.