The taxonomic from jingli201802

test examples by xml extractor

Continue deploying server

Make it could show the updated outputs which mapping to the schema.
Fix PDF parts.

Finishing mapping original output to TNC

Totally 4 sheets.
TNC_TaxonomicName is almost finished. To do:

The fullNameWithAuthorship needs to import article author (if not mention taxonomic author) from Tarasom functions.
Find subspecies, common names.

TNC_TaxonomicNameUsage and TNC_BibliographicReference are in process..

Correct "kindOfNameUsage" column

Instead of describing whether the name is scientific or common, it should instead describe whether the name is an instance of sp. nov., comb. nov, gen. nov. etc.

Currently typification.csv records too much information, accidentally capturing some type descriptions. Each type should be described in around 1 sentence and this sentence should contain gender, location and identifier.

csv format conversion

highlighting the extracted information in the original pdf

Working on the cases of species combination, or changing species' category (XML)

Remap output to TNU

The client has specified an output format which uses four classes. In order to fit this format the PDF output needs to be in the form of four CSVs.

Definition of done:
Executing the PDF extraction code will produce a folder with four CSV's- each CSV containing the list of instances of a particular class in the TNU schema.

github restructure

scr includes more docs; configuration ...

meeting agenda

make a kanban chart

Scrape citethisforme output

From the response generated by citethisforme, grab the appropriate fields from the page source and translate these into a dictionary output which can be used to populate bibliography.csv

xml adding fields scientific name. working on a new case include both new species and new genus

Combine our name detection with GNRD

Currently our name detection is not reliable enough to use in the final release.

By aggregating our extraction result with the result of an existing web application (https://github.com/GlobalNamesArchitecture/gnrd) we will be able to obtain a more accurate and reliable list of scientific names.

Add some holotype/coordinate detection (PDF)

server

use flask with ui, pdf input, xml input. output accordingly.

Finishing mapping original output to standard schemas(XML)

server update and debug

Implement combination support

Deal with instances of comb. nov. as detailed by Haylee,

Generate a report for lessons learnt and issues in the process of extracting data.

see handover issus in google doc
https://docs.google.com/document/d/19cB0DGzzpEuWftartd-TqGzBXE9oQOSB/edit#

getting contact with shadow team to see how they build server on ANU

getting contact with shadow team to see how they build server on ANU. If not applicable, research other server platforms (Herok)

Working on different Zookeys cases of TaxPub.

Making imporvements on accuracy

Attempt to increase "border word" list / Integrate other name detection (PDF)

Specify UI input and output preview display.

sending email to clients asking for specific requirements of input and output preview, if needed.

XML extraction program integration and pack outputs file into zip file

Correct TNU id's

Current TNU ID's are generated in the TNU csv, this is incorrect, they should be generated in TaxonomicName.csv and then passed on to TNU csv and typification.csv respectively

Improve UI

editable output, output display

Add clean error handling for webservice failure

If either citethisforme or gnparser cannot be accessed, it is important that the client can be made aware of this:

A) So they can tell that the issue is not an error with our program
B) If the issue persists they can consult our documentation to understand the nature of the problem and possible solutions. (It is possible that instead of replacing the webservice our interface with that service might just have to be adapted to changes they have made.)

Accuracy analysis

Working on different publishers' cases of TaxPub

Risk Management

update risk register

Send request to citethisforme interface with DOI

Access citethisforme's free auto-citation page and request a citation of the article in question's DOI

quick reference data return

Remove duplicate name entries in taxonomicName.csv

When multiple entries for the same name are recorded, delete the one which contains less information.

Create type coordinate column

Haylee has requested a another column in the typification csv which contains the coordinates of the type.

Implement support for multiple name authors

Currently even though the authorship value in GNParser's JSON output is always a list, only the first value is used (usually there is only one value).

Definition of done:
The authorship list in GNParser's output is concatenated into a single string and this string is inserted into the final CSV output of the program.

Improve name detection

Name detection has two significant ways it can be improved

By accounting for punctuation in the name which should tell the program certain words don't belong. Eg. capital letters where there shouldn't be, unclosed brackets and misplaced full stops.
By checking if GNParser returns an "unparsed tail" string which contains words adjacent to the sp./gen./comb. Because these words can be assumed to be in the name, this error means that too many words are included in front of the name (so remove the front word and try again)
Since GNParser is a webservice it is important to do these checks in rounds, instead of 1 by 1 as that would increase the latency immensely and may also cause gnparser to flag the ip of the program's machine as malicious.

Replacing PYPDF2 with more a more suitable tool
Attempting to correct the line breaks manually

jingli201802 / taxonomic Goto Github PK

taxonomic's People

Contributors

Stargazers

Watchers

Forkers

taxonomic's Issues

Recommend Projects

Recommend Topics

Recommend Org