taxonomic's People
taxonomic's Issues
test examples by xml extractor
Continue deploying server
Make it could show the updated outputs which mapping to the schema.
Fix PDF parts.
SOW for semester 2
draft uploaded on google drive.
Finishing mapping original output to TNC
Totally 4 sheets.
TNC_TaxonomicName is almost finished. To do:
- The fullNameWithAuthorship needs to import article author (if not mention taxonomic author) from Tarasom functions.
- Find subspecies, common names.
TNC_TaxonomicNameUsage and TNC_BibliographicReference are in process..
Decision-making
Meeting agenda
Correct "kindOfNameUsage" column
Instead of describing whether the name is scientific or common, it should instead describe whether the name is an instance of sp. nov., comb. nov, gen. nov. etc.
Constrict type detection
Currently typification.csv records too much information, accidentally capturing some type descriptions. Each type should be described in around 1 sentence and this sentence should contain gender, location and identifier.
csv format conversion
highlighting the extracted information in the original pdf
Working on the cases of species combination, or changing species' category (XML)
Remap output to TNU
The client has specified an output format which uses four classes. In order to fit this format the PDF output needs to be in the form of four CSVs.
Definition of done:
Executing the PDF extraction code will produce a folder with four CSV's- each CSV containing the list of instances of a particular class in the TNU schema.
github restructure
scr includes more docs; configuration ...
meeting agenda
make a kanban chart
Scrape citethisforme output
From the response generated by citethisforme, grab the appropriate fields from the page source and translate these into a dictionary output which can be used to populate bibliography.csv
xml adding fields scientific name. working on a new case include both new species and new genus
Combine our name detection with GNRD
Currently our name detection is not reliable enough to use in the final release.
By aggregating our extraction result with the result of an existing web application (https://github.com/GlobalNamesArchitecture/gnrd) we will be able to obtain a more accurate and reliable list of scientific names.
Add some holotype/coordinate detection (PDF)
server
use flask with ui, pdf input, xml input. output accordingly.
Finishing mapping original output to standard schemas(XML)
server update and debug
Implement combination support
Deal with instances of comb. nov. as detailed by Haylee,
Generate a report for lessons learnt and issues in the process of extracting data.
see handover issus in google doc
https://docs.google.com/document/d/19cB0DGzzpEuWftartd-TqGzBXE9oQOSB/edit#
getting contact with shadow team to see how they build server on ANU
getting contact with shadow team to see how they build server on ANU. If not applicable, research other server platforms (Herok)
Working on different Zookeys cases of TaxPub.
Making imporvements on accuracy
Attempt to increase "border word" list / Integrate other name detection (PDF)
Specify UI input and output preview display.
sending email to clients asking for specific requirements of input and output preview, if needed.
XML extraction program integration and pack outputs file into zip file
Correct TNU id's
Current TNU ID's are generated in the TNU csv, this is incorrect, they should be generated in TaxonomicName.csv and then passed on to TNU csv and typification.csv respectively
Improve UI
editable output, output display
Add clean error handling for webservice failure
If either citethisforme or gnparser cannot be accessed, it is important that the client can be made aware of this:
A) So they can tell that the issue is not an error with our program
B) If the issue persists they can consult our documentation to understand the nature of the problem and possible solutions. (It is possible that instead of replacing the webservice our interface with that service might just have to be adapted to changes they have made.)
Accuracy analysis
Working on different publishers' cases of TaxPub
Risk Management
update risk register
Send request to citethisforme interface with DOI
Access citethisforme's free auto-citation page and request a citation of the article in question's DOI
quick reference data return
Remove duplicate name entries in taxonomicName.csv
When multiple entries for the same name are recorded, delete the one which contains less information.
Create type coordinate column
Haylee has requested a another column in the typification csv which contains the coordinates of the type.
Implement support for multiple name authors
Currently even though the authorship value in GNParser's JSON output is always a list, only the first value is used (usually there is only one value).
Definition of done:
The authorship list in GNParser's output is concatenated into a single string and this string is inserted into the final CSV output of the program.
Improve name detection
Name detection has two significant ways it can be improved
-
By accounting for punctuation in the name which should tell the program certain words don't belong. Eg. capital letters where there shouldn't be, unclosed brackets and misplaced full stops.
-
By checking if GNParser returns an "unparsed tail" string which contains words adjacent to the sp./gen./comb. Because these words can be assumed to be in the name, this error means that too many words are included in front of the name (so remove the front word and try again)
Since GNParser is a webservice it is important to do these checks in rounds, instead of 1 by 1 as that would increase the latency immensely and may also cause gnparser to flag the ip of the program's machine as malicious.
Change coordinate scope
Include coordinates which are described using decimals instead of DMS, and include more room for variation of spacing, punctuation etc. in the regex.
Testing with clients
Improve UI (output display)
The output need to have the preview output function.
Correct spacing issues caused by PYPDF2
Currently PYPDF2 (the library used to convert PDFs to strings) adds many unnecessary and unpredictable line breaks which make it difficult to parse important information (eg. references)
Possible solutions include:
- Replacing PYPDF2 with more a more suitable tool
- Attempting to correct the line breaks manually
link different attributes within text (eg genders with holotypes, holotypes with species)
Testing accuracy
Finishing codes.
Start real test.
Code updated to adapt to different xml format
Improve UI (editable output)
Implement AnyStyle.io reference parsing
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.