Comments (13)
How do I know what file created this? do we have a numbering system for
batches? file should give the batch ID (which should be described in the
"about" pages in phylotree/ ) and then the filename.
On Fri, Aug 7, 2015 at 6:49 PM, Peter Murray-Rust <
[email protected]> wrote:
This looks like one Newick file, not a list.
On Fri, Aug 7, 2015 at 5:59 PM, Ross Mounce [email protected]
wrote:Assigned #17 #17 to
@petermr https://github.com/petermr.—
Reply to this email directly or view it on GitHub
#17 (comment).Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
from phylotree.
There is only 1 batch. I ran through them all in a bash loop with a timeout command. Sorry I edited your comment rather than writing a new comment. Still learning this mobile phone app icons...
from phylotree.
Yes. We should probably record these as garbles. This is what
public void setSpeciesPattern(Pattern speciesPattern) ;
has been created for. (`PatterThe question is whether to
- abort the CTree
- replace the tip by some reserved Newick-friendly message - e.g. "GARBLED" - which cannot be a taxon.I think we should do the latter and I'll try to code it.
from phylotree.
latter seems reasonable to me but perhaps too strict (?)
I was imagining perhaps just a post-OCR filter to remove all special characters. Replace with nothing (no space) or for certain special characters, e.g. $ replace with its most likely replacement candidate: S
from phylotree.
The problem is that the OCR doesn't know what it is being used for. Maybe I can add a pre-OCR filter to limit the characters allowed. But this will have to wait. (Phylo is the only plugin that uses OCR ATM). I'll think about it.
This is a classic problem https://en.wikipedia.org/wiki/Error_detection_and_correction - we can detect errors, can we correct them? This will require heuristics.
from phylotree.
Ross>>I was imagining perhaps just a post-OCR filter to remove all special characters.
We already have one. I'll look it up. It corrects "ch/o" to "chlo", etc.
from phylotree.
See "/org/xmlcml/norma/images/ocr/italicGarbles.xml" in norma
<garbles title="italicGarbles" characters="I/">
<!-- italic "l" -->
<garble original="a/" edited="al"/>
<garble original="e/" edited="el"/>
<garble original="i/" edited="il"/>
<garble original="I/" edited="ll"/>
<garble original="l/" edited="ll"/>
<garble original="o/" edited="ol"/>
<garble original="u/" edited="ul"/>
<garble original="y/" edited="yl"/>
<garble original="r/o" edited="rio"/>
<garble original="s/c" edited="sic"/>
<garble original="g/uc" edited="gluc"/>
<garble original="g/yc" edited="glyc"/>
<garble original="d/\[" edited="cili"/>
<garble original="f/uo" edited="fluo"/>
<garble original="d/\[" edited="cili"/>
<garble original="/\[" edited="il"/>
<garble original="t/G" edited="tic"/>
<garble original="k/n" edited="kin"/>
<garble original="lI" edited="ll"/>
<!-- these may be ambiguous -->
<garble original="C/" edited="cl"/>
<garble original="c/" edited="ci"/>
<garble original="/Us" edited="ius"/>
<!-- very ambiguous -->
<garble original="n'x" edited="rix"/>
<!-- ligatures -->
<garble original="fi" edited="fi"/>
fi
</garbles>
These are believable garbles. Problem is that some images are not appropriate for analysis.
P.
from phylotree.
The architecture has the OCR in norma
. At this stage we don't know the
context (and if we are doing money-related or computer-related documents)
"$" may be intended.
The interpretation has to be at the domain level. That means ami-phylo
in
this case.
P.
On Sat, Aug 8, 2015 at 11:19 AM, Ross Mounce [email protected]
wrote:
latter seems reasonable to me but perhaps too strict (?)
I was imagining perhaps just a post-OCR filter to remove all special
characters. Replace with nothing (no space) or for certain special
characters, e.g. $ replace with its most likely replacement candidate: S—
Reply to this email directly or view it on GitHub
#17 (comment)
.
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
from phylotree.
Your comments are very useful. As a first pass we can say:
- negative lengths are a programming error
- commas are reserved characters and should be replaced
Because there is no agreed Newick validator we have to guess the rest of the rules. For example Trex says that a Newick file much have at least 3 taxa. Is that true of all validators? Does Newick require a valid Taxon name? I don't know but I think some of my files with only node IDs have passed Trex. - The appropriate action is to compile a list of failing images and create a failing test from them - analogous to
MergeTipTest.testConvertLabelsAndTreeAndMerge()
. I'll incorporate them and gradually debug
from phylotree.
Trying to check your failures. Most are not in the https://github.com/rossmounce/pluto-ONS/tree/master/testing/output file. Suspect some numbers may not be correct. As before, collect just the failing ones.
from phylotree.
They are all there. All 4000+ . If you are browsing via github.com with a web browser it will show you just the first 1000 files or folders. Perhaps this could be the problem? If you clone the repo locally to your hard drive everything should be there.
from phylotree.
OK, thx
Probably a good idea to select out the problem files.
On Sat, Aug 8, 2015 at 3:37 PM, Ross Mounce [email protected]
wrote:
They are all there. All 4000+ . If you are browsing via github.com with a
web browser it will show you just the first 1000 files or folders. Perhaps
this could be the problem? If you clone the repo locally to your hard drive
everything should be there.—
Reply to this email directly or view it on GitHub
#17 (comment)
.
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
from phylotree.
Which files? The source image file?
If you know the ID of it you can go directly to it without even cloning the whole repo, via a web browser:
just append it to this URL e.g. https://github.com/rossmounce/pluto-ONS/tree/master/testing/output/
'+'
ijs.0.000174-0-000.pbm.png
= https://github.com/rossmounce/pluto-ONS/tree/master/testing/output/ijs.0.000174-0-000.pbm.png
It's all standardised structure and file names
from phylotree.
Related Issues (20)
- Consistency of Binomial and EGID
- Implement 50 images as test material HOT 1
- Add pruning for incomplete or malformed tips
- Missing edge labels
- NeXML validation issue
- Compare no. tips vs HOCR phrases
- Configuration file for Tesseract HOT 1
- Local Genus and species lookup HOT 1
- Tidying NCBI taxdump HOT 9
- Validating scientific names after `ami-phylo` parse. HOT 4
- Check tip count against HOCR output
- Untrapped Tesseract failure
- Failure to match tips to labels (implement Hungarian algorithm? )
- Incorrect line breaks from Tesseract.
- OTU labels are now entirely numerical HOT 3
- Serious design flaw in `cmine`+`ami-phylo`
- Logfile
- "Valid" Newick problems HOT 7
- How to run the new code/methods on a set of image files? HOT 2
- Binomial lookup correction not fully working HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from phylotree.