Here's the full list of 16 erroneous Newick files: <a href="ht

Yes. We should probably record these as garbles. This is what <div class="snippet-

Trying to check your failures. Most are not in the <a href="https://github.com/rossmou

Erroneous Newick output about phylotree HOT 13 OPEN

rossmounce commented on July 23, 2024

Erroneous Newick output

from phylotree.

Comments (13)

petermr commented on July 23, 2024

How do I know what file created this? do we have a numbering system for
batches? file should give the batch ID (which should be described in the
"about" pages in phylotree/ ) and then the filename.

On Fri, Aug 7, 2015 at 6:49 PM, Peter Murray-Rust <
[email protected]> wrote:

This looks like one Newick file, not a list.

On Fri, Aug 7, 2015 at 5:59 PM, Ross Mounce [email protected]
wrote:

Assigned #17 #17 to
@petermr https://github.com/petermr.

—
Reply to this email directly or view it on GitHub
#17 (comment).

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

from phylotree.

rossmounce commented on July 23, 2024

There is only 1 batch. I ran through them all in a bash loop with a timeout command. Sorry I edited your comment rather than writing a new comment. Still learning this mobile phone app icons...

from phylotree.

petermr commented on July 23, 2024

Yes. We should probably record these as garbles. This is what

    public void setSpeciesPattern(Pattern speciesPattern) ;

has been created for. (`PatterThe question is whether to

abort the CTree
replace the tip by some reserved Newick-friendly message - e.g. "GARBLED" - which cannot be a taxon.I think we should do the latter and I'll try to code it.

from phylotree.

rossmounce commented on July 23, 2024

latter seems reasonable to me but perhaps too strict (?)

I was imagining perhaps just a post-OCR filter to remove all special characters. Replace with nothing (no space) or for certain special characters, e.g. $ replace with its most likely replacement candidate: S

from phylotree.

petermr commented on July 23, 2024

The problem is that the OCR doesn't know what it is being used for. Maybe I can add a pre-OCR filter to limit the characters allowed. But this will have to wait. (Phylo is the only plugin that uses OCR ATM). I'll think about it.
This is a classic problem https://en.wikipedia.org/wiki/Error_detection_and_correction - we can detect errors, can we correct them? This will require heuristics.

from phylotree.

petermr commented on July 23, 2024

Ross>>I was imagining perhaps just a post-OCR filter to remove all special characters.

We already have one. I'll look it up. It corrects "ch/o" to "chlo", etc.

from phylotree.

petermr commented on July 23, 2024

See "/org/xmlcml/norma/images/ocr/italicGarbles.xml" in norma

<garbles title="italicGarbles" characters="I/">
  <!-- italic "l" -->
  <garble original="a/" edited="al"/>
  <garble original="e/" edited="el"/>
  <garble original="i/" edited="il"/>
  <garble original="I/" edited="ll"/>
  <garble original="l/" edited="ll"/>
  <garble original="o/" edited="ol"/>
  <garble original="u/" edited="ul"/>
  <garble original="y/" edited="yl"/>
  <garble original="r/o" edited="rio"/>
  <garble original="s/c" edited="sic"/>
  <garble original="g/uc" edited="gluc"/>
  <garble original="g/yc" edited="glyc"/>
  <garble original="d/\[" edited="cili"/>
  <garble original="f/uo" edited="fluo"/>
  <garble original="d/\[" edited="cili"/>
  <garble original="/\[" edited="il"/>
  <garble original="t/G" edited="tic"/>
  <garble original="k/n" edited="kin"/>

  <garble original="lI" edited="ll"/>

  <!-- these may be ambiguous -->
  <garble original="C/" edited="cl"/>
  <garble original="c/" edited="ci"/>
  <garble original="/Us" edited="ius"/>

  <!-- very ambiguous -->
   <garble original="n'x" edited="rix"/>

  <!-- ligatures -->
  <garble original="ﬁ" edited="fi"/>
ﬁ  
</garbles>

These are believable garbles. Problem is that some images are not appropriate for analysis.
P.

from phylotree.

petermr commented on July 23, 2024

The architecture has the OCR in norma. At this stage we don't know the
context (and if we are doing money-related or computer-related documents)
"$" may be intended.

The interpretation has to be at the domain level. That means ami-phylo in
this case.

On Sat, Aug 8, 2015 at 11:19 AM, Ross Mounce [email protected]
wrote:

latter seems reasonable to me but perhaps too strict (?)

I was imagining perhaps just a post-OCR filter to remove all special
characters. Replace with nothing (no space) or for certain special
characters, e.g. $ replace with its most likely replacement candidate: S

—
Reply to this email directly or view it on GitHub
#17 (comment)
.

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

from phylotree.

petermr commented on July 23, 2024

Your comments are very useful. As a first pass we can say:

negative lengths are a programming error
commas are reserved characters and should be replaced
Because there is no agreed Newick validator we have to guess the rest of the rules. For example Trex says that a Newick file much have at least 3 taxa. Is that true of all validators? Does Newick require a valid Taxon name? I don't know but I think some of my files with only node IDs have passed Trex.
The appropriate action is to compile a list of failing images and create a failing test from them - analogous to MergeTipTest.testConvertLabelsAndTreeAndMerge(). I'll incorporate them and gradually debug

from phylotree.

petermr commented on July 23, 2024

Trying to check your failures. Most are not in the https://github.com/rossmounce/pluto-ONS/tree/master/testing/output file. Suspect some numbers may not be correct. As before, collect just the failing ones.

from phylotree.

rossmounce commented on July 23, 2024

They are all there. All 4000+ . If you are browsing via github.com with a web browser it will show you just the first 1000 files or folders. Perhaps this could be the problem? If you clone the repo locally to your hard drive everything should be there.

from phylotree.

petermr commented on July 23, 2024

OK, thx
Probably a good idea to select out the problem files.

On Sat, Aug 8, 2015 at 3:37 PM, Ross Mounce [email protected]
wrote:

They are all there. All 4000+ . If you are browsing via github.com with a
web browser it will show you just the first 1000 files or folders. Perhaps
this could be the problem? If you clone the repo locally to your hard drive
everything should be there.

—
Reply to this email directly or view it on GitHub
#17 (comment)
.

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

from phylotree.

rossmounce commented on July 23, 2024

Which files? The source image file?
If you know the ID of it you can go directly to it without even cloning the whole repo, via a web browser:

just append it to this URL e.g. https://github.com/rossmounce/pluto-ONS/tree/master/testing/output/
'+'
ijs.0.000174-0-000.pbm.png

= https://github.com/rossmounce/pluto-ONS/tree/master/testing/output/ijs.0.000174-0-000.pbm.png

It's all standardised structure and file names

from phylotree.

Erroneous Newick output about phylotree HOT 13 OPEN

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent