Code Monkey home page Code Monkey logo

Comments (13)

petermr avatar petermr commented on July 23, 2024

How do I know what file created this? do we have a numbering system for
batches? file should give the batch ID (which should be described in the
"about" pages in phylotree/ ) and then the filename.

On Fri, Aug 7, 2015 at 6:49 PM, Peter Murray-Rust <
[email protected]> wrote:

This looks like one Newick file, not a list.

On Fri, Aug 7, 2015 at 5:59 PM, Ross Mounce [email protected]
wrote:

Assigned #17 #17 to
@petermr https://github.com/petermr.


Reply to this email directly or view it on GitHub
#17 (comment).

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

from phylotree.

rossmounce avatar rossmounce commented on July 23, 2024

There is only 1 batch. I ran through them all in a bash loop with a timeout command. Sorry I edited your comment rather than writing a new comment. Still learning this mobile phone app icons...

from phylotree.

petermr avatar petermr commented on July 23, 2024

Yes. We should probably record these as garbles. This is what

    public void setSpeciesPattern(Pattern speciesPattern) ;

has been created for. (`PatterThe question is whether to

  • abort the CTree
  • replace the tip by some reserved Newick-friendly message - e.g. "GARBLED" - which cannot be a taxon.I think we should do the latter and I'll try to code it.

from phylotree.

rossmounce avatar rossmounce commented on July 23, 2024

latter seems reasonable to me but perhaps too strict (?)

I was imagining perhaps just a post-OCR filter to remove all special characters. Replace with nothing (no space) or for certain special characters, e.g. $ replace with its most likely replacement candidate: S

from phylotree.

petermr avatar petermr commented on July 23, 2024

The problem is that the OCR doesn't know what it is being used for. Maybe I can add a pre-OCR filter to limit the characters allowed. But this will have to wait. (Phylo is the only plugin that uses OCR ATM). I'll think about it.
This is a classic problem https://en.wikipedia.org/wiki/Error_detection_and_correction - we can detect errors, can we correct them? This will require heuristics.

from phylotree.

petermr avatar petermr commented on July 23, 2024

Ross>>I was imagining perhaps just a post-OCR filter to remove all special characters.

We already have one. I'll look it up. It corrects "ch/o" to "chlo", etc.

from phylotree.

petermr avatar petermr commented on July 23, 2024

See "/org/xmlcml/norma/images/ocr/italicGarbles.xml" in norma

<garbles title="italicGarbles" characters="I/">
  <!-- italic "l" -->
  <garble original="a/" edited="al"/>
  <garble original="e/" edited="el"/>
  <garble original="i/" edited="il"/>
  <garble original="I/" edited="ll"/>
  <garble original="l/" edited="ll"/>
  <garble original="o/" edited="ol"/>
  <garble original="u/" edited="ul"/>
  <garble original="y/" edited="yl"/>
  <garble original="r/o" edited="rio"/>
  <garble original="s/c" edited="sic"/>
  <garble original="g/uc" edited="gluc"/>
  <garble original="g/yc" edited="glyc"/>
  <garble original="d/\[" edited="cili"/>
  <garble original="f/uo" edited="fluo"/>
  <garble original="d/\[" edited="cili"/>
  <garble original="/\[" edited="il"/>
  <garble original="t/G" edited="tic"/>
  <garble original="k/n" edited="kin"/>

  <garble original="lI" edited="ll"/>

  <!-- these may be ambiguous -->
  <garble original="C/" edited="cl"/>
  <garble original="c/" edited="ci"/>
  <garble original="/Us" edited="ius"/>

  <!-- very ambiguous -->
   <garble original="n'x" edited="rix"/>

  <!-- ligatures -->
  <garble original="fi" edited="fi"/>
fi  
</garbles>

These are believable garbles. Problem is that some images are not appropriate for analysis.
P.

from phylotree.

petermr avatar petermr commented on July 23, 2024

The architecture has the OCR in norma. At this stage we don't know the
context (and if we are doing money-related or computer-related documents)
"$" may be intended.

The interpretation has to be at the domain level. That means ami-phylo in
this case.

P.

On Sat, Aug 8, 2015 at 11:19 AM, Ross Mounce [email protected]
wrote:

latter seems reasonable to me but perhaps too strict (?)

I was imagining perhaps just a post-OCR filter to remove all special
characters. Replace with nothing (no space) or for certain special
characters, e.g. $ replace with its most likely replacement candidate: S


Reply to this email directly or view it on GitHub
#17 (comment)
.

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

from phylotree.

petermr avatar petermr commented on July 23, 2024

Your comments are very useful. As a first pass we can say:

  • negative lengths are a programming error
  • commas are reserved characters and should be replaced
    Because there is no agreed Newick validator we have to guess the rest of the rules. For example Trex says that a Newick file much have at least 3 taxa. Is that true of all validators? Does Newick require a valid Taxon name? I don't know but I think some of my files with only node IDs have passed Trex.
  • The appropriate action is to compile a list of failing images and create a failing test from them - analogous to MergeTipTest.testConvertLabelsAndTreeAndMerge(). I'll incorporate them and gradually debug

from phylotree.

petermr avatar petermr commented on July 23, 2024

Trying to check your failures. Most are not in the https://github.com/rossmounce/pluto-ONS/tree/master/testing/output file. Suspect some numbers may not be correct. As before, collect just the failing ones.

from phylotree.

rossmounce avatar rossmounce commented on July 23, 2024

They are all there. All 4000+ . If you are browsing via github.com with a web browser it will show you just the first 1000 files or folders. Perhaps this could be the problem? If you clone the repo locally to your hard drive everything should be there.

from phylotree.

petermr avatar petermr commented on July 23, 2024

OK, thx
Probably a good idea to select out the problem files.

On Sat, Aug 8, 2015 at 3:37 PM, Ross Mounce [email protected]
wrote:

They are all there. All 4000+ . If you are browsing via github.com with a
web browser it will show you just the first 1000 files or folders. Perhaps
this could be the problem? If you clone the repo locally to your hard drive
everything should be there.


Reply to this email directly or view it on GitHub
#17 (comment)
.

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

from phylotree.

rossmounce avatar rossmounce commented on July 23, 2024

Which files? The source image file?
If you know the ID of it you can go directly to it without even cloning the whole repo, via a web browser:

just append it to this URL e.g. https://github.com/rossmounce/pluto-ONS/tree/master/testing/output/
'+'
ijs.0.000174-0-000.pbm.png

= https://github.com/rossmounce/pluto-ONS/tree/master/testing/output/ijs.0.000174-0-000.pbm.png

It's all standardised structure and file names

from phylotree.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.