contentmine / phylotree Goto Github PK
View Code? Open in Web Editor NEWA repository for ami-phylotree development
A repository for ami-phylotree development
Current main loop for extracting fields from a value.
value
is raw surface
editList
stores the edits
extractionList
stores the final extractions as Extraction
s name-value pairs.
editedBuilder
is the new surface, but should be superseded by ExtractionList
.
package org.xmlcml.norma.editor;
public class PatternElement
public String createEditedValueAndRecord(String value) {
String newValue = null;
Pattern pattern1 = createPattern();
Matcher matcher = pattern1.matcher(value);
editRecord = new EditList();
extractionList = new ArrayList<Extraction>();
if (matcher.matches()) {
editedBuilder = new StringBuilder();
for (int i = 1; i <= matcher.groupCount(); i++) {
String group = matcher.group(i);
FieldElement field = getField(i - 1);
group = field.applySubstitutions(group);
Extraction extraction = new Extraction(field.getNameAttribute(), group);
extractionList.add(extraction);
EditList fieldRecord = field.getEditRecord();
if (fieldRecord.size() > 0) {
LOG.trace("fr1 "+fieldRecord);
this.editRecord.add(fieldRecord);
}
insertSpace(field);
editedBuilder.append(group);
}
newValue = editedBuilder.toString();
}
LOG.debug(">>"+extractionList);
return newValue;
}
A considerable number of trees have unannotated edges
<edge>
instead of
<edge source="foo" target="bar"/>
Ross Mounce to identify 2-3 such examples and commit as failing tests.
The tree in Newick format should be saveable through a commandline option:
ami-phylo --ph.newick [CTree filename]
will output the Newick to a subdirectory (yet to be determined) of CTree
All successful parses will results in
<genus> <species> <strain> <egid>
These are then further validated via lookup to remove cases where
<genus+species> != <binomial looked up from egid>
all tips failing these tests should be pruned from the tree used for further processing ("validTree"). (The incorrect otus
can be retained but not used).
ami-phylo
will run from the commandline and should, as far as possible, manage options from there. Please list here the options that should be part of the commandline.
We should try improving our OCR output by restricting tesseract to a whitelist of characters. This StackOverflow post appears to detail how this can be done very simply/easily.
http://stackoverflow.com/questions/2363490/limit-characters-tesseract-is-looking-for
I think we should NOT include these characters in the whitelist:
\ / $ % ^ & # ! ~ £
Of course we'd need to test the effect of this change. I will try and find example files that contain these types of characters in the 'raw' unmodified tesseract output. Then compare that output with the whitelist-tesseracted output.
The IJSEM phylotrees have an uncontrolled tip syntax:
genus species strain strain1? \(AAdddddd | AAAddddd | NC_dddddd\)
However we can't assume this in general. What other syntaxes are we likely to have to tackle?
genus species
or
genus
are common, but there are many others. Can we come up with heuristics that work "most of the time"? Because otherwise we are gong to have to define this for each tree, which defeats the automation.
If a tree is corrupted (e.g. has small gaps so splits into several trees) then much of the valid HOCR output is not assigned. This can be used to reject broken trees.
The NEXML from ami-phylo should be saveable through a commandline option:
ami-phylo --ph.nexml [CTree filename]
will output the NEXML to a subdirectory (yet to be determined) of CTree
Provide a mechanism for validating *.nwk output.
As an example http://libpll.org/api/group__newickParseGroup.html defines a valid tree as
Validate if a newick tree is a valid phylogenetic tree.
A valid tree is one where the root node is binary or ternary and all other internal nodes are binary. In case the root is ternary then the tree must contain at least another internal node and the total number of nodes must be equal to $ 2l - 2$, where $l$ is the number of leaves. If the root is binary, then the total number of nodes must be equal to $2l - 1$.
This implies that all multiple parentage should be expanded to binary trees apart from roots.
Is this a satisfactory validator? and does it validate node labels, etc.
To create a tool which removes tip nodes (primarily because they have no labels). This should be done on nexml
. Any resultant poorly formed nodes (e.g. with 1 or 0 children) should be deleted/tidied
The toplevel pages of this repository should describe the project, its methodology, its specifications, its errorTYPES. links to other resources. Phylotree should be a showcase for other CM prokects ad it should be possible for someone to understand the process in abstract even if they don't understand phylogenetics in detail.
When Tesseract fails the subprocess should about to avoid errors such as:
6059 [main] ERROR org.xmlcml.norma.image.ocr.ImageToHOCRConverter - Process failed to terminate after :60
6059 [main] DEBUG org.xmlcml.norma.image.ocr.ImageToHOCRConverter - creating HTML output: /Users/pm286/workspace/ami-plugin/target/junk/1439960185967/null.pbm.png.hocr.html
6059 [main] DEBUG org.xmlcml.norma.image.ocr.ImageToHOCRConverter - creating hocr.hocr name: /Users/pm286/workspace/ami-plugin/target/junk/1439960185967/null.pbm.png.hocr.hocr
6059 [main] DEBUG org.xmlcml.norma.image.ocr.ImageToHOCRConverter - failed to create: /Users/pm286/workspace/ami-plugin/target/junk/1439960185967/null.pbm.png.hocr.html or /Users/pm286/workspace/ami-plugin/target/junk/1439960185967/null.pbm.png.hocr.hocr
6059 [main] ERROR org.xmlcml.ami2.plugins.phylotree.PhyloTreeArgProcessor - cannot run tesseract
6063 [main] DEBUG org.xmlcml.ami2.plugins.phylotree.PhyloTreeArgProcessor - cTreeLog: [org.xmlcml.cmine.args.log.CMineLog: log]
6063 [main] DEBUG org.xmlcml.ami2.plugins.phylotree.PhyloTreeArgProcessor - outputResultElement NYI null; need to add tree
If fact this is due to FileNotFound which should be trapped before trying Tesseract.
The HTML from HOCR (Tesseract) can be converted to SVG and should be saveable through a commandline option:
ami-phylo --hocr.svg [CTree filename]
will output the raw SVG to a subdirectory (yet to be determined) of CTree
See https://github.com/petermr/cmine/issues/2
The main loop over multiple CTrees should include all processing. Currently ami-phylo
executes runAndOutput()
over multiple CTrees and then runs the NexML analysis. This means that only the last CTree is analyzed.
Solution:
(a) immediate. break processImage()
in ami-phylo
into per-CTree modules (driven by args.xml
(b) medium separate argProcessor from per-CTree analyses.
NCBI taxdump
has approximate format:
9 | Buchnera aphidicola | | scientific name |
9 | Buchnera aphidicola Munson et al. 1991 | | synonym |
10 | "Cellvibrio" Winogradsky 1929 | | synonym |
10 | Cellvibrio | | scientific name |
10 | Cellvibrio (ex Winogradsky 1929) Blackall et al. 1986 emend. Humphry et al. 2003 | | synonym |
11 | "Cellvibrio gilvus" Hulcher and King 1958 | | authority |
11 | Cellvibrio gilvus | | equivalent name |
11 | [Cellvibrio] gilvus | | scientific name |
13 | Dictyoglomus | | scientific name |
13 | Dictyoglomus Saiki et al. 1985 | | authority |
14 | ATCC 35947 | | type material |
14 | DSM 3960 | | type material |
14 | Dictyoglomus thermophilum | | scientific name |
14 | Dictyoglomus thermophilum Saiki et al. 1985 | | authority |
14 | strain H-6-12 | | type material |
16 | Methyliphilus | | equivalent name |
16 | Methylophilus | | scientific name |
16 | Methylophilus Jenkins et al. 1987 | | synonym |
16 | Methylotrophus | | misspelling |
17 | ATCC 53528 | | type material |
17 | DSM 46235 | | type material |
17 | LMG 6787 | | type material |
17 | Methyliphilus methylitrophus | | equivalent name |
17 | Methyliphilus methylotrophus | | equivalent name |
17 | Methylophilus methylitrophus | | equivalent name |
17 | Methylophilus methylotrophus | | scientific name |
17 | Methylophilus methylotrophus Jenkins et al. 1987 | | authority |
17 | Methylophilus sp. CBMB147 | | includes |
17 | Methylotrophus methylophilus | | synonym |
17 | NCIB 10515 | | type material |
17 | NCIMB 10515 | | type material |
17 | VKM B-1623 | | type material |
18 | Pelobacter | | scientific name |
18 | Pelobacter Schink and Pfennig 1983 | | authority |
19 | DSM 2380 | | type material |
19 | NBRC 103641 | | type material |
19 | Pelobacter carbinolicus | | scientific name |
19 | Pelobacter carbinolicus Schink 1984 | | authority |
19 | strain Gra Bd 1 | | type material |
20 | Phenylobacterium | | scientific name |
We will develop this as a local lookup for Genus and species
in some cases Tesseract splits lables into two lines. An example is Escherichia coli
in /ami-plugin/src/test/resources/org/xmlcml/ami2/phylo/15goodtree/ijs.0.000364-0-004.pbm.png
which is split after the s
.
This is detected and probably mended by keeping track of unused phrases in labels.
in ijs_0_000364_0_003
the output appears to have a trifurcation. What does this do to downstream programs. Do they:
Here's the Newick
(((((:195.0,:135.0)NT1.25:86.0,:301.0)NT1.17:131.0,(:251.0,:364.0)NT1.15:106.0)NT1.5:12.0,(:223.0,:333.0)NT1.19:158.0)NT1.2:10.0,((:441.0,(((:295.0,((((:140.0,:164.0)NT1.28:63.0,:165.0)NT1.22:24.0,(:183.0,:185.0)NT1.26:55.0)NT1.21:78.0,(:186.0,:194.0)NT1.23:105.0)NT1.14:56.0)NT1.10:21.0,(:279.0,:302.0)NT1.16:107.0)NT1.8:8.0,((:189.0,:191.0)NT1.27:187.0,:343.0)NT1.12:55.0)NT1.6:11.0,:497.0)NT1.3:6.0,((:348.0,(:124.0,:167.0)NT1.24:205.0)NT1.7:12.0,(:484.0,((:260.0,:320.0)NT1.13:33.0,((:157.0,:181.0)NT1.20:21.0,:152.0)NT1.18:91.0)NT1.11:22.0)NT1.9:32.0)NT1.4:11.0)NT1.1:9.0)NT1.60;
(Tip labels have been lost - can be restored if necessary)
IOW does ami-phylo
have to break multifurcations into binary?
To throw an error if the number of tips is significantly lower than the number of tips in image.
Test case image 745
Copy 50 images of square trees from IJSEM via Ross Mounce and use as main test material. Use to determine types of processing error and corrective actions.
Image ID: 65397-0-000 (image below)
Tenacibaculum amylolyticum MBIC 4355T (AF032505)
yet if you lookup AF032505 on GenBank/ENA it's a HIV-1 isolate. "HIV-1 isolate C_1f from Baltimore, envelope glycoprotein V1V2 region (env) gene, partial cds"
The correct EGID for this tip label is almost certainly AB032505. If you search for the genus & species name, there are just 74 hits, hence it was easy for me to pick out the likely typo in this case.
List all tasks that will need to be carried out for extraction of trees, checking, validation, lookup and then creation of supertree. Create issues for each separate task. These tasks should ideally provide a record of the work as well as the design
Ross Mounce:
almost all newick files have one or more missing tip labels so today I'm just going to plough through adding in the missing labels, manually. After this is done I will do GenBank lookup to go from GB accession number -> GB taxon ID
(just to say, this is all version controlled on github, so no change will go undocumented...)
ISSUE: describe this problem precisely and attempt to formulate primary causes
R
and p4
/ STK2
seem to conflict over whether trees are valid or not 😞
Newick file: ijs.0.65514-0-000.pbm.nwk
((((D0062743:155.0,(M62795:103.0,(AFO78775:118.0,ABO78049:61.0)NT1.12:48.0)NT1.9:25.0)NT1.7:33.0,ABO78055:121.0)NT1.5:15.0,(A3278570:99.0,(AB264798:31.0,D12657:57.0)NT1.10:53.0)NT1.8:49.0)NT1.3:86.0,((EF407879:242.0,(AB192292:146.0,(M62798:135.0,DQ457019:206.0)NT1.6:33.0)NT1.4:17.0)NT1.2:36.0,(DQ244076:27.0,00244077:29.0)NT1.11:157.0)NT1.1:33.0)NT1.27;
fine for R
can read it in and plot it.
but p4
/ STK2
gives warning about unmatched parenthesis:
***Error: failed to parse a tree in your data set.
Error parsing tree
Tree.parseNewick(), tree 't0'
Unmatched unparen.
((((D0062743:155.0,(M62795:103.0,(AFO78775:118.0,ABO78049:61.0)NT1.12:48.0)NT1.9:25.0)NT1.7:33.0,ABO78055:121.0)NT1.5:15.0,(A327
@petermr do you also have lookup code for converting accession number -> taxonID for the NeXML / Newick or should I create some? Happy either way just need to know.
The raw SVG (Tesseract) can be combined with the tree SVG and should be saveable through a commandline option:
ami-phylo --ph.svg [CTree filename]
will output the combined SVG to a subdirectory (yet to be determined) of CTree
Looking through output from my latest code test using the 50-image test set... when NeXML is output (which it isn't always), all the OTU labels are entirely numerical (if present). Are we using a digits-only tesseract whitelist dictionary? Seems like it to me. Sample NeXML output from on file (ijs.0.000653-0-000.pbm.nexml.xml) below:
Perhaps I need to edit a config file in my usr/local ... tessdata to mirror what you have on your machine, Peter?
Full output for all 50 files is on github here: https://github.com/rossmounce/pluto-ONS/tree/master/testing/50-images
<?xml version="1.0" encoding="UTF-8"?>
<nexml xmlns="http://www.nexml.org/2009" xmlns:nex="http://www.nexml.org/2009" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<otus label="RootTaxaBlock">
<otu id="otu1"/>
<otu id="otu2">560171118 171777113 11511111 124647 03162681</otu>
<otu id="otu3">336171113 131136110113 08111 147301 0114221451</otu>
<otu id="otu4">310171115 50111363111611 0500462681</otu>
<otu id="otu5">530171118 81081011111738 138114 4851 0064361</otu>
<otu id="otu6"> 1380171113 1101167113 11116 218371 0415425121</otu>
<otu id="otu7"> 36611118 1111611 1.1146 218341 0115425091</otu>
<otu id="otu8">3520111115 11911113 081111 13201 1130211851</otu>
<otu id="otu9">860171115 9615111111 1.11116 218801 0115513291</otu>
<otu id="otu10">3670171113 5111111115 11000 17691 01606461</otu>
<otu id="otu11">3610171113 1602111917313 1011 132851 9113730181</otu>
</otus>
<trees>
<tree id="T1">
<node id="NT1.1" label="NT1.1" x="0.0" y="457.0" otu="otu1" root="true"/>
<node id="NT1.2" label="NT1.2" x="244.0" y="375.0"/>
<node id="NT1.3" x="308.0" y="259.0" label="67"/>
<node id="NT1.4" x="411.0" y="186.0" label="82"/>
<node id="NT1.5" x="467.0" y="138.0" label="65"/>
<node id="NT1.6" label="NT1.6" x="480.0" y="93.0"/>
<node id="NT1.7" x="493.0" y="330.0" label="99"/>
<node id="NT1.8" x="553.0" y="375.0" label="85"/>
<node id="NT1.9" x="641.0" y="413.0" label="99"/>
<node id="NT1.10" label="NT1.10" x="688.0" y="132.0" otu="otu2"/>
<node id="NT1.11" label="NT1.11" x="716.0" y="336.0" otu="otu3"/>
<node id="NT1.12" label="NT1.12" x="721.0" y="439.0" otu="otu4"/>
<node id="NT1.13" x="725.0" y="55.0" label="100"/>
<node id="NT1.14" label="NT1.14" x="727.0" y="491.0" otu="otu5"/>
<node id="NT1.15" label="NT1.15" x="746.0" y="29.0" otu="otu6"/>
<node id="NT1.16" label="NT1.16" x="756.0" y="81.0" otu="otu7"/>
<node id="NT1.17" label="NT1.17" x="763.0" y="183.0" otu="otu8"/>
<node id="NT1.18" label="NT1.18" x="835.0" y="285.0" otu="otu9"/>
<node id="NT1.19" label="NT1.19" x="847.0" y="234.0" otu="otu10"/>
<node id="NT1.20" label="NT1.20" x="858.0" y="388.0" otu="otu11"/>
<edge source="NT1.15" target="NT1.13"/>
<edge source="NT1.16" target="NT1.13"/>
<edge source="NT1.10" target="NT1.6"/>
<edge source="NT1.17" target="NT1.5"/>
<edge source="NT1.19" target="NT1.4"/>
<edge source="NT1.1" target="NT1.2"/>
<edge source="NT1.11" target="NT1.8"/>
<edge source="NT1.12" target="NT1.9"/>
<edge source="NT1.18" target="NT1.7"/>
<edge source="NT1.20" target="NT1.9"/>
<edge source="NT1.14" target="NT1.2"/>
<edge source="NT1.13" target="NT1.6"/>
<edge source="NT1.6" target="NT1.5"/>
<edge source="NT1.5" target="NT1.4"/>
<edge source="NT1.4" target="NT1.3"/>
<edge source="NT1.3" target="NT1.2"/>
<edge source="NT1.3" target="NT1.7"/>
<edge source="NT1.7" target="NT1.8"/>
<edge source="NT1.8" target="NT1.9"/>
</tree>
</trees>
</nexml>
Here's the full list of 16 erroneous Newick files:
description: problem fairly obvious here. OCR has interpreted a . as a , (comma). "B,multivorans"
In Newick commas are special characters so this causes problems.
description: a very odd file with 199 empty/unlabelled tips! It makes a lot of sense when you view the source image: https://github.com/rossmounce/pluto-ONS/blob/master/testing/output/ijs.0.001149-0-003.pbm.png/ijs.0.001149-0-003.pbm.png AMI has clearly quite faithfully tried to interpret this odd image, it just isn't valid Newick.
description this is the problem bit: ":::::::::::::47.0" unknown cause
description comma inserted in taxon label: "X74685,D25307"
description very poor source image, poor OCR output, comma in taxon label: "(,V/lbrloxuuLMG21346T"
description not sure what problem is here, some single quote ' marks perhaps?
description not sure. Perhaps the slash symbol / or the single quote marks '
description possibly the negative length branches ":-59.0,Oxyrrhismarina" and there's a square bracket symbol [ "Chlamydomonas[noen" and single quote marks
description taxon label with a dollar sign in it "US$433:27.0" also negative branch lengths ":-28.0,99:-16.0,:-17.0,99:-6.0"
description at least three slashes in taxon labels "B.Vinson/isubsp"
description loads of odd symbols ":26.0,5'5°" lots of exclamation marks "V.loge!35077:37.0" pound symbol "£31"
description negative branch lengths "(:-2.0,:-5.0)NT1.16:-4.0" and a single quote mark
description slash "Methanobacler/umIvanovii" single quotes and negative branch lengths "(:-1.0,:182.0)"
description colon in taxon name "Peptostreptococcusmicro::135.0,"
description single quote marks and double quote marks
In
PhyloTreeArgProcessor.createNexmlAndTreeFromPixels(File inputImageFile) t
phyloTreePixelAnalyzer = createAndConfigurePixelAnalyzer(image);
diagramTree = phyloTreePixelAnalyzer.processImageIntoGraphsAndTree();
LOG.debug("processImageIntoGraphsAndTree finished");
the processing spends a lot of time identifying and trying to process the characters. Tesseract provides bounding boxes which could be used to remove characters before processing tree. Alternatively we could try to identify small pixel Islands and not analyse them.
Ross Mounce wrote:
##Pre-processing steps
#Put all Newick trees into one file, one per line
for i in *.nwk ; do cat $i >> testree.tre ; sed -i -e '$a\' testree.tre ; done
#Remove Newick strings that are 70 characters or less. This selects for trees containing four labelled taxa or more
awk 'length($0)>70' testree.tre > filteredtrees.tre
#Flatten from unicode to ascii text
iconv -f utf-8 -t ascii//translit filteredtrees.tre -o asciitrees.tre
# grep -P '[^\x00-\x7f]' filteredtrees.tre to see the unicode chars
#Substitute hyphens
sed -i 's/-//g' asciitrees.tre
# stk/p4 does not like take with labels solely composed of numbers
sed -i 's/\([(,]\)[0-9]\([0-9][0-9][0-9][0-9][0-9]:\)/\1C\2/g' asciitrees.tre
sed -i 's/\([(,]\)[0-9]\([0-9][0-9][0-9][0-9][0-9][0-9][0-9]:\)/\1C\2/g' asciitrees.tre
# stk/p4 does not like unmatched ' symbols
sed -i 's/'\''//g' asciitrees.tre
# stk/p4 does not like unmatched " symbols
sed -i 's/[\"]//g' asciitrees.tre
# stk/p4 does not like / symbols
# stk/p4 does not like taxa beginning with . symbol
# stk/p4 does not like taxa beginning with , symbol
I'll comment with suggested actions
In some images there is very poor matching of tips to labels. (e.g. ijs_0_000174_0_000
). This may be due to large x-distances between tip and LHS of label.
Strategies:
In any case the joining criteria should be with a rectangle of error rather than a circle.
Create a logfile recording ERROR/INFO/DEBUG for each image processed. This logfile can be used for:
The OCR process provides both a binomial and an EGID (ENAGenbankID). This issue is to devise a strategy for reconciling conflicts in interpretation.
At the OCR level the parse is checked, substituted (for incorrect types of character, e.g. punctuation), and validated against a template syntax. All discussion below relates to valid syntax (NOT necessarily valid content).
OCR
-> OCR_Binomial
-> OCR_EGID
EGID and Binomial are then looked up and There is a matrix of:
... TBC ...
The genus
and species
fields can be checked against local taxdump/genus.txt
and taxdump/species.txt
.
(a) write generic TaxdumpLookup
(b) integrate it to ami-phylo
All our NeXML files aren't quite valid at the moment. As validated against: http://www.nexml.org/nexml/phylows/validator
we currently have:
...
<otus label="RootTaxaBlock" xmlns:cmphy="http://contentmine.org/phylotree">
<otu id="otu1">Marinobacler hydrocarbonoclasticus ATCC 49840T (ABO19148)</otu>
<otu id="otu2">Microbulbifer hydrolyticus DSM 11525T (U5813138)</otu>
...
</otus>
<trees>
<tree id="T1">
<node id="NT1.1" label="NT1.1" x="119.0" y="292.0"/>
...
Instead, we need a reference from trees to the otus tag e.g.:
...
<otus about="#Tls34455" id="Tls34455" label="Taxa" xml:base="http://purl.org/phylo/treebase/phylows/taxon/TB2:" label="RootTaxaBlock" xmlns:cmphy="http://contentmine.org/phylotree">
<otu id="otu1">Marinobacler hydrocarbonoclasticus ATCC 49840T (ABO19148)</otu>
<otu id="otu2">Microbulbifer hydrolyticus DSM 11525T (U5813138)</otu>
...
</otus>
<trees otus="Tls34455">
<tree id="T1">
<node id="NT1.1" label="NT1.1" x="119.0" y="292.0"/>
...
The primary output of ami-phylo
is NEXML and the current output validates against nexml.org.
I propose we use NEXML to aggregate per-OTU logging information.
read tree format (e.g. ijsem.xml)
and generate regexes (level0= detect, level1=correct) and actions (abort, record error, etc.)
Run HOCR
Run diagramanalyzer
merge to identify tips (else we analyze other non-tip text)
foreach tip {
check text against ijsem.xml (detect)
if (ok) {
create tip label in extended NEXML
} else {
correct text against level1
if (ok) create tip, with edit record
}
if (!ok(tip)) {
action(tip)
}
}
The proposed extension will be something like
original:
<nexml xmlns="http://www.nexml.org/2009" xmlns:nex="http://www.nexml.org/2009" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<otus label="RootTaxaBlock">
<otu id="otu7">Jonquetella anthropi E3_33 (EU840722)</otu>
...
</otus>
</nexml>
...
new:
<nexml xmlns="http://www.nexml.org/2009" xmlns:nex="http://www.nexml.org/2009" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:cm="http://www.contentmine.org/ami-phylo">
<otus label="RootTaxaBlock">
<otu id="otu7" cm:genus="Jonquetella" cm:species="anthropi" cm:strain="E3_33" cm:ena="EU840722">Jonquetella anthropi E3_33 EU840722</otu>
...
</otus>
</nexml>
This introduces a new namespace (for contentmine) and allows us to annotate without crashing. Normal NEXML parsers will ignore our new attributes. It makes it easy to extract information using XPath, e.g. search for all genus except Homo
:
nexml//otu/@cm:genus(not(.='Homo'))
(there's a bit of XML namespace stuff to be added). This makes the search more precise than a contextless grep
for example. (more on garbles follows)
Ross Mounce:
I have discovered that some programs don't like "singleton nodes" p4 may also be having this problem
so trees like this:
((X76436:483.0,((X60646:436.0,(ABOZ1185:296.0,(D16268:208.0,(AJ542512:21.0,AJ542509:31.0)NT1.13:245.0)NT1.6:13.0)NT1.5:56.0)NT1.4:103.0,(AJ551329:342.0,(AJ422145:163.0,(EU046268:80.0,AY373018:217.0)NT1.9:88.0)NT1.8:60.0)NT1.7:185.0)NT1.3:64.0)NT1.2:244.0);
must be converted to:
(X76436:483,((X60646:436,(ABOZ1185:296,(D16268:208,(AJ542512:21,AJ542509:31)NT1.13:245)NT1.6:13)NT1.5:56)NT1.4:103,(AJ551329:342,(AJ422145:163,(EU046268:80,AY373018:217)NT1.9:88)NT1.8:60)NT1.7:185)NT1.3:64)NT1.2;
(aside from removing the decimal .0 the significant difference is removing the outermost parentheses)
for trees like the above R ape package for phylogenetics will give this error message:
MyTree <- read.tree("tree3.tre") #fail
Error in read.tree("tree3.tre") :
The tree has apparently singleton node(s): cannot read tree file.
Reading Newick file aborted at tree no. 1
there is an R script linked from here that I've tried and appears to succeed in removing these singleton nodes: https://stat.ethz.ch/pipermail/r-sig-phylo/2013-June/002783.html
PMR:
If you describe the problem precisely I can edit ami-phylo
to avoid this problem. Can the singleton node be anywhere? If so I can alter:
A-
B
C-
D
to
A-
B
D
as C has only one child.
Is that what is required? Universally elide all single-node children with their child.
Image ID: 010504-0-000 (image below)
In the image, the correct text is Ulvibacter litoralis KMM 3912T (AY243096)
but tesseract interprets the EGID as AY248096 which is another HIV-1 isolate when you look it up. A valid EGID number, but not the one that matches this tip.
A good test case for cross-matching our OCR-Binomial with EGID-looked-up-Binomial.
Current configuration for IJSEM phylogenetic tips is:
tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._()
located in
/usr/local/share/tessdata/configs/phylo
on my machine.
The following regular expression/s should allow (a) validation and (b) controlled correction of errors in the creation of tip labels in ami-phylo
<patternList>
<!--
Pattern for extracting species and ID from Int. J. Syst. Evol. Microbiol. (IJSEM) publications
ideal pattern is
genus species strain id
of form
Abcdia foobarius AS013T (EO740822)
there should be exactly 4 words (space-separated) .
Any target *without* a single pair of balanced brackets is an absolute fail
We expect single letter garbles (B->8, 0->O, 1->I, etc.) and unexpected whitespace
insertion or deletion (indel)
Note:
<space/> translates to \s+
all fields are wrapped in capture brackets (...)
and concatenated to a single regex
-->
<pattern level="0">
<!-- this regex enforces the ID patterns strictly.
It will only fail when charcters are garbled to the same type, e.g.
M->N , i->l, 3->8
These are undetectable at this stage
-->
<possibleSpace/>
<!-- genus started with an uppercase letter followed by either several lowercase letters
or on/two lowercase letters followed by period (abbreviation) -->
<field name="genus" pattern="(?:\u2018?)[A-Z](?:[a-z]{2,}|[a-z]?\.)">
</field>
<space/>
<!-- species should be only 2 or more lowercase characters -->
<field name="species" pattern="[a-z]{2,}(?:\u2019?)">
</field>
<space/>
<field name="strain" pattern="[^\s\(]+">
</field>
<space/>
<!-- ID has an alpha and numeric part EU840723 or AJ307974 or NC_002967 -->
<!-- require but strip left bracket -->
<field name="id0" pattern="(?:\()[A-Z]{1,2}|NC_">
<!-- and right bracket -->
<field name="id1" pattern="[0-9]{5,6}(?:\)">
</field>
<space/>
</pattern>
<pattern level="1">
<!-- this regex allows for common garbles (detected as an error in 0)
and error correction by "safe" correction. The correction will generate a conformant
filed, but it may not be "correct". Each substitution has an error and can be logged. -->
<possibleSpace/>
<field name="genus" pattern="(?:\u2018?)[A-Z](?:[a-z]{2,}|[a-z02S/]?\.)">
<substitution name="zero2little_o_or_big_o" original="0" edited="[oO]"/>
<substitution name="two2little_z_or_big_z" original="2" edited="[zZ]"/>
<substitution name="big_s2little_s" original="S" edited="s"/>
<substitution name="slash2little_l_or_big_i" original="/" edited="[lI]"/>
<!-- edit more as we find them -->
</field>
<space/>
<field name="species" pattern="[a-z/]+(?:\u2019?)">
<substitution name="s_slash_c2lower_sic" original="s/c" edited="sic"/>
<substitution name="c_slash_l2lower_cil" original="d/" edited="cil"/>
<substitution name="k_slash_n2lower_kin" original="k/n" edited="kin"/>
<substitution name="r_slash_o2lower_rio" original="r/o" edited="rio"/>
<substitution name="zero2little_o" original="0" edited="o"/>
<substitution name="big_s2little_s" original="S" edited="s"/>
<substitution name="slash2lower_l" original="/" edited="l"/>
</field>
<space/>
<field name="strain" pattern="[^\s\(]+">
</field>
<space/>
<field name="id0" pattern="(?:\()[A-Z123580]{1,2}|NC_">
<!-- big letters may be garbled to numbers -->
<substitution name="zero2big_o" original="0" edited="O"/>
<substitution name="one2big_i" original="1" edited="I"/>
<substitution name="two2big_z" original="2" edited="Z"/>
<substitution name="three2big_b" original="3" edited="B"/>
<substitution name="five2big_s" original="5" edited="S"/>
<substitution name="eight2big_b" original="8" edited="B"/>
</field>
<field name="id1" pattern="[0-9BIOSZ]{5,6}(?:\)">
<!-- numbers may be garbled to big letters -->
<substitution name="big_o2zero" original="O" edited="0"/>
<substitution name="big_b2eight" original="B" edited="eight"/>
<substitution name="big_i2one" original="I" edited="one"/>
<substitution name="big_s2five" original="S" edited="5"/>
<substitution name="big_z2two" original="Z" edited="2"/>
</field>
<possibleSpace/>
</pattern>
</patternList>
The HTML from HOCR (Tesseract) should be saveable through a commandline option:
ami-phylo --hocr.html [CTree filename]
will output the raw HTML to a subdirectory (yet to be determined) of CTree
#replace 8 or 3 -> B in 8char matches
sed -i 's/\([(,]\)[38]\([A-Z0-9][A-Z0-9][A-Z0-9][0-9][0-9][0-9][0-9]:\)/\1B\2/g' *.nwk
#replace 8 or 3 -> B in 8char matches
sed -i 's/\([(,][A-Z]\)[38]\([A-Z0-9][A-Z0-9][A-Z0-9][0-9][0-9][0-9]:\)/\1B\2/g' *.nwk
#replace 5 -> B in 8char matches
sed -i 's/\([(,][A-Z0-9]\)[38]\([A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]:\)/\1B\2/g' *.nwk
#replace 8 or 3 -> B in 6char matches
sed -i 's/\([(,]\)[83]\([0-9][0-9][0-9][0-9][0-9]:\)/\1B\2:/g' *.nwk
#replace ^0 -> ^D in 8char matches
sed -i 's/\([(,]\)0\([A-Z0-9][A-Z0-9][0-9][0-9][0-9][0-9][0-9]:\)/\1D\2/g' *.nwk
#replace ^0 -> ^D in 6char matches
sed -i 's/\([(,]\)0\([0-9][0-9][0-9][0-9][0-9]:\)/\1D\2/g' *.nwk
#replace DO or D0 -> DQ in 8char matches
sed -i 's/\([(,]D\)[O0]\([0-9][0-9][0-9][0-9][0-9][0-9]:\)/\1Q\2/g' *.nwk
#replace GO or G0 -> GQ in 8char matches
sed -i 's/\([(,]G\)[O0]\([A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]:\)/\1Q\2/g' *.nwk
#replace A1 -> AJ in 8char matches
sed -i 's/\([(,]A\)1\([0-9][0-9][0-9][0-9][0-9][0-9]:\)/\1J\2/g' *.nwk
#replace 2 -> Z in 6char matches
sed -i 's/\([(,]\)2\([0-9][0-9][0-9][0-9][0-9]:\)/\1Z\2:/g' *.nwk
#replace Z -> 2 in 6char matches
sed -i 's/\([(,][A-Z]\)Z\([0-9][0-9][0-9][0-9]:\)/\12\2:/g' *.nwk
#replace Z -> 2 in 6char matches
sed -i 's/\([(,][A-Z][0-9]\)Z\([0-9][0-9][0-9]:\)/\12\2:/g' *.nwk
#replace Z -> 2 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9]\)Z\([0-9][0-9]:\)/\12\2:/g' *.nwk
#replace Z -> 2 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9][0-9]\)Z\([0-9]:\)/\12\2:/g' *.nwk
#replace Z -> 2 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9][0-9][0-9]\)Z\(:\)/\12\2:/g' *.nwk
#SAFE replace Z -> 2 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9]\)Z\([0-9][0-9][0-9][0-9]:\)/\12\2/g' *.nwk
#SAFE replace Z -> 2 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9]\)Z\([0-9][0-9][0-9]:\)/\12\2/g' *.nwk
#SAFE replace Z -> 2 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9]\)Z\([0-9][0-9]:\)/\12\2/g' *.nwk
#SAFE replace Z -> 2 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9][0-9]\)Z\([0-9]:\)/\12\2/g' *.nwk
#SAFE replace Z -> 2 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9][0-9][0-9]\)Z\(:\)/\12\2/g' *.nwk
#SAFE replace OO -> 00 in 8char matches
sed -i 's/\([(,][A-Z][A-Z]\)OO\([0-9][0-9][0-9][0-9]:\)/\100\2/g' *.nwk
#replace O -> 0
sed -i 's/\([(,][A-Z][A-Z][0-9]\)O\([0-9][0-9][0-9][0-9]:\)/\10\2/g' *.nwk
#SAFE replace O -> 0 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9]\)O\([0-9][0-9][0-9]:\)/\10\2/g' *.nwk
#SAFE replace O -> 0 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9]\)O\([0-9][0-9]:\)/\10\2/g' *.nwk
#SAFE replace O -> 0 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9][0-9]\)O\([0-9]:\)/\10\2/g' *.nwk
#SAFE replace O -> 0 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9][0-9][0-9]\)O\(:\)/\10\2/g' *.nwk
#SAFE replace O -> 0 in 6char matches
sed -i 's/\([(,][A-Z]\)O\([0-9][0-9][0-9][0-9]:\)/\10\2/g' *.nwk
#SAFE replace O -> 0 in 6char matches
sed -i 's/\([(,][A-Z][0-9]\)O\([0-9][0-9][0-9]:\)/\10\2/g' *.nwk
#SAFE replace O -> 0 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9]\)O\([0-9][0-9]:\)/\10\2/g' *.nwk
#SAFE replace O -> 0 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9][0-9]\)O\([0-9]:\)/\10\2/g' *.nwk
#SAFE replace O -> 0 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9][0-9][0-9]\)O\(:\)/\10\2/g' *.nwk
#replace B -> 8 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9]\)B\([0-9][0-9][0-9][0-9]:\)/\18\2/g' *.nwk
#SAFE replace B -> 8 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9]\)B\([0-9][0-9][0-9]:\)/\18\2/g' *.nwk
#SAFE replace B -> 8 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9]\)B\([0-9][0-9]:\)/\18\2/g' *.nwk
#SAFE replace B -> 8 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9][0-9]\)B\([0-9]:\)/\18\2/g' *.nwk
#SAFE replace B -> 8 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9][0-9][0-9]\)B\(:\)/\18\2/g' *.nwk
#SAFE replace B -> 8 in 6char matches
sed -i 's/\([(,][A-Z]\)B\([0-9][0-9][0-9][0-9]:\)/\18\2/g' *.nwk
#SAFE replace B -> 8 in 6char matches
sed -i 's/\([(,][A-Z][0-9]\)B\([0-9][0-9][0-9]:\)/\18\2/g' *.nwk
#SAFE replace B -> 8 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9]\)B\([0-9][0-9]:\)/\18\2/g' *.nwk
#SAFE replace B -> 8 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9][0-9]\)B\([0-9]:\)/\18\2/g' *.nwk
#SAFE replace B -> 8 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9][0-9][0-9]\)B\(:\)/\18\2/g' *.nwk
#replace G -> 6 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9]\)G\([0-9][0-9][0-9][0-9]:\)/\16\2/g' *.nwk
#SAFE replace G -> 6 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9]\)G\([0-9][0-9][0-9]:\)/\16\2/g' *.nwk
#SAFE replace G -> 6 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9]\)G\([0-9][0-9]:\)/\16\2/g' *.nwk
#SAFE replace G -> 6 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9][0-9]\)G\([0-9]:\)/\16\2/g' *.nwk
#SAFE replace G -> 6 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9][0-9][0-9]\)G\(:\)/\16\2/g' *.nwk
#SAFE replace G -> 6 in 6char matches
sed -i 's/\([(,][A-Z]\)G\([0-9][0-9][0-9][0-9]:\)/\16\2/g' *.nwk
#SAFE replace G -> 6 in 6char matches
sed -i 's/\([(,][A-Z][0-9]\)G\([0-9][0-9][0-9]:\)/\16\2/g' *.nwk
#SAFE replace G -> 6 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9]\)G\([0-9][0-9]:\)/\16\2/g' *.nwk
#SAFE replace G -> 6 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9][0-9]\)G\([0-9]:\)/\16\2/g' *.nwk
#SAFE replace G -> 6 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9][0-9][0-9]\)G\(:\)/\16\2/g' *.nwk
#SAFE replace E -> 2 in 8char matches
sed -i 's/\([(,][A-Z][A-Z]\)E\([A-Z0-9][0-9][0-9][0-9][0-9]:\)/\12\2/g' *.nwk
#SAFE replace S -> 6 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9A-Z]\)S\([0-9][0-9][0-9][0-9]:\)/\16\2/g' *.nwk
Some browsers and tools throw errors with points > 127 even when encoded as UTF-8. These points should be escaped as &#dddd;
The NCBI taxdump contains ca 600K records, looking like:
9 | Buchnera aphidicola | | scientific name |
9 | Buchnera aphidicola Munson et al. 1991 | | synonym |
10 | "Cellvibrio" Winogradsky 1929 | | synonym |
10 | Cellvibrio | | scientific name |
10 | Cellvibrio (ex Winogradsky 1929) Blackall et al. 1986 emend. Humphry et al. 2003 | | synonym |
11 | "Cellvibrio gilvus" Hulcher and King 1958 | | authority |
11 | Cellvibrio gilvus | | equivalent name |
11 | [Cellvibrio] gilvus | | scientific name |
13 | Dictyoglomus | | scientific name |
13 | Dictyoglomus Saiki et al. 1985 | | authority |
14 | ATCC 35947 | | type material |
14 | DSM 3960 | | type material |
14 | Dictyoglomus thermophilum | | scientific name |
14 | Dictyoglomus thermophilum Saiki et al. 1985 | | authority |
14 | strain H-6-12 | | type material |
16 | Methyliphilus | | equivalent name |
16 | Methylophilus | | scientific name |
16 | Methylophilus Jenkins et al. 1987 | | synonym |
16 | Methylotrophus | | misspelling |
17 | ATCC 53528 | | type material |
17 | DSM 46235 | | type material |
17 | LMG 6787 | | type material |
17 | Methyliphilus methylitrophus | | equivalent name |
17 | Methyliphilus methylotrophus | | equivalent name |
17 | Methylophilus methylitrophus | | equivalent name |
17 | Methylophilus methylotrophus | | scientific name |
17 | Methylophilus methylotrophus Jenkins et al. 1987 | | authority |
17 | Methylophilus sp. CBMB147 | | includes |
17 | Methylotrophus methylophilus | | synonym |
17 | NCIB 10515 | | type material |
17 | NCIMB 10515 | | type material |
17 | VKM B-1623 | | type material |
18 | Pelobacter | | scientific name |
18 | Pelobacter Schink and Pfennig 1983 | | authority |
19 | DSM 2380 | | type material |
19 | NBRC 103641 | | type material |
19 | Pelobacter carbinolicus | | scientific name |
19 | Pelobacter carbinolicus Schink 1984 | | authority |
19 | strain Gra Bd 1 | | type material |
20 | Phenylobacterium | | scientific name |
I have looked through this in some detail and propose that for current ami-phylo
(mainly tackling IJSEM) we use ONLY "scientific name"s.
These name have the following forms:
Genus
Genus species
Genus species qualifiers...
The qualifiers include:
sp. ddd
subsp. ddd
NTCC 1234
and many more.
[a-z]+
(results in genus
only)These are then sorted and duplicates removed, leaving only single-word genus or two word binomial.
Dendroscope (phylogenetic tree viewing software) cannot view many of the output NeXML files because they contain Unicode characters.
Error log from Dendroscope:
Executing: open file='/home/ross/workspace/ami-plugin/all-output/all-input/ijs.0.000364-0-003.pbm.png/ijs.0.000364-0-003.pbm.nexml.xml';
[Fatal Error] :13:19: An invalid XML character (Unicode: 0x18) was found in the element content of the document.
org.xml.sax.SAXParseException; lineNumber: 13; columnNumber: 19; An invalid XML character (Unicode: 0x18) was found in the element content of the document.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:347)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
at org.nexml.model.DocumentFactory.parse(DocumentFactory.java:52)
at org.nexml.model.DocumentFactory.safeParse(DocumentFactory.java:62)
at dendroscope.D.A.B.A(Unknown Source)
at dendroscope.D.A.B(Unknown Source)
at dendroscope.commands.OpenFileCommand.apply(Unknown Source)
at jloda.C.A.F.A(Unknown Source)
at jloda.C.A.F.D(Unknown Source)
at dendroscope.N.B$2.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Command usage: open file=<filename>; - Opens a file
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.