Code Monkey home page Code Monkey logo

phylotree's People

Contributors

petermr avatar rossmounce avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

phylotree's Issues

PMR_NOTES Extracting fields from OTUs

Current main loop for extracting fields from a value.
value is raw surface
editList stores the edits
extractionList stores the final extractions as Extractions name-value pairs.
editedBuilder is the new surface, but should be superseded by ExtractionList.

package org.xmlcml.norma.editor;
public class PatternElement

    public String createEditedValueAndRecord(String value) {
        String newValue = null;
        Pattern pattern1 = createPattern();
        Matcher matcher = pattern1.matcher(value);
        editRecord = new EditList();
        extractionList = new ArrayList<Extraction>();
        if (matcher.matches()) {
            editedBuilder = new StringBuilder();
            for (int i = 1; i <= matcher.groupCount(); i++) {
                String group = matcher.group(i);
                FieldElement field = getField(i - 1);
                group = field.applySubstitutions(group);
                Extraction extraction = new Extraction(field.getNameAttribute(), group);
                extractionList.add(extraction);
                EditList fieldRecord = field.getEditRecord();
                if (fieldRecord.size() > 0) {
                    LOG.trace("fr1 "+fieldRecord);
                    this.editRecord.add(fieldRecord);
                }
                insertSpace(field);
                editedBuilder.append(group);
            }
            newValue =  editedBuilder.toString();
        }
        LOG.debug(">>"+extractionList);
        return newValue;
    }

Missing edge labels

A considerable number of trees have unannotated edges

<edge>

instead of

<edge source="foo" target="bar"/>

Ross Mounce to identify 2-3 such examples and commit as failing tests.

commanline: output Newick

The tree in Newick format should be saveable through a commandline option:

ami-phylo --ph.newick [CTree filename]

will output the Newick to a subdirectory (yet to be determined) of CTree

Add pruning for incomplete or malformed tips

All successful parses will results in

<genus> <species> <strain> <egid>

These are then further validated via lookup to remove cases where

<genus+species> != <binomial looked up from egid>

all tips failing these tests should be pruned from the tree used for further processing ("validTree"). (The incorrect otus can be retained but not used).

ami-phylo commandline

ami-phylo will run from the commandline and should, as far as possible, manage options from there. Please list here the options that should be part of the commandline.

Test the effect of using character whitelists in tesseract

We should try improving our OCR output by restricting tesseract to a whitelist of characters. This StackOverflow post appears to detail how this can be done very simply/easily.
http://stackoverflow.com/questions/2363490/limit-characters-tesseract-is-looking-for

I think we should NOT include these characters in the whitelist:
\ / $ % ^ & # ! ~ £

Of course we'd need to test the effect of this change. I will try and find example files that contain these types of characters in the 'raw' unmodified tesseract output. Then compare that output with the whitelist-tesseracted output.

Syntax for tips

The IJSEM phylotrees have an uncontrolled tip syntax:

genus species strain strain1? \(AAdddddd | AAAddddd | NC_dddddd\)

However we can't assume this in general. What other syntaxes are we likely to have to tackle?

genus species

or

genus

are common, but there are many others. Can we come up with heuristics that work "most of the time"? Because otherwise we are gong to have to define this for each tree, which defeats the automation.

Check tip count against HOCR output

If a tree is corrupted (e.g. has small gaps so splits into several trees) then much of the valid HOCR output is not assigned. This can be used to reject broken trees.

commandline: output Nexml

The NEXML from ami-phylo should be saveable through a commandline option:

ami-phylo --ph.nexml [CTree filename]

will output the NEXML to a subdirectory (yet to be determined) of CTree

validation of Newick output

Provide a mechanism for validating *.nwk output.

As an example http://libpll.org/api/group__newickParseGroup.html defines a valid tree as

Validate if a newick tree is a valid phylogenetic tree.

A valid tree is one where the root node is binary or ternary and all other internal nodes are binary. In case the root is ternary then the tree must contain at least another internal node and the total number of nodes must be equal to $ 2l - 2$, where $l$ is the number of leaves. If the root is binary, then the total number of nodes must be equal to $2l - 1$.

This implies that all multiple parentage should be expanded to binary trees apart from roots.

Is this a satisfactory validator? and does it validate node labels, etc.

Deleting Tip nodes

To create a tool which removes tip nodes (primarily because they have no labels). This should be done on nexml . Any resultant poorly formed nodes (e.g. with 1 or 0 children) should be deleted/tidied

Navigation through phylotree repository

The toplevel pages of this repository should describe the project, its methodology, its specifications, its errorTYPES. links to other resources. Phylotree should be a showcase for other CM prokects ad it should be possible for someone to understand the process in abstract even if they don't understand phylogenetics in detail.

Untrapped Tesseract failure

When Tesseract fails the subprocess should about to avoid errors such as:

6059 [main] ERROR org.xmlcml.norma.image.ocr.ImageToHOCRConverter  - Process failed to terminate after :60
6059 [main] DEBUG org.xmlcml.norma.image.ocr.ImageToHOCRConverter  - creating HTML output: /Users/pm286/workspace/ami-plugin/target/junk/1439960185967/null.pbm.png.hocr.html
6059 [main] DEBUG org.xmlcml.norma.image.ocr.ImageToHOCRConverter  - creating hocr.hocr name: /Users/pm286/workspace/ami-plugin/target/junk/1439960185967/null.pbm.png.hocr.hocr
6059 [main] DEBUG org.xmlcml.norma.image.ocr.ImageToHOCRConverter  - failed to create: /Users/pm286/workspace/ami-plugin/target/junk/1439960185967/null.pbm.png.hocr.html or /Users/pm286/workspace/ami-plugin/target/junk/1439960185967/null.pbm.png.hocr.hocr
6059 [main] ERROR org.xmlcml.ami2.plugins.phylotree.PhyloTreeArgProcessor  - cannot run tesseract
6063 [main] DEBUG org.xmlcml.ami2.plugins.phylotree.PhyloTreeArgProcessor  - cTreeLog: [org.xmlcml.cmine.args.log.CMineLog: log]
6063 [main] DEBUG org.xmlcml.ami2.plugins.phylotree.PhyloTreeArgProcessor  - outputResultElement NYI null; need to add tree

If fact this is due to FileNotFound which should be trapped before trying Tesseract.

commandline: output raw SVG

The HTML from HOCR (Tesseract) can be converted to SVG and should be saveable through a commandline option:

ami-phylo --hocr.svg [CTree filename]

will output the raw SVG to a subdirectory (yet to be determined) of CTree

Serious design flaw in `cmine`+`ami-phylo`

See https://github.com/petermr/cmine/issues/2

The main loop over multiple CTrees should include all processing. Currently ami-phylo executes runAndOutput() over multiple CTrees and then runs the NexML analysis. This means that only the last CTree is analyzed.

Solution:
(a) immediate. break processImage() in ami-phylo into per-CTree modules (driven by args.xml
(b) medium separate argProcessor from per-CTree analyses.

Local Genus and species lookup

NCBI taxdump has approximate format:

9   |   Buchnera aphidicola |       |   scientific name |
9   |   Buchnera aphidicola Munson et al. 1991  |       |   synonym |
10  |   "Cellvibrio" Winogradsky 1929   |       |   synonym |
10  |   Cellvibrio  |       |   scientific name |
10  |   Cellvibrio (ex Winogradsky 1929) Blackall et al. 1986 emend. Humphry et al. 2003    |       |   synonym |
11  |   "Cellvibrio gilvus" Hulcher and King 1958   |       |   authority   |
11  |   Cellvibrio gilvus   |       |   equivalent name |
11  |   [Cellvibrio] gilvus |       |   scientific name |
13  |   Dictyoglomus    |       |   scientific name |
13  |   Dictyoglomus Saiki et al. 1985  |       |   authority   |
14  |   ATCC 35947  |       |   type material   |
14  |   DSM 3960    |       |   type material   |
14  |   Dictyoglomus thermophilum   |       |   scientific name |
14  |   Dictyoglomus thermophilum Saiki et al. 1985 |       |   authority   |
14  |   strain H-6-12   |       |   type material   |
16  |   Methyliphilus   |       |   equivalent name |
16  |   Methylophilus   |       |   scientific name |
16  |   Methylophilus Jenkins et al. 1987   |       |   synonym |
16  |   Methylotrophus  |       |   misspelling |
17  |   ATCC 53528  |       |   type material   |
17  |   DSM 46235   |       |   type material   |
17  |   LMG 6787    |       |   type material   |
17  |   Methyliphilus methylitrophus    |       |   equivalent name |
17  |   Methyliphilus methylotrophus    |       |   equivalent name |
17  |   Methylophilus methylitrophus    |       |   equivalent name |
17  |   Methylophilus methylotrophus    |       |   scientific name |
17  |   Methylophilus methylotrophus Jenkins et al. 1987    |       |   authority   |
17  |   Methylophilus sp. CBMB147   |       |   includes    |
17  |   Methylotrophus methylophilus    |       |   synonym |
17  |   NCIB 10515  |       |   type material   |
17  |   NCIMB 10515 |       |   type material   |
17  |   VKM B-1623  |       |   type material   |
18  |   Pelobacter  |       |   scientific name |
18  |   Pelobacter Schink and Pfennig 1983  |       |   authority   |
19  |   DSM 2380    |       |   type material   |
19  |   NBRC 103641 |       |   type material   |
19  |   Pelobacter carbinolicus |       |   scientific name |
19  |   Pelobacter carbinolicus Schink 1984 |       |   authority   |
19  |   strain Gra Bd 1 |       |   type material   |
20  |   Phenylobacterium    |       |   scientific name |

We will develop this as a local lookup for Genus and species

Incorrect line breaks from Tesseract.

in some cases Tesseract splits lables into two lines. An example is Escherichia coli in /ami-plugin/src/test/resources/org/xmlcml/ami2/phylo/15goodtree/ijs.0.000364-0-004.pbm.png which is split after the s.

This is detected and probably mended by keeping track of unused phrases in labels.

Multiple branches in output

in ijs_0_000364_0_003
the output appears to have a trifurcation. What does this do to downstream programs. Do they:

  • throw an error?
  • process it correctly
  • process it incorrectly

Here's the Newick

(((((:195.0,:135.0)NT1.25:86.0,:301.0)NT1.17:131.0,(:251.0,:364.0)NT1.15:106.0)NT1.5:12.0,(:223.0,:333.0)NT1.19:158.0)NT1.2:10.0,((:441.0,(((:295.0,((((:140.0,:164.0)NT1.28:63.0,:165.0)NT1.22:24.0,(:183.0,:185.0)NT1.26:55.0)NT1.21:78.0,(:186.0,:194.0)NT1.23:105.0)NT1.14:56.0)NT1.10:21.0,(:279.0,:302.0)NT1.16:107.0)NT1.8:8.0,((:189.0,:191.0)NT1.27:187.0,:343.0)NT1.12:55.0)NT1.6:11.0,:497.0)NT1.3:6.0,((:348.0,(:124.0,:167.0)NT1.24:205.0)NT1.7:12.0,(:484.0,((:260.0,:320.0)NT1.13:33.0,((:157.0,:181.0)NT1.20:21.0,:152.0)NT1.18:91.0)NT1.11:22.0)NT1.9:32.0)NT1.4:11.0)NT1.1:9.0)NT1.60;

(Tip labels have been lost - can be restored if necessary)

IOW does ami-phylo have to break multifurcations into binary?

Implement 50 images as test material

Copy 50 images of square trees from IJSEM via Ross Mounce and use as main test material. Use to determine types of processing error and corrective actions.

List all tasks required for phylotree analysis

List all tasks that will need to be carried out for extraction of trees, checking, validation, lookup and then creation of supertree. Create issues for each separate task. These tasks should ideally provide a record of the work as well as the design

Missing tip labels

Ross Mounce:
almost all newick files have one or more missing tip labels so today I'm just going to plough through adding in the missing labels, manually. After this is done I will do GenBank lookup to go from GB accession number -> GB taxon ID
(just to say, this is all version controlled on github, so no change will go undocumented...)

ISSUE: describe this problem precisely and attempt to formulate primary causes

Conflict in Newick validity between R & p4/STK2

R and p4 / STK2 seem to conflict over whether trees are valid or not 😞

Newick file: ijs.0.65514-0-000.pbm.nwk

((((D0062743:155.0,(M62795:103.0,(AFO78775:118.0,ABO78049:61.0)NT1.12:48.0)NT1.9:25.0)NT1.7:33.0,ABO78055:121.0)NT1.5:15.0,(A3278570:99.0,(AB264798:31.0,D12657:57.0)NT1.10:53.0)NT1.8:49.0)NT1.3:86.0,((EF407879:242.0,(AB192292:146.0,(M62798:135.0,DQ457019:206.0)NT1.6:33.0)NT1.4:17.0)NT1.2:36.0,(DQ244076:27.0,00244077:29.0)NT1.11:157.0)NT1.1:33.0)NT1.27;

fine for R can read it in and plot it.
but p4 / STK2 gives warning about unmatched parenthesis:

***Error: failed to parse a tree in your data set.
Error parsing tree


Tree.parseNewick(), tree 't0'
    Unmatched unparen.
((((D0062743:155.0,(M62795:103.0,(AFO78775:118.0,ABO78049:61.0)NT1.12:48.0)NT1.9:25.0)NT1.7:33.0,ABO78055:121.0)NT1.5:15.0,(A327

commandline: output combined SVG

The raw SVG (Tesseract) can be combined with the tree SVG and should be saveable through a commandline option:

ami-phylo --ph.svg [CTree filename]

will output the combined SVG to a subdirectory (yet to be determined) of CTree

OTU labels are now entirely numerical

Looking through output from my latest code test using the 50-image test set... when NeXML is output (which it isn't always), all the OTU labels are entirely numerical (if present). Are we using a digits-only tesseract whitelist dictionary? Seems like it to me. Sample NeXML output from on file (ijs.0.000653-0-000.pbm.nexml.xml) below:

Perhaps I need to edit a config file in my usr/local ... tessdata to mirror what you have on your machine, Peter?

Full output for all 50 files is on github here: https://github.com/rossmounce/pluto-ONS/tree/master/testing/50-images

<?xml version="1.0" encoding="UTF-8"?>
<nexml xmlns="http://www.nexml.org/2009" xmlns:nex="http://www.nexml.org/2009" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <otus label="RootTaxaBlock">
  <otu id="otu1"/>
  <otu id="otu2">560171118 171777113 11511111 124647 03162681</otu>
  <otu id="otu3">336171113 131136110113 08111 147301 0114221451</otu>
  <otu id="otu4">310171115 50111363111611 0500462681</otu>
  <otu id="otu5">530171118 81081011111738 138114 4851 0064361</otu>
  <otu id="otu6"> 1380171113 1101167113 11116 218371 0415425121</otu>
  <otu id="otu7"> 36611118 1111611 1.1146 218341 0115425091</otu>
  <otu id="otu8">3520111115 11911113 081111 13201 1130211851</otu>
  <otu id="otu9">860171115 9615111111 1.11116 218801 0115513291</otu>
  <otu id="otu10">3670171113 5111111115 11000 17691 01606461</otu>
  <otu id="otu11">3610171113 1602111917313 1011 132851 9113730181</otu>
 </otus>
 <trees>
  <tree id="T1">
   <node id="NT1.1" label="NT1.1" x="0.0" y="457.0" otu="otu1" root="true"/>
   <node id="NT1.2" label="NT1.2" x="244.0" y="375.0"/>
   <node id="NT1.3" x="308.0" y="259.0" label="67"/>
   <node id="NT1.4" x="411.0" y="186.0" label="82"/>
   <node id="NT1.5" x="467.0" y="138.0" label="65"/>
   <node id="NT1.6" label="NT1.6" x="480.0" y="93.0"/>
   <node id="NT1.7" x="493.0" y="330.0" label="99"/>
   <node id="NT1.8" x="553.0" y="375.0" label="85"/>
   <node id="NT1.9" x="641.0" y="413.0" label="99"/>
   <node id="NT1.10" label="NT1.10" x="688.0" y="132.0" otu="otu2"/>
   <node id="NT1.11" label="NT1.11" x="716.0" y="336.0" otu="otu3"/>
   <node id="NT1.12" label="NT1.12" x="721.0" y="439.0" otu="otu4"/>
   <node id="NT1.13" x="725.0" y="55.0" label="100"/>
   <node id="NT1.14" label="NT1.14" x="727.0" y="491.0" otu="otu5"/>
   <node id="NT1.15" label="NT1.15" x="746.0" y="29.0" otu="otu6"/>
   <node id="NT1.16" label="NT1.16" x="756.0" y="81.0" otu="otu7"/>
   <node id="NT1.17" label="NT1.17" x="763.0" y="183.0" otu="otu8"/>
   <node id="NT1.18" label="NT1.18" x="835.0" y="285.0" otu="otu9"/>
   <node id="NT1.19" label="NT1.19" x="847.0" y="234.0" otu="otu10"/>
   <node id="NT1.20" label="NT1.20" x="858.0" y="388.0" otu="otu11"/>
   <edge source="NT1.15" target="NT1.13"/>
   <edge source="NT1.16" target="NT1.13"/>
   <edge source="NT1.10" target="NT1.6"/>
   <edge source="NT1.17" target="NT1.5"/>
   <edge source="NT1.19" target="NT1.4"/>
   <edge source="NT1.1" target="NT1.2"/>
   <edge source="NT1.11" target="NT1.8"/>
   <edge source="NT1.12" target="NT1.9"/>
   <edge source="NT1.18" target="NT1.7"/>
   <edge source="NT1.20" target="NT1.9"/>
   <edge source="NT1.14" target="NT1.2"/>
   <edge source="NT1.13" target="NT1.6"/>
   <edge source="NT1.6" target="NT1.5"/>
   <edge source="NT1.5" target="NT1.4"/>
   <edge source="NT1.4" target="NT1.3"/>
   <edge source="NT1.3" target="NT1.2"/>
   <edge source="NT1.3" target="NT1.7"/>
   <edge source="NT1.7" target="NT1.8"/>
   <edge source="NT1.8" target="NT1.9"/>
  </tree>
 </trees>
</nexml>

Erroneous Newick output

Here's the full list of 16 erroneous Newick files:

Details

  • 001123

description: problem fairly obvious here. OCR has interpreted a . as a , (comma). "B,multivorans"
In Newick commas are special characters so this causes problems.

description: a very odd file with 199 empty/unlabelled tips! It makes a lot of sense when you view the source image: https://github.com/rossmounce/pluto-ONS/blob/master/testing/output/ijs.0.001149-0-003.pbm.png/ijs.0.001149-0-003.pbm.png AMI has clearly quite faithfully tried to interpret this odd image, it just isn't valid Newick.

  • 003160

description this is the problem bit: ":::::::::::::47.0" unknown cause

  • 019687

description comma inserted in taxon label: "X74685,D25307"

  • 022285

description very poor source image, poor OCR output, comma in taxon label: "(,V/lbrloxuuLMG21346T"

  • 02251

description not sure what problem is here, some single quote ' marks perhaps?

  • 02303

description not sure. Perhaps the slash symbol / or the single quote marks '

  • 02328

description possibly the negative length branches ":-59.0,Oxyrrhismarina" and there's a square bracket symbol [ "Chlamydomonas[noen" and single quote marks

  • 02329

description taxon label with a dollar sign in it "US$433:27.0" also negative branch lengths ":-28.0,99:-16.0,:-17.0,99:-6.0"

  • 02770

description at least three slashes in taxon labels "B.Vinson/isubsp"

  • 02792

description loads of odd symbols ":26.0,5'5°" lots of exclamation marks "V.loge!35077:37.0" pound symbol "£31"

  • 02806

description negative branch lengths "(:-2.0,:-5.0)NT1.16:-4.0" and a single quote mark

  • 02994

description slash "Methanobacler/umIvanovii" single quotes and negative branch lengths "(:-1.0,:182.0)"

  • 63077

description colon in taxon name "Peptostreptococcusmicro::135.0,"

  • 63400

description single quote marks and double quote marks

Separate characters and trees in pixel analysis

In

PhyloTreeArgProcessor.createNexmlAndTreeFromPixels(File inputImageFile) t
            phyloTreePixelAnalyzer = createAndConfigurePixelAnalyzer(image);
            diagramTree = phyloTreePixelAnalyzer.processImageIntoGraphsAndTree();
            LOG.debug("processImageIntoGraphsAndTree finished");

the processing spends a lot of time identifying and trying to process the characters. Tesseract provides bounding boxes which could be used to remove characters before processing tree. Alternatively we could try to identify small pixel Islands and not analyse them.

preprocessing of newick for STK

Ross Mounce wrote:

##Pre-processing steps

#Put all Newick trees into one file, one per line
for i in *.nwk ; do cat $i >> testree.tre ; sed -i -e '$a\' testree.tre ; done
#Remove Newick strings that are 70 characters or less. This selects for trees containing four labelled taxa or more
awk 'length($0)>70' testree.tre > filteredtrees.tre
#Flatten from unicode to ascii text
iconv -f utf-8 -t ascii//translit filteredtrees.tre -o asciitrees.tre
# grep -P '[^\x00-\x7f]' filteredtrees.tre to see the unicode chars
#Substitute hyphens
sed -i 's/-//g' asciitrees.tre
# stk/p4 does not like take with labels solely composed of numbers
sed -i 's/\([(,]\)[0-9]\([0-9][0-9][0-9][0-9][0-9]:\)/\1C\2/g' asciitrees.tre
sed -i 's/\([(,]\)[0-9]\([0-9][0-9][0-9][0-9][0-9][0-9][0-9]:\)/\1C\2/g' asciitrees.tre
# stk/p4 does not like unmatched ' symbols
sed -i 's/'\''//g' asciitrees.tre
# stk/p4 does not like unmatched " symbols
sed -i 's/[\"]//g' asciitrees.tre
# stk/p4 does not like / symbols
# stk/p4 does not like taxa beginning with . symbol
# stk/p4 does not like taxa beginning with , symbol

I'll comment with suggested actions

Failure to match tips to labels (implement Hungarian algorithm? )

In some images there is very poor matching of tips to labels. (e.g. ijs_0_000174_0_000). This may be due to large x-distances between tip and LHS of label.

Strategies:

  • adaptive algorithm to re-adjust x-distances
  • Hungarian algorithm for best bipartite graph.

In any case the joining criteria should be with a rectangle of error rather than a circle.

Logfile

Create a logfile recording ERROR/INFO/DEBUG for each image processed. This logfile can be used for:

  • alerting and systematising errors
  • archival
  • metrics
  • re-running refined searches

Consistency of Binomial and EGID

The OCR process provides both a binomial and an EGID (ENAGenbankID). This issue is to devise a strategy for reconciling conflicts in interpretation.

At the OCR level the parse is checked, substituted (for incorrect types of character, e.g. punctuation), and validated against a template syntax. All discussion below relates to valid syntax (NOT necessarily valid content).

OCR
    -> OCR_Binomial
    -> OCR_EGID

EGID and Binomial are then looked up and There is a matrix of:

... TBC ...

NeXML validation issue

All our NeXML files aren't quite valid at the moment. As validated against: http://www.nexml.org/nexml/phylows/validator

we currently have:

...
<otus label="RootTaxaBlock" xmlns:cmphy="http://contentmine.org/phylotree">
 <otu id="otu1">Marinobacler hydrocarbonoclasticus ATCC 49840T (ABO19148)</otu>
  <otu id="otu2">Microbulbifer hydrolyticus DSM 11525T (U5813138)</otu>
...
 </otus>
 <trees>
  <tree id="T1">
   <node id="NT1.1" label="NT1.1" x="119.0" y="292.0"/>
...

Instead, we need a reference from trees to the otus tag e.g.:

...
<otus about="#Tls34455" id="Tls34455" label="Taxa" xml:base="http://purl.org/phylo/treebase/phylows/taxon/TB2:" label="RootTaxaBlock" xmlns:cmphy="http://contentmine.org/phylotree">
 <otu id="otu1">Marinobacler hydrocarbonoclasticus ATCC 49840T (ABO19148)</otu>
  <otu id="otu2">Microbulbifer hydrolyticus DSM 11525T (U5813138)</otu>
...
 </otus>
 <trees otus="Tls34455">
  <tree id="T1">
   <node id="NT1.1" label="NT1.1" x="119.0" y="292.0"/>
...

NEXML OTU format

The primary output of ami-phylo is NEXML and the current output validates against nexml.org.

I propose we use NEXML to aggregate per-OTU logging information.

   read tree format (e.g. ijsem.xml) 
     and generate regexes (level0= detect, level1=correct) and actions (abort, record error, etc.)
   Run HOCR
   Run diagramanalyzer 
   merge to identify tips (else we analyze other non-tip text)
   foreach tip {
      check text against ijsem.xml (detect)
      if (ok) {
         create tip label in extended NEXML
      } else {
         correct text against level1
         if (ok) create tip, with edit record
      }
      if (!ok(tip)) {
         action(tip)
      }
   }

The proposed extension will be something like

original:
<nexml xmlns="http://www.nexml.org/2009" xmlns:nex="http://www.nexml.org/2009" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <otus label="RootTaxaBlock">

   <otu id="otu7">Jonquetella anthropi E3_33 (EU840722)</otu>
...
</otus>
</nexml>
...
new: 
<nexml xmlns="http://www.nexml.org/2009" xmlns:nex="http://www.nexml.org/2009" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:cm="http://www.contentmine.org/ami-phylo">
 <otus label="RootTaxaBlock">

<otu id="otu7" cm:genus="Jonquetella" cm:species="anthropi" cm:strain="E3_33" cm:ena="EU840722">Jonquetella anthropi E3_33 EU840722</otu>
...
</otus>
</nexml>

This introduces a new namespace (for contentmine) and allows us to annotate without crashing. Normal NEXML parsers will ignore our new attributes. It makes it easy to extract information using XPath, e.g. search for all genus except Homo:

nexml//otu/@cm:genus(not(.='Homo'))

(there's a bit of XML namespace stuff to be added). This makes the search more precise than a contextless grep for example. (more on garbles follows)

Singleton nodes in Newick

Ross Mounce:
I have discovered that some programs don't like "singleton nodes" p4 may also be having this problem
so trees like this:

((X76436:483.0,((X60646:436.0,(ABOZ1185:296.0,(D16268:208.0,(AJ542512:21.0,AJ542509:31.0)NT1.13:245.0)NT1.6:13.0)NT1.5:56.0)NT1.4:103.0,(AJ551329:342.0,(AJ422145:163.0,(EU046268:80.0,AY373018:217.0)NT1.9:88.0)NT1.8:60.0)NT1.7:185.0)NT1.3:64.0)NT1.2:244.0);

must be converted to:

(X76436:483,((X60646:436,(ABOZ1185:296,(D16268:208,(AJ542512:21,AJ542509:31)NT1.13:245)NT1.6:13)NT1.5:56)NT1.4:103,(AJ551329:342,(AJ422145:163,(EU046268:80,AY373018:217)NT1.9:88)NT1.8:60)NT1.7:185)NT1.3:64)NT1.2;

(aside from removing the decimal .0 the significant difference is removing the outermost parentheses)
for trees like the above R ape package for phylogenetics will give this error message:

MyTree <- read.tree("tree3.tre") #fail
Error in read.tree("tree3.tre") : 
  The tree has apparently singleton node(s): cannot read tree file.
  Reading Newick file aborted at tree no. 1

there is an R script linked from here that I've tried and appears to succeed in removing these singleton nodes: https://stat.ethz.ch/pipermail/r-sig-phylo/2013-June/002783.html

PMR:
If you describe the problem precisely I can edit ami-phylo to avoid this problem. Can the singleton node be anywhere? If so I can alter:

A-
  B
  C-
    D

to

A- 
  B
  D

as C has only one child.
Is that what is required? Universally elide all single-node children with their child.

Configuration file for Tesseract

Current configuration for IJSEM phylogenetic tips is:

tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._()

located in

/usr/local/share/tessdata/configs/phylo

on my machine.

Detecting/correcting garbles in OCR for ami-phylo

The following regular expression/s should allow (a) validation and (b) controlled correction of errors in the creation of tip labels in ami-phylo

<patternList>
  <!-- 
  Pattern for extracting species and ID from Int. J. Syst. Evol. Microbiol. (IJSEM) publications
  ideal pattern is 
          genus   species  strain    id
  of form
          Abcdia foobarius AS013T (EO740822)
  there should be exactly 4 words (space-separated) .
  Any target *without* a single pair of balanced brackets is an absolute fail
  We expect single letter garbles (B->8, 0->O, 1->I, etc.) and unexpected whitespace 
  insertion or deletion (indel)

  Note:
    <space/> translates to \s+
     all fields are wrapped in capture brackets (...)
     and concatenated to a single regex  
   -->
  <pattern level="0">
    <!--  this regex enforces the ID patterns strictly.
    It will only fail when charcters are garbled to the same type, e.g.
    M->N , i->l, 3->8
    These are undetectable at this stage
     -->
    <possibleSpace/>
    <!--  genus started with an uppercase letter followed by either several lowercase letters
    or on/two lowercase letters followed by period (abbreviation) -->
    <field name="genus" pattern="(?:\u2018?)[A-Z](?:[a-z]{2,}|[a-z]?\.)">
    </field>
    <space/>
    <!-- species should be only 2 or more lowercase characters -->
    <field name="species" pattern="[a-z]{2,}(?:\u2019?)">
    </field>
    <space/>
    <field name="strain" pattern="[^\s\(]+">
    </field>
    <space/>
    <!-- ID has an alpha and numeric part EU840723 or AJ307974 or NC_002967 -->
    <!-- require but strip left bracket -->
    <field name="id0" pattern="(?:\()[A-Z]{1,2}|NC_">
    <!--  and right bracket -->
    <field name="id1" pattern="[0-9]{5,6}(?:\)">
    </field>
    <space/>
  </pattern>

  <pattern level="1">
    <!-- this regex allows for common garbles (detected as an error in 0) 
    and error correction by "safe" correction. The correction will generate a conformant 
    filed, but it may not be "correct". Each substitution has an error and can be logged.  -->
    <possibleSpace/>
    <field name="genus" pattern="(?:\u2018?)[A-Z](?:[a-z]{2,}|[a-z02S/]?\.)">
      <substitution name="zero2little_o_or_big_o" original="0" edited="[oO]"/>
      <substitution name="two2little_z_or_big_z" original="2" edited="[zZ]"/>
      <substitution name="big_s2little_s" original="S" edited="s"/>
      <substitution name="slash2little_l_or_big_i" original="/" edited="[lI]"/>
      <!-- edit more as we find them -->
    </field>
    <space/>
    <field name="species" pattern="[a-z/]+(?:\u2019?)">
      <substitution name="s_slash_c2lower_sic" original="s/c" edited="sic"/>
      <substitution name="c_slash_l2lower_cil" original="d/" edited="cil"/>
      <substitution name="k_slash_n2lower_kin" original="k/n" edited="kin"/>
      <substitution name="r_slash_o2lower_rio" original="r/o" edited="rio"/>
      <substitution name="zero2little_o" original="0" edited="o"/>
      <substitution name="big_s2little_s" original="S" edited="s"/>
      <substitution name="slash2lower_l" original="/" edited="l"/>
    </field>
    <space/>
    <field name="strain" pattern="[^\s\(]+">
    </field>
    <space/>
    <field name="id0" pattern="(?:\()[A-Z123580]{1,2}|NC_">
      <!-- big letters may be garbled to numbers -->
      <substitution name="zero2big_o" original="0" edited="O"/>
      <substitution name="one2big_i" original="1" edited="I"/>
      <substitution name="two2big_z" original="2" edited="Z"/>
      <substitution name="three2big_b" original="3" edited="B"/>
      <substitution name="five2big_s" original="5" edited="S"/>
      <substitution name="eight2big_b" original="8" edited="B"/>
    </field>
    <field name="id1" pattern="[0-9BIOSZ]{5,6}(?:\)">
      <!--  numbers may be garbled to big letters -->
      <substitution name="big_o2zero" original="O" edited="0"/>
      <substitution name="big_b2eight" original="B" edited="eight"/>
      <substitution name="big_i2one" original="I" edited="one"/>
      <substitution name="big_s2five" original="S" edited="5"/>
      <substitution name="big_z2two" original="Z" edited="2"/>
    </field>
    <possibleSpace/>
  </pattern>
</patternList>

commandline: output rawHOCR

The HTML from HOCR (Tesseract) should be saveable through a commandline option:

ami-phylo --hocr.html [CTree filename]

will output the raw HTML to a subdirectory (yet to be determined) of CTree

suggested OCR correction rules

#replace 8 or 3 -> B in 8char matches
sed -i 's/\([(,]\)[38]\([A-Z0-9][A-Z0-9][A-Z0-9][0-9][0-9][0-9][0-9]:\)/\1B\2/g' *.nwk
#replace 8 or 3 -> B in 8char matches
sed -i 's/\([(,][A-Z]\)[38]\([A-Z0-9][A-Z0-9][A-Z0-9][0-9][0-9][0-9]:\)/\1B\2/g' *.nwk
#replace 5 -> B in 8char matches
sed -i 's/\([(,][A-Z0-9]\)[38]\([A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]:\)/\1B\2/g' *.nwk
#replace 8 or 3 -> B in 6char matches
sed -i 's/\([(,]\)[83]\([0-9][0-9][0-9][0-9][0-9]:\)/\1B\2:/g' *.nwk
#replace ^0 -> ^D in 8char matches
sed -i 's/\([(,]\)0\([A-Z0-9][A-Z0-9][0-9][0-9][0-9][0-9][0-9]:\)/\1D\2/g' *.nwk
#replace ^0 -> ^D in 6char matches
sed -i 's/\([(,]\)0\([0-9][0-9][0-9][0-9][0-9]:\)/\1D\2/g' *.nwk
#replace DO or D0 -> DQ in 8char matches
sed -i 's/\([(,]D\)[O0]\([0-9][0-9][0-9][0-9][0-9][0-9]:\)/\1Q\2/g' *.nwk
#replace GO or G0 -> GQ in 8char matches
sed -i 's/\([(,]G\)[O0]\([A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]:\)/\1Q\2/g' *.nwk
#replace A1 -> AJ in 8char matches
sed -i 's/\([(,]A\)1\([0-9][0-9][0-9][0-9][0-9][0-9]:\)/\1J\2/g' *.nwk
#replace 2 -> Z in 6char matches
sed -i 's/\([(,]\)2\([0-9][0-9][0-9][0-9][0-9]:\)/\1Z\2:/g' *.nwk
#replace Z -> 2 in 6char matches
sed -i 's/\([(,][A-Z]\)Z\([0-9][0-9][0-9][0-9]:\)/\12\2:/g' *.nwk
#replace Z -> 2 in 6char matches
sed -i 's/\([(,][A-Z][0-9]\)Z\([0-9][0-9][0-9]:\)/\12\2:/g' *.nwk
#replace Z -> 2 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9]\)Z\([0-9][0-9]:\)/\12\2:/g' *.nwk
#replace Z -> 2 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9][0-9]\)Z\([0-9]:\)/\12\2:/g' *.nwk
#replace Z -> 2 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9][0-9][0-9]\)Z\(:\)/\12\2:/g' *.nwk
#SAFE replace Z -> 2 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9]\)Z\([0-9][0-9][0-9][0-9]:\)/\12\2/g' *.nwk
#SAFE replace Z -> 2 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9]\)Z\([0-9][0-9][0-9]:\)/\12\2/g' *.nwk
#SAFE replace Z -> 2 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9]\)Z\([0-9][0-9]:\)/\12\2/g' *.nwk
#SAFE replace Z -> 2 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9][0-9]\)Z\([0-9]:\)/\12\2/g' *.nwk
#SAFE replace Z -> 2 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9][0-9][0-9]\)Z\(:\)/\12\2/g' *.nwk
#SAFE replace OO -> 00 in 8char matches
sed -i 's/\([(,][A-Z][A-Z]\)OO\([0-9][0-9][0-9][0-9]:\)/\100\2/g' *.nwk
#replace O -> 0
sed -i 's/\([(,][A-Z][A-Z][0-9]\)O\([0-9][0-9][0-9][0-9]:\)/\10\2/g' *.nwk
#SAFE replace O -> 0 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9]\)O\([0-9][0-9][0-9]:\)/\10\2/g' *.nwk
#SAFE replace O -> 0 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9]\)O\([0-9][0-9]:\)/\10\2/g' *.nwk
#SAFE replace O -> 0 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9][0-9]\)O\([0-9]:\)/\10\2/g' *.nwk
#SAFE replace O -> 0 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9][0-9][0-9]\)O\(:\)/\10\2/g' *.nwk
#SAFE replace O -> 0 in 6char matches
sed -i 's/\([(,][A-Z]\)O\([0-9][0-9][0-9][0-9]:\)/\10\2/g' *.nwk
#SAFE replace O -> 0 in 6char matches
sed -i 's/\([(,][A-Z][0-9]\)O\([0-9][0-9][0-9]:\)/\10\2/g' *.nwk
#SAFE replace O -> 0 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9]\)O\([0-9][0-9]:\)/\10\2/g' *.nwk
#SAFE replace O -> 0 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9][0-9]\)O\([0-9]:\)/\10\2/g' *.nwk
#SAFE replace O -> 0 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9][0-9][0-9]\)O\(:\)/\10\2/g' *.nwk
#replace B -> 8 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9]\)B\([0-9][0-9][0-9][0-9]:\)/\18\2/g' *.nwk
#SAFE replace B -> 8 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9]\)B\([0-9][0-9][0-9]:\)/\18\2/g' *.nwk
#SAFE replace B -> 8 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9]\)B\([0-9][0-9]:\)/\18\2/g' *.nwk
#SAFE replace B -> 8 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9][0-9]\)B\([0-9]:\)/\18\2/g' *.nwk
#SAFE replace B -> 8 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9][0-9][0-9]\)B\(:\)/\18\2/g' *.nwk
#SAFE replace B -> 8 in 6char matches
sed -i 's/\([(,][A-Z]\)B\([0-9][0-9][0-9][0-9]:\)/\18\2/g' *.nwk
#SAFE replace B -> 8 in 6char matches
sed -i 's/\([(,][A-Z][0-9]\)B\([0-9][0-9][0-9]:\)/\18\2/g' *.nwk
#SAFE replace B -> 8 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9]\)B\([0-9][0-9]:\)/\18\2/g' *.nwk
#SAFE replace B -> 8 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9][0-9]\)B\([0-9]:\)/\18\2/g' *.nwk
#SAFE replace B -> 8 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9][0-9][0-9]\)B\(:\)/\18\2/g' *.nwk
#replace G -> 6 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9]\)G\([0-9][0-9][0-9][0-9]:\)/\16\2/g' *.nwk
#SAFE replace G -> 6 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9]\)G\([0-9][0-9][0-9]:\)/\16\2/g' *.nwk
#SAFE replace G -> 6 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9]\)G\([0-9][0-9]:\)/\16\2/g' *.nwk
#SAFE replace G -> 6 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9][0-9]\)G\([0-9]:\)/\16\2/g' *.nwk
#SAFE replace G -> 6 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9][0-9][0-9][0-9][0-9]\)G\(:\)/\16\2/g' *.nwk
#SAFE replace G -> 6 in 6char matches
sed -i 's/\([(,][A-Z]\)G\([0-9][0-9][0-9][0-9]:\)/\16\2/g' *.nwk
#SAFE replace G -> 6 in 6char matches
sed -i 's/\([(,][A-Z][0-9]\)G\([0-9][0-9][0-9]:\)/\16\2/g' *.nwk
#SAFE replace G -> 6 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9]\)G\([0-9][0-9]:\)/\16\2/g' *.nwk
#SAFE replace G -> 6 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9][0-9]\)G\([0-9]:\)/\16\2/g' *.nwk
#SAFE replace G -> 6 in 6char matches
sed -i 's/\([(,][A-Z][0-9][0-9][0-9][0-9]\)G\(:\)/\16\2/g' *.nwk
#SAFE replace E -> 2 in 8char matches
sed -i 's/\([(,][A-Z][A-Z]\)E\([A-Z0-9][0-9][0-9][0-9][0-9]:\)/\12\2/g' *.nwk
#SAFE replace S -> 6 in 8char matches
sed -i 's/\([(,][A-Z][A-Z][0-9A-Z]\)S\([0-9][0-9][0-9][0-9]:\)/\16\2/g' *.nwk

Escape high Unicode points

Some browsers and tools throw errors with points > 127 even when encoded as UTF-8. These points should be escaped as &#dddd;

Tidying NCBI taxdump

The NCBI taxdump contains ca 600K records, looking like:

9   |   Buchnera aphidicola |       |   scientific name |
9   |   Buchnera aphidicola Munson et al. 1991  |       |   synonym |
10  |   "Cellvibrio" Winogradsky 1929   |       |   synonym |
10  |   Cellvibrio  |       |   scientific name |
10  |   Cellvibrio (ex Winogradsky 1929) Blackall et al. 1986 emend. Humphry et al. 2003    |       |   synonym |
11  |   "Cellvibrio gilvus" Hulcher and King 1958   |       |   authority   |
11  |   Cellvibrio gilvus   |       |   equivalent name |
11  |   [Cellvibrio] gilvus |       |   scientific name |
13  |   Dictyoglomus    |       |   scientific name |
13  |   Dictyoglomus Saiki et al. 1985  |       |   authority   |
14  |   ATCC 35947  |       |   type material   |
14  |   DSM 3960    |       |   type material   |
14  |   Dictyoglomus thermophilum   |       |   scientific name |
14  |   Dictyoglomus thermophilum Saiki et al. 1985 |       |   authority   |
14  |   strain H-6-12   |       |   type material   |
16  |   Methyliphilus   |       |   equivalent name |
16  |   Methylophilus   |       |   scientific name |
16  |   Methylophilus Jenkins et al. 1987   |       |   synonym |
16  |   Methylotrophus  |       |   misspelling |
17  |   ATCC 53528  |       |   type material   |
17  |   DSM 46235   |       |   type material   |
17  |   LMG 6787    |       |   type material   |
17  |   Methyliphilus methylitrophus    |       |   equivalent name |
17  |   Methyliphilus methylotrophus    |       |   equivalent name |
17  |   Methylophilus methylitrophus    |       |   equivalent name |
17  |   Methylophilus methylotrophus    |       |   scientific name |
17  |   Methylophilus methylotrophus Jenkins et al. 1987    |       |   authority   |
17  |   Methylophilus sp. CBMB147   |       |   includes    |
17  |   Methylotrophus methylophilus    |       |   synonym |
17  |   NCIB 10515  |       |   type material   |
17  |   NCIMB 10515 |       |   type material   |
17  |   VKM B-1623  |       |   type material   |
18  |   Pelobacter  |       |   scientific name |
18  |   Pelobacter Schink and Pfennig 1983  |       |   authority   |
19  |   DSM 2380    |       |   type material   |
19  |   NBRC 103641 |       |   type material   |
19  |   Pelobacter carbinolicus |       |   scientific name |
19  |   Pelobacter carbinolicus Schink 1984 |       |   authority   |
19  |   strain Gra Bd 1 |       |   type material   |
20  |   Phenylobacterium    |       |   scientific name |

I have looked through this in some detail and propose that for current ami-phylo (mainly tackling IJSEM) we use ONLY "scientific name"s.
These name have the following forms:

Genus
Genus species
Genus species qualifiers...

The qualifiers include:

sp. ddd
subsp. ddd
NTCC 1234

and many more.

  • Proposal : trim ALL third fields to create only binomials
  • Proposal : trim all second fields which do not fit [a-z]+ (results in genus only)

These are then sorted and duplicates removed, leaving only single-word genus or two word binomial.

Unicode chars in NeXML output not good for some tree viewers

Dendroscope (phylogenetic tree viewing software) cannot view many of the output NeXML files because they contain Unicode characters.

Error log from Dendroscope:

Executing: open file='/home/ross/workspace/ami-plugin/all-output/all-input/ijs.0.000364-0-003.pbm.png/ijs.0.000364-0-003.pbm.nexml.xml';
[Fatal Error] :13:19: An invalid XML character (Unicode: 0x18) was found in the element content of the document.
org.xml.sax.SAXParseException; lineNumber: 13; columnNumber: 19; An invalid XML character (Unicode: 0x18) was found in the element content of the document.
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:347)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
    at org.nexml.model.DocumentFactory.parse(DocumentFactory.java:52)
    at org.nexml.model.DocumentFactory.safeParse(DocumentFactory.java:62)
    at dendroscope.D.A.B.A(Unknown Source)
    at dendroscope.D.A.B(Unknown Source)
    at dendroscope.commands.OpenFileCommand.apply(Unknown Source)
    at jloda.C.A.F.A(Unknown Source)
    at jloda.C.A.F.D(Unknown Source)
    at dendroscope.N.B$2.run(Unknown Source)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Command usage: open file=<filename>; - Opens a file

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.