Code Monkey home page Code Monkey logo

ami's People

Contributors

blahah avatar jcmolloy avatar mjw99 avatar petermr avatar tarrow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ami's Issues

AMI wordFrequencies doesn't recognize CSS

When using the standard AMI command, used in the tutorial (ami2-word --project PROJECTNAME -i scholarly.html --w.words wordFrequencies --w.stopwords stopwords.txt), where stopwords.txt is copied from the ami2-0.1-SNAPSHOT.jar, the outputted data contains parts of CSS found in the style tag in scholarly.html.

Example of the output:
header28

Input was the scholarly.html created with norma from PMC4350396.

ami2-regex ERROR org.xmlcml.cmine.args.DefaultArgProcessor

When running ami2-regex, I get following errors. But it does nonetheless create subdirectories for results.

...........................................................................0 [main] ERROR org.xmlcml.cmine.args.DefaultArgProcessor - Cannot create scholarlyHtmlElement: nu.xom.IllegalCharacterDataException: 0x2 is not allowed in XML content 19 [main] ERROR org.xmlcml.cmine.args.DefaultArgProcessor - Cannot create scholarlyHtmlElement: nu.xom.IllegalCharacterDataException: 0x2 is not allowed in XML content .............................................................................................Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3664) at java.lang.String.<init>(String.java:207) at java.lang.StringBuilder.toString(StringBuilder.java:407) at org.apache.commons.io.output.StringBuilderWriter.toString(StringBuilderWriter.java:158) at org.apache.commons.io.IOUtils.toString(IOUtils.java:779) at org.apache.commons.io.IOUtils.toString(IOUtils.java:803) at org.xmlcml.html.HtmlFactory.parse(HtmlFactory.java:631) at org.xmlcml.html.HtmlFactory.parse(HtmlFactory.java:623) at org.xmlcml.cmine.args.DefaultArgProcessor.getScholarlyHtmlElement(DefaultArgProcessor.java:1187) at org.xmlcml.cmine.files.CTree.ensureScholarlyHtmlElement(CTree.java:947) at org.xmlcml.cmine.args.DefaultArgProcessor.extractPSectionElements(DefaultArgProcessor.java:1170) at org.xmlcml.ami2.plugins.AMIArgProcessor.ensureSectionElements(AMIArgProcessor.java:250) at org.xmlcml.ami2.plugins.AMIArgProcessor.runRunMethodsOnChosenArgOptions(AMIArgProcessor.java:221) at org.xmlcml.cmine.args.DefaultArgProcessor.runAndOutput(DefaultArgProcessor.java:1111) at org.xmlcml.ami2.plugins.regex.RegexPlugin.main(RegexPlugin.java:35)

Tests failure on v0.2.7

This is here to track all the incidences of failing tests that we see in the latest development version:

On my gentoo box with:

Apache Maven 3.3.3 (7994120775791599e205a5524ec3e0dfe41d4a06; 2015-04-22T12:57:37+01:00)
Maven home: /usr/share/maven-bin-3.3
Java version: 1.8.0_51, vendor: Oracle Corporation
Java home: /opt/oracle-jdk-bin-1.8.0.51/jre
Default locale: en_GB, platform encoding: UTF-8
OS name: "linux", version: "4.0.5-gentoo", arch: "amd64", family: "unix"

We have two failures:

Failed tests:   testIdentifier(org.xmlcml.ami2.TutorialTest): results assertion failure: starts with: <results title="bio.ena"><result pre="an usher protein family domain (PFA" exact="M00577" post=") and/or flanking chaperone (PF00345, PF02753 or C" xpath="/*[local-name()='html'][1]/*[local-name()='body'][1]/*[local-name()='div'][1]/*[local-name()='div'][6]/*[local-name()='div'][1]/*[local-name()='p
  testGeneDictionary(org.xmlcml.ami2.plugins.gene.GeneArgProcessorTest): results assertion failure: starts with: <results title="hgnc" />

ami2-regex: --context influences matches

Different --context sizes influence the number of matches.

Software versions

$ getpapers -V
0.4.14

$ norma --version
norma(0.3.1)
norma(0.3.1)

$ ami2-regex --version
regex(null)
regex(null)

# (from the 0.2.24 .deb release)

Steps

$ mkdir tmp
$ getpapers -q PMCID:PMC4833924 -x -o tmp
$ norma -p . -i fulltext.xml -o scholarly.html --transform nlm2html

regex.xml

<compoundRegex title="jrc">
  <regex fields="jrc">NM[-]?\d\d\d</regex>
</compoundRegex>

Output

$ ami2-regex --project tmp --context 25 25 -i scholarly.html --r.regex regex.xml
$ more tmp/PMC4833924/results/regex/jrc/results.xml | wc -l
94

$ ami2-regex --project tmp --context 0 0 -i scholarly.html --r.regex regex.xml
$ more tmp/PMC4833924/results/regex/jrc/results.xml | wc -l
115

$ ami2-regex --project tmp --context 256721 256721 -i scholarly.html --r.regex regex.xml
$ more tmp/PMC4833924/results/regex/jrc/results.xml | wc -l
50 # note that there's text wrapping happening here too

256721 is should be the number of characters in the scholarly.html, so I had expected the last output to be 1, but it seems the issue is more complex than that.

Github port of AMI

More of AMI has been ported from Bitbucket to Github. see ContentMine/cm-ucl#18 where we are using this port.
The projects/repos are:

euclid
svg
html
imageanalysis
pdf2svg
svg2xml

These have all got new version numbers and commits will need versioning.

Currently the build fails for @chartgerink - see ContentMine/cm-ucl#18 because the parent pom is not recognized.

Bad plugin definitions in pom.xml

We need to tidy the pom.xml to remove these (non-breaking) problems.

[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found duplicate declaration of plugin org.apache.maven.plugins:maven-assembly-plugin @ line 243, column 12
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-surefire-plugin is missing. @ line 229, column 12

ami2-gene: case does not seem to get ignored with "--w.case ignore"

Which is a suggestion option when running "ami2-gene --help"...

Symptom:

<?xml version="1.0" encoding="UTF-8"?>
<results title="human">
 <result pre="Tgfb3, Cdk1 and Ccna2 with known functions in cell growth and proliferation, and " 
    exact="P53"
    post=", Tnfsf10 and Prkar2b genes in regulation of cell death, see Supplement"
    xpath="/html[1]/body[1]/div[2]/div[3]/div[12]/p[1]"
    dictionary="hgnc"
    dictionaryCheck="false"
    name="human"/>
 <result pre="R as previously described 58. Goose glyceraldehyde-3-phosphate dehydrogenase gene ( "
    exact="GAPDH"
    post=") gene was used as an internal control. The primer sequences of the genes were listed in &lt;a href=&quot;#"
    name="human"/>
</results>

The gene (in the post of the first result) Tnfsf10 is in the HGNC dictionary, but only as upper case. I would expect this gene to be found too when ignoring casing...

Ami fails to compile

When I run "mvn package" (per README), I get this compile error:

[ERROR] Failed to execute goal on project ami2: Could not resolve dependencies for project org.xml-cml:ami2:jar:0.3.0: Failed to collect dependencies for [org.xml-cml:norma:jar:0.3.0 (compile), org.xml-cml:diagramanalyzer:jar:0.1-SNAPSHOT (compile), org.apache.lucene:lucene-core:jar:4.10.3 (compile), org.apache.lucene:lucene-analyzers-common:jar:4.10.3 (compile), com.google.guava:guava:jar:18.0 (compile), junit:junit:jar:4.11 (test), blogspot.software-and-algorithms:stern-library:jar:0.1 (compile), org.vafer:jdeb:jar:1.3 (provided), com.jayway.jsonpath:json-path:jar:2.0.0 (compile), com.univocity:univocity-parsers:jar:1.5.6 (compile)]: Failed to read artifact descriptor for org.xml-cml:norma:jar:0.3.0: Could not transfer artifact org.xml-cml:norma:pom:0.3.0 from/to ebi-repo (http://www.ebi.ac.uk/intact/maven/nexus/content/repositories/ebi-repo/): Failed to transfer file: http://www.ebi.ac.uk/intact/maven/nexus/content/repositories/ebi-repo/org/xml-cml/norma/0.3.0/norma-0.3.0.pom. Return code is: 500 , ReasonPhrase:Internal Server Error. -> [Help 1]

I like to get a build environment as I want to write a plugin to use a controlled list of nanomaterial names ("NM-100") to be extracted the occurrence of articles...

Collating Word Clouds

Hi, I think it would be a good idea if the individual word clouds generated could be collated together to create a descriptive word cloud for all the documents. Would this be possible?

Cheers!

ami2-regex: XPath not working

XPaths in ami2-regex results are all the same, while the actual matches aren't. It seems to be the XPath of the last match. Now that I'm looking at things: the last match doesn't actually have an XPath in the results.xml.

Software versions

$ getpapers -V
0.4.14

$ norma --version
norma(0.3.1)
norma(0.3.1)

$ ami2-regex --version
regex(null)
regex(null)

# (from the 0.2.24 .deb release)

Steps

$ mkdir tmp
$ getpapers -q PMCID:PMC4833924 -x -o tmp
$ norma -p . -i fulltext.xml -o scholarly.html --transform nlm2html
$ ami2-regex --project tmp --context 25 25 -i scholarly.html --r.regex regex.xml

regex.xml

<compoundRegex title="jrc">
  <regex fields="jrc">NM[-]?\d\d\d</regex>
</compoundRegex>

Output

PMC4833924/results/regex/jrc/results.xml

<?xml version="1.0" encoding="UTF-8"?>
<results title="jrc">
 <result pre=" 7 April 2016). Since Ag " name0="jrc" value0="NM-300" post="K was provided as dispers" xpath="/html[1]/body[1]/div[2]/div[5]/p[3]"/>
 <result pre="ispersant alone, i.e. Ag " name0="jrc" value0="NM-300" post="K DIS, was assessed (NM-x" xpath="/html[1]/body[1]/div[2]/div[5]/p[3]"/>
...
 <result pre="as. The dispersant of Ag " name0="jrc" value0="NM-300" post="K alone (that does not co" xpath="/html[1]/body[1]/div[2]/div[5]/p[3]"/>
 <result pre="e’ in the BCOP assay; Ag " name0="jrc" value0="NM-300" post="K DIS was assessed as ‘no"/>
</results>

inconsistencies in results.xml

When viewing results.xmls, some differences between plugins came up,
species has attributes: pre, exact, match, post
gene has attributes: pre, exact, name, post
sequence has attributes: pre, exact, name, post, xpath

ami2-regex error with multiple group returns

cannot reproduce an example like this: ( ( group1) ( group2 ) )
<regex weight="1.0" fields="sunlight amount sunshade">(([Ff]ull|[Pp]artial)\s+([Ss]hade|[Ss]un))</regex> (source)

all run with $ ami2-regex -q dinosaurs-xmls-regex/ -i scholarly.html --r.regex dinosaurs-xmls-regex/dinosaurfood.xml --context 50 50

works: ( group1 ) with one field
<regex weight="1.0" fields="food">([Ff]ood\ssupply)</regex>

does not work: ( ( group1 ) (group2 ) ) with two fields
<regex weight="1.0" fields="food supply">(([Ff]ood)\s(supply))</regex>

1225 [main] DEBUG org.xmlcml.ami2.plugins.regex.RegexComponent  - Unusual fieldList: [food, supply] in dinosaurfood; found: <regex weight="1.0" fields="food supply">(.{0,50})(([Ff]ood)\s(supply))\s+(.{0,50})</regex>
11409 [main] ERROR org.xmlcml.ami2.plugins.MatcherResult  - groupList (5; [t. Nevertheless, we are aware that parental care, , food supply, food, supply, and the allocation of available energy to growth a]) does not match fieldList (2;[food, supply])

does not work: ( group1 ) ( group2 ) with two fields
<regex weight="1.0" fields="food supply">([Ff]ood)\s(supply)</regex>

1164 [main] DEBUG org.xmlcml.ami2.plugins.regex.RegexComponent  - Unusual fieldList: [food, supply] in dinosaurfood; found: <regex weight="1.0" fields="food supply">(.{0,50})([Ff]ood)\s(supply)\s+(.{0,50})</regex>
10386 [main] ERROR org.xmlcml.ami2.plugins.MatcherResult  - groupList (4; [t. Nevertheless, we are aware that parental care, , food, supply, and the allocation of available energy to growth a]) does not match fieldList (2;[food, supply])

does not work: ( ( group1 ) ( group2 ) ) with three fields
<regex weight="1.0" fields="foodsupply food supply">(([Ff]ood)\s(supply))</regex>

1211 [main] DEBUG org.xmlcml.ami2.plugins.regex.RegexComponent  - Unusual fieldList: [foodsupply, food, supply] in dinosaurfood; found: <regex weight="1.0" fields="foodsupply food supply">(.{0,50})(([Ff]ood)\s(supply))\s+(.{0,50})</regex>
11297 [main] ERROR org.xmlcml.ami2.plugins.MatcherResult  - groupList (5; [t. Nevertheless, we are aware that parental care, , food supply, food, supply, and the allocation of available energy to growth a]) does not match fieldList (3;[foodsupply, food, supply])

does not work: last query with three fields but without --context

does not work: last query with three fields but with --context 0 0

1222 [main] DEBUG org.xmlcml.ami2.plugins.regex.RegexComponent  - Unusual fieldList: [foodsupply, food, supply] in dinosaurfood; found: <regex weight="1.0" fields="foodsupply food supply">(.{0,0})(([Ff]ood)\s(supply))\s+(.{0,0})</regex>
8086 [main] ERROR org.xmlcml.ami2.plugins.MatcherResult  - groupList (5; [, food supply, food, supply, ]) does not match fieldList (3;[foodsupply, food, supply])

Enable Travis

Make travis work on this repository.

Current situation is it has been added to travis.org and travis will run on anything with a .travis.yml.

fails with

[ERROR] /home/travis/build/ContentMine/ami-plugin/src/main/java/org/xmlcml/ami2/plugins/phylotree/PhyloTreeArgProcessor.java:[80,17] cannot find symbol
  symbol:   method CORE_LOG()
  location: class org.xmlcml.ami2.plugins.phylotree.PhyloTreeArgProcessor
[ERROR] /home/travis/build/ContentMine/ami-plugin/src/main/java/org/xmlcml/ami2/plugins/phylotree/PhyloTreeArgProcessor.java:[130,27] cannot find symbol
  symbol:   method CORE_LOG()
  location: class org.xmlcml.ami2.plugins.phylotree.PhyloTreeArgProcessor
[ERROR] /home/travis/build/ContentMine/ami-plugin/src/main/java/org/xmlcml/ami2/plugins/ResultsAnalysis.java:[249,31] cannot find symbol
  symbol:   method setColumnHeadingList(java.util.List<org.xmlcml.cmine.util.CellRenderer>)
  location: variable dataTablesTool of type org.xmlcml.cmine.util.DataTablesTool
[ERROR] /home/travis/build/ContentMine/ami-plugin/src/main/java/org/xmlcml/ami2/plugins/CommandProcessor.java:[147,41] cannot find symbol
  symbol:   variable columnHeadingList
  location: variable dataTablesTool of type org.xmlcml.cmine.util.DataTablesTool
[ERROR] /home/travis/build/ContentMine/ami-plugin/src/main/java/org/xmlcml/ami2/plugins/CommandProcessor.java:[149,56] cannot find symbol
  symbol:   variable columnHeadingList
  location: variable dataTablesTool of type org.xmlcml.cmine.util.DataTablesTool

Results analysis fails: "Can not find plugin: null:null"

Windows error message when running cmine
......................Exception in thread "main" java.lang.RuntimeException: Can not find plugin: null:null at org.xmlcml.ami2.plugins.ResultsAnalysis.makeHtmlDataTable(ResultsAnal ysis.java:243) at org.xmlcml.ami2.plugins.CommandProcessor.createDataTables(CommandProc essor.java:142) at org.xmlcml.ami2.plugins.CommandProcessor.main(CommandProcessor.java:1 74)

ami2-species Name detection improvements

Now that we're publishing FACTS to http://facts.contentmine.org/ (which is great!), it would be good to make some simple adjustments to the ami-species plugin to filter-out some false positives e.g. the below 'Mbp':

2015-09-19-144453_1258x736_scrot

May I suggest that for two- or three-letter matches (only), these must be further checked against a whitelist of genera provided by the NCBI taxdump? There really are only a very few two- or three-letter genera so hopefully this whitelist approach for the shortest of names wouldn't be too computationally costly?

Examples of two- or three-letter genera to whitelist include:
(There are no one letter genera you'll be glad to know!)

Pan https://en.wikipedia.org/wiki/Pan_(genus)
Ia https://en.wikipedia.org/wiki/Ia_(genus)
Aa https://en.wikipedia.org/wiki/Aa_(plant)

Merge dev into master

We want to include the updates of dev into master. This is blocked by enabling Travis (blocked by #48, #47) or we risk breaking master. Alternatively we could enable another form of CI; alter where jenkins looks etc.

Refactoring AMI and Norma into Emma

I am creating a new module which is a (currently simple) workflow for running norma and AMI jobs.

Because its function is running, the animal is an Emu (which can run 50 kph) and she's called Emma.

Emma will clean up the recent CmineParser and CMineCommand into separate modules. Currently:

  • EmmaParser
  • EmmaCommand
  • Emma the runner. This can be standalone and will run Norma as required and Ami Commands as required.

ami2-species --lookup wikipedia not working?

Tried RSU's speciesexample.sh from the Wellcome Trust workshop:
https://github.com/ContentMine/wellcome-2015-files/blob/master/02_ami/speciesexample.sh

I noticed it had a --lookup wikipedia flag so was wondering where the results of that go?
Seems like the flag didn't do anything to me.

Results.xml file looks like this below: ( Wikipedia definitely has at least one of the binomials: https://en.wikipedia.org/wiki/Vibrio_harveyi )

<results title="binomial">
 <result pre="ntimicrobial activity (assessed on " exact="Vibrio harveyi" match="Vibrio harveyi" post=" cultures) was limited in both H an" name="binomial"/>
 <result pre="ia genus Vibrio, including; " exact="V. harveyi" match="Vibrio harveyi" post=" [ 14], V. mediterranei [ 7]" name="binomial"/>
 <result pre="ncluding; V. harveyi [ 14], " exact="V. mediterranei" match="Vibrio mediterranei" post=" [ 7], V. owensii [ 15] and " name="binomial"/>
 <result pre=" 14], V. mediterranei [ 7], " exact="V. owensii" match="Vibrio owensii" post=" [ 15] and V. coralliilyticus&lt;/i" name="binomial"/>
 <result pre=" [ 7], V. owensii [ 15] and " exact="V. coralliilyticus" match="Vibrio coralliilyticus" post=" [ 10]. Furthermore, Piskorska, Smi" name="binomial"/>
 <result pre="Smith [ 7], reports an increase in " exact="V. mediterranei" match="Vibrio mediterranei" post=" in the mucus of WS-affected Ech" name="binomial"/>
 <result pre="ei in the mucus of WS-affected " exact="Echinopora lamellosa" match="Echinopora lamellosa" post=", compared to healthy samples. Howe" name="binomial"/>
 <result pre="iated with healthy and WS-affected " exact="E. lamellosa" match="Echinopora lamellosa" post=", an approach that can lead to phyl" name="binomial"/>
 <result pre="ve a dominant presence in diseased " exact="A. muricata" match="A. muricata" post=" samples in a study screening the m" name="binomial"/>
 <result pre="Sweet and Bythell [ 18] also found " exact="V. harveyi" match="Vibrio harveyi" post=" to increase in diseased samples, b" name="binomial"/>
 <result pre=" healthy samples does not rule out " exact="V. harveyi" match="Vibrio harveyi" post=" as a primary causal agent in WS, a" name="binomial"/>
 <result pre="ated from WS-affected specimens of " exact="Acropora muricata" match="Acropora muricata" post=" and shown to be able to ingest the" name="binomial"/>
 <result pre="te community associated with WS in " exact="E. lamellosa" match="Echinopora lamellosa" post=" and their roles in disease aetiolo" name="binomial"/>
 <result pre="iated with healthy and WS-affected " exact="E. lamellosa" match="Echinopora lamellosa" post=" in aquaria, using culture-independ" name="binomial"/>
 <result pre="ciliate communities of WS-affected " exact="E. lamellosa" match="Echinopora lamellosa" post=" was analysed to determine the effi" name="binomial"/>
 <result pre="ealthy and White Syndrome-affected " exact="Echinopora lamellosa" match="Echinopora lamellosa" post=" fragments (donated by the Horniman" name="binomial"/>
 <result pre="mining the causal agent/s of WS in " exact="Echinopora lamellosa" match="Echinopora lamellosa" post=" through a process of elimination. " name="binomial"/>
 <result pre="WSU) and treated WS-affected (WST) " exact="E. lamellosa" match="Echinopora lamellosa" post=" samples, n = 3 H, n = 3 ST, n = 3 " name="binomial"/>
 <result pre="apparently healthy and WS-affected " exact="E. lamellosa" match="Echinopora lamellosa" post=" samples. Each tissue sample was tr" name="binomial"/>
 <result pre="Pure cultures of " exact="Vibrio harveyi" match="Vibrio harveyi" post=", a bacterial species heavily consi" name="binomial"/>
 <result pre="viously housed WS-affected corals. " exact="V. harveyi" match="Vibrio harveyi" post=" was isolated using thiosulfate-cit" name="binomial"/>
 <result pre="triplicate control for uninhibited " exact="V. harveyi" match="Vibrio harveyi" post=" growth (97.5 μl Mueller Hinton Bro" name="binomial"/>
 <result pre=" Mueller Hinton Broth + 97.5 μl of " exact="V. harveyi" match="Vibrio harveyi" post=" culture), a triplicate control for" name="binomial"/>
 <result pre=" MHB + 5 μl 100% ethanol + 97.5 μl " exact="V. harveyi" match="Vibrio harveyi" post=" culture) and a triplicate of each " name="binomial"/>
 <result pre=" + 5 μl coral extract + 97.5 μl of " exact="V. harveyi" match="Vibrio harveyi" post="). Using 5 μl coral extract in a 20" name="binomial"/>
 <result pre="tely equal, bacterial abundance in " exact="E. lamellosa" match="Echinopora lamellosa" post=" samples was compared across all sa" name="binomial"/>
 <result pre="nstead to determine differences in " exact="Vibrio harveyi" match="Vibrio harveyi" post=" growth inhibition between coral sa" name="binomial"/>
 <result pre="drome (WS) and treated WS-affected " exact="Echinopora lamellosa" match="Echinopora lamellosa" post=" samples (based on DGGE band matchi" name="binomial"/>
 <result pre=" to their association with healthy " exact="E. lamellosa" match="Echinopora lamellosa" post=" samples, these bacteria were consi" name="binomial"/>
 <result pre="ated White Syndrome-affected (WST) " exact="Echinopora lamellosa" match="Echinopora lamellosa" post=" samples. Error bars represent SE. " name="binomial"/>
 <result pre="ies in all white syndrome-affected " exact="E. lamellosa" match="Echinopora lamellosa" post=" samples (treated and untreated), w" name="binomial"/>
 <result pre="3). Ciliates C1 (97% similarity to " exact="Cohnilembus verminus" match="Cohnilembus verminus" post=") and C3 (99% similarity to Pseu" name="binomial"/>
 <result pre="ound to be most closely related to " exact="Philaster digitformis" match="Philaster digitformis" post=" Morph 3 ( Fig. 6), a ciliate isola" name="binomial"/>
 <result pre="4 is proposed to be a new morph of " exact="P. digitformis" match="Philaster digitformis" post=", designated as Morph 4. Morph 4 di" name="binomial"/>
 <result pre="445 bases. C2 (98% similarity with " exact="Loxophyllum rostratum" match="Loxophyllum rostratum" post=") is only represented in two WSU an" name="binomial"/>
 <result pre="el [ 40] with an opalinid protist; " exact="Opalina ranarum" match="Opalina ranarum" post=" (AF141970) as the outgroup, as use" name="binomial"/>
 <result pre="samples of the scleractinian coral " exact="Echinopora lamellosa" match="Echinopora lamellosa" post=", whilst also determining WS-associ" name="binomial"/>
 <result pre="6S rRNA) diversity associated with " exact="E. lamellosa" match="Echinopora lamellosa" post=" was found to increase between heal" name="binomial"/>
 <result pre=" being absent in healthy samples). " exact="Tenacibaculum maritimum" match="Tenacibaculum maritimum" post=" has previously been shown to be as" name="binomial"/>
 <result pre="iofilm on the coral surface [ 43]. " exact="T. maritimum" match="Tenacibaculum maritimum" post=" has also been shown to be responsi" name="binomial"/>
 <result pre="and tumour-affected soft tissue in " exact="Echinopora lamellosa" match="Echinopora lamellosa" post=", which often subsequently became a" name="binomial"/>
 <result pre="ecently been associated with WS in " exact="Acropora cervicornis" match="Acropora cervicornis" post=" [ 46]. Interestingly, the ribotype" name="binomial"/>
 <result pre="usly been found in associated with " exact="Oculina patagonica" match="Oculina patagonica" post=", from which isolates showed antimi" name="binomial"/>
 <result pre="action against the coral pathogens " exact="Vibrio shiloi" match="Vibrio shiloi" post=", V. coralliilyticus and " name="binomial"/>
 <result pre="al pathogens Vibrio shiloi, " exact="V. coralliilyticus" match="Vibrio coralliilyticus" post=" and Thalassomonas loyana [ " name="binomial"/>
 <result pre="/i&gt;, V. coralliilyticus and " exact="Thalassomonas loyana" match="Thalassomonas loyana" post=" [ 22]. It could therefore be expec" name="binomial"/>
 <result pre="owever, was found to be present in " exact="E. lamellosa" match="Echinopora lamellosa" post=" samples treated with the antibioti" name="binomial"/>
 <result pre=", previously proposed WS pathogen) " exact="Vibrio harveyi" match="Vibrio harveyi" post=". This could suggest that antimicro" name="binomial"/>
 <result pre="althy and diseased specimens, with " exact="E. lamellosa" match="Echinopora lamellosa" post=" appearing to have a naturally very" name="binomial"/>
 <result pre="ted antimicrobial capacity against " exact="Vibrio harveyi" match="Vibrio harveyi" post=". This could suggest that not all c" name="binomial"/>
 <result pre="ggest that not all corals, such as " exact="Echinopora lamellosa" match="Echinopora lamellosa" post=", possess such antimicrobial qualit" name="binomial"/>
 <result pre=" [ 60– 62] and it is possible that " exact="E. lamellosa" match="Echinopora lamellosa" post=" features antimicrobial properties " name="binomial"/>
 <result pre="perties that do not select against " exact="Vibrio harveyi" match="Vibrio harveyi" post=". To understand the overall antimic" name="binomial"/>
 <result pre="robial capacity and selectivity of " exact="E. lamellosa" match="Echinopora lamellosa" post=" (and other species), future antimi" name="binomial"/>
 <result pre="iliates were identified in healthy " exact="E. lamellosa" match="Echinopora lamellosa" post=" samples, thus making this the firs" name="binomial"/>
 <result pre="iates with WS in the plating coral " exact="E. lamellosa" match="Echinopora lamellosa" post=". A Philaster sp was also fo" name="binomial"/>
 <result pre="e of another Philaster sp ( " exact="P. lucinda" match="Philaster lucinda" post=") in WBD and WS [ 46, 63]. It was o" name="binomial"/>
 <result pre="ued. This promotes the theory that " exact="P. lucinda" match="Philaster lucinda" post=" is a secondary agent in WBD, which" name="binomial"/>
 <result pre="ate being found in all WS-affected " exact="E. lamellosa" match="Echinopora lamellosa" post=" samples in the present study. C" name="binomial"/>
 <result pre="es between healthy and WS-affected " exact="E. lamellosa" match="Echinopora lamellosa" post=". Given its association with the fo" name="binomial"/>
 <result pre="n formation and tissue necrosis in " exact="Echinopora lamellosa" match="Echinopora lamellosa" post=". Since this bacterium has not been" name="binomial"/>
</results>

AMI file not found using chocolatey

From a ContentMine user:

"I tried to download ami on my windows 10 machine, but it could not find the file using chocolatey.
Is the link incorrect on the website, or has ami been moved?"

Deprecate `projectSnippetsTree` in favour of `pluginSnippetsTree`

The current code uses projectSnippetsTree for code an XML describing a collection of SnippetsTrees from a search or other plugin. Since a cProject can have many of these trees the name is misleading.

Deprecate projectSnippetsTree in favour of pluginSnippetsTree by cloning code and renaming and marking old as @Deprecated.

Also since the semantics are effectively <PluginOption>.snippets.xml we can retrieve the SnippetsTree by its PluginOption (name).

Repository is unexpectedly large

The GitHub API gives the size of the Norma repository as 362425 KB and the AMI repository as 301415 KB.

The recent experience of two new developers, both of whom needed to buy additional hardware in order to be able to clone and work with these repositories, suggests that new users or developers are unlikely to expect these repositories to be so large.

Ways of reducing the size of the repositories should be investigated. For instance, could the repositories' test corpora be factored out into a different module that can be shared, as a dependency, between Norma, AMI, and perhaps other modules in the AMI stack?

(Corresponding Norma issue: ContentMine/norma#71 .)

Installation of AMI on Windows fails

C:\WINDOWS\system32>choco install ami -s https://www.myget.org/F/contentmine/api/v2 -y
Chocolatey v0.10.7
Installing the following packages:
ami
By installing you accept licenses for the packages.
Progress: Downloading ami 0.2.24... 100%

ami v0.2.24
ami package files install completed. Performing other installation steps.
Attempt to get headers for https://github.com/ContentMine/ami/releases/download/v0.2.24/ami2-bin.zip failed.
The remote file either doesn't exist, is unauthorized, or is forbidden for url 'https://github.com/ContentMine/ami/releases/download/v0.2.24/ami2-bin.zip'. E
ception calling "GetResponse" with "0" argument(s): "The remote server returned an error: (404) Not Found."
Downloading ami 64 bit
from 'https://github.com/ContentMine/ami/releases/download/v0.2.24/ami2-bin.zip'
ERROR: The remote file either doesn't exist, is unauthorized, or is forbidden for url 'https://github.com/ContentMine/ami/releases/download/v0.2.24/ami2-bin.zi
'. Exception calling "GetResponse" with "0" argument(s): "The remote server returned an error: (404) Not Found."
The install of ami was NOT successful.
Error while running 'C:\ProgramData\chocolatey\lib\ami\tools\chocolateyinstall.ps1'.
See log for details.

Chocolatey installed 0/1 packages. 1 packages failed.
See the log for details (C:\ProgramData\chocolatey\logs\chocolatey.log).

Failures

  • ami (exited 404) - Error while running 'C:\ProgramData\chocolatey\lib\ami\tools\chocolateyinstall.ps1'.
    See log for details.

Prune unnecessary code/content

There are some directories and files that may not need to be a part of this repo. We should eliminate them early.

We can use this thread to highlight files and discuss whether they should be kept. Once a list of files to remove is drawn up, I will prune them in a pull request.

Revise argument order in norma / ami

The argument order should not affect the parsing and running of norma and ami. However at present we require:

  • the first argument should be a plugin (e.g. "-q" fails)
  • initation (e.g. creating argumentOptions) should not have to come before parsing and running.

question: how can I make regex greedy?

I have a regular expression file:

<compoundRegex title="jrc">
    <regex fields="jrc">NM[-]?\d\d\d[K]?</regex>
</compoundRegex>

This find NM-100, etc, etc. However, it also find (ctj JSON output):

    {
      "pre": " OECD reference material ",
      "name0": "jrc",
      "value0": "NM-300",
      "post": "K; nano-AgB 5–50 nm) on t",
      "pmc": "PMC3841577"
    },

Here, I would like it to "find" NM-300K, for which the regex needs to be greedy... anyway of instructing this in the above config file?

java.lang.OutOfMemoryError: GC overhead limit exceeded

running: word([frequencies])[{xpath:@count>20}, {w.stopwords:pmcstop.txt stopwords.txt}]
WS: XXXXX  .............................!!.................................................................................!!...............................!!.................................................................................................Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at nu.xom.ParentNode.checkCapacity(Unknown Source)
    at nu.xom.ParentNode.fastInsertChild(Unknown Source)
    at nu.xom.NonVerifyingHandler.startElement(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source)
    at org.apache.xerces.impl.XMLNamespaceBinder.handleStartElement(Unknown Source)
    at org.apache.xerces.impl.XMLNamespaceBinder.startElement(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
    at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
    at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at nu.xom.Builder.build(Unknown Source)
    at nu.xom.Builder.build(Unknown Source)
    at org.xmlcml.xml.XMLUtil.parseXML(XMLUtil.java:366)
    at org.xmlcml.html.HtmlFactory.parseToXHTML(HtmlFactory.java:701)
    at org.xmlcml.html.HtmlFactory.parse(HtmlFactory.java:646)
    at org.xmlcml.html.HtmlFactory.parse(HtmlFactory.java:623)
    at org.xmlcml.cmine.args.DefaultArgProcessor.getScholarlyHtmlElement(DefaultArgProcessor.java:1176)
    at org.xmlcml.cmine.files.CTree.ensureScholarlyHtmlElement(CTree.java:947)
    at org.xmlcml.cmine.args.DefaultArgProcessor.extractPSectionElements(DefaultArgProcessor.java:1159)
    at org.xmlcml.ami2.plugins.AMIArgProcessor.ensureSectionElements(AMIArgProcessor.java:191)
    at org.xmlcml.ami2.plugins.AMIArgProcessor.runRunMethodsOnChosenArgOptions(AMIArgProcessor.java:162)
    at org.xmlcml.cmine.args.DefaultArgProcessor.runAndOutput(DefaultArgProcessor.java:1100)
    at org.xmlcml.ami2.plugins.word.WordPluginOption.run(WordPluginOption.java:46)
    at org.xmlcml.ami2.plugins.CommandProcessor.runCommands(CommandProcessor.java:86)
    at org.xmlcml.ami2.plugins.CommandProcessor.processCommands(CommandProcessor.java:60)
    at org.xmlcml.ami2.plugins.CommandProcessor.main(CommandProcessor.java:176)
cmine XXXXX  3597,93s user 13,90s system 403% cpu 14:54,94 total

ami2-regex ERROR org.xmlcml.ami2.plugins.MatcherResult

The regex.xml looks like this (after the example here)

<compoundRegex title="dinosaurfood">
<regex weight="1.0" fields="food pre word post">((.{1,50})([Ff]ood)(.{1,50}))</regex>
<regex weight="1.0" fields="sustentation pre word post">((.{1,50})([Ss]ustentation)(.{1,50}))</regex>
</compoundRegex>

errors look like this: - groupList is len 6, fieldList is len 4

548519 [main] ERROR org.xmlcml.ami2.plugins.MatcherResult  - groupList (6; [a significant episode in the evolution of terrestrial biotas (125–80 Ma) in which the taxonomic div, ersification of angiosperms and the resulting new food resources spurred co-evolutionary radiations of, ersification of angiosperms and the resulting new , food,  resources spurred co-evolutionary radiations of, insects and some terrestrial vertebrates (e.g., herbivorous dinosaurs; Lloyd et al. 2008). Wilson e]) does not match fieldList (4;[food, pre, word, post])
556129 [main] ERROR org.xmlcml.ami2.plugins.MatcherResult  - groupList (6; [medium-sized and large individuals, indicates important niche partitioning between these carnivorou, s dinosaurs. The top predators at the acme of the food chain were represented by three large theropods,, s dinosaurs. The top predators at the acme of the , food,  chain were represented by three large theropods,, Lourinhanosaurus, Ceratosaurus and Allosaurus, and a very large form, Torvosaurus, functionally and]) does not match fieldList (4;[food, pre, word, post])
572041 [main] ERROR org.xmlcml.ami2.plugins.MatcherResult  - groupList (6; [With a minimum length of 612 mm, the maxilla of Torvosaurus gurneyi pertains to a very large indivi, dual positioned at the apex of the food chain in the Late Jurassic ecosystem of Iberia., dual positioned at the apex of the , food,  chain in the Late Jurassic ecosystem of Iberia., The maxilla occupies 52% ( Allosaurus) to 61% ( Yangchuanosaurus) of the skull length in the larges]) does not match fieldList (4;[food, pre, word, post])

Dictionary based search fails if not in code dir

tom@pisces newstack % ami-plugin/target/appassembler/bin/ami2-species --project ./zika -i scholarly.html --sp.species --context 35 --sp.type binomial genus 
java.lang.RuntimeException: invoke runExtractSpecies fails
        at org.xmlcml.cmine.args.DefaultArgProcessor.instantiateAndRunMethod(DefaultArgProcessor.java:873)
        at org.xmlcml.cmine.args.DefaultArgProcessor.runMethodsOfType(DefaultArgProcessor.java:768)
        at org.xmlcml.cmine.args.DefaultArgProcessor.runRunMethodsOnChosenArgOptions(DefaultArgProcessor.java:749)
        at org.xmlcml.ami2.plugins.AMIArgProcessor.runRunMethodsOnChosenArgOptions(AMIArgProcessor.java:176)
        at org.xmlcml.cmine.args.DefaultArgProcessor.runAndOutput(DefaultArgProcessor.java:929)
        at org.xmlcml.ami2.plugins.species.SpeciesPlugin.main(SpeciesPlugin.java:36)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.xmlcml.cmine.args.DefaultArgProcessor.instantiateAndRunMethod(DefaultArgProcessor.java:871)
        ... 5 more
Caused by: java.lang.NullPointerException
        at org.xmlcml.ami2.dictionary.DefaultAMIDictionary.contains(DefaultAMIDictionary.java:112)
        at org.xmlcml.ami2.plugins.AMISearcher.markFalsePositives(AMISearcher.java:380)
        at org.xmlcml.ami2.plugins.AMISearcher.search(AMISearcher.java:284)
        at org.xmlcml.ami2.plugins.AMIArgProcessor.searchSectionElements(AMIArgProcessor.java:298)
        at org.xmlcml.ami2.plugins.species.SpeciesArgProcessor.runExtractSpecies(SpeciesArgProcessor.java:57)
        ... 10 more
java.lang.RuntimeException: invoke runExtractSpecies fails
        at org.xmlcml.cmine.args.DefaultArgProcessor.instantiateAndRunMethod(DefaultArgProcessor.java:873)
        at org.xmlcml.cmine.args.DefaultArgProcessor.runMethodsOfType(DefaultArgProcessor.java:768)
        at org.xmlcml.cmine.args.DefaultArgProcessor.runRunMethodsOnChosenArgOptions(DefaultArgProcessor.java:749)
        at org.xmlcml.ami2.plugins.AMIArgProcessor.runRunMethodsOnChosenArgOptions(AMIArgProcessor.java:176)
        at org.xmlcml.cmine.args.DefaultArgProcessor.runAndOutput(DefaultArgProcessor.java:929)
        at org.xmlcml.ami2.plugins.species.SpeciesPlugin.main(SpeciesPlugin.java:36)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.xmlcml.cmine.args.DefaultArgProcessor.instantiateAndRunMethod(DefaultArgProcessor.java:871)
        ... 5 more
Caused by: java.lang.NullPointerException
        at org.xmlcml.ami2.dictionary.DefaultAMIDictionary.contains(DefaultAMIDictionary.java:112)
        at org.xmlcml.ami2.plugins.AMISearcher.markFalsePositives(AMISearcher.java:380)
        at org.xmlcml.ami2.plugins.AMISearcher.search(AMISearcher.java:284)
        at org.xmlcml.ami2.plugins.AMIArgProcessor.searchSectionElements(AMIArgProcessor.java:298)
        at org.xmlcml.ami2.plugins.species.SpeciesArgProcessor.runExtractSpecies(SpeciesArgProcessor.java:57)
        ... 10 more
java.lang.RuntimeException: invoke runExtractSpecies fails
        at org.xmlcml.cmine.args.DefaultArgProcessor.instantiateAndRunMethod(DefaultArgProcessor.java:873)
        at org.xmlcml.cmine.args.DefaultArgProcessor.runMethodsOfType(DefaultArgProcessor.java:768)
        at org.xmlcml.cmine.args.DefaultArgProcessor.runRunMethodsOnChosenArgOptions(DefaultArgProcessor.java:749)
        at org.xmlcml.ami2.plugins.AMIArgProcessor.runRunMethodsOnChosenArgOptions(AMIArgProcessor.java:176)
        at org.xmlcml.cmine.args.DefaultArgProcessor.runAndOutput(DefaultArgProcessor.java:929)
        at org.xmlcml.ami2.plugins.species.SpeciesPlugin.main(SpeciesPlugin.java:36)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.xmlcml.cmine.args.DefaultArgProcessor.instantiateAndRunMethod(DefaultArgProcessor.java:871)
        ... 5 more
Caused by: java.lang.NullPointerException
        at org.xmlcml.ami2.dictionary.DefaultAMIDictionary.contains(DefaultAMIDictionary.java:112)
        at org.xmlcml.ami2.plugins.AMISearcher.markFalsePositives(AMISearcher.java:380)
        at org.xmlcml.ami2.plugins.AMISearcher.search(AMISearcher.java:284)
        at org.xmlcml.ami2.plugins.AMIArgProcessor.searchSectionElements(AMIArgProcessor.java:298)
        at org.xmlcml.ami2.plugins.species.SpeciesArgProcessor.runExtractSpecies(SpeciesArgProcessor.java:57)
        ... 10 more

This doesn't happen if in ami-plugin. I assume there is some hardcoded path to the/a dictionary somewhere.

build problems - hard to repeat

Two fails in TestIdentifier and GeneArgProcessorTest

this is tucked away in the actual output of the tests:

Caused by: nu.xom.IllegalCharacterDataException: 0x17 is not allowed in XML content
Type stack traces which result in failing tests.

I think it is something to do with encoding of the resources files but I'm not certain. It happens on an ubuntu virtual machine on my laptop; but not on what I thought was an identical VM on my desktop. However it does happen outside the VM on my gentoo box.

It also seems to fail to happen when on the same VM on my laptop but in a folder that is shared with the windows host but it may be masked by another error (Failure to delete test files) that is caused by windows trying to index the testfiles and they're made and keeping them open wiht with try and delete them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.