dstl / baleen Goto Github PK

View Code? Open in Web Editor NEW

147.0 147.0 40.0 95.81 MB

Entity Extraction Text Processor

License: Apache License 2.0

Java 96.37% CSS 0.01% HTML 3.04% JavaScript 0.56% Shell 0.04%

baleen entity-extraction java

baleen's People

Stargazers

Watchers

Forkers

stuartmarsden joegarbett workware1 n- rooreynolds gitter-badger rokemanorresearch davidsoloman marksp casm-consulting keyz182 sindhuchary jonnyelliot abbywalker dalbrecht te-565 jamesfry skobets commitd neilireson seebeyond paulgallop antjkennedy mattplindsay aguyard shadowridgedev vkajen hgirgas steven-committed advaitha nationalcrimeagency jamesdbaker roppa bytearchive prolincur vasco989k bing-ok joskid uk-gov-mirror javaecosystemstudy

baleen's Issues

Does not build on openjdk 8

Fails to build on openjdk 8. I did not spend time figuring out why but switched to oracle jdk. You should make a note in the ReadMe saying that it must be Oracle and not OpenJDK.

Hello there,
I'm new to Baleen, I read most of the documentation. Baleen is running in the background. But when I run my test application, I get some runtime exceptions. I ran my test application with -verbose on so I could see all the messages. I have copied my code at the bottom.

Error1:
This comes when I link my test application only with Baleen library. I get an exception "java.lang.NoClassDefFoundError: org/apache/http/config/Lookup".

Error2:
Then I linked httpCore4.4.x jar (which I hope I'm not supposed to do), ran, then I don't get above error. But I get another new exception java.lang.NoSuchMethodError: org.apache.http.entity.ContentType.withCharset(Ljava/lang/String;)Lorg/apache/http/entity/ContentType;

I assume that I must not link apache libraries since Baleen already has references to them and the libraries I link may cause to make conflicts between libraries. I'm on Windows 10, 64x and IDE is Netbeans. I'm using Baleen 2.2.0. Could someone help me to figure out what I'm missing here please?

Following is my test program.
package testbaleen;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.analysis_engine.AnalysisEngineDescription;
import org.apache.uima.fit.factory.AnalysisEngineFactory;
import org.apache.uima.fit.factory.ExternalResourceFactory;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ExternalResourceDescription;
import org.apache.uima.resource.ResourceInitializationException;
import org.elasticsearch.client.Client;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.node.Node;
import org.elasticsearch.node.NodeBuilder;
import uk.gov.dstl.baleen.consumers.ElasticsearchRest;
import uk.gov.dstl.baleen.resources.SharedElasticsearchRestResource;

/**
*

@author Susantha
*/
public class TestBaleen {

private static Path tmpDir;
private static final String ELASTICSEARCH = "elasticsearchRest";
protected static Client client;
protected static JCas jCas;
protected static AnalysisEngine ae;
/**

@param args the command line arguments
*/
public static void main(String[] args) {

 try {
     tmpDir = Files.createTempDirectory("elasticsearch");

     String s = tmpDir.toString();
     
     Settings settings = Settings.builder()
             .put("path.home", tmpDir.toString())
             .put("http.port", "19600")		//Don't use the default ports for testing purposes
             .put("transport.tcp.port", "19300")
             .build();
     
     Node node = NodeBuilder.nodeBuilder()
             .settings(settings)
             .data(true)
             .local(true)
             .clusterName("SusanthaSearch")
             .node();
     
     ExternalResourceDescription erd = ExternalResourceFactory.createExternalResourceDescription(ELASTICSEARCH, SharedElasticsearchRestResource.class, SharedElasticsearchRestResource.PARAM_URL, "http://localhost:19600");
     AnalysisEngineDescription aed = AnalysisEngineFactory.createEngineDescription(ElasticsearchRest.class, ELASTICSEARCH, erd);
     
     try
     {
         System.out.println("Now creating the engine");
         ae = AnalysisEngineFactory.createEngine(aed);
     }catch(ResourceInitializationException ex)
     {
         System.out.println("Caught"+ex.getMessage());
     }catch(Exception e)
     {
         System.out.println("Caught"+e.getMessage());
     }
     client = node.client();
     System.out.println("Done and dusted...");
     
 } catch (IOException ex) {
 Logger.getLogger(TestBaleen.class.getName()).log(Level.SEVERE, null, ex);
     //Logger.getLogger(ContentScrapper.class.getName()).log(Level.SEVERE, null, ex);
 } catch (ResourceInitializationException ex) {
 Logger.getLogger(TestBaleen.class.getName()).log(Level.SEVERE, null, ex);

}

OdinParser error if Baleen JAR path contains space

The OdinParser annotator fails if Baleen is run from a JAR file with a space in its full path.
This has been raised as in issue in the clulab/processors project.

Version should be updated to 2.1.0-SNAPSHOT

The repository is no longer in line with the released 2.0.0 build, so the version numbers should be updated.

Baleen Graph doesn't build on Java 9

I did a clean pull of the GitHub repository, but when trying to build the project it failed on the baleen-graph project.

[ERROR] Failures: 
[ERROR]   EntityGraphFileTest.testGraphson:99->assertPathsEqual:64 expected:<...e":1},"value":""}],"[docId":[{"id":{"@type":"g:Int64","@value":3},"value":{"@type":"g:List","@value":["8b408a0c7163fdfff06ced3e80d7d2b3acd9db900905c4783c28295b8c996165"]}}],"isNormalised":[{"id":{"@type":"g:Int64","@value":4},"value":{"@type":"g:List","@value":[false]]}}],"longestValue":...> but was:<...e":1},"value":""}],"[isNormalised":[{"id":{"@type":"g:Int64","@value":3},"value":{"@type":"g:List","@value":[false]}}],"docId":[{"id":{"@type":"g:Int64","@value":4},"value":{"@type":"g:List","@value":["8b408a0c7163fdfff06ced3e80d7d2b3acd9db900905c4783c28295b8c996165"]]}}],"longestValue":...>
[ERROR]   EntityGraphFileTest.testGyro:117
[INFO] 
[ERROR] Tests run: 41, Failures: 2, Errors: 0, Skipped: 0

I'm building it on Ubuntu 16.04 with OpenJDK version 1.8.0_171.

UPDATE: Maven was actually using Java 9 (and not Java 8), and that was the cause of the problem. I've updated the issue title to reflect this.

$ mvn -version
Apache Maven 3.3.9
Maven home: /usr/share/maven
Java version: 9.0.4, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-9-oracle
Default locale: en_GB, platform encoding: UTF-8
OS name: "linux", version: "4.15.0-29-generic", arch: "amd64", family: "unix"

Temporal parsing fails when timezone is not included

Against 2.3 Snapshot Release -2016-11-01

Occurred on a document that contained dates: "20 January 2014" and "20 Jan 2014"

2016-11-23 14:03:20,106 WARN  uk.gov.dstl.baleen.core.pipelines.BaleenPipeline - Pipeline ran with errors
org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:401)
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:308)
    at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:269)
    at org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.processNext(ProcessingUnit.java:893)
    at org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.run(ProcessingUnit.java:575)
Caused by: java.lang.NullPointerException: null
    at java.util.TimeZone.parseCustomTimeZone(TimeZone.java:783)
    at java.util.TimeZone.getTimeZone(TimeZone.java:562)
    at java.util.TimeZone.getTimeZone(TimeZone.java:516)
    at uk.gov.dstl.baleen.annotators.regex.DateTime.processDayMonthTime(DateTime.java:127)
    at uk.gov.dstl.baleen.annotators.regex.DateTime.doProcess(DateTime.java:46)
    at uk.gov.dstl.baleen.uima.BaleenAnnotator.process(BaleenAnnotator.java:81)
    at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)

JavaDoc doesn't launch on Mac

Hi James,

I reinstalled Baleen 2.1.0 on my MacBook Pro and tried to launch the JavaDoc, but it isn't launching. There's an error, but that's all it is saying. Baleen 2.1.0 .JAR file is in the same directory as the JavaDoc executable and that doesn't launch either: The Terminal reports it was unable to access the .JAR file. What am I doing wrong please?

Expand REST API to provide annotator input/output information

As of Baleen 2.4, the required inputs and produced outputs of each annotator are declared in order to the pipeline orderers to function. It would be beneficial if this information could be exposed through the REST API.

The gotcha to this is that this information is only accessible once the annotator has been instantiated and configured, so it would either need to be per annotator in an existing pipeline, or allow for annotator configuration to be passed.

Error while installing on Mac

Hi James, Team

I'm running Eclipse and trying to install Baleen, but keep getting build failure errors for the Collection Readers onwards, which means it isn't building half of the tool.

Please can you help?

Cheers,

Results :

Failed tests: testMultipleDirectories(uk.gov.dstl.baleen.collectionreaders.FolderReaderTest)
testSubDirectoriesNonRecursive(uk.gov.dstl.baleen.collectionreaders.FolderReaderTest): expected:</[]var/folders/n2/4x13f...> but was:</[private/]var/folders/n2/4x13f...>
testModifiedFile(uk.gov.dstl.baleen.collectionreaders.FolderReaderTest)
testSubDirectories(uk.gov.dstl.baleen.collectionreaders.FolderReaderTest)
testCreateFile(uk.gov.dstl.baleen.collectionreaders.FolderReaderTest)

Tests run: 19, Failures: 5, Errors: 0, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Baleen ............................................ SUCCESS [ 2.281 s]
[INFO] Baleen Core ....................................... SUCCESS [ 28.455 s]
[INFO] Baleen UIMA ....................................... SUCCESS [ 10.691 s]
[INFO] Baleen Resources .................................. SUCCESS [ 44.327 s]
[INFO] Baleen Annotators ................................. SUCCESS [01:07 min]
[INFO] Baleen Collection Readers ......................... FAILURE [ 27.084 s]
[INFO] Baleen Consumers .................................. SKIPPED
[INFO] Baleen History .................................... SKIPPED
[INFO] Baleen Runner ..................................... SKIPPED
[INFO] Baleen Javadoc .................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 03:00 min
[INFO] Finished at: 2015-10-19T23:10:03+00:00
[INFO] Final Memory: 21M/208M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.10:test (default-test) on project baleen-collectionreaders: There are test failures.
[ERROR]
[ERROR] Please refer to /Users/User1/Desktop/baleen-master/baleen/baleen-collectionreaders/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn -rf :baleen-collectionreaders

Building to include Javadoc from command-line

Hi,

I tried to build from the command line, but there is no -javadoc.jar file placed in the target directory. Even if I remove the comment from the baleen-javadoc/pom.xml file to ensure that a javadoc Jar file is created, it doesnt contain all the files that are expected by the Core webapp UI.

Any advices?

MongoReader collection reader not working

The MongoReader collection reader fails at line 127 due to an old version (2.2) of commons-io being built into the JAR file and not supporting this call to IOUtils.toInputStream.

This behaviour was noted having built Baleen 2.4.1-SNAPSHOT from source in Netbeans 8.2.

Tests fail when run as a super-user

The following two tests fail when running as a user with super-user privileges (e.g. sudo, root) on Linux:

AllAnnotationsJsonConsumerTest.java
EntityCountTest.java

The cause of this is that when running as a super user, you have permission to write to read-only files. But the above tests attempt to write to read-only files in order to test the error handling. The tests expect the writes to fail, but as a super-user they don't and the test therefore fails instead.

Elasticsearch Mapping should specify that entities are nested

Unless the Elasticsearch mapping explicitly defines the entities array as nested, then we lose the ability to search documents for, as an example, Person entities with the value Holmes.

The mapping needs changing in AbstractElasticsearchConsumer to define entities as a nested object, as per https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-mapping.html

Elasticsearch and Antarctica

I was recently looking into issue #3 from the Elasticsearch side, along with the related issues elastic/elasticsearch#27832 and elastic/elasticsearch#17407.

elastic/elasticsearch#17407 (comment) also applies to your issue, as does the fix (setting "orientation": "clockwise" in the mapping).

However, it seems that you managed to simplify your Antarctica outline to one that Elasticsearch does accept in #39, even though it still looks to be oriented clockwise. It's possible that the shape that Elasticsearch has indexed is not the shape you asked for, because of the code linked in elastic/elasticsearch#27832 (comment).

I'd recommend fixing the mapping, or reversing the orientation of the Antarctica outline to make it anticlockwise as per the GeoJSON spec, to make sure that Elasticsearch indexes it correctly. If you do so, it looks like you can revert #39 and use the original higher-precision outline (suitably reversed). Meanwhile we're looking at this leniency in more detail in elastic/elasticsearch#27832.

REST API for Pipelines doesn't accept Sample YAML

One a freshly built test implementation, the YAML provided in the sample documentation (included below) fails with a 500 error when submitting with a POST with the two form parameters to http://localhost:6413/api/1/pipelines

mapping values are not allowed here
in 'string', line 1, column 26:
collectionreader: class: FolderReader folders: - ./ ...
^

Sample YAML:

mongo:
db: baleen
host: localhost

elasticsearch:
cluster: elasticsearch
host: localhost

collectionreader:
class: FolderReader
folders:

C:\baleen\data

annotators:

cleaners.AddGenderToPerson
cleaners.AddTitleToPerson
cleaners.CleanPunctuation
cleaners.CleanTemporal
cleaners.CollapseLocations
cleaners.CorefBrackets
cleaners.CorefCapitalisationAndApostrophe
cleaners.CurrencyDetection
cleaners.EntityInitials
cleaners.ExpandLocationToDescription
cleaners.MergeAdjacent
cleaners.MergeAdjacentQuantities
cleaners.MergeNationalityIntoEntity
cleaners.NaiveMergeRelations
cleaners.NormalizeOSGB
cleaners.NormalizeTemporal
cleaners.NormalizeWhitespace
cleaners.ReferentToEntity
cleaners.RelationTypeFilter
cleaners.RemoveLowConfidenceEntities
cleaners.RemoveNestedEntities
cleaners.RemoveNestedLocations
cleaners.RemoveOverlappingEntities
cleaners.SplitBrackets
cleaners.Surname
coreference.SieveCoreference
gazetteer.Country
gazetteer.File
class: gazetteer.Mongo
type: Buzzword
collection: buzzwords
class: gazetteer.Mongo
type: Location
collection: location
class: gazetteer.Mongo
type: Organisation
collection: organisations
class: gazetteer.Mongo
type: Person
collection: people
grammatical.NPAtCoordinate
grammatical.NPElement
grammatical.NPLocation
grammatical.NPOrganisation
grammatical.NPTitleEntity
grammatical.QuantityNPEntity
grammatical.TOLocationEntity
language.OpenNLP
class: misc.DocumentTypeByLocation
baseDirectory: C:\baleen\data
misc.GenericMilitaryPlatform
misc.GenericVehicle
misc.GenericWeapon
misc.MentionedAgain
misc.NationalityToLocation
misc.OrganisationPersonRole
misc.People
misc.Pronouns
regex.Area
regex.BritishArmyUnits
regex.Callsign
regex.CasRegistryNumber
regex.Date
regex.DateTime
regex.Distance
regex.DocumentNumber
regex.Dtg
regex.Email
regex.FlightNumber
regex.Frequency
regex.Hms
regex.IpV4
regex.LatLon
regex.Mgrs
regex.Money
regex.Nationality
regex.Osgb
regex.Postcode
regex.RelativeDate
regex.SocialMediaUsername
regex.TaskForce
regex.Telephone
regex.Time
regex.TimeQuantity
regex.USTelephone
regex.UnqualifiedDate
regex.Url
regex.Volume
regex.Weight
class: relations.NPVNP
onlyExisting: true
stats.DocumentLanguage
class: stats.OpenNLP
model: models/en-ner-location.bin
type: Location
class: stats.OpenNLP
model: models/en-ner-organization.bin
type: Organisation
class: stats.OpenNLP
model: models/en-ner-person.bin
type: Person

consumers:

Mongo
Elasticsearch

Incorrect normalisation factor for square inches

baleen/baleen-annotators/src/main/java/uk/gov/dstl/baleen/annotators/regex/Area.java

Line 25 in 67976e1

public static final double IN2_TO_M2 = 0.000064516;

The above line is out by a factor of 10, and should be 0.00064516

BaleenCollectionReader.getContentExtractor() results in ClassNotFoundException

Using the following config which I borrowed from the baleen-runner tests:

sample_pipeline.yaml:

collectionreader:
  class: FolderReader
  folders:
    - /tmp/data

annotators:
  - class: regex.Email
  - class: regex.Url

consumers:
  - class: EntityCount

The application generates the following error in the output:

2018-03-19 21:32:32,138 DEBUG uk.gov.dstl.baleen.core.utils.BuilderUtils - Couldn't find class uk.gov.dstl.baleen.contentextractors.StructureContentExtractor in package uk.gov.dstl.baleen.contentextractors
java.lang.ClassNotFoundException: uk.gov.dstl.baleen.contentextractors.uk.gov.dstl.baleen.contentextractors.StructureContentExtractor

Notice, the CNFE has the package spec repeated twice: uk.gov.dstl.baleen.contentextractors.uk.gov.dstl.baleen.contentextractors.StructureContentExtractor

I believe the bug is caused by passing a fully qualified classname AND defaultPackage="uk.gov.dstl.baleen.contentextractors" to BuilderUtils.getClassFromString() here:

https://github.com/dstl/baleen/blob/master/baleen-uima/src/main/java/uk/gov/dstl/baleen/uima/BaleenCollectionReader.java#L178

Another possible fix would be to modify BuilderUtils.getClassFromString() and test if the className parameter contains the defaultPackage:

https://github.com/dstl/baleen/blob/master/baleen-core/src/main/java/uk/gov/dstl/baleen/core/utils/BuilderUtils.java#L64

Lastly, another fix would be to modify BaleenDefaults.DEFAULT_CONTENT_EXTRACTOR so that it does not contain the FQ classname, here:

https://github.com/dstl/baleen/blob/master/baleen-core/src/main/java/uk/gov/dstl/baleen/core/utils/BaleenDefaults.java#L34

Component REST API doesn't work on Java 9

The component REST API - which returns a list of available components such as annotators - doesn't work on Java 9. This appears to be an issue with the Reflections API, which is giving the following warning when running on Java 9:

WARN  org.reflections.Reflections - given scan urls are empty. set urls in the configuration

Running the same JAR on Java 8 doesn't produce this warning. The component API seems to return a null/empty object, which is therefore breaking the Plankton interface as well.

Missing jar file

I'm new to large scale Java projects so this may be a noob question but the first step in your wiki references a jar file that doesn't exist. Does the project need to be built first? Is there any documentation on building the project? I see building the javadoc but I didn't find it very helpful.

Elasticsearch doesn't like Antarctica

If you put use a document with 'Antarctica' in it causes Elasticsearch to exception:

Caused by: org.elasticsearch.index.mapper.MapperParsingException: failed to parse [entities.geoJson
Caused by: com.spatial4j.core.exception.InvalidShapeException: Self-intersection at or near point (-7.409738314942461, -71.63108011089658, NaN)

Gazetteers don't support subType

Allow gazetteers (e.g. File Gazetteer) to support subType on entities

Improve cleaning based on semantic type

uk.gov.dstl.baleen.annotators.cleaners.helpers.AbstractNestedEntities will merge based on the first entity found (or least confidence).

Perhaps it should also consider the semantic type, a more specific type (eg Entity vs Person) should pick the person (for the same confidence)

Money Regex doesn't pull out full entity

The MoneyRegex annotator doesn't seem to extract the full entity, and appears to be limited at 3 figures (e.g. will pull $300 instead of $3000)

Landing page for Baleen server references 2.5.0-SNAPSHOT

The current landing page for the Baleen server (the page that appears when you go to http://localhost:6413) references baleen-2.5.0-SNAPSHOT.jar in the commands, but should now reference baleen-2.6.0-SNAPSHOT.jar (and just baleen-2.6.0.jar once released).

ExpandLocationToDescription is too greedy

The ExpandLocationToDescription annotator seems to eat a lot of text up. It can produce annotations which are basically the size of the document (if the location is the last word)

The regex has spaces it in I wonder if it's looking for 'of' on its own rather than 'part of'.

HTML5 output chokes on elements with newlines

Sometimes, the HTML5 output will contain a visible HTML string in the form:

…most-of-entity-name" data-referent="" >start of entity
most-of-entity-name …

This is in the HTML source as " data-referent="" >.

From a few simple tests, this appears to happen when the tagged element contains a line-break (and hence the HTML5 output breaks it across paragraphs).

Using part of the NIST IE-ER data set (ieer-short.txt) and running it through a pipeline that uses OpenNLP results in ieer-short.html.txt.

Expected behaviour in this case is that National Convention Assembly is correctly tagged in the output without broken HTML.

[Not an issue - just seeking help] - Baleeen Forum?

Hi all,
So sorry for opening this as an issue, but despite endless Googling, I can;t find anywhere to communicate with developers or Users of Baleen.

Is there a channel or forum anywhere?

I've setup a basic pipeline (using the html5 consumer) but despite using the OpenNLP Annotator, there are no Spans being added to the HMTL output.

Just seeking advice from others.

Many thanks

FastClasspathScanner is outdated -- consider porting to ClassGraph

Your project, dstl/baleen, depends on the outdated library FastClasspathScanner in the following source files:

baleen-core/src/main/java/uk/gov/dstl/baleen/core/utils/ReflectionUtils.java

FastClasspathScanner has been significantly reworked since the version your code depends upon:

a significant number of bugs have been fixed
some nontrivial API changes have been made to simplify and unify the API
FastClasspathScanner has been renamed to ClassGraph: https://github.com/classgraph/classgraph

ClassGraph is a significantly more robust library than FastClasspathScanner, and is more future-proof. All future development work will be focused on ClassGraph, and FastClasspathScanner will see no future development.

Please consider porting your code over to the new ClassGraph API, particularly if your project is in production or has downstream dependencies:

Feel free to close this bug report if this code is no longer in use. (You were sent this bug report because your project depends upon FastClasspathScanner, and has been starred by 109 users. Apologies if this bug report is not helpful.)

gazetteer.File ignores termSeparator parameter

You can set the termSeparator parameter - and it looks like it gets passed in the config to uk.gov.dstl.baleen.resources.gazetteer.FileGazetteer. However, the init() method ignores it, so you always get the default comma separator.

Expose component (e.g. annotator) parameters through REST API

It would be useful if component parameters, for example whether an annotator should be case sensitive or not, were exposed through the API. That would allow for the development of a GUI tool for building pipelines, as we could query the REST API to find the available parameters and what they do.

Error in baleen-runner.xml

Error introduced when merging PRs 69-74, so current commit 4a2f4a7 and previous commits: 49cd328 f87f295 666c139 8781ec9 3a4d380 f3a1f74 will build, but not run.

Issue with top level dependency on Maven Central

Hi,

I'd like to use Baleen on a project. I have added in the top level dependency from Maven Central to my pom file, as follows:

<dependency>
    <groupId>uk.gov.dstl.baleen</groupId>
    <artifactId>baleen</artifactId>
    <version>2.3.0</version>
</dependency>

but the build is failing. If you use one of the child packages it seems to work. Any ideas what I am doing wrong?

[ERROR] Failed to execute goal on project graph-loader-ejb: Could not resolve dependencies for project graph-loader-ejb:ejb:1.0-SNAPSHOT: Failure to find uk.gov.dstl.baleen:baleen:jar:2.3.0 in https://repo.maven.apache.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced -> [Help 1]
[

Javadoc doesn't work if there are spaces in the path

The detection of Javadoc, making it available through the Baleen server, appears to fail if there are spaces in the path. Presumably, it's escaping the spaces somewhere and then is unable to find the required JAR file at the escaped path.

Hello! we found a vulnerable dependency in your project.

Hi! We spot a vulnerable dependency in your project, which might threaten your software. We also found another project that uses the same vulnerable dependency in a similar way as you did, and they have upgraded the dependency. We, thus, believe that your project is highly possible to be affected by this vulnerability similarly. The following shows the detailed information.

Vulnerability description

CVE: CVE-2019-16943
Vulnerable dependency: com.fasterxml.jackson.core:jackson-databind:2.9.8
Vulnerable function: com.fasterxml.jackson.databind.JavaType:isEnumType()
Invocation Path:

uk.gov.dstl.baleen.consumers.LocationElasticsearch:doProcess(org.apache.uima.jcas.JCas)
 ⬇️ 
com.fasterxml.jackson.databind.ObjectMapper:readValue(java.lang.String,java.lang.Class)
 ⬇️ 
...
 ⬇️ 
com.fasterxml.jackson.databind.JavaType:isEnumType()

Upgrade example

Another project also used the same dependency with a similar invocation path, and they have taken actions to resolve this issue.

Project: https://github.com/visionarts/power-jambda
Action commit:visionarts/power-jambda@c8ed091
Invocation Path:

com.visionarts.powerjambda.actions.JsonBodyActionRequestReader:readRequest(com.visionarts.powerjambda.AwsProxyRequest)
 ⬇️ 
com.fasterxml.jackson.databind.ObjectMapper:readValue(java.lang.String,java.lang.Class)
 ⬇️ 
...
 ⬇️ 
com.fasterxml.jackson.databind.JavaType:isEnumType()

Therefore, you might also need to upgrade this dependency. Hope this can help you! 😄

CleanTemporal throughs NPE

NullPointerException thrown by CleanTemporal if the value hasn't been set on an entity. NPE thrown on Line 102.

Baleen tests fail with OpenJDK 8

The Baleen "requirements" suggest Oracle JDK but also just say that Baleen works with Java 8.

I've tried compiling with OpenJDK, but Maven fails because of a test failure. Baleen doesn't require JavaFX anywhere, but the testComponents() unit test for AbstractComponentApiServletTest uses JavaFX for its example data. OpenJDK doesn't include JavaFX.

I've managed to build Baleen by commenting out the entire contents of that test. It would be helpful if Baleen's tests didn't depend on proprietary classes that aren't in OpenJDK when the core functionality does not require those classes.

Stream closed exception when using OdinParser

This is caused by two version of the liblinear library in the dependency hierarchy and the 1.8 version used by the maltparser wins.

Link to CpeManager broken on Baleen Landing page

In the quick start information on the Baleen landing page, the link to CpeManager is broken as it's been renamed in 2.2.0 to PipelineCpeBuilder.

Two digit years assume 2000-2099

Can we add a parameter to the affected annotators (any that use DateTimeFormatter I believe) to allow the configuration of the pivot point?

Plankton should have default content extractor selected

The default content extractor (currently StructureContentExtractor) should be selected by default when using Plankton. Also, when generating the YAML, the content extractor should not be explicitly set if it is the default.

Build fails with fresh maven cache

Build fails with the following error:

Failed to execute goal on project baleen-collectionreaders: Could not resolve dependencies for project uk.gov.dstl.baleen:baleen-collectionreaders:jar:2.5.0-SNAPSHOT: Failure to find org.apache.pdfbox:jbig2-imageio:jar:3.0.0-SNAPSHOT in https://repository.apache.org/content/repositories/snapshots was cached in the local repository, resolution will not be reattempted until the update interval of apache-snapshots has elapsed or updates are forced

There's a transitive dependency on jbig2-imageio.jar-3.0.0-SNAPSHOT; however, that snapshot version is not longer available in the Apache Maven Repo.
Only jbig2-imageio.jar:3.0.1-SNAPSHOT is available. See here:
https://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/jbig2-imageio/

Steps to reproduce:

Remove the existing jbig2-imageio.jar-3.0.0-SNAPSHOT artifacts from your local .m2/repository
mvn clean package (or whatever goals you typically specify)

It looks like the dependency is dragged in as follows:

- uk.gov.dstl.baleen:baleen-collectionreaders:jar:2.5.0-SNAPSHOT
  - io.committed.krill:krill:jar:1.0.2
    - org.apache.tika:tika-parsers:jar:1.16
      - org.apache.pdfbox:jbig2-imageio:jar:3.0.0-SNAPSHOT

JavaDoc giving 403 forbidden

I have compiled successfully but cannot access the javadoc. I copied the file baleen-javadoc-2.2.0-SNAPSHOT.jar in to the same directory but it is not found and I get a 404. If I change the filename to baleen-2.2.0-SNAPSHOT-javadoc.jar then I can see from the log it finds the javadoc file. I then however get a 403 error when I try to go to it.

The is a bit frustrating as I need to read the javadoc to be able to figure out how to use baleen.

This is on the latest commit c862249

Relevant Log when starting baleen

2016-03-16 22:30:40,269 INFO org.eclipse.jetty.server.handler.ContextHandler - Started o.e.j.s.h.ContextHandler@77be656f{/javadoc,jar:file:/home/stuart/Programming/workrelated/baleen/baleen/target/baleen-2.2.0-SNAPSHOT-javadoc.jar!/,AVAILABLE}

Pipeline initialization fails with MongoHistory and language.OpenNLP

Trying to create the following pipeline causes the pipeline initialization to fail on language.OpenNLP.

history:
   class: uk.gov.dstl.baleen.history.mongo.MongoHistory

collectionreader:
  class: FolderReader
  folders:
  - corpus

annotators:
- language.OpenNLP
- class: gazetteer.Mongo
  collection: person_gazetteer
  valueField: name
  type: Person
- class: stats.OpenNLP
  model: en-ner-person.bin
  type: Person

consumers:
- Mongo

Lines 50 to 52 in 17a18f7

    
           public static final String PARAM_OUPUT_COLLECTION = "ouputCollection"; 
        
           @ConfigurationParameter(name = PARAM_OUPUT_COLLECTION, defaultValue = "renoun_patterns")

baleen/baleen-annotators/src/main/java/uk/gov/dstl/baleen/annotators/renoun/AbstractPatternDataGenerator.java

Lines 65 to 67 in bb88621

    
           public static final String PARAM_OUPUT_COLLECTION = "ouputCollection"; 
        
           @ConfigurationParameter(name = PARAM_OUPUT_COLLECTION, defaultValue = "renoun_patterns")

Clean git-clone gives build error

Just did a clone of the repo after installing the latest JDK and Maven on OSX.
After using the command mvn package -Dmaven.test.skip=true it fails with the following message:

Failed to execute goal on project baleen-resources: Could not resolve dependencies for project uk.gov.dstl.baleen:baleen-resources:jar:2.6.1-SNAPSHOT: Could not find artifact uk.gov.dstl.baleen:baleen-uima:jar:tests:2.6.1-SNAPSHOT

Also without the test option, it won't compile. Hopefully you can tell me what went wrong.

Best regards

	public static final String PARAM_OUPUT_COLLECTION = "ouputCollection";

	@ConfigurationParameter(name = PARAM_OUPUT_COLLECTION, defaultValue = "renoun_patterns")