Code Monkey home page Code Monkey logo

baleen's People

Contributors

chrisflatley avatar dependabot[bot] avatar dstl-admin avatar gitter-badger avatar ids55 avatar jamesdbaker avatar jamesfry avatar jbaker-dstl avatar jbaker-nca avatar jle123 avatar johndaws avatar n- avatar stuarthendren avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

baleen's Issues

Elasticsearch and Antarctica

I was recently looking into issue #3 from the Elasticsearch side, along with the related issues elastic/elasticsearch#27832 and elastic/elasticsearch#17407.

elastic/elasticsearch#17407 (comment) also applies to your issue, as does the fix (setting "orientation": "clockwise" in the mapping).

However, it seems that you managed to simplify your Antarctica outline to one that Elasticsearch does accept in #39, even though it still looks to be oriented clockwise. It's possible that the shape that Elasticsearch has indexed is not the shape you asked for, because of the code linked in elastic/elasticsearch#27832 (comment).

I'd recommend fixing the mapping, or reversing the orientation of the Antarctica outline to make it anticlockwise as per the GeoJSON spec, to make sure that Elasticsearch indexes it correctly. If you do so, it looks like you can revert #39 and use the original higher-precision outline (suitably reversed). Meanwhile we're looking at this leniency in more detail in elastic/elasticsearch#27832.

Hello! we found a vulnerable dependency in your project.

Hi! We spot a vulnerable dependency in your project, which might threaten your software. We also found another project that uses the same vulnerable dependency in a similar way as you did, and they have upgraded the dependency. We, thus, believe that your project is highly possible to be affected by this vulnerability similarly. The following shows the detailed information.

Vulnerability description

  • CVE: CVE-2019-16943
  • Vulnerable dependency: com.fasterxml.jackson.core:jackson-databind:2.9.8
  • Vulnerable function: com.fasterxml.jackson.databind.JavaType:isEnumType()
  • Invocation Path:
uk.gov.dstl.baleen.consumers.LocationElasticsearch:doProcess(org.apache.uima.jcas.JCas)
 ⬇️ 
com.fasterxml.jackson.databind.ObjectMapper:readValue(java.lang.String,java.lang.Class)
 ⬇️ 
...
 ⬇️ 
com.fasterxml.jackson.databind.JavaType:isEnumType()

Upgrade example

Another project also used the same dependency with a similar invocation path, and they have taken actions to resolve this issue.

com.visionarts.powerjambda.actions.JsonBodyActionRequestReader:readRequest(com.visionarts.powerjambda.AwsProxyRequest)
 ⬇️ 
com.fasterxml.jackson.databind.ObjectMapper:readValue(java.lang.String,java.lang.Class)
 ⬇️ 
...
 ⬇️ 
com.fasterxml.jackson.databind.JavaType:isEnumType()

Therefore, you might also need to upgrade this dependency. Hope this can help you! 😄

Tests fail when run as a super-user

The following two tests fail when running as a user with super-user privileges (e.g. sudo, root) on Linux:

  • AllAnnotationsJsonConsumerTest.java
  • EntityCountTest.java

The cause of this is that when running as a super user, you have permission to write to read-only files. But the above tests attempt to write to read-only files in order to test the error handling. The tests expect the writes to fail, but as a super-user they don't and the test therefore fails instead.

BaleenCollectionReader.getContentExtractor() results in ClassNotFoundException

Using the following config which I borrowed from the baleen-runner tests:

sample_pipeline.yaml:

collectionreader:
  class: FolderReader
  folders:
    - /tmp/data

annotators:
  - class: regex.Email
  - class: regex.Url

consumers:
  - class: EntityCount

The application generates the following error in the output:

2018-03-19 21:32:32,138 DEBUG uk.gov.dstl.baleen.core.utils.BuilderUtils - Couldn't find class uk.gov.dstl.baleen.contentextractors.StructureContentExtractor in package uk.gov.dstl.baleen.contentextractors
java.lang.ClassNotFoundException: uk.gov.dstl.baleen.contentextractors.uk.gov.dstl.baleen.contentextractors.StructureContentExtractor

Notice, the CNFE has the package spec repeated twice: uk.gov.dstl.baleen.contentextractors.uk.gov.dstl.baleen.contentextractors.StructureContentExtractor

I believe the bug is caused by passing a fully qualified classname AND defaultPackage="uk.gov.dstl.baleen.contentextractors" to BuilderUtils.getClassFromString() here:

https://github.com/dstl/baleen/blob/master/baleen-uima/src/main/java/uk/gov/dstl/baleen/uima/BaleenCollectionReader.java#L178

Another possible fix would be to modify BuilderUtils.getClassFromString() and test if the className parameter contains the defaultPackage:

https://github.com/dstl/baleen/blob/master/baleen-core/src/main/java/uk/gov/dstl/baleen/core/utils/BuilderUtils.java#L64

Lastly, another fix would be to modify BaleenDefaults.DEFAULT_CONTENT_EXTRACTOR so that it does not contain the FQ classname, here:

https://github.com/dstl/baleen/blob/master/baleen-core/src/main/java/uk/gov/dstl/baleen/core/utils/BaleenDefaults.java#L34

Component REST API doesn't work on Java 9

The component REST API - which returns a list of available components such as annotators - doesn't work on Java 9. This appears to be an issue with the Reflections API, which is giving the following warning when running on Java 9:

WARN  org.reflections.Reflections - given scan urls are empty. set urls in the configuration

Running the same JAR on Java 8 doesn't produce this warning. The component API seems to return a null/empty object, which is therefore breaking the Plankton interface as well.

gazetteer.File ignores termSeparator parameter

You can set the termSeparator parameter - and it looks like it gets passed in the config to uk.gov.dstl.baleen.resources.gazetteer.FileGazetteer. However, the init() method ignores it, so you always get the default comma separator.

Pipeline initialization fails with MongoHistory and language.OpenNLP

Trying to create the following pipeline causes the pipeline initialization to fail on language.OpenNLP.

history:
   class: uk.gov.dstl.baleen.history.mongo.MongoHistory

collectionreader:
  class: FolderReader
  folders:
  - corpus

annotators:
- language.OpenNLP
- class: gazetteer.Mongo
  collection: person_gazetteer
  valueField: name
  type: Person
- class: stats.OpenNLP
  model: en-ner-person.bin
  type: Person

consumers:
- Mongo

HTML5 output chokes on elements with newlines

Sometimes, the HTML5 output will contain a visible HTML string in the form:

…most-of-entity-name" data-referent="" >start of entity
most-of-entity-name …

This is in the HTML source as " data-referent="" >.

From a few simple tests, this appears to happen when the tagged element contains a line-break (and hence the HTML5 output breaks it across paragraphs).

Using part of the NIST IE-ER data set (ieer-short.txt) and running it through a pipeline that uses OpenNLP results in ieer-short.html.txt.

Expected behaviour in this case is that National Convention Assembly is correctly tagged in the output without broken HTML.

Expose component (e.g. annotator) parameters through REST API

It would be useful if component parameters, for example whether an annotator should be case sensitive or not, were exposed through the API. That would allow for the development of a GUI tool for building pipelines, as we could query the REST API to find the available parameters and what they do.

Javadoc doesn't work if there are spaces in the path

The detection of Javadoc, making it available through the Baleen server, appears to fail if there are spaces in the path. Presumably, it's escaping the spaces somewhere and then is unable to find the required JAR file at the escaped path.

JavaDoc giving 403 forbidden

I have compiled successfully but cannot access the javadoc. I copied the file baleen-javadoc-2.2.0-SNAPSHOT.jar in to the same directory but it is not found and I get a 404. If I change the filename to baleen-2.2.0-SNAPSHOT-javadoc.jar then I can see from the log it finds the javadoc file. I then however get a 403 error when I try to go to it.

The is a bit frustrating as I need to read the javadoc to be able to figure out how to use baleen.

This is on the latest commit c862249

Relevant Log when starting baleen

2016-03-16 22:30:40,269 INFO org.eclipse.jetty.server.handler.ContextHandler - Started o.e.j.s.h.ContextHandler@77be656f{/javadoc,jar:file:/home/stuart/Programming/workrelated/baleen/baleen/target/baleen-2.2.0-SNAPSHOT-javadoc.jar!/,AVAILABLE}

Clean git-clone gives build error

Just did a clone of the repo after installing the latest JDK and Maven on OSX.
After using the command mvn package -Dmaven.test.skip=true it fails with the following message:

Failed to execute goal on project baleen-resources: Could not resolve dependencies for project uk.gov.dstl.baleen:baleen-resources:jar:2.6.1-SNAPSHOT: Could not find artifact uk.gov.dstl.baleen:baleen-uima:jar:tests:2.6.1-SNAPSHOT 

Also without the test option, it won't compile. Hopefully you can tell me what went wrong.

Best regards

Build fails with fresh maven cache

Build fails with the following error:

Failed to execute goal on project baleen-collectionreaders: Could not resolve dependencies for project uk.gov.dstl.baleen:baleen-collectionreaders:jar:2.5.0-SNAPSHOT: Failure to find org.apache.pdfbox:jbig2-imageio:jar:3.0.0-SNAPSHOT in https://repository.apache.org/content/repositories/snapshots was cached in the local repository, resolution will not be reattempted until the update interval of apache-snapshots has elapsed or updates are forced

There's a transitive dependency on jbig2-imageio.jar-3.0.0-SNAPSHOT; however, that snapshot version is not longer available in the Apache Maven Repo.
Only jbig2-imageio.jar:3.0.1-SNAPSHOT is available. See here:
https://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/jbig2-imageio/

Steps to reproduce:

  1. Remove the existing jbig2-imageio.jar-3.0.0-SNAPSHOT artifacts from your local .m2/repository
  2. mvn clean package (or whatever goals you typically specify)

It looks like the dependency is dragged in as follows:

- uk.gov.dstl.baleen:baleen-collectionreaders:jar:2.5.0-SNAPSHOT
  - io.committed.krill:krill:jar:1.0.2
    - org.apache.tika:tika-parsers:jar:1.16
      - org.apache.pdfbox:jbig2-imageio:jar:3.0.0-SNAPSHOT

Missing jar file

I'm new to large scale Java projects so this may be a noob question but the first step in your wiki references a jar file that doesn't exist. Does the project need to be built first? Is there any documentation on building the project? I see building the javadoc but I didn't find it very helpful.

Elasticsearch doesn't like Antarctica

If you put use a document with 'Antarctica' in it causes Elasticsearch to exception:

Caused by: org.elasticsearch.index.mapper.MapperParsingException: failed to parse [entities.geoJson
Caused by: com.spatial4j.core.exception.InvalidShapeException: Self-intersection at or near point (-7.409738314942461, -71.63108011089658, NaN)

FastClasspathScanner is outdated -- consider porting to ClassGraph

Your project, dstl/baleen, depends on the outdated library FastClasspathScanner in the following source files:

FastClasspathScanner has been significantly reworked since the version your code depends upon:

  • a significant number of bugs have been fixed
  • some nontrivial API changes have been made to simplify and unify the API
  • FastClasspathScanner has been renamed to ClassGraph: https://github.com/classgraph/classgraph

ClassGraph is a significantly more robust library than FastClasspathScanner, and is more future-proof. All future development work will be focused on ClassGraph, and FastClasspathScanner will see no future development.

Please consider porting your code over to the new ClassGraph API, particularly if your project is in production or has downstream dependencies:

Feel free to close this bug report if this code is no longer in use. (You were sent this bug report because your project depends upon FastClasspathScanner, and has been starred by 109 users. Apologies if this bug report is not helpful.)

Runtime exceptions

Hello there,
I'm new to Baleen, I read most of the documentation. Baleen is running in the background. But when I run my test application, I get some runtime exceptions. I ran my test application with -verbose on so I could see all the messages. I have copied my code at the bottom.

Error1:
This comes when I link my test application only with Baleen library. I get an exception "java.lang.NoClassDefFoundError: org/apache/http/config/Lookup".

error1

Error2:
Then I linked httpCore4.4.x jar (which I hope I'm not supposed to do), ran, then I don't get above error. But I get another new exception java.lang.NoSuchMethodError: org.apache.http.entity.ContentType.withCharset(Ljava/lang/String;)Lorg/apache/http/entity/ContentType;

error2

I assume that I must not link apache libraries since Baleen already has references to them and the libraries I link may cause to make conflicts between libraries. I'm on Windows 10, 64x and IDE is Netbeans. I'm using Baleen 2.2.0. Could someone help me to figure out what I'm missing here please?


Following is my test program.
package testbaleen;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.analysis_engine.AnalysisEngineDescription;
import org.apache.uima.fit.factory.AnalysisEngineFactory;
import org.apache.uima.fit.factory.ExternalResourceFactory;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ExternalResourceDescription;
import org.apache.uima.resource.ResourceInitializationException;
import org.elasticsearch.client.Client;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.node.Node;
import org.elasticsearch.node.NodeBuilder;
import uk.gov.dstl.baleen.consumers.ElasticsearchRest;
import uk.gov.dstl.baleen.resources.SharedElasticsearchRestResource;

/**
*

  • @author Susantha
    */
    public class TestBaleen {

    private static Path tmpDir;
    private static final String ELASTICSEARCH = "elasticsearchRest";
    protected static Client client;
    protected static JCas jCas;
    protected static AnalysisEngine ae;
    /**

    • @param args the command line arguments
      */
      public static void main(String[] args) {

       try {
           tmpDir = Files.createTempDirectory("elasticsearch");
      
           String s = tmpDir.toString();
           
           Settings settings = Settings.builder()
                   .put("path.home", tmpDir.toString())
                   .put("http.port", "19600")		//Don't use the default ports for testing purposes
                   .put("transport.tcp.port", "19300")
                   .build();
           
           Node node = NodeBuilder.nodeBuilder()
                   .settings(settings)
                   .data(true)
                   .local(true)
                   .clusterName("SusanthaSearch")
                   .node();
           
           ExternalResourceDescription erd = ExternalResourceFactory.createExternalResourceDescription(ELASTICSEARCH, SharedElasticsearchRestResource.class, SharedElasticsearchRestResource.PARAM_URL, "http://localhost:19600");
           AnalysisEngineDescription aed = AnalysisEngineFactory.createEngineDescription(ElasticsearchRest.class, ELASTICSEARCH, erd);
           
           try
           {
               System.out.println("Now creating the engine");
               ae = AnalysisEngineFactory.createEngine(aed);
           }catch(ResourceInitializationException ex)
           {
               System.out.println("Caught"+ex.getMessage());
           }catch(Exception e)
           {
               System.out.println("Caught"+e.getMessage());
           }
           client = node.client();
           System.out.println("Done and dusted...");
           
       } catch (IOException ex) {
       Logger.getLogger(TestBaleen.class.getName()).log(Level.SEVERE, null, ex);
           //Logger.getLogger(ContentScrapper.class.getName()).log(Level.SEVERE, null, ex);
       } catch (ResourceInitializationException ex) {
       Logger.getLogger(TestBaleen.class.getName()).log(Level.SEVERE, null, ex);
      

      }

}

}

Plankton should have default content extractor selected

The default content extractor (currently StructureContentExtractor) should be selected by default when using Plankton. Also, when generating the YAML, the content extractor should not be explicitly set if it is the default.

JavaDoc doesn't launch on Mac

Hi James,

I reinstalled Baleen 2.1.0 on my MacBook Pro and tried to launch the JavaDoc, but it isn't launching. There's an error, but that's all it is saying. Baleen 2.1.0 .JAR file is in the same directory as the JavaDoc executable and that doesn't launch either: The Terminal reports it was unable to access the .JAR file. What am I doing wrong please?

DM

Baleen Graph doesn't build on Java 9

I did a clean pull of the GitHub repository, but when trying to build the project it failed on the baleen-graph project.

[ERROR] Failures: 
[ERROR]   EntityGraphFileTest.testGraphson:99->assertPathsEqual:64 expected:<...e":1},"value":""}],"[docId":[{"id":{"@type":"g:Int64","@value":3},"value":{"@type":"g:List","@value":["8b408a0c7163fdfff06ced3e80d7d2b3acd9db900905c4783c28295b8c996165"]}}],"isNormalised":[{"id":{"@type":"g:Int64","@value":4},"value":{"@type":"g:List","@value":[false]]}}],"longestValue":...> but was:<...e":1},"value":""}],"[isNormalised":[{"id":{"@type":"g:Int64","@value":3},"value":{"@type":"g:List","@value":[false]}}],"docId":[{"id":{"@type":"g:Int64","@value":4},"value":{"@type":"g:List","@value":["8b408a0c7163fdfff06ced3e80d7d2b3acd9db900905c4783c28295b8c996165"]]}}],"longestValue":...>
[ERROR]   EntityGraphFileTest.testGyro:117
[INFO] 
[ERROR] Tests run: 41, Failures: 2, Errors: 0, Skipped: 0

I'm building it on Ubuntu 16.04 with OpenJDK version 1.8.0_171.

UPDATE: Maven was actually using Java 9 (and not Java 8), and that was the cause of the problem. I've updated the issue title to reflect this.

$ mvn -version
Apache Maven 3.3.9
Maven home: /usr/share/maven
Java version: 9.0.4, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-9-oracle
Default locale: en_GB, platform encoding: UTF-8
OS name: "linux", version: "4.15.0-29-generic", arch: "amd64", family: "unix"

[Not an issue - just seeking help] - Baleeen Forum?

Hi all,
So sorry for opening this as an issue, but despite endless Googling, I can;t find anywhere to communicate with developers or Users of Baleen.

Is there a channel or forum anywhere?

I've setup a basic pipeline (using the html5 consumer) but despite using the OpenNLP Annotator, there are no Spans being added to the HMTL output.

Just seeking advice from others.

Many thanks

CleanTemporal throughs NPE

NullPointerException thrown by CleanTemporal if the value hasn't been set on an entity. NPE thrown on Line 102.

ExpandLocationToDescription is too greedy

The ExpandLocationToDescription annotator seems to eat a lot of text up. It can produce annotations which are basically the size of the document (if the location is the last word)

The regex has spaces it in I wonder if it's looking for 'of' on its own rather than 'part of'.

Expand REST API to provide annotator input/output information

As of Baleen 2.4, the required inputs and produced outputs of each annotator are declared in order to the pipeline orderers to function. It would be beneficial if this information could be exposed through the REST API.

The gotcha to this is that this information is only accessible once the annotator has been instantiated and configured, so it would either need to be per annotator in an existing pipeline, or allow for annotator configuration to be passed.

Improve cleaning based on semantic type

uk.gov.dstl.baleen.annotators.cleaners.helpers.AbstractNestedEntities will merge based on the first entity found (or least confidence).

Perhaps it should also consider the semantic type, a more specific type (eg Entity vs Person) should pick the person (for the same confidence)

Temporal parsing fails when timezone is not included

Against 2.3 Snapshot Release -2016-11-01

Occurred on a document that contained dates: "20 January 2014" and "20 Jan 2014"

2016-11-23 14:03:20,106 WARN  uk.gov.dstl.baleen.core.pipelines.BaleenPipeline - Pipeline ran with errors
org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:401)
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:308)
    at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:269)
    at org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.processNext(ProcessingUnit.java:893)
    at org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.run(ProcessingUnit.java:575)
Caused by: java.lang.NullPointerException: null
    at java.util.TimeZone.parseCustomTimeZone(TimeZone.java:783)
    at java.util.TimeZone.getTimeZone(TimeZone.java:562)
    at java.util.TimeZone.getTimeZone(TimeZone.java:516)
    at uk.gov.dstl.baleen.annotators.regex.DateTime.processDayMonthTime(DateTime.java:127)
    at uk.gov.dstl.baleen.annotators.regex.DateTime.doProcess(DateTime.java:46)
    at uk.gov.dstl.baleen.uima.BaleenAnnotator.process(BaleenAnnotator.java:81)
    at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)

Issue with top level dependency on Maven Central

Hi,

I'd like to use Baleen on a project. I have added in the top level dependency from Maven Central to my pom file, as follows:

<dependency>
    <groupId>uk.gov.dstl.baleen</groupId>
    <artifactId>baleen</artifactId>
    <version>2.3.0</version>
</dependency>

but the build is failing. If you use one of the child packages it seems to work. Any ideas what I am doing wrong?

[ERROR] Failed to execute goal on project graph-loader-ejb: Could not resolve dependencies for project graph-loader-ejb:ejb:1.0-SNAPSHOT: Failure to find uk.gov.dstl.baleen:baleen:jar:2.3.0 in https://repo.maven.apache.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced -> [Help 1]
[

REST API for Pipelines doesn't accept Sample YAML

One a freshly built test implementation, the YAML provided in the sample documentation (included below) fails with a 500 error when submitting with a POST with the two form parameters to http://localhost:6413/api/1/pipelines

mapping values are not allowed here
in 'string', line 1, column 26:
collectionreader: class: FolderReader folders: - ./ ...
^

Sample YAML:

mongo:
db: baleen
host: localhost

elasticsearch:
cluster: elasticsearch
host: localhost

collectionreader:
class: FolderReader
folders:

  • C:\baleen\data

annotators:

  • cleaners.AddGenderToPerson
  • cleaners.AddTitleToPerson
  • cleaners.CleanPunctuation
  • cleaners.CleanTemporal
  • cleaners.CollapseLocations
  • cleaners.CorefBrackets
  • cleaners.CorefCapitalisationAndApostrophe
  • cleaners.CurrencyDetection
  • cleaners.EntityInitials
  • cleaners.ExpandLocationToDescription
  • cleaners.MergeAdjacent
  • cleaners.MergeAdjacentQuantities
  • cleaners.MergeNationalityIntoEntity
  • cleaners.NaiveMergeRelations
  • cleaners.NormalizeOSGB
  • cleaners.NormalizeTemporal
  • cleaners.NormalizeWhitespace
  • cleaners.ReferentToEntity
  • cleaners.RelationTypeFilter
  • cleaners.RemoveLowConfidenceEntities
  • cleaners.RemoveNestedEntities
  • cleaners.RemoveNestedLocations
  • cleaners.RemoveOverlappingEntities
  • cleaners.SplitBrackets
  • cleaners.Surname
  • coreference.SieveCoreference
  • gazetteer.Country
  • gazetteer.File
  • class: gazetteer.Mongo
    type: Buzzword
    collection: buzzwords
  • class: gazetteer.Mongo
    type: Location
    collection: location
  • class: gazetteer.Mongo
    type: Organisation
    collection: organisations
  • class: gazetteer.Mongo
    type: Person
    collection: people
  • grammatical.NPAtCoordinate
  • grammatical.NPElement
  • grammatical.NPLocation
  • grammatical.NPOrganisation
  • grammatical.NPTitleEntity
  • grammatical.QuantityNPEntity
  • grammatical.TOLocationEntity
  • language.OpenNLP
  • class: misc.DocumentTypeByLocation
    baseDirectory: C:\baleen\data
  • misc.GenericMilitaryPlatform
  • misc.GenericVehicle
  • misc.GenericWeapon
  • misc.MentionedAgain
  • misc.NationalityToLocation
  • misc.OrganisationPersonRole
  • misc.People
  • misc.Pronouns
  • regex.Area
  • regex.BritishArmyUnits
  • regex.Callsign
  • regex.CasRegistryNumber
  • regex.Date
  • regex.DateTime
  • regex.Distance
  • regex.DocumentNumber
  • regex.Dtg
  • regex.Email
  • regex.FlightNumber
  • regex.Frequency
  • regex.Hms
  • regex.IpV4
  • regex.LatLon
  • regex.Mgrs
  • regex.Money
  • regex.Nationality
  • regex.Osgb
  • regex.Postcode
  • regex.RelativeDate
  • regex.SocialMediaUsername
  • regex.TaskForce
  • regex.Telephone
  • regex.Time
  • regex.TimeQuantity
  • regex.USTelephone
  • regex.UnqualifiedDate
  • regex.Url
  • regex.Volume
  • regex.Weight
  • class: relations.NPVNP
    onlyExisting: true
  • stats.DocumentLanguage
  • class: stats.OpenNLP
    model: models/en-ner-location.bin
    type: Location
  • class: stats.OpenNLP
    model: models/en-ner-organization.bin
    type: Organisation
  • class: stats.OpenNLP
    model: models/en-ner-person.bin
    type: Person

consumers:

  • Mongo
  • Elasticsearch

Two digit years assume 2000-2099

Can we add a parameter to the affected annotators (any that use DateTimeFormatter I believe) to allow the configuration of the pivot point?

Building to include Javadoc from command-line

Hi,

I tried to build from the command line, but there is no -javadoc.jar file placed in the target directory. Even if I remove the comment from the baleen-javadoc/pom.xml file to ensure that a javadoc Jar file is created, it doesnt contain all the files that are expected by the Core webapp UI.

Any advices?

Does not build on openjdk 8

Fails to build on openjdk 8. I did not spend time figuring out why but switched to oracle jdk. You should make a note in the ReadMe saying that it must be Oracle and not OpenJDK.

MongoReader collection reader not working

The MongoReader collection reader fails at line 127 due to an old version (2.2) of commons-io being built into the JAR file and not supporting this call to IOUtils.toInputStream.

This behaviour was noted having built Baleen 2.4.1-SNAPSHOT from source in Netbeans 8.2.

Error while installing on Mac

Hi James, Team

I'm running Eclipse and trying to install Baleen, but keep getting build failure errors for the Collection Readers onwards, which means it isn't building half of the tool.

Please can you help?

Cheers,

DM

Results :

Failed tests: testMultipleDirectories(uk.gov.dstl.baleen.collectionreaders.FolderReaderTest)
testSubDirectoriesNonRecursive(uk.gov.dstl.baleen.collectionreaders.FolderReaderTest): expected:</[]var/folders/n2/4x13f...> but was:</[private/]var/folders/n2/4x13f...>
testModifiedFile(uk.gov.dstl.baleen.collectionreaders.FolderReaderTest)
testSubDirectories(uk.gov.dstl.baleen.collectionreaders.FolderReaderTest)
testCreateFile(uk.gov.dstl.baleen.collectionreaders.FolderReaderTest)

Tests run: 19, Failures: 5, Errors: 0, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Baleen ............................................ SUCCESS [ 2.281 s]
[INFO] Baleen Core ....................................... SUCCESS [ 28.455 s]
[INFO] Baleen UIMA ....................................... SUCCESS [ 10.691 s]
[INFO] Baleen Resources .................................. SUCCESS [ 44.327 s]
[INFO] Baleen Annotators ................................. SUCCESS [01:07 min]
[INFO] Baleen Collection Readers ......................... FAILURE [ 27.084 s]
[INFO] Baleen Consumers .................................. SKIPPED
[INFO] Baleen History .................................... SKIPPED
[INFO] Baleen Runner ..................................... SKIPPED
[INFO] Baleen Javadoc .................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 03:00 min
[INFO] Finished at: 2015-10-19T23:10:03+00:00
[INFO] Final Memory: 21M/208M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.10:test (default-test) on project baleen-collectionreaders: There are test failures.
[ERROR]
[ERROR] Please refer to /Users/User1/Desktop/baleen-master/baleen/baleen-collectionreaders/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn -rf :baleen-collectionreaders

Baleen tests fail with OpenJDK 8

The Baleen "requirements" suggest Oracle JDK but also just say that Baleen works with Java 8.

I've tried compiling with OpenJDK, but Maven fails because of a test failure. Baleen doesn't require JavaFX anywhere, but the testComponents() unit test for AbstractComponentApiServletTest uses JavaFX for its example data. OpenJDK doesn't include JavaFX.

I've managed to build Baleen by commenting out the entire contents of that test. It would be helpful if Baleen's tests didn't depend on proprietary classes that aren't in OpenJDK when the core functionality does not require those classes.

Param name typos in ReNoun annotators

Typo on ouput/output

public static final String PARAM_OUPUT_COLLECTION = "ouputCollection";
@ConfigurationParameter(name = PARAM_OUPUT_COLLECTION, defaultValue = "renoun_patterns")

public static final String PARAM_OUPUT_COLLECTION = "ouputCollection";
@ConfigurationParameter(name = PARAM_OUPUT_COLLECTION, defaultValue = "renoun_patterns")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.