Code Monkey home page Code Monkey logo

Comments (7)

JohnDaws avatar JohnDaws commented on August 24, 2024

Thank you for the detailed error report.
I have had a look at this this morning and cannot replicate your issue.
Please could you let me know how you are passing the pipeline to Baleen? I have tried submitting an identical (other than folder location) pipeline via a baleen config file at startup and also configured by hand using the plankton interface.
Am I right in assuming (given your previous issue) that you are running from a version built from the latest 2.5.0-SNAPSHOT code?
Sorry to not be more helpful...

from baleen.

aalsup avatar aalsup commented on August 24, 2024

John, thanks for your quick reply. The error appears very early in the console logs, and the application continues to bootstrap and run with no further errors. Here's my configuration:

$ ls -l /tmp/baleen
-rw-r--r--  1 ahalsup  wheel  212079116 Mar 20 10:34 baleen-2.5.0-SNAPSHOT.jar
drwxr-xr-x  3 ahalsup  wheel         96 Mar 20 10:35 data
-rw-r--r--  1 ahalsup  wheel        138 Mar 20 10:38 runner.yaml
-rw-r--r--  1 ahalsup  wheel        192 Mar 20 10:38 sample_pipeline.yaml

$ ls -l /tmp/baleen/data
-rw-r--r--  1 ahalsup  wheel  95 Mar 20 10:34 data.txt

/tmp/baleen/runner.yaml

pipelines:
  - name: sample
    file: /tmp/baleen/sample_pipeline.yaml

logging:
  loggers:
    - name: console
      minLevel: DEBUG

/tmp/baleen/sample_pipeline.yaml

collectionreader:
  class: FolderReader
  folders:
    - /tmp/baleen/data

annotators:
  - class: regex.Email
  - class: regex.Url

consumers:
  - class: EntityCount
  - class: print.Entities

/tmp/baleen/data/data.txt

This is an example email [email protected] which would correspond with a URL http://example.com

Here's the command to run, and a snippet of the console output:

$ java -jar ./baleen-2.5.0-SNAPSHOT.jar runner.yaml
10:48:10.019 [main] INFO uk.gov.dstl.baleen.runner.Baleen - Baleen starting
10:48:10.021 [main] INFO uk.gov.dstl.baleen.runner.Baleen - Baleen about to run
...
2018-03-20 10:48:10,550 DEBUG uk.gov.dstl.baleen.collectionreaders.FolderReader[sample] - Starting function initialize
2018-03-20 10:48:10,553 DEBUG uk.gov.dstl.baleen.core.metrics.LoggingMetricListener - Created timer 'sample:uk.gov.dstl.baleen.collectionreaders.FolderReader:initialize'
2018-03-20 10:48:10,555 DEBUG uk.gov.dstl.baleen.core.utils.BuilderUtils - Couldn't find class uk.gov.dstl.baleen.contentextractors.StructureContentExtractor in package uk.gov.dstl.baleen.contentextractors
java.lang.ClassNotFoundException: uk.gov.dstl.baleen.contentextractors.uk.gov.dstl.baleen.contentextractors.StructureContentExtractor
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:264)
	at uk.gov.dstl.baleen.core.utils.BuilderUtils.getClassFromString(BuilderUtils.java:67)
	at uk.gov.dstl.baleen.uima.BaleenCollectionReader.getContentExtractor(BaleenCollectionReader.java:178)
	at uk.gov.dstl.baleen.collectionreaders.FolderReader.doInitialize(FolderReader.java:146)
	at uk.gov.dstl.baleen.uima.BaleenCollectionReader.initialize(BaleenCollectionReader.java:64)
...
2018-03-20 10:48:10,557 DEBUG uk.gov.dstl.baleen.contentextractors.StructureContentExtractor[sample] - Starting function initialize
2018-03-20 10:48:10,557 DEBUG uk.gov.dstl.baleen.core.metrics.LoggingMetricListener - Created timer 'sample:uk.gov.dstl.baleen.contentextractors.StructureContentExtractor:initialize'
...
2018-03-20 10:48:11,401 INFO  org.eclipse.jetty.server.Server - Started @1548ms
2018-03-20 10:48:11,401 DEBUG org.eclipse.jetty.util.component.AbstractLifeCycle - STARTED @1548ms org.eclipse.jetty.server.Server@7eecb5b8
2018-03-20 10:48:11,401 INFO  uk.gov.dstl.baleen.core.web.BaleenWebApi - Server started
2018-03-20 10:48:11,401 INFO  uk.gov.dstl.baleen.core.manager.BaleenManager - Initialisation complete

After stopping the application ctrl+c, I see that the entityCount.tsv file has been created:

$ cat entityCount.tsv
/private/tmp/baleen/data/data.txt	2
/private/tmp/baleen/data/data.txt	2

from baleen.

JohnDaws avatar JohnDaws commented on August 24, 2024

Thank you for the extra information... now that I have set my logging minlevel to the same as yours I am seeing the same result.

It does look like Baleen is working, despite the error, as the entityCount.tsv file is the default output for the EntityCount consumer that you are using. The output you have suggests you have run baleen twice on your single file and so the file with its two entities have been added to the output file twice.

If you set the minlevel to INFO in runner.yaml, you will suppress a lot of the output and you should also see the results of the print.Entities consumer that you are also running.

2018-03-20 15:20:49,919 INFO  uk.gov.dstl.baleen.consumers.print.Entities[sample] - uk.gov.dstl.baleen.types.semantic.Entity:
        Value: [email protected]
        Type: uk.gov.dstl.baleen.types.common.CommsIdentifier
        Span: 25 -> 42

2018-03-20 15:20:49,920 INFO  uk.gov.dstl.baleen.consumers.print.Entities[sample] - uk.gov.dstl.baleen.types.semantic.Entity:
        Value: http://example.com
        Type: uk.gov.dstl.baleen.types.common.Url
        Span: 77 -> 95

This should hopefully demonstrates that Baleen is working, despite the errors, and so hopefully you can continue with your use. I will leave this open, as clearly something is not right nonetheless.

from baleen.

aalsup avatar aalsup commented on August 24, 2024

@JohnDaws - you rock! Thanks for looking into this for me.

from baleen.

jamesdbaker avatar jamesdbaker commented on August 24, 2024

So I believe this is actually expected behaviour and not a problem (although perhaps a little inefficient). The method used to find the class first checks for the class name in the default package, and then if it can't be found there checks the name on it's own. Doing it in this order prevents someone creating a class with no package with the same name as an existing class and it being picked up by mistake.

In the BaleenDefaults class, the default ContentExtractor is fully specified for clarity. But that means when a component attempts to gets the default content extractor it first checks by prepending the default package name before trying the already-qualified (correct) name. It logs the exception for debugging purposes, but it's only there for information and is nothing to worry about.

So perhaps a little inefficient, but not an issue and can be safely ignored. I'd recommend closing this issue.

from baleen.

aalsup avatar aalsup commented on August 24, 2024

👍 Agreed.

from baleen.

JohnDaws avatar JohnDaws commented on August 24, 2024

Thanks James for providing a detailed answer, I will close the thread.

Andrewm, for info, my goto consumer for testing purposes is the Html5 consumer.
This will write all each input document to an html file wrap a around each entity with some metadata. It's a bit easier to keep track of than print.entities once you have more than a few results.

pipeline

...
consumers:
  - class: EntityCount
  - class: print.Entities
  - class: Html5
    outputFolder: .\html_out
    css: .\email_url.css

Without a css file, the Html output will just look like plain text in a browser, but if you add the css file below to your html output folder then the entities within the text should appear 20% larger and colour coded by type. Obviously you can add the other types and be a bit more adventurous with the styling if you wish.

email_url.css

span.Url {color:red;} 
span.CommsIdentifier {color:blue;}
span.baleen {font-size:120%;}

from baleen.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.