Code Monkey home page Code Monkey logo

Comments (10)

jbaker-dstl avatar jbaker-dstl commented on August 24, 2024

I think there might be several things going on here.

Firstly, you're right in saying that some of the Apache HTTP Components classes seem to be missing from Baleen 2.2.0 - I'm not sure why that is (it's a sub-dependency of one of Baleen's dependencies rather than something we use directly, so it might be something to do with the version we're using there). However, in the latest snapshots of Baleen 2.3.0 it does seem to be included so perhaps consider using that instead?

However, looking at the test code you've provided I'm not sure you're using Baleen in the correct way. Instantiating a single annotator on it's own almost certainly won't work as it will be missing a lot of the additional functionality provided by Baleen and required by a Baleen annotator.

What is it you're trying to achieve? If you're trying to run Baleen inside your application, have you looked at this Wiki page: https://github.com/dstl/baleen/wiki/Run-as-a-Standalone-Application

from baleen.

susachintha avatar susachintha commented on August 24, 2024

Thanks for the quick reply. I'm not sure from where I can get 2.3.0. Under releases, only 2.2.0 appears as the latest release. I should try that version definitely.

As I came across these issues in my application, I just copied part of it to a sample application to make it simpler. To give you an overview of what I'm trying to do, I just copied few more lines of my program to this test application. Basically, I'm reading some external files (for the moment only a csv file) and read its data (employee data) and store them in ElasticSearch cluster through Baleen, so that later, when a search is performed, easily data can be retrieved.

Logic is something similar to this. (This code may not compile, as I have just copied some lines from the main application)

public static void main(String[] args) {

        try {
            tmpDir = Files.createTempDirectory("elasticsearch");

            String s = tmpDir.toString();
            
            Settings settings = Settings.builder()
                    .put("path.home", tmpDir.toString())
                    .put("http.port", "19600")		//Don't use the default ports for testing purposes
                    .put("transport.tcp.port", "19300")
                    .build();
            
            Node node = NodeBuilder.nodeBuilder()
                    .settings(settings)
                    .data(true)
                    .local(true)
                    .clusterName("SusanthaSearch")
                    .node();
            
            ExternalResourceDescription erd = ExternalResourceFactory.createExternalResourceDescription(ELASTICSEARCH, SharedElasticsearchRestResource.class, SharedElasticsearchRestResource.PARAM_URL, "http://localhost:19600");
            AnalysisEngineDescription aed = AnalysisEngineFactory.createEngineDescription(ElasticsearchRest.class, ELASTICSEARCH, erd);
            
            try
            {
                System.out.println("Now creating the engine");
                ae = AnalysisEngineFactory.createEngine(aed);
            }catch(ResourceInitializationException ex)
            {
                System.out.println("Caught"+ex.getMessage());
            }catch(Exception e)
            {
                System.out.println("Caught"+e.getMessage());
            }
            client = node.client();
            
            String path = "RealFolderPathMustBeGiven"; // folder path where all the files resides
            Files.walk(Paths.get(path)).forEach(filePath -> {
                
            
            FileInputStream fis; // Finds the workbook instance for XLSX file 
                    fis = new FileInputStream(filePath.toString());
            XSSFWorkbook myWorkBook = new XSSFWorkbook(fis); // Return first sheet from the XLSX workbook 
                    XSSFSheet sheet = myWorkBook.getSheetAt(0);
                    ArrayList<String> columnList = new ArrayList();
                    
                    Iterator<Row> rowIterator = sheet.iterator();
                    if (rowIterator.hasNext()) {
                        //First Row
                        // excelsheet header values can be retrieved and store
                    }
            while (rowIterator.hasNext()) {
                  //2nd row on wards
                        Employee employee = new Employee(jCas);
                        
                        Row row = rowIterator.next();
                        for (int i = 0; i < columnList.size(); i++) {
                            if (row.getCell(i) != null) {
                                if(columnList.get(i).equals("LastName"))
                                {
                                    employee.setLastName(row.getCell(i).toString());
                                }
                                else if(columnList.get(i).equals("FirstName"))
                                {
                                    employee.setFirstName(row.getCell(i).toString());
                                }
                                else if(columnList.get(i).equals("Gender"))
                                {
                                    employee.setGender(row.getCell(i).toString());
                                }
                                else if(columnList.get(i).equals("Birthday"))
                                {
                                    employee.setBirthday(row.getCell(i).toString());
                                }
                                else if(columnList.get(i).equals("BirthCountry"))
                                {
                                    employee.setBirthCountry(row.getCell(i).toString());
                                }
                                employee.addToIndexes();
                            }
                            
                        }
                        
                    }
            
            ae.process(jCas);
            
            }
            
        } catch (IOException ex) {
        Logger.getLogger(TestBaleen.class.getName()).log(Level.SEVERE, null, ex);
            //Logger.getLogger(ContentScrapper.class.getName()).log(Level.SEVERE, null, ex);
        } catch (ResourceInitializationException ex) {
        Logger.getLogger(TestBaleen.class.getName()).log(Level.SEVERE, null, ex);
    }

}


//Then from somewhere else I can perform searches like this
{
SearchHit result = client.search(new SearchRequest()).actionGet().getHits().hits()[0];
List<Map<String, Object>> entities = (List<Map<String, Object>>) result.getSource().get("entities");
//I hope this will give all the employees
//Later I need to perform some other searches. Like, for the given parameter (lastName or Birthday), retrieving the employee.
}


Here Employee class is similar to Person class in Baleen which is inherited from Entity. Employee_Type also there similar to Person_Type.
Idea here to use Baleen for our project is to annotate important data, so that searching would be comprehensive and easy. Our web application needs to perform advance searches like above which I haven't written them yet. Still I'm writing the text processing, reading and indexing them to ES cluster.

from baleen.

jbaker-dstl avatar jbaker-dstl commented on August 24, 2024

To get Baleen 2.3.0-SNAPSHOT, you will need to compile it yourself (i.e. clone the repo and run mvn package).

In terms of what you're trying to acheive, it sounds like what you really want to do is develop additional components for Baleen and then use Baleen as normal with a pipeline including your components. Any components that are on the Classpath are automatically picked up and available for use. You can also add in additional types (Employee in your case) using a similar method (i.e. making them available on the Classpath).

Have you read the Development Guides included in Baleen? Based on the code above, you probably want to develop your own ContentExtractor to handle your CSV files (or possibly your own CollectionReader) to read in the data and annotate it as appropriate. Your pipeline would then look something like:

collectionreader:
  class: your.class.on.the.classpath.XslxCollectionReader
  file: C:\your\file.xlsx

annotators:
# None required here unless you want to do additional extraction, e.g. finding e-mail addresses and phone numbers?

consumers:
- Elasticsearch

Have a look also at the following pages for some examples: https://github.com/dstl/baleen/wiki/Additional-Components

from baleen.

susachintha avatar susachintha commented on August 24, 2024

1.Thanks. I'll go through the docs again.

2.Some build errors cause not to compile.
I followed this (https://github.com/dstl/baleen/blob/master/BUILD.md) but at the before step 5, I get some compiler errors. Snapshot of console output is copied at the bottom.

  1. I'm wondering whether you could make Baleen to support Netbeans too.

{{
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] Baleen
[INFO] Baleen Core
[INFO] Baleen UIMA
[INFO] Baleen Resources
[INFO] Baleen Annotators
[INFO] Baleen Collection Readers
[INFO] Baleen Consumers
[INFO] Baleen Jobs
[INFO] Baleen History
[INFO] Baleen Runner
[INFO] Baleen Javadoc
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Baleen 2.2.0
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- jacoco-maven-plugin:0.7.4.201502262128:prepare-agent (default-prepare-agent) @ baleen ---
[INFO] argLine set to "-javaagent:C:\Users\Susantha\.m2\repository\org\jacoco\org.jacoco.agent\0.7.4.201502262128\org.jacoco.agent-0.7.4.201502262128-runtime.jar=destfile=D:\baleen-master\baleen-master\baleen\target\jacoco.exec"
[INFO]
[INFO] --- jacoco-maven-plugin:0.7.4.201502262128:report (default-report) @ baleen ---
[INFO] Skipping JaCoCo execution due to missing execution data file:D:\baleen-master\baleen-master\baleen\target\jacoco.exec
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Baleen Core 2.2.0
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- jacoco-maven-plugin:0.7.4.201502262128:prepare-agent (default-prepare-agent) @ baleen-core ---
[INFO] argLine set to "-javaagent:C:\Users\Susantha\.m2\repository\org\jacoco\org.jacoco.agent\0.7.4.201502262128\org.jacoco.agent-0.7.4.201502262128-runtime.jar=destfile=D:\baleen-master\baleen-master\baleen\baleen-core\target\jacoco.exec"
[INFO]
[INFO] --- maven-resources-plugin:2.5:resources (default-resources) @ baleen-core ---
[debug] execute contextualize
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 66 resources
[INFO]
[INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @ baleen-core ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- maven-resources-plugin:2.5:testResources (default-testResources) @ baleen-core ---
[debug] execute contextualize
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 22 resources
[INFO]
[INFO] --- maven-compiler-plugin:2.3.2:testCompile (default-testCompile) @ baleen-core ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- maven-surefire-plugin:2.10:test (default-test) @ baleen-core ---
[INFO] Surefire report directory: D:\baleen-master\baleen-master\baleen\baleen-core\target\surefire-reports


T E S T S

Results :

Tests run: 0, Failures: 0, Errors: 0, Skipped: 0

[INFO]
[INFO] --- jacoco-maven-plugin:0.7.4.201502262128:report (default-report) @ baleen-core ---
[INFO] Analyzed bundle 'Baleen Core' with 75 classes
[INFO]
[INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ baleen-core ---
[INFO]
[INFO] --- maven-jar-plugin:2.4:test-jar (default) @ baleen-core ---
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Baleen UIMA 2.2.0
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- jacoco-maven-plugin:0.7.4.201502262128:prepare-agent (default-prepare-agent) @ baleen-uima ---
[INFO] argLine set to "-javaagent:C:\Users\Susantha\.m2\repository\org\jacoco\org.jacoco.agent\0.7.4.201502262128\org.jacoco.agent-0.7.4.201502262128-runtime.jar=destfile=D:\baleen-master\baleen-master\baleen\baleen-uima\target\jacoco.exec"
[INFO]
[INFO] --- maven-resources-plugin:2.5:resources (default-resources) @ baleen-uima ---
[debug] execute contextualize
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 9 resources
[INFO]
[INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @ baleen-uima ---
[INFO] Compiling 86 source files to D:\baleen-master\baleen-master\baleen\baleen-uima\target\classes
[INFO] -------------------------------------------------------------
[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] \baleen-master\baleen-master\baleen\baleen-uima\src\main\java\uk\gov\dstl\baleen\types\semantic\Entity.java:[19,7] error: Entity is not abstract and does not override abstract method getTypeName() in Recordable
[ERROR] \baleen-master\baleen-master\baleen\baleen-uima\src\main\java\uk\gov\dstl\baleen\types\semantic\Event.java:[22,7] error: Event is not abstract and does not override abstract method getTypeName() in Recordable
[ERROR] \baleen-master\baleen-master\baleen\baleen-uima\src\main\java\org\apache\uima\jcas\tcas\DocumentAnnotation.java:[286,60] error: incompatible types: String cannot be converted to String[]
[ERROR] \baleen-master\baleen-master\baleen\baleen-uima\src\main\java\uk\gov\dstl\baleen\types\semantic\Relation.java:[19,7] error: Relation is not abstract and does not override abstract method getTypeName() in Recordable
[ERROR] \baleen-master\baleen-master\baleen\baleen-uima\src\main\java\uk\gov\dstl\baleen\uima\grammar\ParseTree.java:[72,68] error: cannot find symbol
[ERROR] \baleen-master\baleen-master\baleen\baleen-uima\src\main\java\uk\gov\dstl\baleen\uima\grammar\ParseTree.java:[73,20] error: cannot find symbol
[ERROR] \baleen-master\baleen-master\baleen\baleen-uima\src\main\java\uk\gov\dstl\baleen\uima\grammar\ParseTree.java:[156,31] error: cannot find symbol
[ERROR]\baleen-master\baleen-master\baleen\baleen-uima\src\main\java\uk\gov\dstl\baleen\uima\grammar\ParseTree.java:[157,40] error: incompatible types: invalid method reference
[INFO] 8 errors
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Baleen ............................................ SUCCESS [4.146s]
[INFO] Baleen Core ....................................... SUCCESS [11.700s]
[INFO] Baleen UIMA ....................................... FAILURE [3.081s]
[INFO] Baleen Resources .................................. SKIPPED
[INFO] Baleen Annotators ................................. SKIPPED
[INFO] Baleen Collection Readers ......................... SKIPPED
[INFO] Baleen Consumers .................................. SKIPPED
[INFO] Baleen Jobs ....................................... SKIPPED
[INFO] Baleen History .................................... SKIPPED
[INFO] Baleen Runner ..................................... SKIPPED
[INFO] Baleen Javadoc .................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 19.299s
[INFO] Finished at: Wed Jan 04 11:07:26 GMT+05:30 2017
[INFO] Final Memory: 20M/47M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.3.2:compile (default-compile) on project baleen-uima: Compilation failure: Compilation failure:
[ERROR] \baleen-master\baleen-master\baleen\baleen-uima\src\main\java\uk\gov\dstl\baleen\types\semantic\Entity.java:[19,7] error: Entity is not abstract and does not override abstract method getTypeName() in Recordable
[ERROR] \baleen-master\baleen-master\baleen\baleen-uima\src\main\java\uk\gov\dstl\baleen\types\semantic\Event.java:[22,7] error: Event is not abstract and does not override abstract method getTypeName() in Recordable
[ERROR] \baleen-master\baleen-master\baleen\baleen-uima\src\main\java\org\apache\uima\jcas\tcas\DocumentAnnotation.java:[286,60] error: incompatible types: String cannot be converted to String[]
[ERROR] \baleen-master\baleen-master\baleen\baleen-uima\src\main\java\uk\gov\dstl\baleen\types\semantic\Relation.java:[19,7] error: Relation is not abstract and does not override abstract method getTypeName() in Recordable
[ERROR] \baleen-master\baleen-master\baleen\baleen-uima\src\main\java\uk\gov\dstl\baleen\uima\grammar\ParseTree.java:[72,68] error: cannot find symbol
[ERROR] \baleen-master\baleen-master\baleen\baleen-uima\src\main\java\uk\gov\dstl\baleen\uima\grammar\ParseTree.java:[73,20] error: cannot find symbol
[ERROR]\baleen-master\baleen-master\baleen\baleen-uima\src\main\java\uk\gov\dstl\baleen\uima\grammar\ParseTree.java:[156,31] error: cannot find symbol
[ERROR] \baleen-master\baleen-master\baleen\baleen-uima\src\main\java\uk\gov\dstl\baleen\uima\grammar\ParseTree.java:[157,40] error: incompatible types: invalid method reference
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn -rf :baleen-uima
}}

from baleen.

jbaker-dstl avatar jbaker-dstl commented on August 24, 2024

Have you tried building from the command line rather than through Eclipse? The Eclipse and Maven integration can be somewhat buggy, so it's possible that's the issue. The other thing to try might be building it on a path without spaces. I will try doing a clean compile of the code here later, but have built it before without issues.

What JDK are you using to build Baleen?

With regards to NetBeans, the files on GitHub don't include the Eclipse specific project files, so you should be able to import the Maven projects into NetBeans.

from baleen.

jbaker-dstl avatar jbaker-dstl commented on August 24, 2024

I've just build the latest code from the command line and it worked fine, suggesting it's something with your setup (or possibly an Eclipse bug).

from baleen.

jamesfry avatar jamesfry commented on August 24, 2024

In your logs the compile error about a method that is not being overridden is a default method, which were introduced with Java 8 - are you building with JDK7?

from baleen.

susachintha avatar susachintha commented on August 24, 2024

Thanks a lot, I was able to compile and build Baleen 2.3.0 without errors after fixing couple of issues in my side.
I removed whitespaces in my folder path, that solved some compiler errors.
I was using Eclipse Java EE IDE for Web Developers version, for some reason, it allows only up to 1.7 Java version. So I couldn't move forward with Eclipse. Probably I was using a wrong version of eclipse.

Then I imported Maven project to Netbeans, after removing the whitespaces in the project path, I was able to build Baleen. Cause for the some jar file missing errors was I had been using Baleen-core, but after importing to Netbeans I realized complete set of jar dependencies are available in the root of Baleen, so referencing to root Baleen solved those errors also.
What I'm going to do next is rather try to load annotators on their own, use the framework and callbacks. I'll be back if I encounter any problems again.

One last question Could you please send me the pipeline configuration file for this (https://github.com/jamesdbaker/Baleen-Components)? That would be handy to have a look at real example.

I was referring this (https://github.com/dstl/baleen/wiki/Sample-Pipeline) pipeline and am not too sure certain information there. For example
annotators:

  • language.OpenNLP
  • class: misc.DocumentTypeByLocation
    baseDirectory: C:\baleen\data
  • gazetteer.Country
  • class: gazetteer.Mongo
    type: Buzzword
    collection: buzzwords
  • class: gazetteer.Mongo
    type: Location
    collection: location
  • class: gazetteer.Mongo
    type: Organisation
    collection: organisations
  • class: gazetteer.Mongo
    type: Person
    collection: people

What is the convention of these definitions?

  1. class:gazetteer.Mongo" Does it refer to Baleen Mongo class or something user defined class? Baleen doesn't have Mongo class, but has MongoGazetteer.java
  2. misc.DocumentTypeByLocation" Is it a user defined class? What is the baseDirectory there?
  3. What does mean by 'type' and 'collection' under class:gazetteer.Mongo?
  4. I read in the documentation; annotators, consumers and collectionReader are the main clauses to define in pipeline. ( I can see these are high level package names of some Baleen projects. So other than Baleen Annotators, Baleen Collection Readers and Baleen Consumers, if I need to use other classes in other packages, can I define them by their high level package name? For example if I need to use Entity.java and it comes under baleen.types.semantic package. So similar to 'annotators:' above, can I use like this;

types:
-semantic.Entity

Thanks in advance for your answers.

from baleen.

jbaker-dstl avatar jbaker-dstl commented on August 24, 2024

Please read the 'Running Baleen with Additional Annotators' guide included within Baleen and the Javadoc for PipelineCpeBuilder, this will answer some of your questions.

Also, consider using the Plankton tool to play around with building pipelines as this will produce the correct YAML for you. The one from that additional components project would look something like:

annotators:
  - jamesbaker.baleen.annotators.HashTag

gazetteer.Mongo refers to the Mongo class in uk.gov.dstl.baleen.annotators.gazetteer. All built in annotators are under uk.gov.dstl.baleen.annotators, so we only need to specify the end of the class. If the annotator you want to load is not under this package, then you need to provide the full name.

type and collection are configuration properties for the Mongo gazetteer annotator. Information about the configuration properties for each annotator (or any component) can be found in the Javadoc.

To include new entity types, I believe they just need to be on the classpath and defined in such a way that UimaFIT will detect them. Please refer to the UimaFIT documentation for how to do this.

from baleen.

susachintha avatar susachintha commented on August 24, 2024

Thank you for the references, they are very helpful.

from baleen.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.