stepthom / lucene-lda Goto Github PK
View Code? Open in Web Editor NEWUsing latent Dirichlet allocation (LDA) in Apache Lucene
Using latent Dirichlet allocation (LDA) in Apache Lucene
I ran Mallet on my input data and converted that to the 4 files required to run LDA.
However, bin/queryWithLDA requires an Lucene index(which can be created with the bin/indexDirectory command) and a LDA index( I substituted the mallet file) and put the 4 files in the query folder and ran it.
After this i get
java.lang.ClassCastException: cc.mallet.types.InstanceList cannot be cast to ca.queensu.cs.sail.lucenelda.LDAHelper
at ca.queensu.cs.sail.lucenelda.LDAQueryAllInDirectory.main(LDAQueryAllInDirectory.java:153)
Exception in thread "main" java.lang.NullPointerException
at ca.queensu.cs.sail.lucenelda.LDAQueryAllInDirectory.main(LDAQueryAllInDirectory.java:160)
Any idea on how to proceed with this?
I know the whole purpose of lucene-lda is to run Lucene with LDA. However, to make the tool more general and useful, we need to gracefully accept cases when LDA is not desired, and instead only VSM indices need to be built and queried. This basic functionality works now, but we need to make sure we gracefully exit if the LDAHelper is empty.
It Throws an exception "Exception in thread "main" java.lang.NoSuchFieldError: LUCENE_35" .Then I tried changing the version in the source code to version 4.1 , then after that i am unable to build using version 4.1( i replaced LUCENE_35 with LUCENE_41). Can you suggest me the proper solution for it?
If the filecodes option is not set (and hence no filename->integer mapping is provided by the user), we need to create an identity mapping that can be used in the query results. (I.e., instead of outputing (fileCode, relevancyScore) tuples in the output, we should just output (fileName, relevancyScore) tuples.)
My students in my USC CSCI 572 Search engines class found the following issue had to be dealt with to get this project to work:
Would you be interested in me pushing this upstream? Also what are the chances that we'll get this integrated without having to run LDA outside of this tool? Thank you!
The goal of this test is to be very simple: 3 documents, a couple of easy queries, and very intuitive LDA topics. That way, it will be easy to "verify" the query results by hand.
The 3 documents are already there; just need to run LDA to generate the LDA output.
Currently, the LDAHelper class (which encapsulates all the LDA functionality) is serialized and written to disk at index time, and then read back again at query time. This is a little clumsy, as it requires the user to specify a filepath for the serialized object at index time, and then regurgitate the same path at query time. It would be easier (and perhaps cleaner) to add all the information in the LDAHelper class to the Lucene index itself. Is this possible? How can we do this?
We are able to get Lucene-LDA to compile by removing the lucene-3.0 Jar(and leave the 3.5 jar) from the lib directory.
However, when we try to run the indexDirectory command on the documents that we have, we observed that as per the readme and the source code, lucene-lda doesn't run MALLET by itself.
So we ran mallet on the data first and obtained the output from MALLET. However, after this Lucene-lda doesn't recognize the output from the mallet file(when we try to run the queryWithLDA. command). Does this need to be in some specific data format?
One of the much-needed features in lucene-lda is to compute LDA on the fly, for the cases when LDA has not been precomputed on the corpus.
One easy way to do this is to integrate with MALLET:
MALLET has API calls to run LDA and collect the output. This could all be done in the IndexDirectoryRunLDA.java class.
This may require some changes to the internals of LDAHelper, such as the representation of the matrices (if MALLET returns something different), but should be worth it in the end,
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/lucene/document/Fieldable
at ca.queensu.cs.sail.lucenelda.IndexDirectory.main(IndexDirectory.java:159)
Caused by: java.lang.ClassNotFoundException: org.apache.lucene.document.Fieldable
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 1 more
I was trying to get lucene-lda to work with lucene-core-4.10.5 jar.
It throws compilation errors on building it with lucene-core-4.10.5-SNAPSHOT.jar .
lucene-core-3.5.0.jar and lucene-analyzers-3.5.0.jar were replaced with the following jars in build.xml.
jar:
[javac] Compiling 9 source files to /Users/Balaji/Development/LDA/lucene-lda/build/classes
[javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/LDASimilarity.java:29: error: cannot find symbol
[javac] import org.apache.lucene.search.DefaultSimilarity;
[javac] ^
[javac] symbol: class DefaultSimilarity
[javac] location: package org.apache.lucene.search
[javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/LDASimilarity.java:31: error: cannot find symbol
[javac] public class LDASimilarity extends DefaultSimilarity {
[javac] ^
[javac] symbol: class DefaultSimilarity
[javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/SimpleIndexer.java:10: warning: [deprecation] Index in Field has been deprecated
[javac] import org.apache.lucene.document.Field.Index;
[javac] ^
[javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/SimpleIndexer.java:12: error: cannot find symbol
[javac] import org.apache.lucene.document.NumericField;
[javac] ^
[javac] symbol: class NumericField
[javac] location: package org.apache.lucene.document
[javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/VSMQueryAllInDirectory.java:30: error: package org.apache.lucene.queryParser does not exist
[javac] import org.apache.lucene.queryParser.MultiFieldQueryParser;
[javac] ^
[javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/VSMQueryAllInDirectory.java:31: error: package org.apache.lucene.queryParser does not exist
[javac] import org.apache.lucene.queryParser.QueryParser;
[javac] ^
[javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/VSMQueryAllInDirectory.java:56: error: cannot find symbol
[javac] private static QueryParser parser = null;
[javac] ^
[javac] symbol: class QueryParser
[javac] location: class VSMQueryAllInDirectory
[javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/VSMSimilarity.java:27: error: cannot find symbol
[javac] import org.apache.lucene.search.DefaultSimilarity;
[javac] ^
[javac] symbol: class DefaultSimilarity
[javac] location: package org.apache.lucene.search
[javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/VSMSimilarity.java:29: error: cannot find symbol
[javac] public class VSMSimilarity extends DefaultSimilarity {
[javac] ^
[javac] symbol: class DefaultSimilarity
[javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/IndexDirectory.java:97: warning: [rawtypes] found raw type: Iterator
[javac] for (java.util.Iterator errs = config.getErrorMessageIterator(); errs
[javac] ^
[javac] missing type arguments for generic class Iterator<E>
[javac] where E is a type-variable:
[javac] E extends Object declared in interface Iterator
[javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/IndexDirectoryRunLDA.java:83: warning: [rawtypes] found raw type: Iterator
[javac] for (java.util.Iterator errs = config.getErrorMessageIterator(); errs.hasNext();) {
[javac] ^
[javac] missing type arguments for generic class Iterator<E>
[javac] where E is a type-variable:
[javac] E extends Object declared in interface Iterator
[javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/LDAQueryAllInDirectory.java:121: warning: [rawtypes] found raw type: Iterator
[javac] for (java.util.Iterator errs = config.getErrorMessageIterator(); errs
[javac] ^
[javac] missing type arguments for generic class Iterator<E>
[javac] where E is a type-variable:
[javac] E extends Object declared in interface Iterator
[javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/LDAQueryAllInDirectory.java:166: error: no suitable method found for open(Directory,boolean)
[javac] reader = IndexReader.open(dir, true);
[javac] ^
[javac] method IndexReader.open(Directory,int) is not applicable
[javac] (argument mismatch; boolean cannot be converted to int)
[javac] method IndexReader.open(IndexWriter,boolean) is not applicable
[javac] (argument mismatch; Directory cannot be converted to IndexWriter)
[javac] method IndexReader.open(IndexCommit,int) is not applicable
[javac] (argument mismatch; Directory cannot be converted to IndexCommit)
BUILD FAILED
/Users/Balaji/Development/LDA/lucene-lda/build.xml:45: Compile failed; see the compiler error output for details.
Total time: 1 second
Is there some other change that needs to be made that I'm missing?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.