Code Monkey home page Code Monkey logo

datumbox-framework's Introduction

Datumbox Machine Learning Framework

Build Status Windows Build status Maven Central License

Datumbox

The Datumbox Machine Learning Framework is an open-source framework written in Java which allows the rapid development Machine Learning and Statistical applications. The main focus of the framework is to include a large number of machine learning algorithms & statistical methods and to be able to handle large sized datasets.

Copyright & License

Copyright (C) 2013-2020 Vasilis Vryniotis.

The code is licensed under the Apache License, Version 2.0.

Installation & Versioning

Datumbox Framework is available on Maven Central Repository.

The latest stable version of the framework is 0.8.2 (Build 20200805). To use it, add the following snippet in your pom.xml:

    <dependency>
        <groupId>com.datumbox</groupId>
        <artifactId>datumbox-framework-lib</artifactId>
        <version>0.8.2</version>
    </dependency>

The latest snapshot version of the framework is 0.8.3-SNAPSHOT (Build 20201014). To test it, update your pom.xml as follows:

    <repository>
       <id>sonatype-snapshots</id>
       <name>sonatype snapshots repo</name>
       <url>https://oss.sonatype.org/content/repositories/snapshots</url>
    </repository>

    <dependency>
        <groupId>com.datumbox</groupId>
        <artifactId>datumbox-framework-lib</artifactId>
        <version>0.8.3-SNAPSHOT</version>
    </dependency>

The develop branch is the development branch (default github branch), while the master branch contains the latest stable version of the framework. All the stable releases are marked with tags.

The releases of the framework follow the Semantic Versioning approach. For detailed information about the various releases check out the Changelog.

Documentation and Code Examples

All the public methods and classes of the Framework are documented with Javadoc comments. Moreover for every model there is a JUnit Test which clearly shows how to train and use the models. Finally for more examples on how to use the framework checkout the Code Examples or the official Blog.

Pre-trained Models

Datumbox comes with a large number of pre-trained models which allow you to perform Sentiment Analysis (Document & Twitter), Subjectivity Analysis, Topic Classification, Spam Detection, Adult Content Detection, Language Detection, Commercial Detection, Educational Detection and Gender Detection. To get the binary models check out the Datumbox Zoo.

Which methods/algorithms are supported?

The Framework currently supports performing multiple Parametric & non-parametric Statistical tests, calculating descriptive statistics on censored & uncensored data, performing ANOVA, Cluster Analysis, Dimension Reduction, Regression Analysis, Timeseries Analysis, Sampling and calculation of probabilities from the most common discrete and continues Distributions. In addition it provides several implemented algorithms including Max Entropy, Naive Bayes, SVM, Bootstrap Aggregating, Adaboost, Kmeans, Hierarchical Clustering, Dirichlet Process Mixture Models, Softmax Regression, Ordinal Regression, Linear Regression, Stepwise Regression, PCA and several other techniques that can be used for feature selection, ensemble learning, linear programming solving and recommender systems.

Bug Reports

Despite the fact that parts of the Framework have been used in commercial applications, not all classes are equally used/tested. Currently the framework is in Alpha version, so you should expect some changes on the public APIs on future versions. If you spot a bug please submit it as an Issue on the official Github repository.

Contributing

The Framework can be improved in many ways and as a result any contribution is welcome. By far the most important feature missing from the Framework is the ability to use it from command line or from other languages such as Python. Other important enhancements include improving the documentation, the test coverage and the examples, improving the architecture of the framework and supporting more Machine Learning and Statistical Models. If you make any useful changes on the code, please consider contributing them by sending a pull request.

Acknowledgements

Many thanks to Eleftherios Bampaletakis for his invaluable input on improving the architecture of the Framework. Also many thanks to ej-technologies GmbH for providing a license for their Java Profiler and to JetBrains for providing a license for their Java IDE.

Useful Links

datumbox-framework's People

Contributors

datumbox avatar dependabot[bot] avatar lmpampaletakis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datumbox-framework's Issues

How can to make datumbox train data in disk

I already to run datumbox success. Which example in https://github.com/datumbox/datumbox-framework-examples/blob/develop/src/main/java/com/datumbox/examples/TextClassification.java

But I want to build 1 apps when I input some text. I will say is category .

I already to try which data in here : https://github.com/datumbox/datumbox-framework-zoo/tree/develop/TopicClassification

And which code

TextClassifier textClassifier = MLBuilder.load(TextClassifier.class, "TopicClassification", configuration);
System.out.println(textClassifier.predict("Datumbox is amazing!").getYPredicted());

Have error : Can't find any object with name 'trainingParameters'

how can I fix it?

How to use Pretrained Models in Datumbox Framework

Hi, I'm new in machine learning and I'm building a small project for analyzing data with "Sentiment Analysis, Content Readability, Content Quality, Adult Content and Spam Detection" features. so I wanted a small help to use Pre-trained Models with Framework.

Serialize Dataframe

How do I serialize a dataframe efficiently (records in bulk) onto the disk, with mapdb. My use case is,
I have a large dataset for text classification, it takes a long time to deserialize and tokenize the text. I want to try out multiple experiments, without having to do tokenization again to convert to Record instances.

Initialize data set with existing collections

I'm interested in performing stepwise regression with Datumbox using variables which I have already significantly altered during the run time of the program in memory (a bunch of map-reduces after fetching from the DB). Thus, it is not really reasonable for me to write the data to CSV and then read it back in with Datumbox as is done in most examples.

Is there any convenient way of creating a data set with already-sorted and manipulated data? I have a one dimensional array of (double) Y values and a two dimensional array of (double) X values on which I would like to perform the stepwise regression. In other math libraries (Apache Commons, etc.) this is the standard route, so I'm hoping there's a relatively easy way to use it with Datumbox that I just haven't seen in the documentation.

use mvn package build failure

I used mvn to build the project .
my system information:

$ mvn --version
Apache Maven 3.0.5
Maven home: /usr/share/maven
Java version: 1.7.0_65, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-7-openjdk-amd64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.13.0-38-generic", arch: "amd64", family: "unix"

This is the error information.

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1:45.186s
[INFO] Finished at: Tue Oct 21 15:06:49 CST 2014
[INFO] Final Memory: 10M/173M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project datumbox-framework: Could not resolve dependencies for project com.datumbox:datumbox-framework:jar:0.5.0: Could not find artifact lpsolve:lpsolve:jar:5.5.2.0 in central (http://repo.maven.apache.org/maven2) -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException

Will this work on Android

I just need the Naive Bayes Classifier, and wanted to know if there are any imports or codes (haven't check tho) that would prevent it from working on Android (Dalvik or ART Java VMs)

Train Text Classifier from String array

Hi
I was wondering if it's possible to train a text classifier from a collection of Strings in memory? Thing is, I have my textual data stored in a database, so it might not much make sense for me to retrieve that data, store it as a text file, and then use it to train the classifier. I was wondering if I'd be able to train the classifier directly?

WordSequenceExtractor can not work with MultinomialNaiveBayes Training

Right now, I have a classifier working with NgramsExtractor and MultinomialNaiveBayes training. However, when I change the text extractor to WordSequenceExtractor, it will have error at the fitting stage (Same for UniqueWordSequenceExtractor):

6819 [main] INFO com.datumbox.framework.core.machinelearning.classification.MultinomialNaiveBayes - fit()
Exception in thread "main" java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.ClassCastException
	at com.datumbox.framework.common.concurrency.ThreadMethods.forkJoinExecution(ThreadMethods.java:116)
	at com.datumbox.framework.common.concurrency.ForkJoinStream.forEach(ForkJoinStream.java:56)
	at com.datumbox.framework.core.machinelearning.common.abstracts.algorithms.AbstractNaiveBayes._fit(AbstractNaiveBayes.java:278)
	at com.datumbox.framework.core.machinelearning.common.abstracts.AbstractTrainer.fit(AbstractTrainer.java:125)
	at com.datumbox.framework.core.machinelearning.modelselection.Validator.validate(Validator.java:67)
	at com.avrio.AVcgclassifier.Classification.main(Classification.java:131)
Caused by: java.util.concurrent.ExecutionException: java.lang.ClassCastException
	at java.base/java.util.concurrent.ForkJoinTask.get(ForkJoinTask.java:996)
	at com.datumbox.framework.common.concurrency.ThreadMethods.forkJoinExecution(ThreadMethods.java:112)
	... 5 more
Caused by: java.lang.ClassCastException
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:488)
	at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:590)
	... 7 more
Caused by: java.lang.ClassCastException
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:488)
	at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:590)
	at java.base/java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:668)
	at java.base/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:726)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
	at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:430)
	at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:594)
	at com.datumbox.framework.common.concurrency.ForkJoinStream.lambda$forEach$0(ForkJoinStream.java:55)
	at java.base/java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1393)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:283)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1603)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
Caused by: java.lang.ClassCastException: java.base/java.lang.String cannot be cast to java.base/java.lang.Number
	at com.datumbox.framework.common.dataobjects.TypeInference.toDouble(TypeInference.java:163)
	at com.datumbox.framework.core.machinelearning.common.abstracts.algorithms.AbstractNaiveBayes.lambda$_fit$1(AbstractNaiveBayes.java:284)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
	at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
	at java.base/java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
	at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:747)
	... 3 more

I assume there's some format change that causes this issue?

Created model is giving slow response?

Hi
I created my own sentiment model for twitter. For preparation of model, I collected 4mb data files of pos,neg,neu.

when I created the model the size of the model is increased to 191mb and when I tried out sentiment the by loading that 191 mb model, It is taking 40 sec time to give the output.

In order to load that 191mb model in tomcat, I increased java -xmx value to 2gb.
I want to know why it is taking 40 sec to give the output & when I created model I did not find any fs0 folder. Does it make any difference ?

please give suggestions on following :
Do I need to look at hardware or Do I need to decrease the size of raw files and create the model ?
when I check your twitter-sentiment model in datumbox zoo,it as low size model (just 1.5 mb) not like me

Access output of StepwiseRegression prediction

I'm attempting to access the output of a StepwiseRegression's prediction. After adding data and training the model, which is confirmed via log output, I attempt to have it predict the independent based on new dependents, as seen in this gist.

As seen in the predict method in the gist, I create a new record with an unknown Y value (I've tried null and zero) and then call Stepwise().predict(dataset), which returns void. I cannot access the underlying regression via the datumbox.StepwiseRegression class, and, as far as I can tell, have no access to the underlying predicted value (my original unknown Dataset and Record remain unchanged).

Is there a method for accessing this value that I'm overlooking? Or is there currently no ability to create a prediction on a Stepwise object?

How to Set configs so that I can read Training Data from Disk?

Hello I'm new to machine learning.Datumbox is my first ML framework I'm working with But I did not find any documentation on setting config properties to read trainingdata from Disk, please share a code example of reading training data sets from the disk & setting up config properties.

Models

Does the framework come with already pretrained models as in the REST API ?

Unsupported major.minor version 52.0

Hi,
very nice framework. In our project we're currently using Java 1.7. And your datumbox version in maven seems to be compiled only in Java 8 without backward compatibilty. Is it possible to add a version that is compatible to Java 1.7?

Best regards

Entity based Sentiment Analysis

Hello,

Does DatumBox support Entity based Sentiment Analysis? If yes, can you show us how? We are trying to solve the problem of sentiment analysis.

Thank you

Possible Error in Shapiro-Wilk P-Value

Hi,

I tried out your Shapiro-Wilk implementation as I needed to calculate some values for a paper-submission. I did cross reference it with the Real-Statistics Excel Plugin (http://www.real-statistics.com/) as well as several online tools that allow use of Shapiro-Wilk.

If you run it with the following values:
488.0, 486.0, 492.0, 490.0, 489.0, 491.0, 488.0, 490.0, 496.0, 487.0, 487.0, 493.0
The results should be:
W -> 0.944486
P -> 0.55826969...

However the P value is
0.44173030948

The offending line is here:

It should actually be m-y
pw=ContinuousDistributions.gaussCdf((m-y)/s)

SVM example for text classfication

Hello,

I need to classify the text. I already tried your awesome example based on MultinomialNaiveBayes classifier https://github.com/datumbox/datumbox-framework-examples/blob/develop/src/main/java/com/datumbox/examples/TextClassification.java

I'd like to also test another algorithm - SVM.
Could you please show an example how to transform the mentioned sample class in order to use SVM?

Is it as simple as changing to:

from

trainingParameters.setModelerTrainingParameters(new MultinomialNaiveBayes.TrainingParameters());

to:

trainingParameters.setModelerTrainingParameters(new SupportVectorMachine.TrainingParameters());
or do I need to change something else also?

Also, can I use the same text files for my SVM model that I used previously for MNB?

Unable to download the framework using Maven

Hi, I am trying to include the framework to my project using Maven, however I'm getting the following error :

Failure to find com.datumbox:datumbox-framework:jar:0.8.1 in https://repo.maven.apache.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced

Apache Maven 3.2.5 (12a6b3acb947671f09b81f49094c53f426d8cea1; 2014-12-14T22:59:2
3+05:30)
Maven home: C:\Dev\apache-maven-3.2.3\bin\..
Java version: 1.7.0_65, vendor: Oracle Corporation
Java home: C:\Dev\Java\jdk1.7.0_45\jre
Default locale: en_US, platform encoding: Cp1252
OS name: "windows 8.1", version: "6.3", arch: "amd64", family: "windows"

can I get a workaround for lpsolve?

for some reason I'm having a hard time resolving the following dependency:

lpsolve lpsolve 5.5.2.0

is there a workaround for this? Like a jar I can download?

How to load a big dataset and use multiple TextClassifier to predict it?

I have a big dataset (6GB of text file) that I want to load into dataframe and then use multiple TextClassifier to predict it.

My current solution is using the predict and give the uri of the text file, however it will load the 6GB text file for each TextClassifier run.

It seems like the If we have a way to load the data into dataframe, by using the _predict(Dataframe newData) function from the Modeler class, I can avoid loading the data multiple times.

Then when I was checking how the data was loaded in the TextClassifier it's done by

Map<Object, URI> dataset = new HashMap<>();
        dataset.put(null, datasetURI);
        
        TrainingParameters trainingParameters = (TrainingParameters) knowledgeBase.getTrainingParameters();
        
        Dataframe testDataset = Dataframe.Builder.parseTextFiles(dataset, 
                AbstractTextExtractor.newInstance(trainingParameters.getTextExtractorParameters()),
                knowledgeBase.getConfiguration()
        );

And my only question here is, where is that knowledgeBase comes from? And how can I use it?

For my current modeling setting, I use

Configuration configuration = Configuration.getConfiguration(); //default configuration based on properties file
        MapDBConfiguration DBparam = new MapDBConfiguration();
        DBparam.setDirectory("/link/to/my/model/");
        configuration.setStorageConfiguration(DBparam); //use MapDB engine
        configuration.getConcurrencyConfiguration().setParallelized(true); //turn on/off the parallelization
        configuration.getConcurrencyConfiguration().setMaxNumberOfThreadsPerTask(4); //set the concurrency level

When I want to just do the Dataframe creation step in my code, what I need to do to get the knowledgeBase variable working?

How to setLogPriors for Naive Bayes model during cross validation?

I am using the Cross Validation to estimate the performance for my model, right now the way I am using it is ClassificationMetrics vm = new Validator<>(ClassificationMetrics.class, configuration).validate(new KFoldSplitter(10).split(trainingDataframe), new MultinomialNaiveBayes.TrainingParameters());

in the com.datumbox.framework.core.machinelearning.common.abstracts.algorithms.AbstractNaiveBayes, I see there's a setLogPriors function which can probably be used to tune the model. (I want to create a DET graph for the model performance, by playing around with the prior probability). Is there a way to set the prior probability of different labels for cross validation? Thanks.

FlatDataList with null values gets an exception when trying to calculate the variance

So, as exposed in our discussion in the pull request, here is a minimal piece of code that shows when the thing throws an exception:

    public void testClassifier() {
        RandomGenerator.setGlobalSeed(42L);
        Configuration configuration = Configuration.getConfiguration();
        InMemoryConfiguration memConfiguration = new InMemoryConfiguration();
        final File f = new File(Datumbox.class.getProtectionDomain().getCodeSource().getLocation().getPath());
        memConfiguration.setDirectory(f.getAbsolutePath());
        configuration.setStorageConfiguration(memConfiguration);

        // List of positive and negative sentences, for training
        List<String> positives = new ArrayList<>();
        List<String> negatives = new ArrayList<>();
        positives.add("the rock is destined to be the 21st century's new \" conan \" and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .");
        positives.add("the gorgeously elaborate continuation of \" the lord of the rings \" trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .");
        positives.add("effective but too-tepid biopic");
        negatives.add("simplistic , silly and tedious .");
        negatives.add("it's so laddish and juvenile , only teenage boys could possibly find it funny .");
        negatives.add("exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable .");

        // Construct the training parameters for the classifier
        TextClassifier.TrainingParameters trainingParameters = new TextClassifier.TrainingParameters();
        trainingParameters.setNumericalScalerTrainingParameters(new StandardScaler.TrainingParameters());
        trainingParameters.setCategoricalEncoderTrainingParameters(new CornerConstraintsEncoder.TrainingParameters());
        trainingParameters.setFeatureSelectorTrainingParametersList(Arrays.asList(new ChisquareSelect.TrainingParameters()));
        trainingParameters.setTextExtractorParameters(new NgramsExtractor.Parameters());
        trainingParameters.setModelerTrainingParameters(new BernoulliNaiveBayes.TrainingParameters());

        // Construct list of records form lists of positives/negatives
        AbstractTextExtractor textExtractor = AbstractTextExtractor.newInstance(trainingParameters.getTextExtractorParameters());
        List<Record> records = new ArrayList<>();
        for (String positive: positives) {
            AssociativeArray xData = new AssociativeArray(textExtractor.extract(StringCleaner.clear(positive)));
            records.add(new Record(xData, "positive"));
        }
        for (String negative: negatives) {
            AssociativeArray xData = new AssociativeArray(textExtractor.extract(StringCleaner.clear(negative)));
            records.add(new Record(xData, "negative"));
        }

        // Construct training dataframe
        Dataframe trainingData = new Dataframe(configuration);
        for (Record r: records)
            trainingData.set(trainingData.size(), r);

        // Construct and train the classifier
        TextClassifier classifier = MLBuilder.create(trainingParameters, configuration);
        classifier.fit(trainingData); // Here, you can follow the trace to see the problem
    }

java.lang.OutOfMemoryError while preparing model from own datasets.

I'm trying out to prepare own model having 20+ classes (need to add more to create own model eg: language detection ) with text classifier java class.it is throwing me java.lang.OutOfMemoryError error.
(Java heap issue). So,I tried out by increasing JAVA Xmx value to 5g but no luck still gives me same error.

Exception in thread "ForkJoinPool-1-worker-0"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "ForkJoinPool-1-worker-0"
Exception in thread "main" java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError
at com.datumbox.framework.common.concurrency.ThreadMethods.forkJoinExecution(ThreadMethods.java:116)
at com.datumbox.framework.common.concurrency.ForkJoinStream.forEach(ForkJoinStream.java:56)
at com.datumbox.framework.core.machinelearning.common.abstracts.algorithms.AbstractNaiveBayes._fit(AbstractNaiveBayes.java:269)
at com.datumbox.framework.core.machinelearning.common.abstracts.AbstractTrainer.fit(AbstractTrainer.java:125)
at com.datumbox.framework.applications.datamodeling.Modeler._fit(Modeler.java:263)
at com.datumbox.framework.core.machinelearning.common.abstracts.AbstractTrainer.fit(AbstractTrainer.java:125)
at com.datumbox.framework.applications.nlp.TextClassifier.fit(TextClassifier.java:128)
at aail.storm.datum8.TextClassification.main(TextClassification.java:109)
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError
at java.util.concurrent.ForkJoinTask.get(ForkJoinTask.java:1006)
at com.datumbox.framework.common.concurrency.ThreadMethods.forkJoinExecution(ThreadMethods.java:112)
... 7 more
Caused by: java.lang.OutOfMemoryError
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:598)
at java.util.concurrent.ForkJoinTask.get(ForkJoinTask.java:1005)
... 8 more
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.concurrent.ForkJoinTask.recordExceptionalCompletion(ForkJoinTask.java:471)
at java.util.concurrent.ForkJoinTask.setExceptionalCompletion(ForkJoinTask.java:491)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:291)
at java.util.concurrent.ForkJoinPool.helpComplete(ForkJoinPool.java:1870)
at java.util.concurrent.ForkJoinPool.awaitJoin(ForkJoinPool.java:2045)
at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:404)
at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583)
at com.datumbox.framework.common.concurrency.ForkJoinStream.lambda$forEach$0(ForkJoinStream.java:55)
at com.datumbox.framework.common.concurrency.ForkJoinStream$$Lambda$10/1476011703.run(Unknown Source)
at java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1386)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

can, I know how did you prepared model for Language detection which contains 90+ language classes.
what suggestion can you give me to training own model with 100+ classes with out getting the heap issue error.

Each raw class file size is around 500kb only.

datumbox on maven central

Hi,

thanks for the great work. it would be great to have the package ready for download from maven central. would you think it's possible?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.