Code Monkey home page Code Monkey logo

carrot2's Introduction

Github Build Status

Carrot2

Carrot2 is a programming library for clustering text. It can automatically discover groups of related documents and label them with short key terms or phrases.

Carrot2 can turn, for example, search result titles and snippets into groups like these:

Search result titles and snippets and corresponding cluster labels (right).

Installation

Carrot2 is a software component and typically integrates with other software as a library dependency (see the API documentation available with each release).

Binary releases are published on GitHub and they ship with a HTTP/JSON REST API service called the DCS (document clustering server) for integration with other languages.

Integration with document retrieval services is possible via Apache Solr plugin and Elasticsearch plugin.

Building from Sources

If you need to build the distribution from sources, run:

./gradlew -p distribution assemble

The distribution is placed under distribution/build/dist/ and a compressed version is available at distribution/build/distZip/

Documentation

Source code

Source code is at GitHub.

Contact and more information

License

Carrot2 is licensed under the BSD license.

carrot2's People

Contributors

bayandin avatar chenrui333 avatar davenorthcreek avatar dependabot[bot] avatar dweiss avatar omegaula avatar quan-nh avatar sjvs avatar smillerdev avatar stanislawosinski avatar stonio avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

carrot2's Issues

Inacurate cluster size reporting [CARROT-1]

The cluster tree in the carrot2-demo-webapp shows inconsistent information about cluster sizes. The root node (All results) eports the 'true' number of documents, while the other nodes report the number of document assignments.


Issue: CARROT-1 (migrated from JIRA), created by Stanisław Osiński (@stanislawosinski), resolved May 13 2007

Remembering Lucene settings in the Demo Browser [CARROT-40]

In the current version of the Demo Browser, users must configure the Lucene input component every time the application is restarted, which is annoying. It would be a good idea to either let the user save the configuration explicitly, or simply save the configuration in background and restore it when the Demo Browser is launched again (making the process totally transparent for the user).


Issue: CARROT-40 (migrated from JIRA), created by Stanisław Osiński (@stanislawosinski), resolved Mar 14 2007

Tip of the day [CARROT-2]

This is a result of a heuristic UI evaluation. We might consider adding "tips of the day" for advanced users about e.g.: keyboard shortcuts, how to change clustering algorithms, how to change the results count. The tips might be added on the startup page, with an option to disable them (stored in a persistent cookie).


Issue: CARROT-2 (migrated from JIRA), created by Stanisław Osiński (@stanislawosinski), updated Mar 20 2012

Annoying logging done by JDIC [CARROT-52]

When running the demo browser, JDIC is doing a lot of annoying logging. It dups to the console something along the lines of:

native lib path F:\tmp\br\deps-carrot2-demo-browser-jar*** Error: nspr4.dll doesn't exist under F:\tmp\br\deps-carrot2-demo-browser-jar\ielib
native lib path F:\tmp\br\deps-carrot2-demo-browser-jar

Additionally, it creates a JDIC.log file in the demo browser's main directory.

Is there any way to get rid of this kind of logging?


Issue: CARROT-52 (migrated from JIRA), created by Stanisław Osiński (@stanislawosinski), resolved Apr 03 2007

Tracking of clusters the users chose to inspect [CARROT-13]

This could be an interesting source of statistics, potentially useful for improving/ promoting certain clusters in the future. Implementation idea: when the user clicks on a cluster label (for the first time only?), make a background XmlHTTPRequest to send to the server: the query, the cluster label (+ parent labels?), the rank of the cluster?, the number of documents in the cluster? An even more interesting feature would be to track clicks on documents (a cluster whose documents have not been clicked is worse than a cluster whose documents have been clicked).


Issue: CARROT-13 (migrated from JIRA), created by Stanisław Osiński (@stanislawosinski), updated Mar 20 2012

More scalable input tabs [CARROT-8]

Currently, the number of input tabs we can have in the webapp is limited by the width of the tabs space – it's not possible to have more than about 7-8 tabs because they wouldn't fit on the screen.

One idea to solve this problem is to have a "more..." tab, shown always at the rightmost position, which would show all the inputs emphasizing those that are currently not available on tabs. After the user clicks one of such inputs, the selected input would replace the rightmost regular tab (not the "more..." tab), the new tab would be automatically activated and the cursor placed in the query text field.

In a more sophisticated scenario, each input on the "more..." tab could have a combo box. Entries with the combo box checked would appear as tabs. In this way, users could have as many tabs as they can fit on their screens. One problem with this is the order/ordering of tabs – alphabetical?

Another important factor is to remember user preferences (last added tab, set of selected tabs depending on the scenario) in a persistent cookie.


Issue: CARROT-8 (migrated from JIRA), created by Stanisław Osiński (@stanislawosinski), resolved Jun 20 2007

Stop list merging for Lingo [CARROT-37]

Occasionally it happens that search results contain snippets in different languages, which, combined with inaccurate language recognition, can lead to meaningless cluster labels consisting of e.g. stop words. Applying all known stop lists (and not only the one for the recognized language) would fix the problem. Stop word list merging could be implemented on the level of the tokenizer component, and in this way all clustering algorithms using the component would benefit.


Issue: CARROT-37 (migrated from JIRA), created by Stanisław Osiński (@stanislawosinski), resolved Mar 15 2007

SOLR input [CARROT-17]

SOLR (http://lucene.apache.org/solr/) is gaining more and more momentum, so it might be a good idea to add an input component for it. The component is very easy to create – it's essentially a subclass of XmlInputComponent with a specific XSLT file.


Issue: CARROT-17 (migrated from JIRA), created by Stanisław Osiński (@stanislawosinski), resolved Feb 20 2007

Show in clusters [CARROT-7]

Add a "Show in clusters" feature to the webapp, which will highlight all clusters in the tree to which the selected (hovered-on?) document belongs.


Issue: CARROT-7 (migrated from JIRA), created by Stanisław Osiński (@stanislawosinski), resolved Feb 16 2007

Carrot2 does not build with ANT 1.7.0 [CARROT-27]

Carrot2 does not build with ANT 1.7.0 due to changes in the ANT internals. The Path class, which Carrot2 uses to inherit is dependencies-related taskdefs, has changed the API internally and uses ResourceCollection interface methods rather than String [] list() method.

A fix for this should not be difficult, but it also means that all the development must be moved to ANT 1.7.0 (creating a compatible version is unlikely).


Issue: CARROT-27 (migrated from JIRA), created by Dawid Weiss (@dweiss), resolved Feb 21 2007
Attachments: ant170.patch

"Always show all clusters" link in the cluster list [CARROT-21]

Some users may prefer that all clusters are shown at once, even without clicking the "all clusters" link. To support such users the cluster list could behave in the following way: after the user clicks the "all clusters" (#232) link, at the bottom of the cluster list another link appears (e.g. "always show all clusters"). After the user clicks that link, for all further searches all clusters will appear right away. When immediate showing of all clusters is enabled, at the bottom of the cluster list another link should be visible (e.g. "don't show all clusters right away"), which the user can click to restore the original behaviour (i.e. showinc clusters in groups).


Issue: CARROT-21 (migrated from JIRA), created by Stanisław Osiński (@stanislawosinski), resolved Jun 14 2007

"Show all clusters" link in the clusters list [CARROT-20]

It would be nice to allow the users to display all clusters in one click rather than by multiple clicks on the "more..." links. To implement this, next to the "more..." link, "all clusters" link should be rendered that reveals all clusters at one time time.


Issue: CARROT-20 (migrated from JIRA), created by Stanisław Osiński (@stanislawosinski), resolved Feb 15 2007

BeanShell script failing with a ClassNotFoundException [CARROT-3]

When this is added as an input component for the browser:

/carrot2-lingo-3g-browser/components/input-cached-lucene.bsh:

import com.dawidweiss.carrot.core.local.*;
import com.dawidweiss.carrot.core.local.impl.*;
import com.dawidweiss.carrot.core.local.clustering.*;
import com.dawidweiss.carrot.local.controller.*;
import com.carrot.input.lucene.*;
import carrot2.demo.cache.*;

/**

  • Provide a path to the Lucene index to be used below, or use the "lucene.index.path"
  • system property to do that.
    */
    String indexPath = System.getProperty("lucene.index.path");

LocalComponentFactory factory = null;

if (indexPath != null)
{
luceneFactory = new LuceneLocalInputComponentFactory(indexPath);

factory = new LocalComponentFactoryBase()
{
    public LocalComponent getInstance()
    {
        return new RawDocumentProducerCacheWrapper(luceneFactory.getInstance());
   }
};

}
else
{
RawDocument errorMessage = new RawDocumentSnippet(
"0",
"Please configure Lucene index first",
"To use Lucene input, please provide a path to a Lucene index in the " + "'lucene.index.path' system property.",
"http://www.carrot2.org", // This url could point to some more material about configuring Lucene index
(float)0.0
);

factory = new LocalComponentFactoryBase() {
public LocalComponent getInstance() {
return new RawDocumentProducerCacheWrapper(new
com.dawidweiss.carrot.core.local.impl.DummyLocalInputComponent(errorMessage,
"Please configure Lucene index"));
}
};
}

return new LoadedComponentFactory(
/* id */ "input-cached-lucene",factory);

we get a weird exception:

Exception message: java.lang.NoClassDefFoundError: A class required by class: global/BlockNameSpace$4 could not be loaded: java.lang.NoClassDefFoundError: IllegalName: global/BlockNameSpace$4


Issue: CARROT-3 (migrated from JIRA), created by Stanisław Osiński (@stanislawosinski), resolved Jun 14 2007

HTML control doesn't show when launching the demo browser from webstart [CARROT-38]

When running the HEAD demo browser from WebStart the HTML control (for showing search results) is not visible at all. When running the STABLE demo browser from WebStart in the same environment the Swing HTML control is used instead of JDIC (is this normal?).

OS: Windows XP Professional
JDK: Sun JRE 1.6.0 (not sure for previous JREs)
Build: HEAD


Issue: CARROT-38 (migrated from JIRA), created by Stanisław Osiński (@stanislawosinski), resolved Mar 09 2007
Linked issues:

Misleading use of LuceneSearchConfig in LuceneLocalInputComponentFactory constructor [CARROT-45]

This is a bit of a misleading API – when passing an instance of LuceneSearchConfig, its searcher and analyzer fields can/should be null, as the component factory will create them based on the path and analyzer factory. The problem is that the config class is used both when creating the component factory (where searcher and analyzer fields are irrelevant) and at runtime (where these fields are needed). Therefore, it would be nice to refactor that into two classes – one for use with the constructor and another one for internal use.


Issue: CARROT-45 (migrated from JIRA), created by Stanisław Osiński (@stanislawosinski), resolved Jun 28 2007

A switch for making a Demo Browser process inactive [CARROT-43]

When preparing customized versions of the Demo Browser, it often happens that specialized processes need to be added. One way of managing this could be adding all possible input component and the corresponding process descriptors, but marking the "non-standard" processes as "inactive" (with a dedicated property). In this way, those who would like to try the customized processes could simply activate them.

One possible problem with this approach is that there is no easy way to "deactivate" (stop the controller from loading) component factories, which, in case of factories that need some specific configuration, might cause initialization exceptions.


Issue: CARROT-43 (migrated from JIRA), created by Stanisław Osiński (@stanislawosinski), resolved Jul 13 2007

Tuning browser window opens in background on Java 1.6 [CARROT-55]

When starting up the browser on Java 1.6, the splash screen shows up as usual, but the actual browser window opens "in background" (it's not activated, but put under other windows). This can cause a lot of irritation: 1) every time the browser is launched, the user must take an extra action (click, alt+tab) to start working with it, 2) some users might think the browser doesn't work at all (because it doesn't appear on the screen, only on the task bar). Is this behavior intentional?


Issue: CARROT-55 (migrated from JIRA), created by Stanisław Osiński (@stanislawosinski), resolved Apr 03 2007
Linked issues:

Add desired cluster phrase length customization parameters to STC [CARROT-29]

Add desired cluster phrase length customization parameters to STC. Currently these are hardcoded in:

private float calculateModifiedBaseClusterScore(final int effectivePhraseLength, final int documentCount) {
    final double SINGLE_WORD_BOOST = 0.5f;
    final int optimalPhraseLength = 3;
    final int optimalPhraseLengthDev = 2;

Issue: CARROT-29 (migrated from JIRA), created by Dawid Weiss (@dweiss), resolved Mar 06 2007
Attachments: StcConstants.patch, STCEngine.patch, StcParameters.patch, StcSettingsDialog.patch

The webapp controller servlet should re-read configuration if initial failure occurred. [CARROT-54]

The webapp controller servlet should re-read configuration if initial failure occurred. One can then
correct descriptors and simply refresh the home page to reload the application. A possibility
of reloading everything dynamically should also be an option.

Another thing is more descriptive error messages if configuration loading fails for some reason.


Issue: CARROT-54 (migrated from JIRA), created by Dawid Weiss (@dweiss), resolved Apr 18 2007

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.