Code Monkey home page Code Monkey logo

text-processing-utils's Introduction

Text Processing Utilities

Build Status in Travis CI

Build Status

Language Detection

One of the major changes here (except for some code style change and saying bye-bye to Mr. StringBuffer) is the "unstatic'ing" of the DetectorFactory class. Now it has to be created via consructor that accepts a single parameter shortMessages that defines which corpus should be used:

  • If parameter is set to true then language profiles that are generated from Twitter Corpus are used. These language profiles should prove better on short messages
  • If parameter is set to false then language profiles that are generated from Wikipedia Corpus are used. These language profiles should prove better on long messages

Corpuses are now part of the language detection JAR file. Also please note that Detector instances obtained from the DetectorFactory are stateful - they keep both evaluated text as well as some additional information inside.

Example usage
// Use short message corpus
final DetectorFactory detectorFactory = new DetectorFactory(true);
final Detector detector = detectorFactory.create();

detector.append("Some text to detect language for");
final String detectedLang = detector.detect();

Text Analysis

Set of various tools for performing text analysis

TermExtractionService

Utilizes Language detector and Apache Lucene for term extraction from an arbitrary string. Since TermExtractionService uses the DetectorFactory it also asks for a shortMessages parameter when constructing an instance. Example usage can be found below:

// Use short message corpus
final TermExtractionService service = new TermExtractionService(true);
final List<String> terms = service.getTerms("Some text to extract terms from");
TextCleanupService

Can be used both to remove some of the unwanted chars from the given text and extract Twitter-like entities: hashtags, cashtags, mentions and urls. Example usage:

// Remove characters that are responsible for the text direction or
// are invisible spaces that may interfere
final TextCleanupService service = new TextCleanupService();
final String cleanedUp = service.removeDirectionAndInvisibleChars("Some text with garbage here");

// Remove Twitter-like entities from text. If second parameter is set to true
// removeDirectionAndInvisibleChars() will be invoked before the entity cleanup
final String withoutEntities = service.removeTwitterEntities("@kgusarov Click me! #coolbuttons", true);

// It is also possible to get all the extracted entities with the cleaned up text
final Pair<String, List<Extractor.Entity>> withAndWithoutEntities = service.extractTwitterEntities("@kgusarov Click me! #coolbuttons", true);
TransliterationService

Utilizes ICU4J for performing text transliteration. The service itself uses Commons Pool for pooling Transliterator instances. Example usage:

// Create service instance with given pool and transliterator configuration
final GenericObjectPoolConfig poolConfig = new GenericObjectPoolConfig();
poolConfig.setMaxIdle(4);
poolConfig.setMaxTotal(4);
poolConfig.setMinIdle(4);

final TransliterationService transliterationService =
    new TransliterationService("Any-Lower; SomeExampleReplacement; Any-Latin; NFD; [^\\p{Alnum}] Remove", poolConfig);

transliterationService.addTransliteratorConfiguration("SomeExampleReplacement", "ы > i;");

final String transliterated = transliterate("Мама мыла раму");
TextAnalysisService

Performs text analysis by utilizing other service found in this module. It accepts those services as a constructor arguments and uses them to perform all the neccessary actions.

final TransliterationService ts = TransliterationServiceFactory.create();
final TermExtractionService tes = new TermExtractionService(true);
final TextCleanupService tcs = new TextCleanupService();

final TextAnalysisService service = new TextAnalysisService(tes, tcs, ts);
final AnalysedText analysedText = service.analyse("Text to analyse", true);

Thanks to

Michael McCandless and his awesome language detection test that I've honestly have used.

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

text-processing-utils's People

Contributors

kgusarov avatar shuyo avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.