Code Monkey home page Code Monkey logo

ruby-nlp's Introduction

Ruby Natural Language Processing Resources

A collection of Natural Language Processing (NLP) Ruby libraries, tools and software. Suggestions and contributions are welcome.

Categories

APIs

Client libraries to various 3rd party NLP API services.

  • alchemy_api - provides a client API library for AlchemyAPI's NLP services
  • aylien_textapi_ruby - AYLIEN's officially supported Ruby client library for accessing Text API
  • napi-ruby - a simple Ruby wrapper for the Maluuba nAPI
  • poliqarpr - Ruby client for Poliqarp text corpus server
  • wlapi - Ruby based API for the project Wortschatz Leipzig

Bitext Alignment

Bitext alignment is the process of aligning two parallel documents on a segment by segment basis. In other words, if you have one document in English and its translation in Spanish, bitext alignment is the process of matching each segment from document A with its corresponding translation in document B.

  • alignment - alignment functions for corpus linguistics (Gale-Church implementation)

Classification

Classification aims to assign a document or piece of text to one or more classes or categories making it easier to manage or sort.

  • Classifier - a general module to allow Bayesian and other types of classifications
  • Latent Dirichlet Allocation - used to automatically cluster documents into topics
  • liblinear-ruby-swig - Ruby interface to LIBLINEAR (much more efficient than LIBSVM for text classification and other large linear classifications)
  • linnaeus - a redis-backed Bayesian classifier
  • maxent_string_classifier - a JRuby maximum entropy classifier for string data, based on the OpenNLP Maxent framework
  • Naive-Bayes - simple Naive Bayes classifier
  • nbayes - a full-featured, Ruby implementation of Naive Bayes
  • stuff-classifier - a library for classifying text into multiple categories

Date and Time

  • Chronic - a pure Ruby natural language date parser
  • Chronic Between - a simple Ruby natural language parser for date and time ranges
  • Chronic Duration - a simple Ruby natural language parser for elapsed time
  • Kronic - a dirt simple library for parsing and formatting human readable dates
  • Nickel - extracts date, time, and message information from naturally worded text
  • Tickle - a natural language parser for recurring events.

Error Correction

  • Chat Correct - shows the errors and error types when a correct English sentence is diffed with an incorrect English sentence
  • gingerice - Ruby wrapper for correcting spelling and grammar mistakes based on the context of complete sentences

Full-Text Search

  • ferret - an information retrieval library in the same vein as Apache Lucene
  • ranguba - a project to provide a full-text search system built on Groonga

Keyword Ranking

  • graph-rank - Ruby implementation of the PageRank and TextRank algorithms
  • highscore - find and rank keywords in text

Language Detection

  • Detect Language API Client - detects language of given text and returns detected language codes and scores
  • whatlanguage - a language detection library for Ruby that uses bloom filters for speed

Machine Learning

  • Decision Tree - a ruby library which implements ID3 (information gain) algorithm for decision tree learning
  • rb-libsvm - implementation of SVM, a machine learning and classification algorithm
  • RubyFann - a ruby gem that binds to FANN (Fast Artificial Neural Network) from within a ruby/rails environment

Machine Translation

Miscellaneous

  • gibber - Gibber replaces text with nonsensical latin with a maximum size difference of +/- 30%
  • hiatus - a localization QA tool
  • Naturally - Natural (version number) sorting with support for legal document numbering, college course codes, and Unicode
  • rwordnet - a pure Ruby interface to the WordNet lexical/semantic database
  • twitter-text - gem that provides text processing routines for Twitter Tweets

Multipurpose Tools

The following are libraries that integrate multiple NLP tools or functionality.

Ngrams

  • N-Gram - N-Gram generator in Ruby
  • ngram - break words and phrases into ngrams
  • raingrams - a flexible and general-purpose ngrams library written in Ruby

Parsers

A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or object of a verb.

  • linkparser - a Ruby binding for the Abiword version of CMU's Link Grammar, a syntactic parser of English
  • rley - Ruby gem implementing a general context-free grammar parser based on Earley's algorithm

Part-of-Speech Taggers

  • engtagger - English Part-of-Speech Tagger Library; a Ruby port of Lingua::EN::Tagger
  • rbtagger - a simple ruby rule-based part of speech tagger
  • TreeTagger for Ruby - Ruby based wrapper for the TreeTagger by Helmut Schmid

Readability

  • lingua - Lingua::EN::Readability is a Ruby module which calculates statistics on English text

Regular Expressions

Ruby NLP Presentations

Sentence Segmentation

Sentence segmentation (aka sentence boundary disambiguation, sentence boundary detection) is the problem in natural language processing of deciding where sentences begin and end. Sentence segmentation is the foundation of many common NLP tasks (machine translation, bitext alignment, summarization, etc.).

Stemmers

Stemming is the term used in linguistic morphology and information retrieval to describe the process for reducing inflected (or sometimes derived) words to their word stem, base or root form.

  • Ruby-Stemmer - Ruby-Stemmer exposes the SnowBall API to Ruby
  • uea-stemmer - a conservative stemmer for search and indexing

Stop Words

  • clarifier
  • stopwords - really just a list of stopwords with some helpers
  • Stopwords Filter - a very simple and naive implementation of a stopwords filter that remove a list of banned words (stopwords) from a sentence

Summarization

Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document.

  • ots - Ruby bindings to open text summarizer
  • summarize - Ruby C wrapper for Open Text Summarizer

Text Extraction

  • Ruby Readability - a tool for extracting the primary readable content of a webpage
  • Yomu - a library for extracting text and metadata from files and documents using the Apache Tika content analysis toolkit

Text Similarity

  • FuzzyMatch - find a needle in a haystack based on string similarity and regular expression rules
  • fuzzy-string-match - fuzzy string matching library for ruby
  • Going the Distance - contains scripts that do various distance calculations
  • hotwater - Fast Ruby FFI string edit distance algorithms
  • levenshtein-ffi - fast string edit distance computation, using the Damerau-Levenshtein algorithm
  • TF-IDF - Term Frequency - Inverse Document Frequency in Ruby
  • tf-idf-similarity - calculate the similarity between texts using tf*idf

Tokenizers

  • Jieba - Chinese tokenizer and segmenter (jRuby)
  • MeCab - Japanese morphological analyzer [MeCab Heroku buildpack]
  • NLP Pure - natural language processing algorithms implemented in pure Ruby with minimal dependencies
  • rseg - a Chinese Word Segmentation (中文分词) routine in pure Ruby
  • thailang4r - Thai tokenizer
  • tiny_segmenter - Ruby port of TinySegmenter.js for tokenizing Japanese text
  • tokenizer - a simple multilingual tokenizer

Word Count

  • wc - a rubygem to count word occurrences in a given text
  • word_count - a word counter for String and Hash in Ruby
  • Word Count Analyzer - analyzes a string for potential areas of the text that might cause word count discrepancies depending on the tool used
  • WordsCounted - a highly customisable Ruby text analyser

ruby-nlp's People

Contributors

diasks2 avatar famished-tiger avatar

Watchers

John-Henry Liberty avatar James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.