Code Monkey home page Code Monkey logo

methodius's Introduction

Methodius (an NGram utility)

A utility for analyzing frequency of text chunks on the web.

Supply a bit o' text to the Methodius class, and let it determine your bigrams, trigrams, ngrams, letter-frequencies, word frequencies, bigram relationships, and create ngram trees.

Hippocratic License HL3-LAW-MEDIA-MIL-SOC-SV

npm

Example

const { Methodius } = require('methodius');
// or import { Methodius } from 'methodius';

const udhr1 = `
All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
`;
const nGrams = new Methodius(udhr1);

const topLetters = nGrams.getTopLetters(10);
const topWords = nGrams.getTopWords(10);

API

Methodius

Global Class

new Methodius(text)

Parameters

name type Description
text string raw text to be analyzed

Static Members

Punctuations

characters to ignore when analyzing text period, comma, semicolon, colon, bang, question mark, interrobang, Spanish bang+, parens, bracket, brace, single quote, some spaces

\\.,;:!?‽¡¿⸘()\\[\\]{}<>’'…\"\n\t\r

wordSeparators

characters to ignore AND CONSUME when trying to find words em-dash, period, comma, semicolon, colon, bang, question mark, interrobang, Spanish bang+, parens, bracket, brace, single quote, space

—\\.,;:!?‽¡¿⸘()\\[\\]{}<>…"\\s

Static Methods

hasPunctuation(string)

determines if string contains punctuation

Parameters

name type Description
string string

Returns boolean

hasSymbols(string)

determines if string contains symbols

Parameters

name type Description
string string

Returns boolean

hasSpace(string)

determines if a string has a space

Parameters

name type Description
string string

Returns boolean

sanitizeText(string)

lowercases text and removes diacritics and other characters that would throw off n-gram analysis

Parameters

name type Description
string string

Returns string

getWords(text)

extracts an array of words from a string

Parameters

name type Description
text string

Returns Array<string>

getNGrams(text, gramSize)

gets ngrams from text

Parameters

name type Description
text string
gramSize Number Default = 2

Returns Array<string>

getMeanWordSize(wordArray)

Gets average size of a word

Parameters

name type Description
wordArray string[]

Returns number

getMedianWordSize(wordArray)

Gets the median (middle) size of a word

Parameters

name type Description
wordArray string[]

Returns number

getWordNGrams(text)

Gets 2-word pairs from text.

Note: This doesn't use sentence punctuation as a boundary. Should it?

Parameters

name type Description
text string
gramSize number default=2

Returns Array<string>

getFrequencyMap(frequencyMap)

converts an array of strings into a map of those strings and number of occurences

Parameters

name type Description
ngramArray Array<string>

Returns Map<string, number>

getPercentMap(frequencyMap)

converts a frequency map into a map of percentages

Parameters

name type Description
frequencyMap Map<string, number>

Returns Map<string, number>

getTopGrams(frequencyMap)

filters a frequency map into only a small subset of the most frequent ones

Parameters

name type Description
frequencyMap Map<string, number>
limit number default=20

Returns Map<string, number>

getIntersection(iterable1, iterable2)

returns an array of items that occur in both iterables

Parameters

name type Description
iterable1 `Map Array`
iterable2 `Map Array`

Returns Array<any> An array of items that occur in both iterables. It will compare the keys, if sent a map

getUnion(iterable1, iterable2)

Returns an array that is the union of two iterables

Parameters

name type Description
iterable1 `Map Array`
iterable2 `Map Array`

Returns Array<any> A union of the items that occur in both iterables.

getDisjunctiveUnion(iterable1, iterable2)

returns an array of arrays of the unique items in either iterable

Parameters

name type Description
iterable1 `Map Array`
iterable2 `Map Array`

Returns Array<Array<any> An array of arrays of the unique items. The first item is the first parameter, 2nd item second param

getComparison(iterable1, iterable2)

returns a map containing various comparisons between two iterables

Parameters

name type Description
iterable1 `Map Array`
iterable2 `Map Array`

Returns Map<string, <array>> A map containing various comparisons between two iterables. Those comparisons will be some kind of array (See intersection or disjunctiveUnion)

getWordPlacementForNGram(ngram, wordsArray)

determines the placement of a single ngram in an array of words

Parameters

name type Description
ngram string
wordsArray Array<string>

Returns Map<string, number> a map with the keys 'start', 'middle', and 'end' whose values correspond to how often the provided ngram occurs in this position

getWordPlacementForNGrams(ngrams, wordsArray)

determines the placement of ngrams in an array of words

Parameters

name type Description
ngram Array<string>
wordsArray Array<string>

Returns Map<string, Map<string, number>> a map with the key of the ngram, and the value that is a map containing start, middle, end

getNgramCollections(ngrams, wordsArray)

gets ngrams from an array of words

Parameters

name type Description
wordArray Array<string> an array of words
ngramSize number default = 2. The size of the ngrams to return

Returns Array<Array<string>> An array containing arrays of ngrams, each array corresponds to a word.

getNgramSiblings(searchText, ngramCollections, siblingSize)

using a collection returned from getNgramCollections, searches for a string and returns what comes before and after it

Parameters

name type Description
searchText string the string to search for
ngramCollections `Array Array<Array>`
siblingSize number default = 1. How many siblings to find in front or behind

Returns Map<'before'|'after',Map<string, number>> a Map with the keys 'before' and 'after' which contain maps of what comes before and after

Example

        const words = ['revolution', 'nation'];
        const ngramCollections = Methodius.getNgramCollections(words, 2);
        const onSiblings = Methodius.getNgramSiblings('io', ngramCollections);
        /* 
        new Map([
          ['before', new Map(
            ['ti', 2]
          )],
          ['after', new Map(
            ['on', 2]
          )]
        ])
        */

getRelatedNgrams(words, ngrams, ngramSize)

Gets the ngrams that will occur before or after other ngrams. Useful for finding patterns of ngrams.

Parameters

name type Description
words Array<string> an array of words to evaluate
ngrams Map<string, number> a frequency map of ngrams
ngramSize number default = 2. the size of the ngram

Returns

Map<string, number> A frequency map of how often ngrams occured before or after other ngrams

Example

This requires several steps. You'll need an array of words and a frequency map of ngrams.

    const ngrams = getNGrams('the revolution of the nation was on television. It was about pollution and the terrible situation ', 2);
    const frequencyMap = getFrequencyMap(ngrams);
    const topNgrams = getTopGrams(frequencyMap, 5);
    const words = ['the', 'revolution', 'of', 'the', 'nation', 'was', 'on', 'television', 'it', 'was', 'about', 'pollution', 'and', 'the', 'terrible', 'situation' ];
    const relatedNgrams = getRelatedNgrams(words, topNgrams, 2, 5);

getNgramTreeCollection(words)

Gets a nested map of maps that breaks down unique words into their smallest ngrams

Parameters

name type Description
words Array<string> an array of words to evaluate

Returns

Map<string, Array<string>| Map<string, <Array|string>> A nested map of maps that breaks down unique words into their smallest ngrams.

Instance Members

sanitizedText

lowercased text with diacritics removed

string

letters

an array of letters in the text

Array<string>

words

an array of words in the text

Array<string>

bigrams

an array of letter bigrams in the text

Array<string>

trigrams

an array of letter trigrams in the text

Array<string>

uniqueLetters

an array of unique letters in the text

Array<string>

uniqueBigrams

an array of unique bigrams in the text

Array<string>

uniqueTrigrams

an array of unique trigrams in the text

Map<string, Map<string, number>>

letterPositions

a map of placements of letters within words

Map<string, Map<string, number>>

bigramPositions

a map of placements of bigrams within words

Map<string, Map<string, number>>

uniqueTrigrams

a map of placements of trigrams within words

Array<string>

uniqueWords

an array of unique words in the text

Array<string>

letterFrequencies

a map of letter frequencies in the sanitized text

Map<string, number>

bigramFrequencies

a map of bigram frequencies in the sanitized text

Map<string, number>

trigramFrequencies

a map of trigram frequencies in the sanitized text

Map<string, number>

wordFrequencies

a map of word frequencies in the sanitized text

Map<string, number>

letterPercentages

a map of letter percentages in the sanitized text

Map<string, number>

bigramPercentages

a map of bigram percentages in the sanitized text

Map<string, number>

trigramPercentages

a map of trigram percentages in the sanitized text

Map<string, number>

wordPercentages

a map of word percentages in the sanitized text

Map<string, number>

meanWordSize

The average size of a word

number

medianWordSize

The middle size of a word

number

ngramTreeCollection

A nested map of maps that breaks down unique words into their smallest ngrams.

Instance Methods

getLetterNGrams(size)

gets an array of customizeable ngrams in the text

Parameters

name type Description
size number default = 2 size of the n-gram to return

Returns Array<string>

getTopLetters(limit)

a map of the most used letters in the text

Parameters

name type Description
limit number default = 20 number of top letters to return

Returns Map<string, number>

getTopBigrams(limit)

a map of the most used bigrams in the text

Parameters

name type Description
limit number default = 20 number of top bigrams to return

Returns Map<string, number>

getTopTrigrams(limit)

a map of the most used trigrams in the text

Parameters

name type Description
limit number default = 20 number of top trigrams to return

Returns Map<string, number>

getTopWords(limit)

a map of the most used words in the text

Parameters

name type Description
limit number default = 20 number of top words to return

Returns Map<string, number>

compareTo(methodius)

Compare this methodius instance to another

Parameters

name type Description
methodius Methodius another Methodius instance

Returns Map<string, Map> A map of property names and their comparisons (intersection, disjunctiveUnions, etc) for a set of properties

getRelatedTopNgrams(ngramSize, limit)

Gets the ngrams that will occur before or after other ngrams based on what the most frequent ngrams are. Useful for finding patterns of ngrams.

Parameters

name type Description
ngramSize number default = 2. the size of the ngram
limit number default = 20. the number of top ngrams to use

Returns

Map<string, number> A frequency map of how often the most common ngrams occured before or after other common ngrams

methodius's People

Contributors

paceaux avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

methodius's Issues

Numbers are treated as ngrams / words

When analyzing the text of the Great Gatsby, Methodius captured some things it interpreted as ngrams which were just numbers.

We need a test to eliminate numbers.

diacritic opt-in for frequencies

right now n-grams work by stripping diacritics; all frequencies/ results take away accent marks.

We should at least offer the ability to opt-in to diacritics.
Seeing confusing results on the demo:

in portuguese, ão is listed, but we can't highlight (a problem with the demo, but still)

With French, we have éèê to account for and it may be useful to see this separate.

Word bigrams

Create properties/ methods for getting word bigrams. Maybe?

put a getNGrams on the member

Right now the class doesn't have many methods, but one thing it lacks is the ability to choose an arbitrary ngram (e.g. quadrigram). Create a member method for this.

Certain punctuation marks being considered bigrams

After analyzing Alice in Wonderland and Huckleberry Finn, some unusual bigrams popped up:

{
    "‘e": 1,
    "‘i": 10,
    "“‘": 3,
    "‘—": 1,
    "“_": 18,
    "_e": 3,
    "—c": 3,
    "‘l": 1,
    "‘s": 1,
    "_”": 11,
    "w-": 1,
    "n—": 12,
    "—e": 10,
    "n—": 57,
    "“_": 54,
    "_m": 53,
    "e-": 115,
    "-y": 11,
    "_”": 32,
    "‘_": 1,
    "_-": 2,
    "‘t": 4,
    "m”": 2,
    "w”": 1,
    "‘h": 1,
    "l”": 1,
    "tb": 1,
    "lh": 1,
    "‘e": 1,
}
  1. Need to analyze the texts and see what the context is for these cases
  2. determine if we can just do some sort of normalization (which is prob the case for the quotes)
  3. otherwise, exclude things like - and _

Remove Rollup. Consider ESBuild?

According to a few diff folks, Rollup may be overkill. Remove it.

Possibly consider ESBuild? Leave Babel in place, though. Maybe.

n-gram combinations

Need ability to discover n-grams that commonly occur with other n-grams.

i.e. How would I discover "tion"

e.g.
"ion"
"tio"
nation => nat ati tio ion
vacation => vac aca cat tio ion
station => sta ati tio ion

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.