james-bowman / nlp Goto Github PK

Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang

License: MIT License

Go 100.00%

go golang natural-language-processing nlp lsa latent-semantic-analysis machine-learning svd singular-value-decomposition tf-idf

nlp's Introduction

Natural Language Processing

Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for the package is the statistical semantics of plain-text documents supporting semantic analysis and retrieval of semantically similar documents.

Built upon the Gonum package for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn and Gensim.

Check out the companion blog post or the Go documentation page for full usage and examples.

Features

LSA (Latent Semantic Analysis aka Latent Semantic Indexing (LSI)) implementation using truncated SVD (Singular Value Decomposition) for dimensionality reduction.
Fast comparison and retrieval of semantically similar documents using SimHash(random hyperplanes/sign random projection) algorithm with multi-index and Forest schemes for LSH (Locality Sensitive Hashing) to support fast, approximate cosine similarity/angular distance comparisons and approximate nearest neighbour search using significantly less memory and processing time.
Random Indexing (RI) and Reflective Random Indexing (RRI) (which extends RI to support indirect inference) for scalable Latent Semantic Analysis (LSA) over large, web-scale corpora.
Latent Dirichlet Allocation (LDA) using a parallelised implementation of the fast SCVB0 (Stochastic Collapsed Variational Bayesian inference) algorithm for unsupervised topic extraction.
PCA (Principal Component Analysis)
TF-IDF weighting to account for frequently occuring words
Sparse matrix implementations used for more efficient memory usage and processing over large document corpora.
Stop word removal to remove frequently occuring English words e.g. "the", "and"
Feature hashing ('the hashing trick') implementation (using MurmurHash3) for reduced memory requirements and reduced reliance on training data
Similarity/distance measures to calculate the similarity/distance between feature vectors.

Planned

Expanded persistence support
Stemming to treat words with common root as the same e.g. "go" and "going"
Clustering algorithms e.g. Heirachical, K-means, etc.
Classification algorithms e.g. SVM, KNN, random forest, etc.

References

nlp's People

Contributors

Stargazers

Watchers

Forkers

shadowridgedev ren-it hucruz outkaj kgocks genjiluo tantanchen alokmishrar vench sensecollective thomask81 pleasedtomeetyou sacontreras mramshaw mayankbaluni sara-moukir ktp-forked-repos jankim abeusher daniilperestoronin jesusbravo38 recoilme zeta1999 rtdatascience veritastry rocco337 rtvt123 ryan3435 wbaiyy augier similar-manga luckymeef isgasho aicelerity-py yangchenghuang lunarforge standardgalactic mewbak ajunlonglive axamon hacklock deemount e-gun

nlp's Issues

Example fails, possibly due to gonum/matrix being deprecated?

Hello,

When running a slightly modified version of your example, I receive the following error:

# github.com/james-bowman/nlp
../../go/src/github.com/james-bowman/nlp/vectorisers.go:163: cannot use mat (type *sparse.DOK) as type mat64.Matrix in return argument:
        *sparse.DOK does not implement mat64.Matrix (wrong type for T method)
                have T() mat.Matrix
                want T() mat64.Matrix
../../go/src/github.com/james-bowman/nlp/vectorisers.go:221: cannot use mat (type *sparse.DOK) as type mat64.Matrix in return argument:
        *sparse.DOK does not implement mat64.Matrix (wrong type for T method)
                have T() mat.Matrix
                want T() mat64.Matrix
../../go/src/github.com/james-bowman/nlp/weightings.go:43: impossible type assertion:
        *sparse.CSR does not implement mat64.Matrix (wrong type for T method)
                have T() mat.Matrix
                want T() mat64.Matrix
../../go/src/github.com/james-bowman/nlp/weightings.go:62: cannot use sparse.NewDIA(m, weights) (type *sparse.DIA) as type mat64.Matrix in assignment:
        *sparse.DIA does not implement mat64.Matrix (wrong type for T method)
                have T() mat.Matrix
                want T() mat64.Matrix
../../go/src/github.com/james-bowman/nlp/weightings.go:76: cannot use t.transform (type mat64.Matrix) as type mat.Matrix in argument to product.Mul:
        mat64.Matrix does not implement mat.Matrix (wrong type for T method)
                have T() mat64.Matrix
                want T() mat.Matrix
../../go/src/github.com/james-bowman/nlp/weightings.go:76: cannot use mat (type mat64.Matrix) as type mat.Matrix in argument to product.Mul:
        mat64.Matrix does not implement mat.Matrix (wrong type for T method)
                have T() mat64.Matrix
                want T() mat.Matrix
../../go/src/github.com/james-bowman/nlp/weightings.go:81: cannot use product (type *sparse.CSR) as type mat64.Matrix in return argument:
        *sparse.CSR does not implement mat64.Matrix (wrong type for T method)
                have T() mat.Matrix
                want T() mat64.Matrix

The code of my modified example is below:

package main

import (
	"fmt"
    "github.com/james-bowman/nlp"
	"github.com/gonum/matrix/mat64"
)

func main() {
	testCorpus := []string{
		"The quick brown fox jumped over the lazy dog",
		"hey diddle diddle, the cat and the fiddle",
		"the cow jumped over the moon",
		"the little dog laughed to see such fun",
		"and the dish ran away with the spoon",
	}

	query := "the brown fox ran around the dog"

	vectoriser := nlp.NewCountVectoriser(true)
	transformer := nlp.NewTfidfTransformer()

	// set k (the number of dimensions following truncation) to 4
	reducer := nlp.NewTruncatedSVD(4)

	// Transform the corpus into an LSI fitting the model to the documents in the process
	mat, _ := vectoriser.FitTransform(testCorpus...)
	mat, _ = transformer.FitTransform(mat)
	lsi, _ := reducer.FitTransform(mat)

	// run the query through the same pipeline that was fitted to the corpus and
	// to project it into the same dimensional space
	mat, _ = vectoriser.Transform(query)
	mat, _ = transformer.Transform(mat)
	queryVector, _ := reducer.Transform(mat)

	// iterate over document feature vectors (columns) in the LSI and compare with the
	// query vector for similarity.  Similarity is determined by the difference between
	// the angles of the vectors known as the cosine similarity
	highestSimilarity := -1.0
	var matched int
	_, docs := lsi.Dims()
	for i := 0; i < docs; i++ {
		similarity := CosineSimilarity(queryVector.(*mat64.Dense).ColView(0), lsi.(*mat64.Dense).ColView(i))
		if similarity > highestSimilarity {
			matched = i
			highestSimilarity = similarity
		}
	}

	fmt.Printf("Matched '%s'", testCorpus[matched])
	// Output: Matched 'The quick brown fox jumped over the lazy dog'
}

I see that gonum/matrix was deprecated a month ago in favor of gonum/gonum and wonder if that could be related.

Thanks very much for your help!

func CosineSimilarity() return NaN

Dear James bowman
I use your library to calculate similarity. The function CosineSimilarity() returns many NaN. So i can't continue my work.
I have only changed your vectorisers.go file . All my changes are as following.

func (v *CountVectoriser) Transform(docs ...string) (mat.Matrix, error) { //function begin from here
mat := sparse.NewDOK(len(v.Vocabulary), len(docs))

for d, doc := range docs {
	v.Tokeniser.ForEachIn(doc, func(word string) {
		i, exists := v.Vocabulary[word]

		if exists {
			weight, wieghtExist := TrainingData.WeightMap[word]
			// normal weight value: 2,  unimportant weight value: 1, important  weight value: 3
			if wieghtExist {
				mat.Set(i, d, mat.At(i, d)+weight)
			} else {
				mat.Set(i, d, mat.At(i, d)+1)
			}

		}
	})
}
return mat.ToCSR(), nil

}

Online/streaming LDA?

Is it possible to run LDA (or other processing algorithms) in a streaming/online fashion, such as is done with gensim? It seems that this would not easily support online processing, but I thought I'd bounce the question off of you since you know the internals much better.

Vectorisers.go only tokenise for a-Z languages.

Hi,

Thanks for your hard work. I really like your code base. I have noticed that the package as is only works for languages that can be expressed in a-Z alphabets and, in addition, the hardcoded stop words make it a bit challenging for even historic or fringe English corpora. I have a fix for both. But did not want to PR without creating the issue first and see if you want to open up the project for non-English, historic English, and non a-Z languages.

Thanks again!

Best,

Thomas

Methods for large corpora?

Sort of related to #8...

You have methods in the API, like in your example, that take an array of strings (docs).

matrix, _ := vectoriser.FitTransform(testCorpus...)

I'd like to use this for very large corpora, with 10s or 100s of millions of (not tiny) documents. Putting these all into a single array of strings does not sound optimal.
Any chance the methods that now have a string array parameter for the documents could be altered to take in a function or interface that could allow iteration to get all the docs? (Or new methods that support this?)

Thanks,
Glen

Punctuation and PoS tagging

I’m having trouble with punctuation in the PoS tagger. If I remove all punctuation (fairly standard) will the PoS tagger still work?

Interface for Tokeniser, Allow Custom Tokenisers?

Hi James,

Would you be open to a PR that made some changes to the Tokeniser type, and to dependent types, to allow for custom Tokenisers? This would make nlp more general for different languages, or for handling different tokenisation strategies.

What I'm imagining is this (note, the workflow is designed to avoid breaking API changes):

Convert Tokeniser to an interface, providing ForEachIn and Tokenise methods.
Convert NewTokeniser to a method that returns a default implementation, which would be identical to the current implementation.
Add a new method, NewCustomTokeniser(tokenPattern string, stopWordList []string) *Tokeniser, which would enable easy creation of a custom tokeniser.
Add new constructors for CountVectoriser and HashingVectoriser to allow use of a custom Tokeniser, OR (your preference) make their vec.tokeniser field into a public field vec.Tokeniser, allowing overrides or manual construction of either.

I could probably make the required changes quickly enough, if you're interested. :)

License?

This looks very interesting, but for anyone to reuse your code, the license must be clear.

Out of memory - What are sensible upper bounds?

I am using the NLP package for similarity search.

Input is a corpus of 300000 entries of []string, each consisting of about 10-50 words.

When I naively follow the example
https://pkg.go.dev/github.com/james-bowman/nlp#example-package
, this panics with Out of Memory.

I reduced the set to 10000 entries and FitTransform takes a very long time on my laptop (Minutes) to complete.

Am I doing something wrong or is this task not the correct way of using this package?

[Question] Optimal number of topics in LDA

Hi!
I'm planning to use the LDA functionality, and, as I read (I'm very new to the matter), in Gensim there's a coherence score that could be used to determine that (magic) key number. I wonder if there's any similar functionality implemented in the library? I have also read that perplexity may be added to the game to help deciding the number of topics but I'm not quite sure if it is correct or how to use it. I would really appreciate any clarification.

Thank you!

Ocr

This looks really nice. Thank you for putting this open.

I am attempting to do OCR.
I can identify all the letters, but then i need to check them against a word list so i can pick up where the OCR has maybe made a mistake.

This way it can then propagate back to the OCR system to get better.

There is no reason also why it can't use semantic meaning of a sentence e to also correct the OCR. It's kind of one step up from just using single words.

I don't have it up on a git repo yet, but figured it would be interesting to you.
If you feel like commenting about this idea it would be great.

I am really curious too where you get data sources . For semantic you need training data right ?

LDA model persistence

Thanks for this library, it seems really useful.
I have been playing around a bit with a feature extractor pipeline of countvectoriser and tfidf transformer feeding into an LDA transformer, but I can't seem to save the Fit'ed pipeline to disk and reload it later to Transform new docs. Looking at the serialized pipeline in json, it seems the vocabulary is there, as well as the tokenizer info and various LDA params, but I don't see the induced topics (matrices). Maybe this is a problem with the way I serialized it? If you can point to a working example of how to properly serialize a trained LDA model and re-use it later, that would be great.
Thanks again!