Code Monkey home page Code Monkey logo

nlp's Introduction

Natural Language Processing

License: MIT GoDoc Build Status Go Report Card codecov Mentioned in Awesome Go Sourcegraph

nlp

Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for the package is the statistical semantics of plain-text documents supporting semantic analysis and retrieval of semantically similar documents.

Built upon the Gonum package for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn and Gensim.

Check out the companion blog post or the Go documentation page for full usage and examples.


Features

Planned

  • Expanded persistence support
  • Stemming to treat words with common root as the same e.g. "go" and "going"
  • Clustering algorithms e.g. Heirachical, K-means, etc.
  • Classification algorithms e.g. SVM, KNN, random forest, etc.

References

  1. Rosario, Barbara. Latent Semantic Indexing: An overview. INFOSYS 240 Spring 2000
  2. Latent Semantic Analysis, a scholarpedia article on LSA written by Tom Landauer, one of the creators of LSA.
  3. Thomo, Alex. Latent Semantic Analysis (Tutorial).
  4. Latent Semantic Indexing. Standford NLP Course
  5. Charikar, Moses S. "Similarity Estimation Techniques from Rounding Algorithms" in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing - STOC ’02, 2002, p. 380.
  6. M. Bawa, T. Condie, and P. Ganesan, “LSH forest: self-tuning indexes for similarity search,” Proc. 14th Int. Conf. World Wide Web - WWW ’05, p. 651, 2005.
  7. A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” VLDB ’99 Proc. 25th Int. Conf. Very Large Data Bases, vol. 99, no. 1, pp. 518–529, 1999.
  8. Kanerva, Pentti, Kristoferson, Jan and Holst, Anders (2000). Random Indexing of Text Samples for Latent Semantic Analysis
  9. Rangan, Venkat. Discovery of Related Terms in a corpus using Reflective Random Indexing
  10. Vasuki, Vidya and Cohen, Trevor. Reflective random indexing for semi-automatic indexing of the biomedical literature
  11. QasemiZadeh, Behrang and Handschuh, Siegfried. Random Indexing Explained with High Probability
  12. Foulds, James; Boyles, Levi; Dubois, Christopher; Smyth, Padhraic; Welling, Max (2013). Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation

nlp's People

Contributors

augier avatar goldbattle avatar james-bowman avatar johnfrankmorgan avatar recoilme avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nlp's Issues

LDA model persistence

Thanks for this library, it seems really useful.
I have been playing around a bit with a feature extractor pipeline of countvectoriser and tfidf transformer feeding into an LDA transformer, but I can't seem to save the Fit'ed pipeline to disk and reload it later to Transform new docs. Looking at the serialized pipeline in json, it seems the vocabulary is there, as well as the tokenizer info and various LDA params, but I don't see the induced topics (matrices). Maybe this is a problem with the way I serialized it? If you can point to a working example of how to properly serialize a trained LDA model and re-use it later, that would be great.
Thanks again!

Interface for Tokeniser, Allow Custom Tokenisers?

Hi James,

Would you be open to a PR that made some changes to the Tokeniser type, and to dependent types, to allow for custom Tokenisers? This would make nlp more general for different languages, or for handling different tokenisation strategies.

What I'm imagining is this (note, the workflow is designed to avoid breaking API changes):

  • Convert Tokeniser to an interface, providing ForEachIn and Tokenise methods.
  • Convert NewTokeniser to a method that returns a default implementation, which would be identical to the current implementation.
  • Add a new method, NewCustomTokeniser(tokenPattern string, stopWordList []string) *Tokeniser, which would enable easy creation of a custom tokeniser.
  • Add new constructors for CountVectoriser and HashingVectoriser to allow use of a custom Tokeniser, OR (your preference) make their vec.tokeniser field into a public field vec.Tokeniser, allowing overrides or manual construction of either.

I could probably make the required changes quickly enough, if you're interested. :)

Punctuation and PoS tagging

I’m having trouble with punctuation in the PoS tagger. If I remove all punctuation (fairly standard) will the PoS tagger still work?

License?

This looks very interesting, but for anyone to reuse your code, the license must be clear.

Out of memory - What are sensible upper bounds?

I am using the NLP package for similarity search.

Input is a corpus of 300000 entries of []string, each consisting of about 10-50 words.

When I naively follow the example
https://pkg.go.dev/github.com/james-bowman/nlp#example-package
, this panics with Out of Memory.

I reduced the set to 10000 entries and FitTransform takes a very long time on my laptop (Minutes) to complete.

Am I doing something wrong or is this task not the correct way of using this package?

Online/streaming LDA?

Is it possible to run LDA (or other processing algorithms) in a streaming/online fashion, such as is done with gensim? It seems that this would not easily support online processing, but I thought I'd bounce the question off of you since you know the internals much better.

func CosineSimilarity() return NaN

Dear James bowman
I use your library to calculate similarity. The function CosineSimilarity() returns many NaN. So i can't continue my work.
I have only changed your vectorisers.go file . All my changes are as following.

func (v *CountVectoriser) Transform(docs ...string) (mat.Matrix, error) { //function begin from here
mat := sparse.NewDOK(len(v.Vocabulary), len(docs))

for d, doc := range docs {
	v.Tokeniser.ForEachIn(doc, func(word string) {
		i, exists := v.Vocabulary[word]

		if exists {
			weight, wieghtExist := TrainingData.WeightMap[word]
			// normal weight value: 2,  unimportant weight value: 1, important  weight value: 3
			if wieghtExist {
				mat.Set(i, d, mat.At(i, d)+weight)
			} else {
				mat.Set(i, d, mat.At(i, d)+1)
			}

		}
	})
}
return mat.ToCSR(), nil

}

Vectorisers.go only tokenise for a-Z languages.

Hi,

Thanks for your hard work. I really like your code base. I have noticed that the package as is only works for languages that can be expressed in a-Z alphabets and, in addition, the hardcoded stop words make it a bit challenging for even historic or fringe English corpora. I have a fix for both. But did not want to PR without creating the issue first and see if you want to open up the project for non-English, historic English, and non a-Z languages.

Thanks again!

Best,

Thomas

Methods for large corpora?

Sort of related to #8...

You have methods in the API, like in your example, that take an array of strings (docs).

matrix, _ := vectoriser.FitTransform(testCorpus...)

I'd like to use this for very large corpora, with 10s or 100s of millions of (not tiny) documents. Putting these all into a single array of strings does not sound optimal.
Any chance the methods that now have a string array parameter for the documents could be altered to take in a function or interface that could allow iteration to get all the docs? (Or new methods that support this?)

Thanks,
Glen

Example fails, possibly due to gonum/matrix being deprecated?

Hello,

When running a slightly modified version of your example, I receive the following error:

# github.com/james-bowman/nlp
../../go/src/github.com/james-bowman/nlp/vectorisers.go:163: cannot use mat (type *sparse.DOK) as type mat64.Matrix in return argument:
        *sparse.DOK does not implement mat64.Matrix (wrong type for T method)
                have T() mat.Matrix
                want T() mat64.Matrix
../../go/src/github.com/james-bowman/nlp/vectorisers.go:221: cannot use mat (type *sparse.DOK) as type mat64.Matrix in return argument:
        *sparse.DOK does not implement mat64.Matrix (wrong type for T method)
                have T() mat.Matrix
                want T() mat64.Matrix
../../go/src/github.com/james-bowman/nlp/weightings.go:43: impossible type assertion:
        *sparse.CSR does not implement mat64.Matrix (wrong type for T method)
                have T() mat.Matrix
                want T() mat64.Matrix
../../go/src/github.com/james-bowman/nlp/weightings.go:62: cannot use sparse.NewDIA(m, weights) (type *sparse.DIA) as type mat64.Matrix in assignment:
        *sparse.DIA does not implement mat64.Matrix (wrong type for T method)
                have T() mat.Matrix
                want T() mat64.Matrix
../../go/src/github.com/james-bowman/nlp/weightings.go:76: cannot use t.transform (type mat64.Matrix) as type mat.Matrix in argument to product.Mul:
        mat64.Matrix does not implement mat.Matrix (wrong type for T method)
                have T() mat64.Matrix
                want T() mat.Matrix
../../go/src/github.com/james-bowman/nlp/weightings.go:76: cannot use mat (type mat64.Matrix) as type mat.Matrix in argument to product.Mul:
        mat64.Matrix does not implement mat.Matrix (wrong type for T method)
                have T() mat64.Matrix
                want T() mat.Matrix
../../go/src/github.com/james-bowman/nlp/weightings.go:81: cannot use product (type *sparse.CSR) as type mat64.Matrix in return argument:
        *sparse.CSR does not implement mat64.Matrix (wrong type for T method)
                have T() mat.Matrix
                want T() mat64.Matrix

The code of my modified example is below:

package main

import (
	"fmt"
    "github.com/james-bowman/nlp"
	"github.com/gonum/matrix/mat64"
)

func main() {
	testCorpus := []string{
		"The quick brown fox jumped over the lazy dog",
		"hey diddle diddle, the cat and the fiddle",
		"the cow jumped over the moon",
		"the little dog laughed to see such fun",
		"and the dish ran away with the spoon",
	}

	query := "the brown fox ran around the dog"

	vectoriser := nlp.NewCountVectoriser(true)
	transformer := nlp.NewTfidfTransformer()

	// set k (the number of dimensions following truncation) to 4
	reducer := nlp.NewTruncatedSVD(4)

	// Transform the corpus into an LSI fitting the model to the documents in the process
	mat, _ := vectoriser.FitTransform(testCorpus...)
	mat, _ = transformer.FitTransform(mat)
	lsi, _ := reducer.FitTransform(mat)

	// run the query through the same pipeline that was fitted to the corpus and
	// to project it into the same dimensional space
	mat, _ = vectoriser.Transform(query)
	mat, _ = transformer.Transform(mat)
	queryVector, _ := reducer.Transform(mat)

	// iterate over document feature vectors (columns) in the LSI and compare with the
	// query vector for similarity.  Similarity is determined by the difference between
	// the angles of the vectors known as the cosine similarity
	highestSimilarity := -1.0
	var matched int
	_, docs := lsi.Dims()
	for i := 0; i < docs; i++ {
		similarity := CosineSimilarity(queryVector.(*mat64.Dense).ColView(0), lsi.(*mat64.Dense).ColView(i))
		if similarity > highestSimilarity {
			matched = i
			highestSimilarity = similarity
		}
	}

	fmt.Printf("Matched '%s'", testCorpus[matched])
	// Output: Matched 'The quick brown fox jumped over the lazy dog'
} 

I see that gonum/matrix was deprecated a month ago in favor of gonum/gonum and wonder if that could be related.

Thanks very much for your help!

Ocr

This looks really nice. Thank you for putting this open.

I am attempting to do OCR.
I can identify all the letters, but then i need to check them against a word list so i can pick up where the OCR has maybe made a mistake.

This way it can then propagate back to the OCR system to get better.

There is no reason also why it can't use semantic meaning of a sentence e to also correct the OCR. It's kind of one step up from just using single words.

I don't have it up on a git repo yet, but figured it would be interesting to you.
If you feel like commenting about this idea it would be great.

I am really curious too where you get data sources . For semantic you need training data right ?

[Question] Optimal number of topics in LDA

Hi!
I'm planning to use the LDA functionality, and, as I read (I'm very new to the matter), in Gensim there's a coherence score that could be used to determine that (magic) key number. I wonder if there's any similar functionality implemented in the library? I have also read that perplexity may be added to the game to help deciding the number of topics but I'm not quite sure if it is correct or how to use it. I would really appreciate any clarification.

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.