Code Monkey home page Code Monkey logo

catalyst's Introduction

Nuget Build Status

catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.

Gitter

โšก Features

New: Language Packages โœจ

We're migrating our model repository to use NuGet packages for all language-specific data and models.

You can find all new language packages here.

The new models are trained on the latest release of Universal Dependencies v2.7.

This is technically not a breaking change yet, but our online repository will be deprecated in the near future - so you should migrate to the new NuGet packages.

When using the new model packages, you can usually remove this line from your code: Storage.Current = new OnlineRepositoryStorage(new DiskStorage("catalyst-models"));, or replace it with Storage.Current = new DiskStorage("catalyst-models") if you are storing your own models locally.

We've also added the option to store and load models using streams:

// Creates and stores the model
var isApattern = new PatternSpotter(Language.English, 0, tag: "is-a-pattern", captureTag: "IsA");
isApattern.NewPattern(
    "Is+Noun",
    mp => mp.Add(
        new PatternUnit(P.Single().WithToken("is").WithPOS(PartOfSpeech.VERB)),
        new PatternUnit(P.Multiple().WithPOS(PartOfSpeech.NOUN, PartOfSpeech.PROPN, PartOfSpeech.AUX, PartOfSpeech.DET, PartOfSpeech.ADJ))
));
using(var f = File.OpenWrite("my-pattern-spotter.bin"))
{
    await isApattern.StoreAsync(f);
}

// Load the model back from disk
var isApattern2 = new PatternSpotter(Language.English, 0, tag: "is-a-pattern", captureTag: "IsA");

using(var f = File.OpenRead("my-pattern-spotter.bin"))
{
    await isApattern2.LoadAsync(f);
}

โœจ Getting Started

Using catalyst is as simple as installing its NuGet Package, and setting the storage to use our online repository. This way, models will be lazy loaded either from disk or downloaded from our online repository. Check out also some of the sample projects for more examples on how to use catalyst.

Catalyst.Models.English.Register(); //You need to pre-register each language (and install the respective NuGet Packages)

Storage.Current = new DiskStorage("catalyst-models");
var nlp = await Pipeline.ForAsync(Language.English);
var doc = new Document("The quick brown fox jumps over the lazy dog", Language.English);
nlp.ProcessSingle(doc);
Console.WriteLine(doc.ToJson());

You can also take advantage of C# lazy evaluation and native multi-threading support to process a large number of documents in parallel:

var docs = GetDocuments();
var parsed = nlp.Process(docs);
DoSomething(parsed);

IEnumerable<IDocument> GetDocuments()
{
    //Generates a few documents, to demonstrate multi-threading & lazy evaluation
    for(int i = 0; i < 1000; i++)
    {
        yield return new Document("The quick brown fox jumps over the lazy dog", Language.English);
    }
}

void DoSomething(IEnumerable<IDocument> docs)
{
    foreach(var doc in docs)
    {
        Console.WriteLine(doc.ToJson());
    }
}

Training a new FastText word2vec embedding model is as simple as this:

var nlp = await Pipeline.ForAsync(Language.English);
var ft = new FastText(Language.English, 0, "wiki-word2vec");
ft.Data.Type = FastText.ModelType.CBow;
ft.Data.Loss = FastText.LossType.NegativeSampling;
ft.Train(nlp.Process(GetDocs()));
ft.StoreAsync();

For fast embedding search, we have also released a C# version of the "Hierarchical Navigable Small World" (HNSW) algorithm on NuGet, based on our fork of Microsoft's HNSW.Net. We have also released a C# version of the "Uniform Manifold Approximation and Projection" (UMAP) algorithm for dimensionality reduction on GitHub and on NuGet.

๐Ÿ“– Links

Documentation
Contribute How to contribute to catalyst codebase.
Samples Sample projects demonstrating catalyst capabilities
Gitter Join our gitter channel

catalyst's People

Contributors

theolivenbaum avatar productiverage avatar aorgish avatar pfriesch avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.