Code Monkey home page Code Monkey logo

torchlanguage's Introduction


TorchLanguage is the equivalent of TorchVision for Natural Language Processing. It gives you access to text transformers (tokens, index, n-grams, etc) and data sets.

Tweet

Join our community to create datasets and deep-learning models! Chat with us on Gitter and join the Google Group to collaborate with us.

PyPI - Python Version Codecov Documentation Status Build Status

This repository consists of:

  • torchlanguage.datasets : Pre-built datasets for common NLP tasks
  • torchlanguage.models : Generic pretrained models for common NLP tasks
  • torchlanguage.transforms : Common transformation for text
  • torchlanguage.utils : Tools, functions and measures for NLP

Installation

Make sure you have Python 2.7 or 3.5+ and PyTorch 0.2.0 or newer. You can then install torchlanguage using pip::

pip install TorchLanguage

Optional requirements

If you want to use English tokenizer from SpaCy <http://spacy.io/>_, you need to install SpaCy and download its English model::

pip install spacy
python -m spacy download en

Text transformation pipeline

The following transformation are available :

  • Character
  • Character2Gram
  • Character3Gram
  • Compose
  • DropOut
  • Embedding
  • FunctionWord
  • GensimModel
  • GloveVector
  • HorizontalStack
  • MaxIndex
  • PartOfSpeech
  • RandomSamples
  • RemoveCharacter
  • RemoveLines
  • RemoveRegex
  • Tag
  • ToFrequencyVector
  • ToIndex
  • Token
  • ToLength
  • ToLower
  • ToNGram
  • ToOneHot
  • ToUpper
  • Transformer
  • VerticalStack

Data

The data module provides the following:

  • Ability to download and load a corpus from a directory. The file must be name Class_Title.txt:
dataset = torchlanguage.datasets.FileDirectory(
    root='./data',
    download=True,
    download_url="http://urltozip/file.zip",
    transform=transformer
   )
  • Wrapper for dataset splits (train, validation) and cross-validation:
cross_val_dataset = {'train': torchlanguage.utils.CrossValidation(dataset, k=k),
    'test': torchlanguage.utils.CrossValidation(dataset, k=k, train=False)}
for k in range(k):
    for data in cross_val_dataset['train']:
        inputs, label = data
    # end for
    for data in cross_val_dataset['test']:
        inputs, label = data
    # end for
    cross_val_dataset['train'].next_fold()
    cross_val_dataset['test'].next_fold()
# end for

Datasets

The datasets module currently contains:

  • FileDirectory: Load a corpus from a directory
  • ReutersC50Dataset: The Reuters C50 dataset for authorship attribution
  • SFGram: A set of science-fiction magazine with five authors.

Others are planned or a work in progress:

  • Traduction
  • Question answering

See the examples directory for examples of dataset usage.

Related Work

EchoTorch is a Python framework to easily implement Reservoir Computing models with pyTorch.

Authors

Citing

If you find TorchLanguage useful for an academic publication, then please use the following BibTeX to cite it:

@misc{torchlanguage,
  author = {Schaetti, Nils},
  title = {TorchLanguage: Natural Language Processing with pyTorch},
  year = {2018},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/nschaetti/TorchLanguage}},
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.