Code Monkey home page Code Monkey logo

tweet2vec's Introduction

tweet2vec

THE project.

Preprocessing

To run preprocessing:

python preprocess.py data.txt

This will run a test which will iterate a few times through the lines of data.txt, and print the raw text, clean text, and hashtags.

Running without arguments (i.e. just python preprocess.py) will default to looking for the file ./data/sample.csv

If the file is not found, it will default to just running on 4 sample tweets provided in the code.

To prepare the data for the model:

python preprocess.py --prepare data.txt

This will generate the file in ./models/mlb.pickle, which is a MultiLabelBinarizer object. This is the object that turns a list of hashtags into an encoded vector (one-hot for each hashtag), for our model.

By default it will filter out hashtags that don't appear more than 10 times. To change this number to say, 100:

python preprocess.py --prepare --threshold 100 data.txt

Note: you will want to remove the ./models/mlb.pickle file if you are generating a new one. If a mlb.pickle exists, the script will load it and then filter out any hashtags that aren't already in the model.

TweetIterator

The main object in prepocessing.py is the TweetIterator object. It allows you to iterator through lines in a text file and yield any number of things useful for our model. This allows you to process and generate features for input files in a memory efficient way (i.e. you never have to load all of your data into memory).

The second argument determines whether you skip samples which do not have any hashtags.

To iterate through the raw text:

tweet_iterator = TweetIterator('source.txt', False, 'raw_text')
for t in tweet_iterator:
    print(t)

To iterate through the hashtags:

tweet_iterator = TweetIterator('source.txt', False, 'hashtags')
for t in tweet_iterator:
    print(t)

If you have an MultilabelBinarizer object prepared, iterate through the label vectors:

tweet_iterator = TweetIterator('source.txt', False, 'label')
for t in tweet_iterator:
    print(t)

Iterate through character matrices:

tweet_iterator = TweetIterator('source.txt', False, 'char_mat')
for t in tweet_iterator:
    print(t)

If you have a word2vec model saved as ./models/w2v.pickle or ./models/w2v.bin, you can iterate through word2vec matrices:

tweet_iterator = TweetIterator('source.txt', False, 'word_mat')
for t in tweet_iterator:
    print(t)

You can combine any number of these:

tweet_iterator = TweetIterator('source.txt', False, 'raw_text', 'clean_test', 'hashtags', 'char_mat', 'word_mat')
for rt, ct, h, cm, wm in tweet_iterator:
    print(rt, ct, h, cm, wm)

etc...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.