Code Monkey home page Code Monkey logo

goextract's Introduction

goextract

Go code for cooccurrence extraction of corpus statistics.

WHENEVER YOU USE THIS CODE, YOU MUST ALWAYS FIRST RUN THE FOLLOWING TWO COMMANDS:

git pull
cd extract/
go build .

i.e., check for updates, then rebuild the go binary executable accordingly.

This software is used to extract the cooccurrence statistics within a corpus of text. You will need:

Preliminary set up.

We can think of statistics extraction in terms of 5 separate steps: Preparation; Unigram extraction; Unigram merging; Cooccurrence extraction; Cooccurence merging.

  • Step 1. I have a big .txt file that is defined with two notions of separation: spaces indicate new tokens, and newlines indicate new documents. Maybe this file is 30 GB.
  • Step 1.1. I want to divide that file up into smaller 1 GB pieces (this will later facilitate generalized multiprocessing for extremely rapid extraction); use the bash command $split to do this!
  • Step 1.2. I want to store things efficiently, so gzip the divided files -- the Go code assumes its given gzipped files anyway.

Unigram extraction.

  • Step 2. Now, I want to know what the unigram statistics of this corpus are -- this will produce an encoder-decoder structure that is required for cooccurrence extraction. Suppose the divided data is in a directory divided/, and we want to store results files into the directory unigrams/, then we do:
i=0; 
for f in divided/*.gz; do
	i=$((i+1)); 
	./extract -option unigram -e $f -U unigrams/$i.unigram; 
done

Note: this will produce N unigram files -- if you get fewer than N files then you have run this script incorrectly (e.g., you forgot the $i.unigram)

  • Step 3. This script will produce a bunch of sub-unigram files for each file in divided/. But, we would rather have a single merged unigram file; additionally we need to specify what the desired vocabulary size should be -- 50,000 is often a good number! This is easy:

./extract -option unigram-merge -U unigrams/ -v 50000

Cooccurrence extraction.

  • Step 4. Now we have the good boy file unigrams/merged.unigram which stores our vocabulary, and the unigram statistics. This will be used to help us extract cooccurrences! However, when doing cooccurrence extraction there is one fundamental consideration: how do we define the context?

  • Step 4a. We could use dynamic context window weighting, like Word2vec; in this case, we just use the argument -w W, where W is the desired context window size (typically in the range of 2-10); note, the larger W is, the longer the extraction will take!

  • Step 4b. We could use a generalized context window file; e.g., perhaps we want to define an assymetric context window with our own desired weights. This is done by passing -window /path/to/window_file.w; examples of how the .w file should be written are found in data/test_data/, which includes left and right assymetric examples.

  • Step 4.1. Do the extraction! Let's suppose you are using a basic 5-token left-right context window, and we are storing temporary .cooc files into a directory called coocs/:

i=0; 
for f in divided/*.gz; do
    i=$((i+1));  
    ./extract -option cooc -e $f -U unigrams/merged.unigram -C coocs/$i.cooc -w 5;   
done

Note: this will produce at least N cooc files as google binaries (gobs) -- if you get fewer than N files then you have run this script incorrectly (e.g., you forgot the $i.cooc)

  • Step 5. Merge the results from extraction!

./extract -option cooc-merge -C coocs/

Final comments.

  • We now have a file called coocs/merged.cooc. This file stores all of the cooccurrence information in the corpus according to the definition of window size, and exists only with respect to the vocabulary encoding defined by the unigram file used during extraction (unigrams/merged.unigram). It is structured as, for each line: term_i context_j Nij, where i and j are the codes defined in the unigram file that map to the unigram file's string.
  • Concurrency pattern: instead of using a for loop to make each .cooc file one at a time, we could multiprocess this and divide responsibility to just iterate over K .gz files, rather than all N. By doing so you can considerably speed up running time; e.g., dividing into 4 simultaneous processes will reduce runtime by x4.
  • Full path pattern: at the current state of this project, everything requires the full path in order to run properly; so, always use the full path to any directory or file when using it; e.g., instead of doing -C coocs/ you will probably need to do -C /home/rldata/hilbert-data/coocs, etc.
  • RAM usage: this code will use a considerable amount of RAM during Step 4.1, and it is highly concurrent within ./extract; therefore, be careful when using on a big server as it will use all available cores (but will be very fast). If you have more than 32 GB of RAM you should be pretty much good; if you have more than 64 GB of RAM then you will certainly be fine.
  • Smart usage: step 4.1 is the only expensive operation, every other operation can be done in the space of a few seconds/minutes; therefore, when thinking about parallelizing, only consider it with respect to step 4.1 --- it is not necessary to parallelize the unigram extraction (although you could do so with exactly the same pattern as you would do for 4.1).

goextract's People

Contributors

kiankd avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.