Code Monkey home page Code Monkey logo

tfidf-tool's Introduction

tfidf-tool

This is an implementation of Python.The tool provides a simple and fast method to calculate tf-idf value.

why use this tool?

  • the tool calculates idf value by multi processes,which is n times faster than traditional method
  • it can calculate n-gram tf-idf value
  • and extract key words from documents

quick start

All the input we use is in the 'input' directory.We will use 'wiki_head_10.txt' which contains 10 documents of wiki to train our model,and use 'wiki_test.txt' to test.

get idf value

    doc = Document('../input/wiki_head_10.txt')
    tfidf = TFIDF(
        documents=doc,
        ngram=2,
        stop_words_path='../input/stop_words.txt',
        idf_path='../output/idf.txt'
    )
    #use 2 process and every process handle 5 docs
    tfidf.multi_pro_idf(process_num=2, p_doc_num=5)

Here we calculate bigram idf value from the 10 wiki docs.

TFIDF's parameter

  • documents:a class of Document. The input is a generator which every element is a list of sentence which represents a document
  • ngram:Integer.1 represents unigram, 2 represents bigram, 3 represents trigram...
  • strop_words_path:stop words file.If use stop words, the ngram words contain stop words will filtered.
  • idf_path:a file path to store the idf value

get tfidf value and extract key words

    tfidf = TFIDF(
            documents=None,
            ngram=2,
            stop_words_path='../input/stop_words.txt',
            idf_path='../output/idf.txt'
        )
    tfidf.load_idf()
        doc = tfidf.read_file('../input/wiki_test.txt')
        #a dict contains word and value
        tfidf = tfidf.calculate_tfidf(doc)
        #extract top 10 key words from one documents
        tfidf.find_keywords(doc, 10)

tfidf-tool's People

Contributors

tigerchen52 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.