Code Monkey home page Code Monkey logo

phrase-mining's Introduction

Phrase-Mining

Sequential Pattern Mining

This code is being done as the assignment of CS 412- Intro to Data Mining (Spring 2018), University of Illinois at Urbana Champaign.

Rules for Implementation are mentioned below.

In natural language processing, it is important to identify "phrases" from text. By considering phrases as word sequences of fixed order that are frequent in the corpus, one can apply the sequential pattern mining algorithm GSP, Prefix-Span or SPADE to solve the problem.

In this assignment, you will be given raw text sentences and you need to implement GSP or other equivalent sequential pattern mining algorithm to find frequent phrases from the text.

You can assume that the input is in English only. As the first step, please remove less important words (stop words) from the sentences. You can use the english stopwords list from below.

a an are as at by be for from has he in is it its of on that the to was were will with Then, convert all words to lower case to remove ambiguity. Now, start building the sequence database. Specifically, process conjunction joiner ("and") by grouping joined words into itemsets. That is, for example, make "A, B and C" into (A, B, C) in sequence transaction.

Input Format

The input dataset is raw text sentences.

The first line of the input corresponds to the minimum support.

Each following line of the input corresponds to one sentence. Words in each transaction are seperated by a space.

Please refer to the sample input below. In sample input 0, the minimum support is 3, and the dataset contains 3 sentences and 6 words (b, c, d, e, f and g).

Constraints

NA

Output Format

The output are the longest length frequent phrases you mined out from the input dataset. This means that you have to report the phrase with the longest length which satisfies the minimum support criterion.

Each line of the output should be in the format:

Support [frequent phrase] Support [frequent phrase] ...... The frequent phrases should be ordered according to their support from largest to smallest. Ties should be resolved by ordering the frequent phrases according to the alphabetical order.

Please refer to the sample output below. In sample output 1, the four phrases are the sequential frequent phrases calcualted after converting b and c into (bc).

Sample Input 0

3 b and c d and e f b c d e f b and c d b e g f Sample Output 0

3 [b d f] 3 [b e f] 3 [c d f] 3 [c e f] Sample Input 1

4 Clustering and classification are important problems in machine learning. There are many machine learning algorithms for classification and clustering problems. Classification problems require training data. Most clustering problems require user-specified group number. SVM, LogisticRegression and NaiveBayes are machine learning algorithms for classification problems. k-means, AGNES and DBSCAN are clustering algorithms. Dimension reduction methods such as PCA are also learning algorithms for clustering problems. Sample Output 1

4 [classification problems] 4 [clustering problems]

phrase-mining's People

Contributors

sanket1sinha avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.