Code Monkey home page Code Monkey logo

concept-expansion-snippet's Introduction

concept-expansion-snippet

This project uses a graph propagation method with pretrained word vectors (GloVe & Word2vec) to do the concept extraction and concept expansion tasks. It supports both Chinese and English.

  • Concept extraction: Extract concepts from the input text with given seed words.
  • Concept expansion: Expand concepts by the snippets from search engines (Baidu for Chinese and Google for English) with given seed words.

Before running the code

  1. Decompress model.zip to get some basic model files:unzip model.zip
  2. Decompress the particular HunPos package of hunpos-1.0-linux.tgz, hunpos-1.0-macosx.tgz, hunpos-1.0-win.zip according to the operating system, and rename the folder to hunpos.
  3. Download the following model files, put them in folder model/ and unzip them:
  4. modify some path lists in config.py if necessary.
  5. Install the requirements in requirements.txt: pip install -r requirements.txt

Parameters

You can run this project simply bypython main.py with additional command line parameters. Parameters can be specified in command line or in config.py.

Here are the parameters available in command line:

input_text: --text
input_seed: --seed, -s
language: --language, -l
task: --task, -t
iter_time: --iter_time, -i
max_num: --max_num, -m
threshold: --threshold, -th
decay: --decay, -d
no_seed: --no_seed, -ns

The following path lists can be modified in config.py:

zh_model, en_model: the word vectors files
jieba_dict
en_stopwords, zh_stopwords
zh_kp_list, en_kp_list: the whole candidate concepts
snippets_zh, snippets_en: the sqlite3 db files to store snippets
hunpos_model, hunpos_bin: nltk hunpos tagger
input, seed, tmp, result

Note

To crawl Google search snippets, you need VPN (for users in Mainland China).

The crawler may be blocked by anti-crawler programs if you see get snippet timeoutinfomation too many times. Rerun the code will solve this problem most of the time.

You can use other word vectors by modify path lists in config.py or rewrite modules/model_load.py. It's normal taking a long time and consuming a lot of memory to load the word vector file.

Make sure there is at least one seed word in the input text.

concept-expansion-snippet's People

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.