Light

franticnerd / taxogen Goto Github PK

View Code? Open in Web Editor NEW

91.0 11.0 36.0 14.14 MB

Python 52.37% C 45.12% Makefile 0.26% Shell 1.13% Java 1.11%

taxogen's Introduction

About

This repo is an implementation of the following paper for constructing topic taxonomy from text corpora.

"TaxoGen: Unsupervised Topic Taxonomy Construction by Adaptive Term Embedding and Clustering", Chao Zhang, Fangbo Tao, Xiusi Chen, Jiaming Shen, Meng Jiang, Brian Sadler, Michelle Vanni, Jiawei Han, ACM SIGKDD Conference on Knowledge Discovery and Pattern Mining (KDD), 2018.

Input

The input consists of three files:

papers.txt

This data file contains all the documents (e.g., paper titles).
Every line is a sequence of processed keywords (either uni-grams or phrases).
The keywords are separated by blank spaces (words in a phrase are concatenated by '_').

keywords.txt

This data file contains all keywords extracted from the document collection (e.g., entities, noun phrases).
Every line is a keyword.

embeddings.txt

This data file contains the embeddings of all the keywords.
Every line is the embedding of a keyword.

The DBLP dataset used in the paper is available here:

https://drive.google.com/file/d/1GbxKrxrmFrKt5vgDHP1xe1Qr_rfvR1jh/view?usp=sharing

How to run?

You can use "python main.py" to run TaxoGen.

A full pipeline is included in run.sh, including how we preprocess the corpus, run TaxoGen, and postproces the results for visualization.

taxogen's People

Contributors

Stargazers

Watchers

taxogen's Issues

Hardcoded pahts

Inside "paras.py" there are some paths hardcoded to your personal folder, generating an error when running the code.

E.g.: /shared/data/jiaming/local-embedding/sp

Just modified them to a local path the code works.

Running taxogen on custom corpus

What would the process be to run taxogen on a new corpus?

Cluster center or result of ranking function as new child node

Reading the TaxoGen-paper, I thought new child nodes in the taxonomy were created by selecting the most representitive term for a topic (using the ranking algorithm which considers popularity and concentration). But reading through the code in main.py and cluster.py it seems to me that child nodes are created by selecting the center index for a cluster. Specifically the center index is found here: https://github.com/franticnerd/taxogen/blob/master/code/cluster.py#L34
the names of the center indices are returned here
https://github.com/franticnerd/taxogen/blob/master/code/cluster.py#L70
and here they are used as new children nodes:
https://github.com/franticnerd/taxogen/blob/master/code/main.py#L63

Did I maybe misunderstand the paper or the code? Or am I looking at the wrong version of the code? I am glad for any clarification.

the score of clus_center is smaller than filter_thre, so it doesn't occur in keywords.txt, leading to KeyError in embs[cate_ph]

Dear authors of TaxoGen,
The keyword score of clus_center is smaller than filter_thre, so it (the clus_center) doesn't occur in keywords.txt, leading to KeyError in embs[cate_ph]. How to solve this problem?
作者们您们好，
聚类中心的词的分数小于阈值，因此该词不出现在keywords.txt，导致embs词典里没有这个词，embs[cate_ph]时查不到该词，出现KeyError，请问如何解决？

Run script

In run.sh, word2veec should be word2vec:

gcc word2vec.c -o word2veec -lm -pthread -O2 -Wall -funroll-loops -Wno-unused-result

could you please tell me how can i get the data?

i cant find it anywhere

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.