Code Monkey home page Code Monkey logo

text-essence's Introduction

Comparative corpus linguistics with embeddings

TextEssence is a tool for comparative corpus linguistics using embeddings. It is described in the NAACL-HLT 2021 paper:

See more at:

Setting up the web interface

(1) Set up the environment. Install Node.js, and set up a Python virtual environment with the necessary dependencies. The following (using Anaconda) should do the trick:

conda create -n textessence python=3
source activate textessence
pip install -r requirements.txt

(2) Download pre-generated data from Zenodo. Go to our Zenodo release and download the files to the machine you'll be running TextEssence on.

Unzip the CORD-19_monthly_embeddings.zip file; this will extract all the pretrained concept embeddings from our case study on CORD-19.

Now you'll need to modify config.ini to link to the files you've downloaded. You'll do this in two steps:

Point to the DB file: Change the DatabaseFile field of PairedNeighborhoodAnalysis to point to the downloaded SQLite DB file. For example, if you downloaded the CORD-19 data into /var/textessence, your config.ini would have:

[PairedNeighborhoodAnalysis]
DatabaseFile = /var/textessence/CORD-19_analysis__2020-03__2020-10.db

Point to the pretrained embeddings: Add a section to config.ini for each of the subcorpora from the CORD-19 analysis, like the following

[2020-03-27]
ReplicateTemplate = /var/textessence/2020-03-27/2020-03-27_SNOMEDCT_concepts_replicate-{REPL}.txt
EmbeddingFormat = txt
[2020-04-24]
ReplicateTemplate = /var/textessence/2020-03-27/2020-03-27_SNOMEDCT_concepts_replicate-{REPL}.txt
EmbeddingFormat = txt
...

This allows the Compare interface to calculate cosine similarities between embeddings on demand.

(3) Start the Flask backend The back end of the web interface is implemented in Flask. A make target has been created in makefile to simplify running the interface:

make run_dashboard

(4) Start the Svelte frontend The front end of the web interface is implemented in Svelte.

First-time setup: Load the visualization module and install its packages, using

git submodule init
git submodule update
cd nearest_neighbors/dashboard/diachronic-concept-viewer
npm install

Main run command: Start the Node server for handling the visualization content, using

npm run autobuild

(5) Start using the system! TextEssence will now be running at http://localhost:5000.

Contact and citation

If you have a question about TextEssence or an issue using the toolkit, please submit a GitHub issue.

If you use TextEssence in your work, please cite the following paper:

@inproceedings{newman-griffis-etal-2021-textessence,
    title = "{T}ext{E}ssence: A Tool for Interactive Analysis of Semantic Shifts Between Corpora",
    author = "Newman-Griffis, Denis  and
      Sivaraman, Venkatesh  and
      Perer, Adam  and
      Fosler-Lussier, Eric  and
      Hochheiser, Harry",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.naacl-demos.13",
    pages = "106--115",
}

For any other questions, please contact Denis Newman-Griffis at [email protected].

text-essence's People

Contributors

drgriffis avatar venkatesh-sivaraman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

text-essence's Issues

Applying the visualization to other datasets

Hi,

Thank you for this wonderful tool.
I am wondering how I could apply this tool to another dataset to perform a similar kind of analysis.
I have the following datas :

<style> </style>
Fields Meaning
wos_id In a Web of Science record, the UID is labeled Accession Number. • Accession Number: CCC:000282939200001 • Accession Number: WOS:000246155700009
sortdate date of publication
has_abstract
pubtype type of publication
pubyear year of publication
pubmonth month publication
issue issue number of journal
source Name of the journal
item Title of publication
language Language of publication
heading Research Domain
subheading Sub Research Domain
doctype Type of document (Article, Review, etc...)
abstract Abstract
keywords
keywords_plus
accession_no Autre identifiant

Thanks for your help

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.