Code Monkey home page Code Monkey logo

nlp_bulk_labelling_app's Introduction

A tool to help you label massive NLP datasets, quickly.

A webapp of the project is available here : https://pliploop-nlp-bulk-labelling-app-main-app-qaetiy.streamlitapp.com/

This project has been tested with python 3.9

Installation

We highly recommend creating a virtual environment before installing any libraries.

First clone the repo by shelling:

git clone [email protected]:Pliploop/NLP_Bulk_labelling_app.git && cd NLP_Bulk_labelling_app/

to install all the dependencies for the project as well as the sample datasets and necessary models, run:

make install

this will install a few sample datasets as well as a couple of models, and the required python packages. it will also create the cache folder in which plotting data will be stored. If you simply want to run the project without the models (this will raise an error when choosing the non-installed models), create a virtual environment and run

pip install requirements.txt

At this point, there might be an installation error concerning the gensim library, which requires *Microsoft visual C++ 14.0+ to function. this can be downloaded and installed at the following link: https://visualstudio.microsoft.com/fr/visual-cpp-build-tools/.

At this point, reattempt installation if you did not have MVC14.0 installed. otherwise, you're good to go!

Visualizing your first dataset

to launch the app, simply shell:

make app or streamlit run main_app.py

this will bring you to the landing page of the app.

Selecting your dataset

You can either choose between the sample datasets downloaded with make install, or you can upload your own custom dataset via the file uploader. If you so choose, you can name the uploaded dataset so it can be saved into the cached datasets folder for later use.

Selecting the column you want to cluster

Next, simply select the column you want to label from with the dropdown menu in the following beta expander.

(Optional) Select the model to be used to compute embeddings

By default, the app computes embeddings with the byte pair language model, a simplistic but rather versatile encoding framework. However, many more models are available with various strengths and weaknesses, and you are free to choose the one you deem most appropriate for your dataset, or play around with them until you get satisfactory results.

(Optional) Select the dimension reduction framework to use

To visualize many-featured data, dimension reduction is required. Three models are provided to do this, by default TSNE is chosen, but again, you are free to experiment until you get optimal cluster separation.

Compute embeddings

Once these steps are done, you can simply press the compute embeddings button and wait a short (results may vary) amount of time for your graph to display. Next up, actual labelling and cluster suggestion!

Actually labelling

how to label clusters

Now that you have a nice visualization of your data and you're (hopefully) happy with the cluster separation, it's time to do some actual labelling. you can use the incorporated lasso tool in the UI to select a cluster of points you deem worthy to be labeled and input their label in the text input below. Depending on the state of the "show labeled data" checkbox, upon validation you will see the cluster disapear from the viz. If you want to keep it visible, just check the aforementioned checkbox.

Say that you'd like to reset your labelling progress for the whole dataset, or maybe just one cluster. you can do this by clicking the "clear label" button in the sidebar. if you want to clear the cache for computing that might be stochastic in nature and you'd like to re-compute the visualization, you can click the "clear cache" button.

cluster suggestion

If you're kind of lazy and don't really feel like hovering over a vast amount of points to figure out the structure of your data, we have a small feature that might help: you can ask the app to suggest your own clusters. make a quick choice between which unsupervised clustering algorithm you want, tweak the parameters until you get a satisfactory result, and you're set to label. In the future, we will be implementing more algorithms as well as a potential cluster validation feature that will allow to directly label a cluster with the label suggested by a TFIDF algorithm.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.