Code Monkey home page Code Monkey logo

michimalek / nlp-clustering-research Goto Github PK

View Code? Open in Web Editor NEW
2.0 0.0 1.0 9.7 MB

A python Sentence-Clustering library based on S-Bert and a diverse number of clustering methods. Establishes best number of clusters for each algorithm and the most optimal algorithm by internal and external validation respectively.

Jupyter Notebook 94.36% Python 5.64%
clustering machine-learning natural-language-processing python sentence-embeddings sentence-transformer topic-modeling wordcloud

nlp-clustering-research's Introduction

NLP-clustering for Recommendation

Introduction

This library focuses on sentence clustering, specifically recommendation clustering for telecommunication networks, but can be applied to any sort of sentences. It is based on the bachelor research "Textual Clustering for Telecommunication Accident Recommendations". The recommendations used in this specific case were extracted during workshops for accident analysis in the research of Wienen et al. This library focuses on finding the most optimal unsupervised text clustering method with the best number of clusters. This consists firstly on internal validation scores to find the most optimal number of cluster and secondly on external validation indices to find the most optimal out of all trained models. It also allows the visual comparison of different algortihms based on dimensionality reduction.

How it works

The following methodology is applied:

  1. Clean sentences from punctuation and stop words
  2. Quantify textual sentences into an numerical multi-dimensional matrix (embedding), each row representing the original sentence and each column representing the feature values set by the embedding method. S-Bert was used to create the embedding.
  3. As beforementioned, the produced embedding matrix includes a tremendous amount of dimensions. To handle the processing faster and simplify the data, the dimensionality reduction tool UMAP is utilized to transform the multi-dimensional matrix into two-dimnesional space.
  4. After simpifing the matrix, the clustering can be conducted with a variety of machine learning methods. The following clustering methods are available for this library:
    • Manual-K:
      • K-Means
      • Agglomerative Clustering
      • Spectral Clustering
    • Auto-K:
      • Affinity Propagation
      • Mean Shift
      • HDBSCAN

Manual-K algorithms distinguish themselves from auto-K algorithms, because manual-k requires a given optimal number of clusters as paramemter and auto-k methods not. Instead, they calculate the optimal number of clusters based on their own internal calculations. Therefore, to find the optimal number of clusters K for the manual-K methods, the KFinder objects gets used. It iterates through a given range and calculates internal validation indices for each iteration to indicate which K performed the best.

All the aforementioned steps 1. - 4. are processed in the main.py file. There are more jupyter notebooks for label simularity (accuary.ipynb), internal validation methods to find the optimal number of clusters for a specfic model (internal_validation.ipynb) and cluster visulaization and word cloud generation for each cluster (internal_validation.ipynb).

Usage

To receive the labeled data from the clustering models, use main.py. It includes all the important data-pipelines to get from only text to text including predicted labels.

Example with K-Means algorithm

To conduct the labels generated based on K-Means, first you have to find the optimal number of clusters K for K-Means. This can be done with the KFinder object, either in the main.py or the internal_validation.ipynb; the format of the function would be transformer_kmeans_pipeline_find_k(clean), with the only parameter being the clean sentences. Running this method returns the internal indices scores over the number of given iterations, the person using this script then has to evaluate K by himself.

After finding the optimal K (potentially also multiple K's with similar internal validation scores), train and return the labels of K-Means with labels = transformer_kmeans_pipeline(clean, K) in main.py and safe the excel with the original text and the corresponding labels. The parameters depend on the algorithm used; Manual-K methods like K-Means again require the cleaned data and the number of K, auto-K methods only inlcude the cleaned data.

After the K-Mean model is trained and produced the clusters, they are probably sufficiently working but there are of course more algorithms available for clustering. In the case of switching to a different algorithm, the aforementioned methods also be used with different algorithms like transformer_spectral_pipeline_find_k(clean) or transformer_agglomerative_pipeline_find_k(clean) for finding the optimal K and transformer_spectral_pipeline(clean) or transformer_agglomerative_pipeline(clean) for producing the actual clusters. Check out the notebooks for validation and visualization methods.

nlp-clustering-research's People

Contributors

michimalek avatar

Stargazers

 avatar  avatar

Forkers

wienen

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.