Code Monkey home page Code Monkey logo

similaripy's Introduction

Similaripy

ForTheBadge built-with-love forthebadge made-with-python

HitCount GitHub stars GitHub forks GitHub repo size in bytes GitHub contributors GitHub license

๐Ÿ“ Approach for a better clustering built in HackNLP

What we wanted to do

Our approach was to create an N-dimensional index representing the similarity of texts and create clusters from it.

  • Take the matrix from Java
  • Create N-dimensional index
  • Create algorithm to create clusters
  • Implement a 3D dimensional representation

What we have done

We have taken the similarity score of all the possible pairs of vectors representation of several texts (requirements) given by ESSI University group project which is calculated using a Cosine distance. Then using that information as a matrix we have created an index using NMSLIB (source: https://github.com/nmslib/nmslib) and implemented a clusterization algorithm by thresholding and selecting a number of neighbours.

Challenges we ran into

We did not have a lot of time to develop our ideas. The brainstorming was a little bit rush and we are not used to it. Moreover, the dataset and method to validate our model were a little difficult to deal with.

What we learned

We've never used nmslib or neither done a clustering algorithm so we can say that almost everything of what we've done it was new to us.

What's next for Similaripy

Re-think about the way it is computed the accuracy for the model and experiment with several parameters to get the best result. We could try several ways to compute the distance and its similarity score instead of the Cosine distance.

Usage

Build model

python3 -m src.scripts.build_api_model data/input_buildModel_duplicates.json

Get matrix

python3 -m src.scripts.get_matrix data/input_computeClusters_duplicates.json data/score_matrix.json data/mapping.json

Build index

python3 -m src.build data

Find clusters

python3 -m src.find_clusters data 

Eval

python3 -m src.eval data/input_computeClusters_duplicates.json data/clusters.json && python3 -m src.eval data/input_computeClusters_duplicates.json data/clusters.json 

License

MIT ยฉ Similaripy

similaripy's People

Contributors

adriacabeza avatar albertsuarez avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

deuveme

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.