Code Monkey home page Code Monkey logo

ws4j's Introduction

WordNet Similarity for Java Build Status Release

WS4J provides a pure Java API for several published semantic relatedness/similarity algorithms for, in theory, any WordNet instance. You can immediately use WS4J on Princeton's English WordNet 3.0 lexical database through MIT Java WordNet Interface 2.4.0, which is the fastest Java library for interfacing with WordNet.

The codebase is mostly a Java re-implementation of WordNet::Similarity written in Perl, using the same data files as seen in src/main/resources, with some test cases for verifying the same logic. WS4J is designed to be thread-safe.

Relatedness/Similarity Algorithms

The semantic relatedness/similarity metrics available are:

  • HSO: Hirst & St-Onge, 1998 - The Hirst & St-Onge measure is based on the idea that two lexicalized concepts are semantically close if their WordNet synsets are connected by a path that is not too long and that "does not change direction too often":

HSO(s1, s2) = const_C - path_length(s1, s2) - const_k * num_of_changes_of_directions(s1, s2);

  • LCH: Leacock & Chodorow, 1998 - The Leacock & Chodorow measure relies on the length of the shortest path between two synsets for their measure of similarity:

LCH(s1, s2) = -Math.log_e(LCS(s1, s2).length / (2 * max_depth(pos)));

  • LESK: Banerjee & Pedersen, 2002 - Lesk (1985) proposed that the relatedness of two words is proportional to the extent of overlaps in their dictionary definitions. This Lesk measure is based on adapted Lesk from Banerjee and Pedersen (2002) extended this notion to use WordNet as the dictionary for the word definitions:

LESK(s1, s2) = sum_{s1' in linked(s1), s2' in linked(s2)}(overlap(s1'.definition, s2'.definition));

  • WUP: Wu & Palmer, 1994 - The Wu & Palmer measure calculates relatedness by considering the depths of the two synsets in the WordNet taxonomies, along with the depth of the LCS:

WUP(s1, s2) = 2 * dLCS.depth / (min_{dlcs in dLCS}(s1.depth - dlcs.depth)) + min_{dlcs in dLCS}(s2.depth - dlcs.depth)), where dLCS(s1, s2) = argmax_{lcs in LCS(s1, s2)}(lcs.depth);

  • RES: Resnik, 1995 - Resnik defined the similarity between two synsets to be the information content of their lowest super-ordinate (most specific common subsumer):

RES(s1, s2) = IC(LCS(s1, s2));

  • PATH - The Path measure computes the semantic relatedness of word senses by counting the number of nodes along the shortest path between the senses in the 'is-a' hierarchies of WordNet:

PATH(s1, s2) = 1 / path_length(s1, s2);

  • JCN: Jiang & Conrath, 1997 - The Jiang & Conrath measure uses the notion of information content but in the form of the conditional probability of encountering an instance of a child synset given an instance of a parent synset:

JCN(s1, s2) = 1 / jcn_distance where jcn_distance(s1, s2) = IC(s1) + IC(s2) - 2 * IC(LCS(s1, s2)); when it's 0, jcn_distance(s1, s2) = -Math.log_e((freq(LCS(s1, s2).root) - 0.01) / freq(LCS(s1, s2).root)) so that we can have a non-zero distance which results in infinite similarity;

  • LIN: Lin, 1998 - The Lin measure idea is similar to JCN with a small modification:

LIN(s1, s2) = 2 * IC(LCS(s1, s2) / (IC(s1) + IC(s2)).

The descriptions above are extracted either from each paper or from WordNet-Similarity CPAN documentation.

Prerequisites

By default, the requirements for compilation are:

  • JDK 8+
  • Maven

Any WordNet instance can be used in WS4J if it implements the ILexicalDatabase interface.

Built with Maven

To create a jar file with dependencies including resource files:

$ mvn install assembly:single

Using WS4J

Then start playing with the facade WS4J API:

src/main/java/edu/uniba/di/lacam/kdde/ws4j/WS4J.java

and a simple demo class:

src/main/java/edu/uniba/di/lacam/kdde/ws4j/demo/SimilarityCalculationDemo.java

which can be run through jar-with-dependencies from the root folder by typing into the terminal:

$ java -jar target/ws4j-1.0.2-jar-with-dependencies.jar

When using WS4J jar package from other projects add the JitPack repository to your POM file:

<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>

and declare this GitHub repo as a dependency:

<dependencies>
    <dependency>
        <groupId>com.github.dmeoli</groupId>
        <artifactId>WS4J</artifactId>
        <version>x.y.z</version>
    </dependency>
</dependencies>

Running the tests

To run JUnit test cases:

$ mvn test

The expected results from the test cases are compatible with the original WordNet::Similarity.

Initial Work

The original author is Hideki Shima.

License License: GPL v3

This software is released under GNU GPL v3 License. See the LICENSE file for details.

ws4j's People

Contributors

dependabot[bot] avatar dmeoli avatar pjhenning avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ws4j's Issues

Potential for uncaught null exception

Hello, it is possible to construct a Concept value, via the first constructor, which gives an object with no value for the pos field:

public Concept(String synsetID) {
this.synsetID = synsetID;
}
but code elsewhere that operates on Concept objects will assume that the pos field has been populated:
if (concept1.getPOS().equals(POS.NOUN)) maxDepth = 20;
which can result in a runtime exception.

Illegal argument exception

I found bug when i tried ws4j using token monitor-patient, i can't run demo using those token. I tried on your web demo, it works, but it error on eclipse.

image

Error in similarity calculation

Hi @DonatoMeoli
Thank you for your code!
I'm using the class SimilarityCalculationDemo and I noticed it returns 0 in all similarity algorithms for some words. For example, I've tested with (car, vehicle) and (cancer, disease).

I guess the problem is at
https://github.com/DonatoMeoli/WS4J/blob/008a427123a9f25106ea81495eb256976658dd1f/src/main/java/edu/uniba/di/lacam/kdde/ws4j/util/WordSimilarityCalculator.java#L110

It's necessary adding the code:
double score = relatedness.getScore(); if (score > maxScore) maxScore = score;

Do you think it is correct?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.