Code Monkey home page Code Monkey logo

cistem's Introduction

CISTEM

license

CISTEM is a stemming algorithm for the German language, developed by Leonie Weißweiler and Alexander Fraser. This repository contains official implementations in a variety of programming languages. At the moment, the following languages are available:

  • Python
  • Java
  • C++
  • C
  • Javascript
  • Go
  • Haskell
  • Perl
  • Swift

The code for each language encludes a method for stemming as well as one for segmentation, which returns the stripped suffix as well as the stem.

Performance

We performed a comparative analysis of six publicly available German stemmers, where CISTEM achieved the best results for f-measure and state-of-the-art results for runtime.

Gold standards

The gold_standards folder contains the two gold standards we used for evaluation. Each file is utf-8 text file with each line containing all the stems of one cluster separated by a single space. Note that we do not supply a reference stem for each cluster, as we measure stemming performance as the ability to group words with the same meaning, which is more relevant for information retrieval purposes than the absolute stem. If you use these gold standards in your own research, please cite our paper: Bibtex

More information on how we evaluated runtimes and stemming quality can be found in our paper:

Leonie Weißweiler, Alexander Fraser (2017). Developing a Stemmer for German Based on a Comparative Analysis of Publicly Available Stemmers. In Proceedings of the German Society for Computational Linguistics and Language Technology (GSCL), to appear.

cistem's People

Contributors

fkoehne avatar fohlen avatar jannikbecker avatar leonieweissweiler avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

cistem's Issues

Web Assembly?

Looking for complete stemmer implementations for german I stumbled upon this. Why not add a web assembly export? I know its very experimental, but it would open up a lot of platforms already. ;) (pretty please)

Rust translation

I made a Rust translation of CISTEM for a project of mine: cistemrs. Probably you don't want to merge it into the repo, but maybe it's useful for other people using the language.

This issue is more of a comment and can be closed immediately. Thanks for the stemmer! It works very well for information retrieval. :)

Potential UNICODE normalisation problem

If the word as input is not normalised to NFC, the following parts will not work:

  • transformation of äöü to aou
  • length() calculation
  • substr($original, - $rest_length)

Working on "whatever" normalised would be complicated.

Methods stem and segments should return the same stem

The difference of transformation between stem() and segments() is:

  • removement of prefix ge-
  • transformation of äöü to aou
  • transformation of ß to ss

Example:

#!perl
use strict;
use warnings;
use utf8;

binmode(STDOUT,":encoding(UTF-8)");
binmode(STDERR,":encoding(UTF-8)");

use lib qw(../CISTEM);

use Cistem;

my @words = qw/geheilwässert/;

for my $word (@words) {
  for my $case_sensitive (0..1) {
    print 'Cistem::stem(',$word,',',$case_sensitive,'): ',
    Cistem::stem($word,$case_sensitive),"\n";
  }

  for my $case_sensitive (0..1) {
    print 'Cistem::segment(',$word,',',$case_sensitive,'): ',
    join('-',Cistem::segment($word,$case_sensitive)),"\n";
  }
}

Which results in:

~/github/perl/CISTEM-test$ perl cistem.t
Cistem::stem(geheilwässert,0): heilwass
Cistem::stem(geheilwässert,1): heilwass
Cistem::segment(geheilwässert,0): geheilwäss-ert
Cistem::segment(geheilwässert,1): geheilwäss-ert

I would expect the same segmentation:

ge-heilwass-ert

This would also allow sharing most of the code.

Typo in "cistem.cpp"

Hi!

I recognized, the first line of "cistem.cpp" says

#include "Cistem.hpp"

I assume it should be

#include "cistem.hpp"

Best,
Florian

Removing of prefix "ge" is too aggressive

Thank you for the useful stemmer. I have stumbled upon an issue:
There should be some exceptions for removing the prefix "ge".
E.g.
"Geschlecht" is stemmed to "schlecht"
"Gesellschaft" is stemmed to "sellschaft"
"gesamt" is stemmed to "samt"
"genau" is stemmed to "nau"
etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.