Code Monkey home page Code Monkey logo

elki's Introduction

ELKI

Environment for Developing KDD-Applications Supported by Index-Structures

Unit tests License AGPL-3.0 DBLP:conf/sisap/Schubert22

Quick Summary

ELKI is an open source (AGPLv3) data mining software written in Java. The focus of ELKI is research in algorithms, with an emphasis on unsupervised methods in cluster analysis and outlier detection. In order to achieve high performance and scalability, ELKI offers many data index structures such as the R*-tree that can provide major performance gains. ELKI is designed to be easy to extend for researchers and students in this domain, and welcomes contributions in particular of new methods. ELKI aims at providing a large collection of highly parameterizable algorithms, in order to allow easy and fair evaluation and benchmarking of algorithms.

Download

You can download precompiled ELKI releases from the home page, or you can use standard Java dependency management such as Gradle and Maven.

Gradle:

dependencies {
    compile group: 'io.github.elki-project', name: 'elki', version:'0.8.0'
}

Maven:

<!-- https://mvnrepository.com/artifact/io.github.elki-project/elki -->
<dependency>
    <groupId>io.github.elki-project</groupId>
    <artifactId>elki</artifactId>
    <version>0.8.0</version>
</dependency>

Background

Data mining research leads to many algorithms for similar tasks. A fair and useful comparison of these algorithms is difficult due to several reasons:

  • Implementations of comparison partners are not at hand.
  • If implementations of different authors are provided, an evaluation in terms of efficiency is biased to evaluate the efforts of different authors in efficient programming instead of evaluating algorithmic merits.

On the other hand, efficient data management tools like index-structures can show considerable impact on data mining tasks and are therefore useful for a broad variety of algorithms.

In ELKI, data mining algorithms and data management tasks are separated and allow for an independent evaluation. This separation makes ELKI unique among data mining frameworks like Weka or Rapidminer and frameworks for index structures like GiST. At the same time, ELKI is open to arbitrary data types, distance or similarity measures, or file formats. The fundamental approach is the independence of file parsers or database connections, data types, distances, distance functions, and data mining algorithms. Helper classes, e.g. for algebraic or analytic computations are available for all algorithms on equal terms.

With the development and publication of ELKI, we humbly hope to serve the data mining and database research community beneficially. The framework is free for scientific usage ("free" as in "open source", see License for details). In case of application of ELKI in scientific publications, we would appreciate credit in form of a citation of the appropriate publication (see our list of publications), that is, the publication related to the release of ELKI you were using.

The people behind ELKI are documented on the Team page.

The ELKI wiki: Tutorials, HowTos, Documentation

Beginners may want to start at the HowTo documents, Examples and Tutorials to help with difficult configuration scenarios and beginning with ELKI development.

This website serves as community development hub and task tracker for both bug reports, Tutorials, FAQ, general issues and development tasks.

The most important documentation pages are: Tutorial, JavaDoc, FAQ, InputFormat, DataTypes, DistanceFunctions, DataSets, Development, Parameterization, Visualization, Benchmarking, and the list of Algorithms and RelatedPublications.

Getting ELKI: Download and Citation Policy

You can download ELKI including source code on the Releases page.
ELKI uses the AGPLv3 License, a well-known open source license.

There is a list of Publications that accompany the ELKI releases. When using ELKI in your scientific work, you should cite the publication corresponding to the ELKI release you are using, to give credit. This also helps to improve the repeatability of your experiments. We would also appreciate if you contributed your algorithm to ELKI to allow others to reproduce your results and compare with your algorithm (which in turn will likely get you citations). We try to document every publication used for implementing ELKI: the page RelatedPublications is generated from the source code annotations.

Efficiency Benchmarking with ELKI

ELKI is quite fast (see some of our benchmark results) but the focus lies on a broad coverage of algorithms and variations. We discourage cross-platform benchmarking, because it is easy to produce misleading results by comparing apples and oranges. For fair comparability, you should implement all algorithms within ELKI, and use the same APIs. We have also observed Java JDK versions have a large impact on the runtime performance. To make your results reproducible, please cite the version you have been using. See also Benchmarking.

Bug Reports and Contact

You can browse the open bug reports or create a new bug report.

We also appreciate any comments, suggestions and code contributions.
You can contact the core development team by e-mail: elki () dbs ifi lmu de

Design Goals

  • Extensibility - ELKI has a very modular design. We want to allow arbitrary combinations of data types, distance functions, algorithms, input formats, index structures and evaluations methods
  • Contributions - ELKI grows only as fast as people contribute. By having a modular design that allows small contributions such as single distance functions and single algorithms, we can have students and external contributors participate in the progress of ELKI
  • Completeness - for an exhaustive comparison of methods, we aim at covering as much published and credited work as we can
  • Fairness - It is easy to do an unfair comparison by badly implementing a competitor. We try to implement every method as good as we can, and by publishing the source code allow for external improvements. We try to add all proposed improvements, such as index structures for faster range and kNN queries
  • Performance - the modular architecture of ELKI allows optimized versions of algorithms and index structures for acceleration
  • Progress - ELKI is changing with every release. To accomodate new features and enhance performance, API breakages are unavoidable. We hope to get a stable API with the 1.0 release, but we are not there yet.

Building ELKI

ELKI is built using the Gradle wrapper:

./gradlew shadowJar

will produce a single executable jar file named elki-bundle-<VERSION>.jar.

Individual jar files can be built using:

./gradlew jar

A complete build (with tests and JavaDoc, it will take a few minutes) can be triggered as:

./gradlew build

Eclipse can build ELKI, and the easiest way is to use elki-bundle as classpath, which includes everything enabled.

elki's People

Contributors

abhisheksharma102 avatar alanmazankiewicz avatar andiwg avatar arandomtree avatar arthurzimek avatar brauliosanchez avatar delead avatar domiakax avatar evgfaer avatar kno10 avatar patrickkostjens avatar paulk-asert avatar pilisera avatar sebaruehl avatar sequoja avatar tiborgo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elki's Issues

PAM Algorithm Cluster Centroids

I use the PAM Algorithm in ELKI where the axis represent Coordinates. The visualization shows the axis but I need the exact values of the centroids. Is it possible to compute the exact numbers?

How to fix: No 'by label' reference outlier found, which is needed for weighting!

I'm trying to visualize a rtree but I am getting an error:

Task failed
de.lmu.ifi.dbs.elki.utilities.exceptions.AbortException: No 'by label' reference outlier found, which is needed for weighting!
	at de.lmu.ifi.dbs.elki.application.greedyensemble.VisualizePairwiseGainMatrix.run(VisualizePairwiseGainMatrix.java:140)
	at de.lmu.ifi.dbs.elki.gui.minigui.MiniGUI$2.doInBackground(MiniGUI.java:600)
	at de.lmu.ifi.dbs.elki.gui.minigui.MiniGUI$2.doInBackground(MiniGUI.java:591)
	at javax.swing.SwingWorker$1.call(Unknown Source)
	at java.util.concurrent.FutureTask.run(Unknown Source)
	at javax.swing.SwingWorker.run(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)

image

I tried adding a field called bylabel

image

Null Pointer dereference in PAMInitialMeans.java

hello [Parser-ing]

& thanks for ELKI

Dont have any mighty algos but if yr interested in a parser for numerical matlab (.mat) files i made one wrapping this jmatio

It is based on yours arff parser (i copied everything;) but it works

MaximumMatchingAccuracy Index out of Bounds Exception

Hey,
im new to elki and might be doing something wrong.
When i run:
for k in $( seq 3 40 ); do java -jar elki-bundle-0.7.6-SNAPSHOT.jar KDDCLIApplication -dbc.in data/synthetic/Vorlesung/mouse.csv -algorithm clustering.kmeans.LloydKMeans -kmeans.k $k -resulthandler ResultWriter -out.gzip -out output/k-$k ; done
i get a lot of

Index 6 out of bounds for length 6
java.lang.ArrayIndexOutOfBoundsException: Index 6 out of bounds for length 6
at elki.evaluation.clustering.MaximumMatchingAccuracy.(MaximumMatchingAccuracy.java:69)
at elki.evaluation.clustering.ClusterContingencyTable.getMaximumMatchingAccuracy(ClusterContingencyTable.java:246)
at elki.evaluation.clustering.EvaluateClustering$ScoreResult.(EvaluateClustering.java:245)
at elki.evaluation.clustering.EvaluateClustering.evaluteResult(EvaluateClustering.java:173)
at elki.evaluation.clustering.EvaluateClustering.processNewResult(EvaluateClustering.java:159)
at elki.evaluation.AutomaticEvaluation.autoEvaluateClusterings(AutomaticEvaluation.java:148)
at elki.evaluation.AutomaticEvaluation.processNewResult(AutomaticEvaluation.java:67)
at elki.workflow.EvaluationStep$Evaluation.update(EvaluationStep.java:106)
at elki.workflow.EvaluationStep$Evaluation.(EvaluationStep.java:95)
at elki.workflow.EvaluationStep.runEvaluators(EvaluationStep.java:72)
at elki.KDDTask.run(KDDTask.java:109)
at elki.application.KDDCLIApplication.run(KDDCLIApplication.java:58)
at elki.application.AbstractApplication.runCLIApplication(AbstractApplication.java:175)
at elki.application.KDDCLIApplication.main(KDDCLIApplication.java:91)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at elki.application.ELKILauncher.main(ELKILauncher.java:80)

if i do
for k in $(seq 3 6) it works. When i go from 6 to 7 or 40 as above it starts to throw the exceptions.

if -dbc.in does not exist, KDDCLIApplication will throw non-useful error

Hi everyone,

this time reporting not a big issue.
If I run ELKI as KDDCLIApplication, and if I get by accident the -dbc.in wrong, and the file does not exist, I will get as error

ERROR: The following configuration errors prevented execution:
Error instantiating internal class: elki.workflow.InputStep Path component should be '/'
The following parameters were not processed: [/home/bastian/data/nonexisting_file]
Stopping execution because of configuration errors above.

I think it would be nice if that could be caught by an useful error message :)
(bc it took me a while to figure out what's wrong)

Cluster of size 0 (with EM algorithm)

Hi everyone,

currently I'm trying to cluster some artificial data. While I'm horribly failing at it (due to noise, I think), I stumbled about some weird occurrence in one of my results.
I clustered my data with the EM algorithm. In multiple instances I get clusters of size 0.
How can that happen?
I don't think this is intended behavior, right?

I've attached 2 results, EM algorithm with k set to 7 and 8, with the former having 4 empty clusters and the latter 1 empty cluster.
I've also attached the input table which I've used to get these results.

My Elki version is 7.2 from December 4, cloned from here.
It's running with java version "1.8.0_144" on an Ubuntu 14.04 (cannot update that, it's a cluster).

Hope someone can have a look at this :).

Regards,
Bastian

EM7.tar.gz
EM8.tar.gz
absolute_counts_per_contig.csv.percentages_per_row.tar.gz

Incorrect processing of column names in NumberVectorLabelParser#getTypeInformation()

When using NumberVectorLabelParser and supplying labelIndices, getTypeInformation is stopping after the desired number of column names has been reached even though some columns have been skipped.
I used this data file: https://www.niss.org/sites/default/files/ScotchWhisky01.txt
And designated RowID and Distillery as label indices.
Before the change in PR #78 I see this (note the column names):
image
After the change I see this:
image

How can I get outlier form outlierResult?

I get outlierResult and scores, How can I judge a outlier?
Scores:
1 1.017915661656192
2 1.0608605021777988
3 1.171651951509847
4 1.0359532383112164
5 0.9946463130241695
6 1.0021667682045214
7 1.0664994726755364
8 1.0163041670169992
9 1.0792733520499878
10 1.0654301407031426

multiple calling of LOF from Scala leads to AbortException: DBID range allocation error

Hello,
I am using the LOF implementation from Elki for some experiments. Since I work in scala, I have written a wrapper to call it from the jar (version 0.7.2). The wrapper is the following:

import de.lmu.ifi.dbs.elki.algorithm.outlier.lof.LOF
import de.lmu.ifi.dbs.elki.database.StaticArrayDatabase
import de.lmu.ifi.dbs.elki.datasource.ArrayAdapterDatabaseConnection
import de.lmu.ifi.dbs.elki.distance.distancefunction.minkowski

import scala.collection.mutable.ListBuffer

/**
  * Created by fouchee on 04.09.17.
  */
case class ElkiLOF(k: Int) extends OutlierDetector {
  def computeScores(instances: Array[Array[Double]]): Array[(Int, Double)] = {
    val distance = new minkowski.EuclideanDistanceFunction
    val lof = new LOF(k, distance)

    val dbc = new ArrayAdapterDatabaseConnection(instances) // Adapter to load data from an existing array.
    val db = new StaticArrayDatabase(dbc, null) // Create a database (which may contain multiple relations!)
    db.initialize()
    val result = lof.run(db).getScores()

    var scoreList = new ListBuffer[Double]()
    val DBIDs = result.iterDBIDs()
    while ( {
      DBIDs.valid
    }) {
      scoreList += result.doubleValue(DBIDs)
      DBIDs.advance
    }
    scoreList


    val corrected = scoreList.map {
      case d if d.isNaN => 1.0 // Or whatever value you'd prefer.
      case d if d.isNegInfinity => 1.0 // Or whatever value you'd prefer.
      case d if d.isPosInfinity => 1.0 // Or whatever value you'd prefer.
      case d => d
    }
    corrected.toArray.zipWithIndex.map(x => (x._2, x._1))
  }
}

It works well, and the implementation is really fast I must say. However, if I run it multiple times in parallel (and I mean with a lot of different data sets and repetition) at some point I run into the following error:

[error] Exception in thread "main" de.lmu.ifi.dbs.elki.utilities.exceptions.AbortException: DBID range allocation error - too many objects allocated!
[error] 	at de.lmu.ifi.dbs.elki.database.ids.integer.TrivialDBIDFactory.generateStaticDBIDRange(TrivialDBIDFactory.java:72)
[error] 	at de.lmu.ifi.dbs.elki.database.ids.DBIDUtil.generateStaticDBIDRange(DBIDUtil.java:196)
[error] 	at de.lmu.ifi.dbs.elki.database.StaticArrayDatabase.initialize(StaticArrayDatabase.java:129)
[error] 	at com.edouardfouche.detectors.ElkiLOF$.computeScores(ElkiLOF.scala:29)

I don't know how to correct this error. When I run in sequential, it occurs at some point as well. It seems that the underlying TrivialDBIDFactory does not deallocate the DBIDs that are not in use anymore. I found here the piece of code that launch the error http://www.massapi.com/class/de/lmu/ifi/dbs/elki/utilities/exceptions/AbortException-5.html

Any idea how to avoid that?

Thank you,
Edouard

Naive quantiles for exponentially modified gaussian

I needed 99% and 99.9% quantiles for an EMG and was able to get decent results with something as simple as

double x = emg.getMean() + emg.getStddev() - Math.log(1 - qt) / emg.getLambda();
for (int i = 0; i < 10; i++)
    x -= (emg.cdf(x) - qt) / emg.pdf(x);

I'm wondering: is that worthy of a PR, or too stupid?

When will the next official version be released? Release 0.7.1 throws ClassCastException in ELKIServiceRegistry

After downloading the new release and trying to launch it via console a ClassCastException was thrown due to line 53 in ELKIServiceRegistry

private static final URLClassLoader CLASSLOADER = (URLClassLoader) ClassLoader.getSystemClassLoader();

The issue apparently is fixed in the current repository

private static final ClassLoader CLASSLOADER = ELKIServiceRegistry.class.getClassLoader();
.

Click to expand stack trace java.lang.ExceptionInInitializerError at de.lmu.ifi.dbs.elki.gui.minigui.MiniGUI.setupAppChooser(MiniGUI.java:247) at de.lmu.ifi.dbs.elki.gui.minigui.MiniGUI.(MiniGUI.java:198) at de.lmu.ifi.dbs.elki.gui.minigui.MiniGUI$9.run(MiniGUI.java:737) at java.desktop/java.awt.event.InvocationEvent.dispatch(InvocationEvent.java:313) at java.desktop/java.awt.EventQueue.dispatchEventImpl(EventQueue.java:770) at java.desktop/java.awt.EventQueue.access$600(EventQueue.java:97) at java.desktop/java.awt.EventQueue$4.run(EventQueue.java:721) at java.desktop/java.awt.EventQueue$4.run(EventQueue.java:715) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:87) at java.desktop/java.awt.EventQueue.dispatchEvent(EventQueue.java:740) at java.desktop/java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:203) at java.desktop/java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:124) at java.desktop/java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:113) at java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:109) at java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101) at java.desktop/java.awt.EventDispatchThread.run(EventDispatchThread.java:90) Caused by: java.lang.ClassCastException: java.base/jdk.internal.loader.ClassLoaders$AppClassLoader cannot be cast to java.base/java.net.URLClassLoader at de.lmu.ifi.dbs.elki.utilities.ELKIServiceRegistry.(ELKIServiceRegistry.java:53) at de.lmu.ifi.dbs.elki.gui.minigui.MiniGUI.setupAppChooser(MiniGUI.java:247) at de.lmu.ifi.dbs.elki.gui.minigui.MiniGUI.(MiniGUI.java:198) at de.lmu.ifi.dbs.elki.gui.minigui.MiniGUI$9.run(MiniGUI.java:737) at java.desktop/java.awt.event.InvocationEvent.dispatch(InvocationEvent.java:313) at java.desktop/java.awt.EventQueue.dispatchEventImpl(EventQueue.java:770) at java.desktop/java.awt.EventQueue.access$600(EventQueue.java:97) at java.desktop/java.awt.EventQueue$4.run(EventQueue.java:721) at java.desktop/java.awt.EventQueue$4.run(EventQueue.java:715) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:87) at java.desktop/java.awt.EventQueue.dispatchEvent(EventQueue.java:740) at java.desktop/java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:203) at java.desktop/java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:124) at java.desktop/java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:113) at java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:109) at java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101) at java.desktop/java.awt.EventDispatchThread.run(EventDispatchThread.java:90)

Java 10 Windows 10


I would like to know when the next pre build release will be rolled out.

ParallelGeneralizedDBSCAN missing from latest release?

As the title suggests I was wondering if it's expected that the parallelized DBSCAN implementation is not present in the the release 0.7.1 jar file. The Javadocs on elki-project.github.io still mention it.

For reference:

$  jar tf elki-0.7.1.jar | grep -i dbscan | grep -i parallel | wc -l
       0

Implementation of ODIN does not comply with the original definition

In the publishing paper of ODIN, ODIN is defined as global outlier detection approach. The ODIN outlier score is calculated as indegree of an observation regarding the weighted kNN-Graph. The weights in the kNN-Graph are calculated with distances between the particular observations.

The implementation in ELKI however uses the non-weighted kNN-Graph to calculate the indegree for the ODIN score ( more precise,all weights in the kNN-Graph are equal to 1/k). This change not only does not comply with the original paper. It also makes ODIN a local outlier detection approach.

Source:
V. Hautamaki, I. Karkkainen and P. Franti, "Outlier detection using k-nearest neighbour graph," Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., Cambridge, 2004, pp. 430-433 Vol.3.
doi: 10.1109/ICPR.2004.1334558

Clusters of size 0 (with HDBSCAN/DeLiClu/OPTICS)

Perhaps this is related to #41 and therefore expected.

However my (naive) understanding is that for some of these algorithms (HDBSCAN) the cluster is defined by the enclosed data points. Therefore a zero sized cluster is meaningless.

Anderberg Hierarchical clustering - maximum array size reached

I am trying to run hierarchical clustering with around 170k vectors of size 768 each and ELKI throws the message

This implementation does not scale to data sets larger than 65536 instances (~16 GB RAM), at which point the Java maximum array size is reached.

my code looks like this:

java -cp "src/dependency/*:src/elki/*" de.lmu.ifi.dbs.elki.application.KDDCLIApplication -dbc.in input.tsv -algorithm clustering.hierarchical.AnderbergHierarchicalClustering

How to get the p-value of the Anderson-Darling test?

How can I get the p-value of the two samples Anderson-Darling-Test?

I used

		StandardizedTwoSampleAndersonDarlingTest ad = new StandardizedTwoSampleAndersonDarlingTest();
		pi.AndersonDarlingValue = ad.unstandardized(d1, d2);		
		pi.AndersonDarlingPValue = ad.deviation(d1, d2); //p-value ???

Imprecise variance calculation in MeanVariance.java

Hi,

I ported some of your numerically stable methods to C++ and C# and noticed some weird results.

Consider the following series:

x = [
            150494407424305.47,
            150494407424305.47,
            150494407424305.47,
            150494407424305.47,
            150494407424305.47,
            150494407424305.47,
            150494407424305.47,
            150494407424305.47,
            150494407424305.47,
            150494407424305.47,
            150494407424305.47,
            150494407424305.47
]

Since x is constant the variance should be zero or very close to zero. However, using the MeanVariance class, I get:

naive variance = 5.425347222222222e-05
sample variance = 5.9185606060606055e-05

When the series values are low:

x = [
            305.47,
            305.47,
            305.47,
            305.47,
            305.47,
            305.47,
            305.47,
            305.47,
            305.47,
            305.47,
            305.47,
            305.47
]

then the precision is good:

naive variance = 6.788729774740758e-28
sample variance = 7.405887026989918e-28

Unfortunately I can't test the original java code as I don't have experience with java and my attempt at building ELKI failed, but I think this should be fairly easy to confirm. My code is basically identical to this.

I realize the values are big but they are not that big. The same issue concerning variance is also present in the PearsonCorrelation class.

Use a @cite doclet to use inline citations in Javadoc

But as we are currently targeting JDK 8, and a new API arrived in JDK 9, it does not make sense to do this yet. The next long-term Java version 11 is scheduled for end of September 2018.
So for ELKI 0.8 it is an option to target JDK 11, and use the new API then.

UnsupportedOperationException when using DBSCAN with RStarTree

I am using ELKI for clustering and I tried it more than 1k times on many datasets and it was fine :D
but when i started it on one of my files (it was the big one) I saw an error in initializing tree.
the whole command and result is here:
java -jar elki-bundle-0.7.1.jar KDDCLIApplication -verbose -verbose -enableDebug true -dbc.in my_input -parser.labelIndices 0 -db.index tree.spatial.rstarvariants.rstar.RStarTreeFactory -time -algorithm clustering.DBSCAN -algorithm.distancefunction geo.LngLatDistanceFunction -geo.model SphericalHaversineEarthModel -dbscan.epsilon 50.0 -dbscan.minpts 446 -resulthandler ResultWriter,ExportVisualizations -out my_output -vis.output my_visOutput de.lmu.ifi.dbs.elki.datasource.FileBasedDatabaseConnection.load: 5716 ms de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.rstar.RStarTreeIndex.directory.capacity: 95 de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.rstar.RStarTreeIndex.directory.minfill: 38 de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.rstar.RStarTreeIndex.leaf.capacity: 153 de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.rstar.RStarTreeIndex.leaf.minfill: 61 Node is not a directory node! java.lang.UnsupportedOperationException: Node is not a directory node! at de.lmu.ifi.dbs.elki.index.tree.AbstractNode.addDirectoryEntry(AbstractNode.java:240) at de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.AbstractRStarTree.insertDirectoryEntry(AbstractRStarTree.java:194) at de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.AbstractRStarTree.reInsert(AbstractRStarTree.java:655) at de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.strategies.overflow.LimitedReinsertOverflowTreatment.handleOverflow(LimitedReinsertOverflowTreatment.java:97) at de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.AbstractRStarTree.overflowTreatment(AbstractRStarTree.java:571) at de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.AbstractRStarTree.adjustTree(AbstractRStarTree.java:676) at de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.AbstractRStarTree.adjustTree(AbstractRStarTree.java:705) at de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.AbstractRStarTree.insertLeafEntry(AbstractRStarTree.java:175) at de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.AbstractRStarTree.reInsert(AbstractRStarTree.java:649) at de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.strategies.overflow.LimitedReinsertOverflowTreatment.handleOverflow(LimitedReinsertOverflowTreatment.java:97) at de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.AbstractRStarTree.overflowTreatment(AbstractRStarTree.java:571) at de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.AbstractRStarTree.adjustTree(AbstractRStarTree.java:676) at de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.AbstractRStarTree.insertLeafEntry(AbstractRStarTree.java:175) at de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.AbstractRStarTree.insertLeaf(AbstractRStarTree.java:151) at de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.rstar.RStarTreeIndex.insert(RStarTreeIndex.java:104) at de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.rstar.RStarTreeIndex.insertAll(RStarTreeIndex.java:129) at de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.rstar.RStarTreeIndex.initialize(RStarTreeIndex.java:94) at de.lmu.ifi.dbs.elki.database.StaticArrayDatabase.initialize(StaticArrayDatabase.java:168) at de.lmu.ifi.dbs.elki.workflow.InputStep.getDatabase(InputStep.java:63) at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:108) at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:61) at de.lmu.ifi.dbs.elki.application.AbstractApplication.runCLIApplication(AbstractApplication.java:194) at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.main(KDDCLIApplication.java:96) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at de.lmu.ifi.dbs.elki.application.ELKILauncher.main(ELKILauncher.java:60)

Installation Errors

ant -f C:\Users\336943\Documents\NetBeansProjects\Clustering -Dnb.internal.action.name=build jar
init:
Deleting: C:\Users\336943\Documents\NetBeansProjects\Clustering\build\built-jar.properties
deps-jar:
Updating property file: C:\Users\336943\Documents\NetBeansProjects\Clustering\build\built-jar.properties
Compiling 1 source file to C:\Users\336943\Documents\NetBeansProjects\Clustering\build\classes
C:\Users\336943\Documents\NetBeansProjects\Clustering\src\tutorial\clustering\SameSizeKMeansAlgorithm.java:111: error: method chooseInitialMeans in interface KMeansInitialization<V#2> cannot be applied to given types;
means = initializer.chooseInitialMeans(database, relation, k, getDistanceFunction());
required: Database,Relation,int,NumberVectorDistanceFunction<? super T>,Factory
found: Database,Relation<V#1>,int,NumberVectorDistanceFunction<CAP#2>
reason: cannot infer type-variable(s) T,O
(actual and formal argument lists differ in length)
where T,O,V#1,V#2 are type-variables:
T extends CAP#1 declared in method <T,O>chooseInitialMeans(Database,Relation,int,NumberVectorDistanceFunction<? super T>,Factory)
O extends NumberVector declared in method <T,O>chooseInitialMeans(Database,Relation,int,NumberVectorDistanceFunction<? super T>,Factory)
V#1 extends NumberVector declared in class SameSizeKMeansAlgorithm
V#2 extends NumberVector declared in interface KMeansInitialization
where CAP#1,CAP#2 are fresh type-variables:
CAP#1 extends NumberVector super: V#1 from capture of ? super V#1
CAP#2 extends Object super: V#1 from capture of ? super V#1
C:\Users\336943\Documents\NetBeansProjects\Clustering\src\tutorial\clustering\SameSizeKMeansAlgorithm.java:123: error: incompatible types: double[][] cannot be converted to List<? extends NumberVector>
means = means(clusters, means, relation);
C:\Users\336943\Documents\NetBeansProjects\Clustering\src\tutorial\clustering\SameSizeKMeansAlgorithm.java:130: error: incompatible types: double[] cannot be converted to Vector
result.addToplevelCluster(new Cluster<>(clusters.get(i), new MeanModel(means[i])));
C:\Users\336943\Documents\NetBeansProjects\Clustering\src\tutorial\clustering\SameSizeKMeansAlgorithm.java:151: error: cannot find symbol
final double d = c.dists[i] = df.distance(fv, DoubleVector.wrap(means[i]));
symbol: method wrap(double[])
location: class DoubleVector
C:\Users\336943\Documents\NetBeansProjects\Clustering\src\tutorial\clustering\SameSizeKMeansAlgorithm.java:237: error: cannot find symbol
c.dists[i] = df.distance(fv, DoubleVector.wrap(means[i]));
symbol: method wrap(double[])
location: class DoubleVector
C:\Users\336943\Documents\NetBeansProjects\Clustering\src\tutorial\clustering\SameSizeKMeansAlgorithm.java:340: error: incompatible types: double[][] cannot be converted to List<? extends NumberVector>
means = means(clusters, means, relation);
Note: Some messages have been simplified; recompile with -Xdiags:verbose to get full output
6 errors
C:\Users\336943\Documents\NetBeansProjects\Clustering\nbproject\build-impl.xml:930: The following error occurred while executing this line:
C:\Users\336943\Documents\NetBeansProjects\Clustering\nbproject\build-impl.xml:270: Compile failed; see the compiler error output for details.
BUILD FAILED (total time: 1 second)

[doc] missing variable c

In documentation:

// Relation containing the number vectors:
Relation<NumberVector> rel = db.getRelation(TypeUtil.NUMBER_VECTOR_FIELD);
// We know that the ids must be a continuous range:
DBIDRange ids = (DBIDRange) rel.getDBIDs();

int i = 0;
for(Cluster<KMeansModel> clu : c.getAllClusters()) {

The variable c.getAllClusters() is undefined.

No data type found satisfying: NumberVector,field AND NumberVector,variable

Hi everyone,

I'm currently trying to use ELKI for clustering some rather big data... or better said, I'd like to use if, if it would let me.
I've used it before, where it worked, but now something is going wrong.
I've cut it down to a file consisting out of 10 rows with 7 columns with float numbers (attached,
absolute_counts_per_contig.csv.percentages_per_row.head_10_columns_7.txt), and I also used the latest ELKI version (just cloned it a minute ago), and I still get an error.

The command + error is the following:

14:30:44 bastian@computer:~$ java -Xmx11G -jar /exports/mm-hpc/bacteriologie/bastian/tools/elki/elki-bundle-0.7.2-SNAPSHOT.jar KDDCLIApplication -dbc.in /exports/mm-hpc/bacteriologie/bastian/data/absolute_counts_per_contig.csv.percentages_per_row.head_10_columns_7.csv -out /exports/mm-hpc/bacteriologie/bastian/data/elki_results/kmeans_perc_per_row_maxiter_10000/2/ -algorithm clustering.kmeans.KMeansLloyd -kmeans.k 2 -kmeans.maxiter 10000   -evaluator clustering.internal.EvaluateDaviesBouldin,clustering.internal.EvaluatePBMIndex,clustering.internal.EvaluateSquaredErrors,clustering.internal.EvaluateVarianceRatioCriteria,clustering.internal.EvaluateSimplifiedSilhouette -parser.colsep \\t -resulthandler ResultWriter 
No data type found satisfying: NumberVector,field AND NumberVector,variable
Available types: DBID DoubleVector,variable,mindim=5,maxdim=7 LabelList
de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variable
Available types: DBID DoubleVector,variable,mindim=5,maxdim=7 LabelList
        at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:123)
        at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:79)
        at de.lmu.ifi.dbs.elki.workflow.AlgorithmStep.runAlgorithms(AlgorithmStep.java:100)
        at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:109)
        at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:58)
        at de.lmu.ifi.dbs.elki.application.AbstractApplication.runCLIApplication(AbstractApplication.java:184)
        at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.main(KDDCLIApplication.java:93)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at de.lmu.ifi.dbs.elki.application.ELKILauncher.main(ELKILauncher.java:77)

It looks like there is some issue with parsing the columns...but I really cannot see it, all columns in all rows have values.
Any advice what could be going wrong?

Thanks,
Bastian

Distance-based cluster evaluation algorithms will fail, if input numbers are too big

Hi everyone,

I'm right now trying to cluster a matrix, and did some back and forth on what I did.
The values in the matrix are pretty big, biggest is 10e+300, and the matrix is also pretty dense.
I did clustering with k-means, which also produced results, but all internal cluster evaluation algorithms failed to produce anything.
This is a result from k-means with k=4

Distance-based Davies Bouldin Index 0.0
Distance-based Density Based Clustering Validation NaN
Distance-based C-Index 1.0
Distance-based PBM-Index NaN
Distance-based Silhouette +-NaN NaN
Distance-based Simp. Silhouette +-NaN NaN
Distance-based Mean distance Infinity
Distance-based Sum of Squares Infinity
Distance-based RMSD Infinity
Distance-based Variance Ratio Criteria NaN
# Concordance
Concordance Gamma 0.9999772178605122
Concordance Tau 0.04571359658825246

In the meantime I do the clustering only on the exponents (so 10e+300 converts to 300), and I do now get useful outputs.
So... no idea what is causing this, but I guess something should warn the user.

Timeout on instantiating de.lmu.ifi.dbs.elki.gui.util.TreePopup

I have tried to build v0.7.1 on OS X using Java 1.8.0_74 and Maven 3.3.9, and got the error in the subject. What does it depend on?

Here is the trace:

[DEBUG] Executing command line: [java, -cp, /Users/me/Downloads/elki-release0.7.1/elki/target/classes:/Users/me/.m2/repository/net/sf/trove4j/trove4j/3.0.3/trove4j-3.0.3.jar, de.lmu.ifi.dbs.elki.application.internal.DocumentParameters, /Users/me/Downloads/elki-release0.7.1/elki/target/apidocs/parameters-byclass.html, /Users/me/Downloads/elki-release0.7.1/elki/target/apidocs/parameters-byopt.html]
Timeout on instantiating de.lmu.ifi.dbs.elki.gui.util.TreePopup
java.util.concurrent.TimeoutException
java.lang.RuntimeException: java.util.concurrent.TimeoutException
    at de.lmu.ifi.dbs.elki.application.internal.DocumentParameters.buildParameterIndex(DocumentParameters.java:317)
    at de.lmu.ifi.dbs.elki.application.internal.DocumentParameters.main(DocumentParameters.java:149)
Caused by: java.util.concurrent.TimeoutException
    at java.util.concurrent.FutureTask.get(FutureTask.java:205)
    at de.lmu.ifi.dbs.elki.application.internal.DocumentParameters.buildParameterIndex(DocumentParameters.java:312)
    at de.lmu.ifi.dbs.elki.application.internal.DocumentParameters.main(DocumentParameters.java:149)
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] ELKI Data Mining Framework - Parent Project ........ SUCCESS [  3.659 s]
[INFO] ELKI Data Mining Framework ......................... FAILURE [01:31 min]
[INFO] ELKI Data Mining Framework - Batik Visualization ... SKIPPED
[INFO] ELKI Data Mining Framework - Tutorial Algorithms ... SKIPPED
[INFO] ELKI Data Mining Framework - LibSVM based extensions SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:35 min
[INFO] Finished at: 2016-04-15T11:53:41+02:00
[INFO] Final Memory: 36M/682M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.3.2:exec (generate-javadoc-parameters) on project elki: Command execution failed. Process exited with an error: 1 (Exit value: 1) -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.3.2:exec (generate-javadoc-parameters) on project elki: Command execution failed.
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
    at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
    at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
    at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
    at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
    at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
    at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.MojoExecutionException: Command execution failed.
    at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:303)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
    ... 20 more
Caused by: org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1)
    at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:402)
    at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:164)
    at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:746)
    at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:292)
    ... 22 more
[ERROR] 
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :elki

PrimsMinimumSpanningTree ArrayIndexOutOfBoundsException

I'm getting an exception, probably because my data set is too small for HDBSCAN (there are only two data points in in the particular data set when the exception is thrown). The data set works fine with the other Clustering algorithms.

I can catch the Exception in my code but perhaps it would be best if it were caught within ELKI.

  clustering = new ELKIBuilder<>(HDBSCANHierarchyExtraction.class) //
   .with(HDBSCANHierarchyExtraction.Parameterizer.MINCLUSTERSIZE_ID, minPoints) //
   .with(HDBSCANLinearMemory.Parameterizer.MIN_PTS_ID, minPoints) //
   .with(AbstractAlgorithm.ALGORITHM_ID, HDBSCANLinearMemory.class) //
   .build().run(db);


Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
	at de.lmu.ifi.dbs.elki.math.geometry.PrimsMinimumSpanningTree.processDense(PrimsMinimumSpanningTree.java:170)
	at de.lmu.ifi.dbs.elki.algorithm.clustering.hierarchical.HDBSCANLinearMemory.run(HDBSCANLinearMemory.java:122)
	at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:87)
	at de.lmu.ifi.dbs.elki.algorithm.clustering.hierarchical.HDBSCANLinearMemory.run(HDBSCANLinearMemory.java:79)
	at de.lmu.ifi.dbs.elki.algorithm.clustering.hierarchical.extraction.HDBSCANHierarchyExtraction.run(HDBSCANHierarchyExtraction.java:129)
	at uk.ac.shef.wit.active10.CreateStaypoints.cluster(CreateStaypoints.java:708)
	at uk.ac.shef.wit.active10.CreateStaypoints.main(CreateStaypoints.java:1151

Per category evaluation of a clustering

Apart from the evaluations of the complete clustering, it would be nice to be able to get the per-label statistics to understand how the individual quality of the categories affect the global clustering quality.

What do you think?

CLIQUE - Connected dense units

Inspecting results of elki's CLIQUE clustering implementations revealed an issue of the implementation. The original paper AGGR98 at paragraph 2.1 states that two dense units are connected if they have a common face (they are identical in n-1 dimensions and in 1 dimension they are neighboring).

According to this the following output should be impossible:

# Cluster: cluster_1
# Cluster name: cluster_1
# Cluster noise flag: false
# Cluster size: 500
# Model class: de.lmu.ifi.dbs.elki.data.model.SubspaceModel
# Cluster Mean: 0.5058922207711539, 0.5997058865585326
# Subspace: Dimensions: [1]
# Coverage: 500
# Units: 
#    d1-[0.04; 0.33[    127 objects
#    d1-[0.33; 0.62[    207 objects
#    d1-[0.62; 0.92[    166 objects

In this case all dense units are neighboring in both of the dimensions, yet the current implementation considers it one cluster.

Geodetic mindist

Hello,

i was reading your paper "Geodetic Distance Queries on R-Trees for Indexing Geographic Data" and i wasn´t sure what did you mean by all those subscript 360 functions, so i checked your code for that defined here https://github.com/elki-project/elki/blob/e8f3c6fdf54e1e0aa8444b94ad5374ad518dfc0c/elki-core-math/src/main/java/de/lmu/ifi/dbs/elki/math/geodesy/SphereUtil.java, it doesn´t follow the pseudocode. For instance lines 806-813

// Determine whether going east or west is shorter.
    double lngE = rminlng - plng;
    if(lngE < 0) {
      lngE += MathUtil.TWOPI;
    }
    double lngW = rmaxlng - plng; // we keep this negative!
    if(lngW > 0) {
      lngW -= MathUtil.TWOPI;
}

where i think you meant to do

//mod360(rminlng-plng) <= mod360(plng-rmaxlng)
// Determine whether going east or west is shorter.
        double lngW = fmod(rminlng - plng, 2 * M_PI);
        if (lngW < 0)
        {
            lngW += 2 * M_PI;
        }
        double lngE = fmod(plng - rmaxlng, 2 * M_PI);
        if (lngE < 0)
        {
            lngE += 2 * M_PI;
}

Also my theory is that 360 subscript means we are working in radians, not degrees. In fact when i followed your pseudocode in paper exactly it is working for me in C++, but i am still not sure what you meant by sentence "In order to distinguish the other cases, we first need to test whether we are on the left or on the right side by rotating the mean longitude of the rectangle by 180◦ – the meridian opposite of the rectangle." . Do i understand correctly, that you wanted to do modulo 360(2 PI) to normalize the angle to range 0-360 [0..2 PI]?

Also i have found out,that the haversine formula should use absolute difference of latitudes and longitudes. Correct me if i am wrong,but if you dont use absolute values there,you may get negative distances. I attach my version of the algorithm including the Haversine formula

static const double wgs84_radius_m = 6378137;
static const double wgs84_flattening = 1.0 / 298.257223563;
static const double earth_radius_m = wgs84_radius_m * (1 - wgs84_flattening);
static const double earth_radius_km = earth_radius_m / 1000.0;
 struct WGS84Mindist
{
    //input is in radians
    //returns distance in kilometers
    static double haversineFormulaRad(double lat1, double lon1, double lat2, double lon2)
    {
        double d_lat = abs(lat1 - lat2);
        double d_lon = abs(lon1 - lon2);

        double a = pow(sin(d_lat / 2), 2) + cos(lat1) * cos(lat2) * pow(sin(d_lon / 2), 2);

        //double d_sigma = 2 * atan2(sqrt(a), sqrt(1 - a));
        double d_sigma = 2 * asin(sqrt(a));

        return earth_radius_km * d_sigma;
    }

    //input is in radians
    //returns angular distance on unit sphere
    static double haversineFormulaRadAngular(double lat1, double lon1, double lat2, double lon2)
    {
        double d_lat = abs(lat1 - lat2);
        double d_lon = abs(lon1 - lon2);

        double a = pow(sin(d_lat / 2), 2) + cos(lat1) * cos(lat2) * pow(sin(d_lon / 2), 2);

        //double d_sigma = 2 * atan2(sqrt(a), sqrt(1 - a));
        double d_sigma = 2 * asin(sqrt(a));

        return d_sigma;
    }

    //input is in radians
    //returns angle in radians
    static double getBearingRad(double lat1, double lon1, double lat2, double lon2)
    {
        double dLon = lon2 - lon1;
        double y = sin(dLon) * cos(lat2);
        double x = cos(lat1) * sin(lat2) - sin(lat1) * cos(lat2) * cos(dLon);
        double radiansBearing = atan2(y, x);

        return radiansBearing;
    }

    //start, end, point - coordinates in radians
    static double getCrossTrackDistanceRad(double lat1, double lon1, double lat2, double lon2, double lat3, double lon3)
    {
        double angDist1Q = haversineFormulaRadAngular(lat1, lon1, lat3, lon3);

        double cos_lat1 = cos(lat1);
        double sin_lat1 = sin(lat1);
        double cos_lat3 = cos(lat3);
        double cos_lat2 = cos(lat2);

        //double theta1Q = getBearing(lat1, lon1, lat3, lon3);
        // double dLon = lon3 - lon1;
        // double y = sin(dLon) * cos(lat3);
        // double x = cos(lat1) * sin(lat3) - sin(lat1) * cos(lat3) * cos(dLon);
        // double radiansBearing = atan2(y, x);
        double dLon1 = lon3 - lon1;
        double y1 = sin(dLon1) * cos_lat3;
        double x1 = cos_lat1 * sin(lat3) - sin_lat1 * cos_lat3 * cos(dLon1);
        double theta1Q = atan2(y1, x1);

        //double theta12 = getBearing(lat1, lon1, lat2, lon2);
        // double dLon = lon2 - lon1;
        // double y = sin(dLon) * cos(lat2);
        // double x = cos(lat1) * sin(lat2) - sin(lat1) * cos(lat2) * cos(dLon);
        // double radiansBearing = atan2(y, x);
        double dLon2 = lon2 - lon1;
        double y2 = sin(dLon2) * cos_lat2;
        double x2 = cos_lat1 * sin(lat2) - sin_lat1 * cos_lat2 * cos(dLon2);
        double theta12 = atan2(y2, x2);

        return asin(sin(angDist1Q) * sin(theta1Q - theta12)) * earth_radius_km;
    }

    static double latlngMinDistDeg(double &plat, double &plng, double &rminlat, double &rminlng, double &rmaxlat, double &rmaxlng)
    {
        return latlngMinDistRad(GeoDistance::deg2rad(plat), GeoDistance::deg2rad(plng),
                                GeoDistance::deg2rad(rminlat), GeoDistance::deg2rad(rminlng),
                                GeoDistance::deg2rad(rmaxlat), GeoDistance::deg2rad(rmaxlng));
    }

    //returns distance in kilometers
    static double latlngMinDistRad(double plat, double plng, double rminlat, double rminlng, double rmaxlat, double rmaxlng)
    {
        // The simplest case is when the query point is in the same "slice":
        if (rminlng <= plng && plng <= rmaxlng)
        {
            if (plat < rminlat)
            {
                return (rminlat - plat) * earth_radius_km; //Sout of MBR
            }
            else if (plat > rmaxlat)
            {
                return (plat - rmaxlat) * earth_radius_km; //North of MBR
            }
            return 0.0; // INSIDE MBR
        }

        // Determine whether going east or west is shorter.
        double lngW = fmod(rminlng - plng, 2 * M_PI);
        if (lngW < 0)
        {
            lngW += 2 * M_PI;
        }
        double lngE = fmod(plng - rmaxlng, 2 * M_PI);
        if (lngE < 0)
        {
            lngE += 2 * M_PI;
        }

        if (lngW <= lngE)
        {
            // West of MBR
            double tau = tan(plat);

            if (lngW >= 0.5 * M_PI) // Large delta of longitude
            {
                if (tau <= tan((rmaxlat + rminlat) * 0.5) * cos(rminlng - plng))
                {
                    return haversineFormulaRad(plat, plng, rmaxlat, rminlng); //North-West
                }
                else
                {
                    return haversineFormulaRad(plat, plng, rminlat, rminlng); //South-West
                }
            }

            if (tau >= tan(rmaxlat) * cos(rminlng - plng))
            {
                return haversineFormulaRad(plat, plng, rmaxlat, rminlng); //North-West
            }

            if (tau <= tan(rminlat) * cos(rminlng - plng))
            {
                return haversineFormulaRad(plat, plng, rminlat, rminlng); //South-West
            }

            return abs(getCrossTrackDistanceRad(rminlat, rminlng, rmaxlat, rminlng, plat, plng)); // West
        }     
        else
        {
             // East of MBR
             double tau = tan(plat);

             if (lngE >= 0.5 * M_PI) // Large delta of longitude
             {
                 if (tau <= tan((rmaxlat + rminlat) * 0.5) * cos(rmaxlng - plng))
                 {
                    return haversineFormulaRad(plat, plng, rmaxlat, rmaxlng); //North-East
                 }
                 else
                 {
                    return haversineFormulaRad(plat, plng, rminlat, rmaxlng); //Sout-East
                 }
             }

             if (tau >= tan(rmaxlat) * cos(rmaxlng - plng))
             {
                 return haversineFormulaRad(plat, plng, rmaxlat, rmaxlng); //North-East
             }

             if (tau <= tan(rminlat) * cos(rmaxlng - plng))
             {
                 return haversineFormulaRad(plat, plng, rminlat, rmaxlng); //Sout-East
             }

             return abs(getCrossTrackDistanceRad(rmaxlat, rmaxlng, rminlat, rmaxlng, plat, plng)); // East
        }
    }
};

The last thing i dont understand why in pseudocode you are returning c * (rminlat-plat)/360 for the N/S case when i think it should be c * (rminlat-plat) , because you say in text we are calculating length of meridian arc, where c is radius of earth.

Thank you for your answer.

Re-run of DeLiClu causes Exception

There seems to be a bug in the implementation which means multiple runs of DeLiClu causes an Exception "DeLiClu heap was empty when it shouldn't have been.".

This does not happen if the index is rebuild at each iteration, and there is no issue using other OPTICS Algorithms.

for (int minPnts : new int[]{5, 10, 15, 20}) {
    if (rebuildIndex) {
            db = new StaticArrayDatabase(dbc, Collections.singletonList(indexFactory));
            db.initialize();
            relations = db.getRelation(TypeUtil.NUMBER_VECTOR_FIELD);
    }

    clustering = new ELKIBuilder<>(OPTICSXi.class) //
       .with(DeLiClu.Parameterizer.MINPTS_ID, minPoints) //
       .with(OPTICSXi.Parameterizer.XI_ID, xi) //
       .with(OPTICSXi.Parameterizer.XIALG_ID, DeLiClu.class) //
       .build().run(db);
}

It seems that running DeLiClu may be altering the data index. I'm unsure if it's related but using DeLiClu I can also get an ObjectNotFoundException with the following...

  for (Cluster<? extends Model> cluster : clustering.getAllClusters()) {
      for (DBIDIter it = cluster.getIDs().iter(); it.valid(); it.advance()) {
            try {
                double[] latlng = relations.get(it).toArray();
            }
            catch(ObjectNotFoundException e) {
                logger.error(e.getLocalizedMessage());
            }
        }
  }

Implementation of code in ELKI

I am new to elki. I have gone through documentation of elki I found huge collection of clustering algorithms, thanks for the contributes. Since I am new to elki I find difficult in implementation of algorithms using miniGUI. Is there any other way for easy understanding quickly so that contributions can be done in a fast manner. Please suggest any one. Thank you.

INFLO does not compute RNN correctly

RNN computed is only based on the neighbor's of the point; and does not include reverse neighbors which are not in the current point's k neighbors.

This is based on de.lmu.ifi.dbs.elki.algorithm.outlier.lof. INFLO#computeNeighborhoods

As currently written, RNN will always be a subset of knn.

Naming JUnit tests

It's quite good convention to name JUnit test files with suffix Test, not a prefix.

It makes much easier to read the code, as most IDEs support switching between class and its test.

Would you accept a PR with renaming test files?

Eclipse Mars launch of MiniGUI fails

I followed the eclipse configuration for Elki and console indicates build was successful
But when I try to run ELKI MiniGUI from Run Configurations it fails with:

An internal error occurred during: "Launching ELKI MiniGUI".
Model not available for elki

Fastutil >8.5.3 not supported

FastUtil removed Int2Float-components in 8.5.3 and 8.5.4 therefore these versions cannot be used. Latest working version is 8.5.2.

How can I cluster data using a distance matrix with the ELKI library?

I have a distance matrix and I want to use that distance matrix when clustering my data.

I've read the ELKI documentation and it states that I can overwrite the distance method when extending the AbstractNumberVectorDistanceFunction class.

The distance class however, returns the coordinates. So from coordinate x to coordinate y. This is troublesome because the distance matrix is filled only with distance values and we use the indexes to find the distance value from index x to index y. Here's the code from the documentation:

public class TutorialDistanceFunction extends AbstractNumberVectorDistanceFunction {
  @Override
  public double distance(NumberVector o1, NumberVector o2) {
    double dx = o1.doubleValue(0) - o2.doubleValue(0);
    double dy = o1.doubleValue(1) - o2.doubleValue(1);
    return dx * dx + Math.abs(dy);
  }
}

My question is how to correctly use the distance matrix when clustering with ELKI.

How can I access class "description" if GNOME error is thrown all the time?

I am trying to list parameters which I can pass to e.g. DBSCAN but there is no way to do that since GNOME error is blocking everything:

java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at de.lmu.ifi.dbs.elki.application.ELKILauncher.main(ELKILauncher.java:72)
Caused by: java.awt.AWTError: Assistive Technology not found: org.GNOME.Accessibility.AtkWrapper
	at java.awt.Toolkit.loadAssistiveTechnologies(Toolkit.java:807)
	at java.awt.Toolkit.getDefaultToolkit(Toolkit.java:886)
	at de.lmu.ifi.dbs.elki.gui.GUIUtil.setLookAndFeel(GUIUtil.java:73)
	at de.lmu.ifi.dbs.elki.gui.minigui.MiniGUI.main(MiniGUI.java:497)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at de.lmu.ifi.dbs.elki.application.ELKILauncher.main(ELKILauncher.java:72)

I have my ELKI package located in src and I am running:

java -jar src/elki/elki-0.7.0.jar -description de.lmu.ifi.dbs.elki.algorithm.clustering.hierarchical.DBSCAN

I also tried

java -cp "src/elki/*:src/dependency/*" -description de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN

but description does not exists.

Task failed java.lang.OutOfMemoryError: Java heap space

I am trying to cluster word2vec vectors which came from text documents. These are 15 decimal numbers. I tried using DBSCAN, fastoptics etc, however i get below error. Can anyone help me on this? I tried using parser-vector-type as SparseFloatVector, FloatVector and the default one too, however i end up getting below error every time

Task failed
java.lang.OutOfMemoryError: Java heap space
	at gnu.trove.set.hash.TIntHashSet.rehash(TIntHashSet.java:410)
	at gnu.trove.impl.hash.THash.ensureCapacity(THash.java:175)
	at de.lmu.ifi.dbs.elki.database.ids.integer.TroveHashSetModifiableDBIDs.addDBIDs(TroveHashSetModifiableDBIDs.java:88)
	at de.lmu.ifi.dbs.elki.index.preprocessed.fastoptics.RandomProjectedNeighborsAndDensities.getNeighs(RandomProjectedNeighborsAndDensities.java:400)
	at de.lmu.ifi.dbs.elki.algorithm.clustering.optics.FastOPTICS.run(FastOPTICS.java:159)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:91)
	at de.lmu.ifi.dbs.elki.workflow.AlgorithmStep.runAlgorithms(AlgorithmStep.java:105)
	at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:112)
	at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:61)
	at [...]

Extracting hierarchy from DBSCAN

I use ELKI within my Java code and have been trying to export the cluster hierarchy generated with HDBSCAN, however this just results in a single root cluster with the child cluster all being leaves.

In order to "fix" this I changed the collectChildren method in the HDBSCANHierarchyExtraction class.
Replacing
collectChildren(temp, clustering, child, clus, flatten);
with
finalizeCluster(child, clustering, clus, flatten);

This does seem to result in a proper hierarchy, although does return all clusters (including those with fewer than minPts data points). However, my understanding of the code is not enough to know whether this is in any way sensible or correct.

I use the following code to output the hierarchy:

{
        ...create clustering code...

        Relation<NumberVector> coords = db.getRelation(TypeUtil.NUMBER_VECTOR_FIELD_2D);
        List<Cluster<DendrogramModel>> topClusters = clustering.getToplevelClusters();
        Hierarchy<Cluster<DendrogramModel>> hierarchy = clustering.getClusterHierarchy();

        for (Cluster<DendrogramModel> cluster : topClusters) {
            System.out.println("---------------------------------");
            outputHierarchy(cluster, hierarchy, coords, "");
        }
}

private static void outputHierarchy(Cluster<DendrogramModel> cluster,
                                    Hierarchy<Cluster<DendrogramModel>> hierarchy,
                                    Relation<NumberVector> coords,
                                    String indent) {
    final DBIDs ids = cluster.getIDs();
    DendrogramModel model = cluster.getModel();
    System.out.format("%s%s: %d : %.3f%n", indent, cluster.getName(), ids.size(), model.getDistance());
    if (!ids.isEmpty()) {
        System.out.print(indent);
        for (DBIDIter iter = ids.iter(); iter.valid(); iter.advance()) {
            System.out.print(Arrays.toString(coords.get(iter).toArray()));
        }
        System.out.println();
    }
    if (hierarchy != null) {
        if (hierarchy.numChildren(cluster) > 0) {
            for (It<Cluster<DendrogramModel>> iter = hierarchy.iterChildren(cluster); iter.valid();
                 iter.advance()) {
                outputHierarchy(iter.get(), hierarchy, coords, indent + "  ");
            }
        }
    }
}

signed long overflow in Xoroshiro128NonThreadsafeRanom

I noticed an issue of signed long overflow in elki-core-util/src/main/java/elki/utilities/random/Xoroshiro128NonThreadsafeRandom.java

 @Override
  public void setSeed(long seed) {
    long xor64 = seed != 0 ? seed : 4101842887655102017L;
    // XorShift64* generator to seed:
    xor64 ^= xor64 >>> 12; // a
    xor64 ^= xor64 << 25; // b
    xor64 ^= xor64 >>> 27; // c
    s0 = xor64 * 2685821657736338717L;
    xor64 ^= xor64 >>> 12; // a
    xor64 ^= xor64 << 25; // b
    xor64 ^= xor64 >>> 27; // c
    s1 = xor64 * 2685821657736338717L;
  }

I've traced the computed results and the following code has overflow issue:

Is it on purpose? The comment shows the source is from http://xoroshiro.di.unimi.it/, where I can't find the above code. Could you provide a reference to the original code? Thanks!

EuclideanRStarTreeKNNQuery NPE

Possibly related to #46.

NPE thrown, I think because I only have a single data point in the data set.

java.lang.NullPointerException
	at de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.query.EuclideanRStarTreeKNNQuery.expandNode(EuclideanRStarTreeKNNQuery.java:105)
	at de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.query.EuclideanRStarTreeKNNQuery.getKNNForObject(EuclideanRStarTreeKNNQuery.java:87)
	at de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.query.EuclideanRStarTreeKNNQuery.getKNNForObject(EuclideanRStarTreeKNNQuery.java:56)
	at de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.query.RStarTreeKNNQuery.getKNNForDBID(RStarTreeKNNQuery.java:94)
	at de.lmu.ifi.dbs.elki.algorithm.clustering.hierarchical.AbstractHDBSCAN.computeCoreDists(AbstractHDBSCAN.java:110)
	at de.lmu.ifi.dbs.elki.algorithm.clustering.hierarchical.HDBSCANLinearMemory.run(HDBSCANLinearMemory.java:116)
	at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:87)
	at de.lmu.ifi.dbs.elki.algorithm.clustering.hierarchical.HDBSCANLinearMemory.run(HDBSCANLinearMemory.java:79)
	at de.lmu.ifi.dbs.elki.algorithm.clustering.hierarchical.extraction.HDBSCANHierarchyExtraction.run(HDBSCANHierarchyExtraction.java:129)
	at uk.ac.shef.wit.active10.CreateStaypoints.cluster(CreateStaypoints.java:707)
	at uk.ac.shef.wit.active10.CreateStaypoints.main(CreateStaypoints.java:1151)

java.lang.ClassCastException when running elki 7.0.1 on OpenJDK

I switched from Oracle JDK 8.x to Open JDK 11.x recently, now my simulation based on ELKI doesn't work any more.
The problem seems to be located in ELKIServiceRegistry

java.lang.ClassCastException: class jdk.internal.loader.ClassLoaders$AppClassLoader cannot be cast to class java.net.URLClassLoader (jdk.internal.loader.ClassLoaders$AppClassLoader and java.net.URLClassLoader are in module java.base of loader 'bootstrap')
at de.lmu.ifi.dbs.elki.utilities.ELKIServiceRegistry.(ELKIServiceRegistry.java:53)

I guess I need to switch back to Oracle Java in the meanwhile.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.