scify / jedaitoolkit Goto Github PK

View Code? Open in Web Editor NEW

208.0 26.0 47.0 284.2 MB

An open source, high scalability toolkit in Java for Entity Resolution.

Home Page: http://jedai.scify.org

License: Apache License 2.0

Java 100.00%

entity-resolution entity-matching scalability blocking

jedaitoolkit's Issues

CSV GroundTruth Reader doesn't work with 1 dataset (Dirty ER)

The second URL of every row is mapped to 0.
So all of the records in the first column of the CSV file are considered duplicates (via transitive closure).

Unable to Read csv or json files

I have my own custom data csv files for both dataset as well as ground truth file,
can anyone help me to use this file to get result.
Actually it throw some errors while using this files as an input.

Exception in thread "main" java.lang.IllegalArgumentException: loops not allowed at org.jgrapht.graph.AbstractBaseGraph.addEdge(AbstractBaseGraph.java:218) at org.scify.jedai.datareader.groundtruthreader.GtCSVReader.getDuplicatePairs(GtCSVReader.java:206) at org.scify.jedai.datareader.groundtruthreader.AbstractGtReader.getDuplicatePairs(AbstractGtReader.java:58) at org.scify.jedai.workflowbuilder.Main.main(Main.java:254)

can anyone help me?

Cannot read ground truth

There is a bug in the code that prevents the ground-truth in CSV format from being read. I tried the samples provided and the web-based docker image failed to load it. I downloaded the code and run it step by step and I think there is a problem with the GtCSVReader. The reading part takes strings like "thisisastring" where only thisisastring should be read. I tried to add nextLine[0] = nextLine[0].substring(1, nextLine[0].length()-1); on line 200 in that file, but no success so far. I need to make it work to test some CSV entity matchings, so maybe somebody has the fix for this issue?

Make block building, block processing, entity clustering classes serializable and add setters for configurable fields

Having the blocking and clustering related classes serializable enables the usage of jedai-core in applications run in Hadoop and Spark clusters.
Having setters for the configurable fields add more flexibility in creating the blocking and clustering objects.

ArrayIndexOutOfBoundsException when blocking with schema clusters

I got the following error when I tried blocking with schema clusters:
java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 at org.scify.jedai.blockbuilding.AbstractBlockBuilding.lambda$parseIndex$10(AbstractBlockBuilding.java:167) at java.base/java.util.HashMap.forEach(HashMap.java:1336) at org.scify.jedai.blockbuilding.AbstractBlockBuilding.parseIndex(AbstractBlockBuilding.java:164) at org.scify.jedai.blockbuilding.AbstractBlockBuilding.readBlocks(AbstractBlockBuilding.java:196) at org.scify.jedai.blockbuilding.AbstractBlockBuilding.getBlocks(AbstractBlockBuilding.java:96) at org.scify.jedai.gui.utilities.WorkflowManager.runBlockBuilding(WorkflowManager.java:824) at org.scify.jedai.gui.utilities.WorkflowManager.runBlockingBasedWorkflow(WorkflowManager.java:896) at org.scify.jedai.gui.utilities.WorkflowManager.executeFullBlockingBasedWorkflow(WorkflowManager.java:393) at org.scify.jedai.gui.utilities.WorkflowManager.executeFullWorkflow(WorkflowManager.java:695) at org.scify.jedai.gui.controllers.steps.CompletedController.lambda$runAlgorithmBtnHandler$6(CompletedController.java:316) at java.base/java.lang.Thread.run(Thread.java:834)

There is a String split operation in the parseIndex function, that is not working properly:
final String[] entropyString = key.split(CLUSTER_SUFFIX);
The delimiters used in keyare equivalent to CLUSTER_PREFIX, not CLUSTER_SUFFIX, and they contain a dollar-sign that has to be escaped. I worked around the issue by changing the above line to
final String[] entropyString = key.split("#\\$!cl");

I'd suggest changing the values of the prefix and suffix to something that is compatible with regex - the solution above is less readable after all.

Reduce memory footprint of SimilarityPairs

We're using jedai-core (not jedai-ui) in our application and we ran into some Out of Memory errors and started profiling our application. The largest chunk of memory was from SimilarityPairs. We experimented with reducing the size of the similarities from double to float and that reduced the memory footprint by about 25% (630 MB -> 470 MB).

I'm assuming we don't need the extra precision afforded by double, is that correct?

Question about Data

Hi, I found that the number of data under this repository does not seem to match the original one, and I would like to know if the data has been processed. For example, the original Amazon-Google has 1363, 3226 entities and 1300 matches respectively, but the numbers are less in this project.

Also I see a lot of dirty data that seems to just mix the two tables together? Is there any other processing.

CSV Headers with upper case doesn't works for PPJoin

In Similarity join page on UI, on providing the Select attribute of Dataset 1: & Select attribute of Dataset 2: value with uppercase value Eg: "INSTANCE ID", the algorithm fails to match results. On further investigating I found that the class AbstractSimilarityJoin method getAttributeValue(String attributeName, EntityProfile profile) on line 67 the attributeName should be changed to attributeName.toLowerCase() for considering attributeNames properly or else it simply ignores the if condition.

JedAI for Data matching

Hello,
I am trying to run Web based application for a data matching task. I have two tables in the csv format: the first table contains 1.2k rows and the second table contains 7k queries. I want to use JedAI to match each query with a row from the first table. When I run a "block-based workflow" the process stuck in the table loading.
I am a bit lost about how to configure the model. So far I tried the settings in the video tutorial and some other settings but the application never generates any outputs. I share the Tables with the message, please let me know if there is anything wrong with the way i generated them.

Dirty ER examples input .csv

Hi, it is possible to have sample files in .csv format for

entity profile D1
ground truth
because .csv files with any formatting will not work.
The error from JedAI-gui is the following:

Thanks you for the support

Images for README

No URLs to Download

I am trying to download the pre-compiled version from the http://jedai.scify.org website.

When I click on Download desktop app for both "Desktop application for Entity Resolution" and "Workbench tool," I get a "Page Not Found" on Github.

Additionally, I created an issue for this, because the webpage doesn't have any contact information. :/

^ I tried compiling it on my machine, but it showed to a crawl, and took over an hour, so I decided to try to download the precompiled JARs. That's why I wanted to download it.

MarkovClustering parameters not set

constructor parameters not assigned to the class properties

Unable to build

I cloned the project to my local and followed the steps listed in the readme , but it fails to build with the error below :

git clone https://github.com/scify/JedAIToolkit.git
cd JedAIToolkit
git submodule update --init
mvn clean package

[INFO] --- maven-assembly-plugin:2.2-beta-5:single (default) @ jedai-ui ---
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] jedai .............................................. SUCCESS [  0.259 s]
[INFO] jedai-core ......................................... SUCCESS [ 59.511 s]
[INFO] jedai-ui ........................................... FAILURE [  6.408 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:06 min
[INFO] Finished at: 2018-12-11T15:42:46-05:00
[INFO] Final Memory: 42M/406M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.                                                                   2-beta-5:single (default) on project jedai-ui: Error reading assemblies: Error l                                                                   ocating assembly descriptor: assembly.xml
[ERROR]
[ERROR] [1] [INFO] Searching for file location: C:\Users\Yeikel\Documents\JedAIT                                                                   oolkit\jedai-ui\assembly.xml
[ERROR]
[ERROR] [2] [INFO] File: C:\Users\Yeikel\Documents\JedAIToolkit\jedai-ui\assembl                                                                   y.xml does not exist.
[ERROR]
[ERROR] [3] [INFO] Invalid artifact specification: 'assembly.xml'. Must contain                                                                    at least three fields, separated by ':'.
[ERROR]
[ERROR] [4] [INFO] Failed to resolve classpath resource: assemblies/assembly.xml                                                                    from classloader: ClassRealm[plugin>org.apache.maven.plugins:maven-assembly-plu                                                                   gin:2.2-beta-5, parent: sun.misc.Launcher$AppClassLoader@33909752]
[ERROR]
[ERROR] [5] [INFO] Failed to resolve classpath resource: assembly.xml from class                                                                   loader: ClassRealm[plugin>org.apache.maven.plugins:maven-assembly-plugin:2.2-bet                                                                   a-5, parent: sun.misc.Launcher$AppClassLoader@33909752]
[ERROR]
[ERROR] [6] [INFO] File: C:\Users\Yeikel\Documents\JedAIToolkit\assembly.xml doe                                                                   s not exist.
[ERROR]
[ERROR] [7] [INFO] Building URL from location: assembly.xml
[ERROR] Error:
[ERROR] java.net.MalformedURLException: no protocol: assembly.xml
[ERROR]         at java.net.URL.<init>(URL.java:593)
[ERROR]         at java.net.URL.<init>(URL.java:490)
[ERROR]         at java.net.URL.<init>(URL.java:439)
[ERROR]         at org.apache.maven.shared.io.location.URLLocatorStrategy.resolv                                                                   e(URLLocatorStrategy.java:54)
[ERROR]         at org.apache.maven.shared.io.location.Locator.resolve(Locator.j                                                                   ava:81)
[ERROR]         at org.apache.maven.plugin.assembly.io.DefaultAssemblyReader.add                                                                   AssemblyFromDescriptor(DefaultAssemblyReader.java:309)
[ERROR]         at org.apache.maven.plugin.assembly.io.DefaultAssemblyReader.rea                                                                   dAssemblies(DefaultAssemblyReader.java:125)
[ERROR]         at org.apache.maven.plugin.assembly.mojos.AbstractAssemblyMojo.e                                                                   xecute(AbstractAssemblyMojo.java:352)
[ERROR]         at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo                                                                   (DefaultBuildPluginManager.java:134)
[ERROR]         at org.apache.maven.lifecycle.internal.MojoExecutor.execute(Mojo                                                                   Executor.java:208)
[ERROR]         at org.apache.maven.lifecycle.internal.MojoExecutor.execute(Mojo                                                                   Executor.java:154)
[ERROR]         at org.apache.maven.lifecycle.internal.MojoExecutor.execute(Mojo                                                                   Executor.java:146)
[ERROR]         at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.bu                                                                   ildProject(LifecycleModuleBuilder.java:117)
[ERROR]         at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.bu                                                                   ildProject(LifecycleModuleBuilder.java:81)
[ERROR]         at org.apache.maven.lifecycle.internal.builder.singlethreaded.Si                                                                   ngleThreadedBuilder.build(SingleThreadedBuilder.java:51)
[ERROR]         at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(                                                                   LifecycleStarter.java:128)
[ERROR]         at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:309                                                                   )
[ERROR]         at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:194                                                                   )
[ERROR]         at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:107)
[ERROR]         at org.apache.maven.cli.MavenCli.execute(MavenCli.java:993)
[ERROR]         at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:345)
[ERROR]         at org.apache.maven.cli.MavenCli.main(MavenCli.java:191)
[ERROR]         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[ERROR]         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcces                                                                   sorImpl.java:62)
[ERROR]         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMet                                                                   hodAccessorImpl.java:43)
[ERROR]         at java.lang.reflect.Method.invoke(Method.java:498)
[ERROR]         at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhan                                                                   ced(Launcher.java:289)
[ERROR]         at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Laun                                                                   cher.java:229)
[ERROR]         at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExi                                                                   tCode(Launcher.java:415)
[ERROR]         at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launch                                                                   er.java:356)
[ERROR]
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e swit                                                                   ch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please rea                                                                   d the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionE                                                                   xception
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :jedai-ui

DBPedia link broken

The link for DBPedia in data/README.md doesn't work.

Unable to achieve high recall and high precision for the bigger datasets

Hello,

In the entity matching step I'm trying to combine different bag models with similarity measures for the dirty dataset "movies" in the data folder.

Unfortunately I'm unable to get high recall and high precision, could you give a good "recipe" to get good results for that dataset?

Thank you

data pairs shown as false negatives and as true positives

I found some cases where data pairs showed up in the end results as false negative and true positive simultaneously.
Its cause is in the class UnilateralDuplicatePropagation and the following functions:

public boolean isSuperfluous(int entityId1, int entityId2) {
        final IdDuplicates duplicatePair1 = new IdDuplicates(entityId1, entityId2);
        final IdDuplicates duplicatePair2 = new IdDuplicates(entityId2, entityId1);
        if (duplicates.contains(duplicatePair1)
                || duplicates.contains(duplicatePair2)) {
            if (entityId1 < entityId2) {
                detectedDuplicates.add(duplicatePair1);
            } else {
                detectedDuplicates.add(duplicatePair2);
            }
        }

        return false;
    }

public Set<IdDuplicates> getFalseNegatives() {
        final Set<IdDuplicates> falseNegatives = new HashSet<>(duplicates);
        falseNegatives.removeAll(detectedDuplicates);
        return falseNegatives;
    }

Only one of two possible combinations of IDs is written to detectedDuplicates, but superfluous combinations still exist in duplicates. When removing detectedDuplicates from duplicates to create falseNegatives, those superfluous combinations remain and are exported as false negatives, while the combinations in detectedDuplicates are exported as true positives.

PPJoin throw ArrayIndexOutOfBound if candidateSize > requireOverlaps.length

If records[k] length is less than records[candId] then we get array index of bound since requireOverlaps created on Kth record size which might be less than records[candId].length

Better structure for match results output file

The PrintToFile.toCSV() method should output the original entity urls, and should be in a format which is easier to import into a database. e.g. 3 columns: custer_id, dataset, entity_url

Dirty datasets in CSV format

Hi I was wondering if you have the dirty datasets available in CSV format? Otherwise I can just create a quick script that reads the JSO files and convert them myself, but I figured there is no harm in asking first! Thanks in advance.

org.scify.jedai.demoworkflows.RdfDblpAcm does not work as expected

It looks like the groundtruth file is wrong.

Unable to build jedai-core - missing dependencies

Hi,

I'm unable to build the project.
The following dependencies can't be found :

com.esotericsoftware.minlog:minlog:jar:1.2-slf4j-jdanbrown-0
gr.demokritos:JInsect:jar:1.1
salvo.jesus:OpenJGraph:jar:1.1

The first one can't be found, the two others seems to be on an unreachable repository http://backend1.scify.org:60004/artifactory/pub-release-local

mvn clean install -U
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] jedai                                                              [pom]
[INFO] jedai-core                                                         [jar]
[INFO] jedai-ui                                                           [jar]
[INFO]
[INFO] ---------------------------< gr.scify:jedai >---------------------------
[INFO] Building jedai 1.3                                                 [1/3]
[INFO] --------------------------------[ pom ]---------------------------------
[INFO]
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ jedai ---
[INFO]
[INFO] --- maven-install-plugin:2.4:install (default-install) @ jedai ---
[INFO] Installing C:\projet\JedAIToolkit\pom.xml to C:\Users\nicolas.lledo\.m2\repository\gr\scify\jedai\1.3\jedai-1.3.pom
[INFO]
[INFO] ------------------------< gr.scify:jedai-core >-------------------------
[INFO] Building jedai-core 1.3                                            [2/3]
[INFO] --------------------------------[ jar ]---------------------------------
Downloading from nexus.somecompany.com: http://nexus.somecompany.com/repository/maven-public/com/esotericsoftware/minlog/minlog/1.2-slf4j-jdanbrown-0/minlog-1.2-slf4j-jdanbrown-0.pom
[WARNING] The POM for com.esotericsoftware.minlog:minlog:jar:1.2-slf4j-jdanbrown-0 is missing, no dependency information available
Downloading from nexus.somecompany.com: http://nexus.somecompany.com/repository/maven-public/gr/demokritos/JInsect/1.1/JInsect-1.1.pom
[WARNING] The POM for gr.demokritos:JInsect:jar:1.1 is missing, no dependency information available
Downloading from nexus.somecompany.com: http://nexus.somecompany.com/repository/maven-public/salvo/jesus/OpenJGraph/1.1/OpenJGraph-1.1.pom
[WARNING] The POM for salvo.jesus:OpenJGraph:jar:1.1 is missing, no dependency information available
Downloading from nexus.somecompany.com: http://nexus.somecompany.com/repository/maven-public/com/esotericsoftware/minlog/minlog/1.2-slf4j-jdanbrown-0/minlog-1.2-slf4j-jdanbrown-0.jar
Downloading from nexus.somecompany.com: http://nexus.somecompany.com/repository/maven-public/salvo/jesus/OpenJGraph/1.1/OpenJGraph-1.1.jar
Downloading from nexus.somecompany.com: http://nexus.somecompany.com/repository/maven-public/gr/demokritos/JInsect/1.1/JInsect-1.1.jar
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for jedai 1.3:
[INFO]
[INFO] jedai .............................................. SUCCESS [  0.452 s]
[INFO] jedai-core ......................................... FAILURE [  1.671 s]
[INFO] jedai-ui ........................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  2.393 s
[INFO] Finished at: 2019-02-27T17:50:24+01:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project jedai-core: Could not resolve dependencies for project gr.scify:jedai-core:jar:1.3: The following artifacts could not be resolved: com.esotericsoftware.minlog:minlog:jar:1.2-slf4j-jdanbrown-0, gr.demokritos:JInsect:jar:1.1, salvo.jesus:OpenJGraph:jar:1.1: Could not find artifact com.esotericsoftware.minlog:minlog:jar:1.2-slf4j-jdanbrown-0 in nexus.somecompany.com (http://nexus.somecompany.com/repository/maven-public/) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :jedai-core

Converting the DBPedia dataset into non-Java format

Hello,
im working on converting the DBPedia dataset into a format accessible without Java.
I have already converted cleanDBPedia1/2.
However i do not understand the ground truth format.
The profiles have attributes and a URI.
The pairs in the ground truth consist of numbers.
However, when i interpret these numbers as offsets into either file i end up with non-matching pairs.
I wrote the entities into the files in the order they were in the deserialized Java list.
How to find matching pairs / understand the grund truth?
Kind regards

Regarding JedAIToolkit sample csv file

I am looking at the source code of JedAIToolkit in github.

I am not able to find the sample csv file for testing.

Can I get the cd_gold.csv and cd.csv file which has been used for testing purpose of TestGtCSVReader.java & TestEntityCSVReader.java.

[WorkflowBuilder.Main] Loads wrong data

Selecting in C-C mode the Abt-Buy, it takes for 2nd dataset amazonProfiles.
Selecting amazonProfiles, it takes for groundtruth amazonGpIdDuplicates.

Null values from database cause null pointer exception in blocking tokenizer

StandardBlocking.getTokens() throws null pointer exception when input parameter is null.

We ought to stop null values from being added to the EntityProfile when reading from a database

Documentation or examples for the open source library

I cannot seem to find any documentation or examples of a standard workflow implemented in python or java in your repository. Do either of these exist? If so, where could I find them? If not, it would be very useful to have these, since a new user of your tool like me now has to go through all of the java classes to learn how to use the tool, which will take a lot of time.

SiGMa Similarity

I had a look at the code of the SiGMa Similarity in class CharacterNGramsWithGlobalWeights and it seems to be exactly the same code as in the Generalized Jaccard Similarity. Am I missing something or is SiGMa not really implemented?

GtCSVReader problems with jgrapht ConnectivityInspector

This issue arose when I attempted to reproduce the workflow in: org.scify.jedai.demoworkflows.CsvDblpAcm.java.

During the reading process of the ground truths in DBLP-ACM_perfectMapping.csv (specifically the GtCSVReader.getDuplicatePairs method), the detection of connected components by the jgrapht package seems to not work.

For some reason I obtain a single cluster of size 2225 and then 5375 more clusters of size 1, which is obviously incorrect since the csv contains about 2225 unique pairs (which should in turn produce 2225 clusters of size 2).

Have you seen this problem before? Maybe the jgrapht package expects a different format than it did previously?

Change comparison counts type to int

We're using jedai-core in our application and we ran into some issues where the number of executed comparisons in ComparisonIterator was going over the number of total comparisons. We identified that this was happening because executedComparisons and totalComparisons are floats and changing them to ints fixed the problem. In Java, comparing two floats for exact equality is generally discouraged.

Dependency org.apache.httpcomponents:httpclient-cache, leading to CVE problem

Hi, In /maven-plugins/sitegen-maven-plugin，there is a dependency **org.apache.httpcomponents:httpclient-cache:jar:4.2.6
** that calls the risk method.

CVE-2020-13956

The scope of this CVE affected version is [,4.5.13)

After further analysis, in this project, the main Api called is org.apache.http.client.utils.URIUtils: extractHost(java.net.URI)Lorg.apache.http.HttpHost

Risk method repair link : GitHub

CVE Bug Invocation Path--

Path Length : 7

org.scify.jedai.datawriter.BlocksPerformanceWriter: printDetailedResultsToSPARQL(java.util.List,java.util.List,java.lang.String,java.lang.String)V /home/hjf/.m2/repository/org/apache/jena/jena-cmds/3.1.0/jena-cmds-3.1.0.jar
org.apache.jena.sparql.modify.UpdateProcessRemoteForm: execute()V /home/hjf/.m2/repository/org/apache/jena/jena-cmds/3.1.0/jena-cmds-3.1.0.jar
org.apache.jena.riot.web.HttpOp: execHttpPostForm(java.lang.String,org.apache.jena.sparql.engine.http.Params,java.lang.String,org.apache.jena.riot.web.HttpResponseHandler,org.apache.http.client.HttpClient,org.apache.http.protocol.HttpContext,org.apache.jena.atlas.web.auth.HttpAuthenticator)V /home/hjf/.m2/repository/org/apache/jena/jena-cmds/3.1.0/jena-cmds-3.1.0.jar
org.apache.jena.riot.web.HttpOp: exec(java.lang.String,org.apache.http.client.methods.HttpUriRequest,java.lang.String,org.apache.jena.riot.web.HttpResponseHandler,org.apache.http.client.HttpClient,org.apache.http.protocol.HttpContext,org.apache.jena.atlas.web.auth.HttpAuthenticator)V /home/hjf/.m2/repository/org/apache/jena/jena-cmds/3.1.0/jena-cmds-3.1.0.jar
org.apache.http.impl.client.AbstractHttpClient: execute(org.apache.http.client.methods.HttpUriRequest,org.apache.http.protocol.HttpContext)Lorg.apache.http.HttpResponse; /home/hjf/.m2/repository/org/apache/jena/jena-cmds/3.1.0/jena-cmds-3.1.0.jar
org.apache.http.impl.client.AbstractHttpClient: determineTarget(org.apache.http.client.methods.HttpUriRequest)Lorg.apache.http.HttpHost; /home/hjf/.m2/repository/org/apache/jena/jena-cmds/3.1.0/jena-cmds-3.1.0.jar
org.apache.http.client.utils.URIUtils: extractHost(java.net.URI)Lorg.apache.http.HttpHost;

Dependency tree--

[INFO] org.scify:jedai-core:jar:3.2.1
[INFO] +- org.jgrapht:jgrapht-core:jar:1.4.0:compile
[INFO] |  \- org.jheaps:jheaps:jar:0.11:compile
[INFO] +- net.sf.trove4j:trove4j:jar:3.0.3:compile
[INFO] +- com.esotericsoftware:minlog:jar:1.3.1:compile
[INFO] +- info.debatty:java-lsh:jar:0.11:compile
[INFO] |  \- info.debatty:java-string-similarity:jar:0.12:compile
[INFO] +- org.apache.commons:commons-lang3:jar:3.4:compile
[INFO] +- org.apache.commons:commons-math3:jar:3.1.1:compile
[INFO] +- org.apache.jena:jena-arq:jar:3.1.0:compile
[INFO] |  +- org.apache.jena:jena-core:jar:3.1.0:compile
[INFO] |  |  +- org.apache.jena:jena-iri:jar:3.1.0:compile
[INFO] |  |  +- xerces:xercesImpl:jar:2.11.0:compile
[INFO] |  |  |  \- xml-apis:xml-apis:jar:1.4.01:compile
[INFO] |  |  +- commons-cli:commons-cli:jar:1.3:compile
[INFO] |  |  \- org.apache.jena:jena-base:jar:3.1.0:compile
[INFO] |  |     \- com.github.andrewoma.dexx:collection:jar:0.6:compile
[INFO] |  +- org.apache.jena:jena-shaded-guava:jar:3.1.0:compile
[INFO] |  +- org.apache.httpcomponents:httpclient:jar:4.2.6:compile
[INFO] |  |  +- org.apache.httpcomponents:httpcore:jar:4.2.5:compile
[INFO] |  |  \- commons-codec:commons-codec:jar:1.6:compile
[INFO] |  +- com.github.jsonld-java:jsonld-java:jar:0.7.0:compile
[INFO] |  |  +- com.fasterxml.jackson.core:jackson-core:jar:2.3.3:compile
[INFO] |  |  +- com.fasterxml.jackson.core:jackson-databind:jar:2.3.3:compile
[INFO] |  |  |  \- com.fasterxml.jackson.core:jackson-annotations:jar:2.3.0:compile
[INFO] |  |  \- commons-io:commons-io:jar:2.4:compile
[INFO] |  +- org.apache.httpcomponents:httpclient-cache:jar:4.2.6:compile
[INFO] |  +- org.apache.thrift:libthrift:jar:0.9.2:compile
[INFO] |  +- org.slf4j:jcl-over-slf4j:jar:1.7.20:compile
[INFO] |  +- org.apache.commons:commons-csv:jar:1.0:compile
[INFO] |  \- org.slf4j:slf4j-api:jar:1.7.20:compile
[INFO] +- org.apache.jena:jena-cmds:jar:3.1.0:compile
[INFO] |  +- org.apache.jena:apache-jena-libs:pom:3.1.0:compile
[INFO] |  |  \- org.apache.jena:jena-tdb:jar:3.1.0:compile
[INFO] |  +- org.slf4j:slf4j-log4j12:jar:1.7.20:compile
[INFO] |  \- log4j:log4j:jar:1.2.17:compile
[INFO] +- com.opencsv:opencsv:jar:3.7:compile
[INFO] +- org.jdom:jdom2:jar:2.0.6:compile
[INFO] +- org.scify:JInsect:jar:1.1:compile
[INFO] |  \- org.scify:OpenJGraph:jar:1.1:compile
[INFO] +- org.rdfhdt:hdt-java-core:jar:1.1:compile
[INFO] |  +- com.beust:jcommander:jar:1.32:compile
[INFO] |  +- org.rdfhdt:hdt-api:jar:1.1:compile
[INFO] |  \- org.apache.commons:commons-compress:jar:1.6:compile
[INFO] |     \- org.tukaani:xz:jar:1.4:compile
[INFO] +- com.google.guava:guava-testlib:jar:30.1.1-jre:test
[INFO] |  +- com.google.code.findbugs:jsr305:jar:3.0.2:test
[INFO] |  +- org.checkerframework:checker-qual:jar:3.8.0:test
[INFO] |  +- com.google.errorprone:error_prone_annotations:jar:2.5.1:test
[INFO] |  +- com.google.j2objc:j2objc-annotations:jar:1.3:test
[INFO] |  +- com.google.guava:guava:jar:30.1.1-jre:test
[INFO] |  |  +- com.google.guava:failureaccess:jar:1.0.1:test
[INFO] |  |  \- com.google.guava:listenablefuture:jar:9999.0-empty-to-avoid-conflict-with-guava:test
[INFO] |  \- junit:junit:jar:4.13.2:test
[INFO] |     \- org.hamcrest:hamcrest-core:jar:1.3:test
[INFO] +- org.hamcrest:hamcrest:jar:2.2:test
[INFO] +- org.junit.jupiter:junit-jupiter-api:jar:5.7.2:test
[INFO] |  +- org.apiguardian:apiguardian-api:jar:1.1.0:test
[INFO] |  +- org.opentest4j:opentest4j:jar:1.2.0:test
[INFO] |  \- org.junit.platform:junit-platform-commons:jar:1.7.2:test
[INFO] \- org.junit.jupiter:junit-jupiter-engine:jar:5.7.2:test
[INFO]    \- org.junit.platform:junit-platform-engine:jar:1.7.2:test

Suggested solutions:

Update dependency version

Thank you very much.

Make jedai-core Extensible

Users of jedai-core are unable to extend the library to utilize a custom similarity metric or entity matching method due to the enums defined in the project (e.g. SimilarityMetric, EntityMatchingMethod, BlockCleaningMethod, etc.). Instead, if these features utilized an extension mechanism (for example, java.util.ServiceLoader or something equivalent), custom features would be possible.

Deploy To Maven Central

To make it easier to consume, can the project be deployed to Maven Central?

Unable to load csv

I am using the latest JedAI-gui: jedai-ui.7z. I tried loading DBLP-ACM .csv data:

ACM.csv
DBLP2.csv
DBLP-ACM_perfectMapping.csv

and I get the following error (please see attached):

[WorkflowBuilder.Main] Error: can't locate dataset

Using the library from CLI (Linux) it raises this exception:

Please choose one of the available Clean-clean ER datasets:
1 - Abt-Buy
2 - DBLP-ACM
3 - DBLP-Scholar
4 - Amazon-Google Products
5 - IMDB-DBPedia Movies
1
Abt-Buy has been selected!
0 [main] ERROR com.esotericsoftware.minlog  - Error in data reading
java.io.FileNotFoundException: data/cleanCleanErDatasets/amazonProfiles (File o directory non esistente)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at java.io.FileInputStream.<init>(FileInputStream.java:93)
	at org.scify.jedai.datareader.AbstractReader.loadSerializedObject(AbstractReader.java:54)
	at org.scify.jedai.datareader.entityreader.EntitySerializationReader.getEntityProfiles(EntitySerializationReader.java:48)
	at org.scify.jedai.workflowbuilder.Main.main(Main.java:241)
Exception in thread "main" java.lang.NullPointerException
	at java.util.ArrayList.addAll(ArrayList.java:581)
	at org.scify.jedai.datareader.entityreader.EntitySerializationReader.getEntityProfiles(EntitySerializationReader.java:48)
	at org.scify.jedai.workflowbuilder.Main.main(Main.java:241)

Null pointer when trying to load data using latest release

I am using the following release

And I am trying the jedaiDesktopApp-1.1.jar with the following datasets (from the samples) :

abtBuyIdDuplicates (for D1)
abtBuyProfiles (for truth file)

But I get the following error :

I tried with CSV files and I also get the same error

Remove maven-assembly-plugin Configuration From jeda-core

If another project is going to depend on jedai-core, having the transitive dependencies assembled inside jedai-core has the potential to conflict if different versions of those same transitive dependencies are needed for the other project. Since jedai-ui is already assembling transitive dependencies, removing transitive dependencies from jedai-core should not have any effect on the UI.

UI and Docker's Web Application get stuck in Data Reading Phase

I get the following error after specifying input sources and then pressing "Next" button in Data Reading Step in JedAI UI:

The input files could not be read successfully.

Details: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Character
cannot be cast to java.lang.String (java.lang.Character cannot be cast to cast to java.lang.String)

In the terminal of Docker's Web Application I have the following:

java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Character
	at kr.di.uoa.gr.jedaiwebapp.models.Dataset.<init>(Dataset.java:86) ~[classes!/:0.0.1-SNAPSHOT]
	at kr.di.uoa.gr.jedaiwebapp.controllers.WorkflowController.validate_DataRead(WorkflowController.java:75) ~[classes!/:0.0.1-SNAPSHOT]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_212]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_212]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_212]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_212]
	at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:190) ~[spring-web-5.1.8.RELEASE.jar!/:5.1.8.RELEASE]
...

Could not read successfully the input file!

How do I create a ground Truth file?

Error on TestGtRDFReader

Hi, I'm tried some tests with JedAI tool.
This tool is useful for my job and I think that it has big potentiality.
I've downloaded the attached file in nt format: source.nt, target.nt.
In the firts step I have successfully executed TestRdfReader class presents in the test package for both datasets. After that I've tried to execute TestGtRDFReader class with the same datasets used before, but I have the following error:
Exception in thread "main" java.lang.IllegalArgumentException: loops not allowed at org.jgrapht.graph.AbstractBaseGraph.addEdge(AbstractBaseGraph.java:203) at org.scify.jedai.datareader.groundtruthreader.GtRDFReader.performReading(GtRDFReader.java:236) at org.scify.jedai.datareader.groundtruthreader.GtRDFReader.getDuplicatePairs(GtRDFReader.java:92) at org.scify.jedai.datareader.groundtruthreader.AbstractGtReader.getDuplicatePairs(AbstractGtReader.java:57) at org.scify.jedai.datareader.TestGtRDFReader.main(TestGtRDFReader.java:39)

datasets.zip

Thanks in advance!

Apply JedAI blocking programmatically - missing documentation

Hi!

I have successfully made the Web application work and I also made my first successful steps by using JedAI with Python.

But now I want to do it programatically with Python and without the Web application, so I want to apply the full workflow but only with the terminal and the VS Code.

But I couldn't find any detailed documentation how I can do blocking, cleaning ... programatically.

scify / jedaitoolkit Goto Github PK

jedaitoolkit's Issues

Recommend Projects

Recommend Topics

Recommend Org