daqcri / nadeef Goto Github PK
View Code? Open in Web Editor NEWA Generalized Data Cleaning System
Home Page: da.qcri.org/NADEEF/
License: Other
A Generalized Data Cleaning System
Home Page: da.qcri.org/NADEEF/
License: Other
Hello,
currently, I am running a set of experiments on the HOSP dataset:
1k - 100k tuples with 2%-10% noise introduced by myself.
HOSP is provided within NADEEF github repository.
At some point, I am getting OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Thread-762" java.lang.OutOfMemoryError: GC overhead limit exceeded
I increased the memory usage, and run NADEEF on Linux machine with the following configuration:
java -Xmx14G -cp out/bin/*:examples/:out/test qa.qcri.nadeef.console.Console
Any idea, what else I could configure to get NADEEF running without OOM errors?
Thank you for your help.
@zyzyis from the demo should open a new window
Can we show more columns in tuple rank?
Can we navigate left-right in the ranking?
Michael B has same request when I did a demo to him, same for the people at QSTP
Hi,
I am trying to run the following DC rule:
{
"source" : {
"type" : "csv",
"file" : ["/home/khayyzy/data/TaxB/inputDB_100000.csv"],
},
"rule" : [
{
"name" : "myDC2",
"type" : "dc",
"value" : ["not(t1.Salary>t2.Salary&t1.Tax<t2.Tax)"]
}
]
}
on the schema:
Name VARCHAR(255),Dept INT,Salary INT,Tax INT
When running NADEEF, it returns zero violations. I am sure that there are at least 6000 violations. Am I doing something wrong in defining the DC?
We are using a special CSV header and we need to let user know how to use or upload files correctly.
Hi,
I am trying to run NADEEF with the following FD rule:
o_custkey | c_address
I got this error when I define attribute o_custkey as integer and c_address as varchar(255). I am running derby db on memory.
Error: Synchronization failed.
Exception: java.sql.SQLSyntaxErrorException: Comparisons between 'INTEGER' and 'CHAR (UCS_BASIC)' are not supported. Types must be comparable. String types must also have matching collation. If collation does not match, a possible solution is to cast operands to force them to the default collation (e.g. SELECT tablename FROM sys.systables WHERE CAST(tablename AS VARCHAR(128)) = 'T1')
Error: Node has an exception during execution.
Exception: java.lang.NullPointerException: null
Hi,
I got the following when running a DC with large number of rows
HScope time (ms) 0
VScope time (ms) 0
Blocks 1
Iterator time (ms) 4491737
DB load time (ms) 712
Detect time (ms) 4528734
Detect thread count 1953116
Detect tuple count -1474936480
Violation 0
Detection finished in 4528734 ms and found 0 violations.
Test
in the upper panels, if a click on a violating tuple with id 2 (involved in a violation), the search of “2” returns tons of tuples. Why not searching for tid:2 or something similar?
@zyzyis should be "List of Tables"
The current interface is limited to dropping a file which can be annoying.
Add the ability to browse the local disk, URLs, etc. to upload CSV files.
We should be able to upload compressed CSV files
In two cases (paolo and paolo2) when I created a new project, I got this msg on the top right:
FATAL: database "null" does not exist
on the left there was only one panel: No job is running.
and I could not do anything.
Now, if I open one of them again (by going back to the first screen) I have the correct panels to select the data source on the left, no errors
tested on Chrome
Hi,
How to assign more memory to NADEEF?
:> load sldkfsdls
Oops, something is wrong. Please check the log in the output dir.
Exception: java.io.FileNotFoundException: null
hello,
the NADEEF throws an exception if the uploaded data files names contain "-" (minus) Like, "noisy-data.csv"
thanks!
NVD3 development is lacking behind D3, and the quality is not good.
Try to get it into BNCF form.
I've added a CSV for paolo4 and saved, but it is not shown in the main GUI.
I've tried twice, refresh, etc.
Is there any requirement on the header?
here are the first lines of the file:
rec_id:Varchar(600), given_name:Varchar(1000), surname:Varchar(1000), street_number:Varchar(3000), address_1:Varchar(3000), address_2:Varchar(3000), suburb:Varchar(1000), postcode:Varchar(500), state:Varchar(200), date_of_birth:varchar(1000), age:Varchar(200), phone_number:Varchar(1000), soc_sec_id:Varchar(1000), blocking_number:Varchar(1000)
rec-183-org, elisha, kerslake, 29, goldsborough close, , ringwood north, 4806, qld, 19020708, 34, 02 70611981, 5402131, 1
rec-351-dup-0, georgina, hannink, 109, knoke avenue, s/port yacht-b, leichhardt, 4207, qld, 19920216, , , 5814238, 3
rec-304-dup-0, matthew, boyes, 9, kadina crescent, , whitfield, 4060, nsw, 19660821, , 07 34459589, 3290986, 1
It would be nice to have an example/template for the different rules in the panel where they can be specified.
Eg, if I click FD, it should show the syntax for an FD
Java, should the piece of java I need to fill it
etc
If we edit a rule code after clicking on detect, the rule description shows that the rule is updated, but the violation tab always shows violation of the previous rule code.
there are some issues when running multiple rules together, see paolo4 with 3 rules. In rule attribute only 2 atts are shown (but 4 are actually involved)
When I click on a violating tuple in the upper panels, I'd like to see the tuple of interest and its context: the other tuples involved in its violations
for example, if tuple X is in violation with tuple Y on constraint C1 and tuple Z with constraint C2, it would be great to see the cells involved in X and Y in a background color, and the cells involved in X and Z in another color.
I believe this would greatly improved the usability of the GUI
If someone click on detect rule R1, and switch to another rule R2, if we select R1 again it doesn't show violations until we click on detect..
In violation relation, can we use different edge (different colors?) for different rules?
Can we do some kind of zoom in/out or ordering?
This panel has a lot of potential, I wonder if we can do more in terms of visualisation
Can we use any graph algorithm to analyse?
When receiving CSV files, the header should be sanitized for proper Postgres column names.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.