bmiller1009 / deduper Goto Github PK
View Code? Open in Web Editor NEWGeneral deduping engine for JDBC sources with output to JDBC/csv targets
License: Apache License 2.0
General deduping engine for JDBC sources with output to JDBC/csv targets
License: Apache License 2.0
Take a look at a blocking queue
API should support showing items in a JNDI context
Add ability to preview a sample row of data so that user can see how the values are being hashed
Move the delete APIs for target and dupes to the JNDI Target/Dupes objects
Store string hash as one of the values in the duplicate table
Change Builder to take in Csv/Sql target JNDI objects rather than just strings
deduper should function as both a library and a cli. Add a command-line interface with argument display and parsing so it can function on the command-line.
This should only be done once at the beginning of the program
deduper should allow an input parameter of a list of "seen" hashes. This will allow deduper to dedupe against other data sets as well as against "itself"
Add the ability for deduper to source hash values from a Kafka topic as well as write them to a target topic
API should support listing JNDI contexts available
Dupe Count report should contain a count of all dupes as well as unique dupes
Executor timeout setting should be set as a parameter in the Deduper Builder API with a default of 60 seconds
Allow ability for capitalization to be considered or ignored
Some method signatures are undocumented or incorrect after async code merge
deduper should be able to extract a full set of hashes (1 per row). This will allow deduping between multiple data sets.
Add ability to detect dupes using COUNT/HAVING SQL Syntax
Add proper support for CI/CD of project
deduper should have the ability to add new jndi entries on the fly w/o modifying the config file:
https://www.journaldev.com/2509/java-datasource-jdbc-datasource-example
Deduper should allow option for "seen" hashes to be cached on disk or in memory
Flat file output should be locked while being written to
Null values should be "stringified" and "hashed in" by default or optionally ignored
Get some performance metrics on how fast deduper runs with different numbers of rows with different amounts of columns.
SqlPersistor transaction rollback shouldn't occur in finally block, it should occur in the exception block
Add library build instructions to README
Add input parameters to control whether duplicates and deduped data are persisted
Currently only Sql persistor accepts a parameter for persisting the json representation of a hash
Use trove4j to store the long representation of string hashes when building up hash list in type loop. This will save tons of space vs having a mutable map of <String, Long>
Use builder pattern to improve config parameter collection:
Refactor table creation code into SqlUtils library
Sql Target should reflect null/not null of the jdbc source. Currently this property is not reflected in the target.
On large data sets, the number of seen hashes stored in memory could be too heavy. Add the ability to keep some hashes in memory and some on disk using some sort of smart caching behavior.
Fill out proper README
Options such as quoting values, what to do in case of delimiters contained within values, etc should be better handled. Take a look at OpenCsv:
compile group: 'com.opencsv', name: 'opencsv', version: '4.6'
Make hash column primary key in SQL Persistor duplicate table
Flat file output in jndi should have reasonable defaults IE comma delimited, txt as extension
Allow each JNDI connection to be attached to a different context
Currently no checks are being done on whether the dupe or target persistence already exists. Add an input parameter for optionally dropping/truncating existing dupe/target persistence.
There is repetitive code in the Consumer classes which could be refactored with a base Consumer class
Add Apache Drill JDBC to open up more sources to read from
Host dokka content on git
Lock file for csv isn't always deleted when a run is over
Change Duplicates API to store all instances of a duplicate. Current implementation stores each instance as a separate item. This is overkill and wastes space.
There are some new functionalities which should be documented in the README after the Async branch code merge
Rather than complex data transformations for each pair in the duplicate persistence transform whole collection before processing it
If publishing thread fails all consuming threads should be shutdown
Add Dokka documentation
Improve and expand unit testing
Add proper logging to each class
Publish library to maven central
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.