The deduper from bmiller1009

Use Producer/Consumer pattern to push items to be saved to database

Take a look at a blocking queue

API should support showing items in a JNDI context

Add ability to preview a sample row of data being hashed

Add ability to preview a sample row of data so that user can see how the values are being hashed

Move the delete APIs for target and dupes to the JNDI Target/Dupes objects

Store string hash as one of the values in the duplicate table

Change Builder to take in Csv/Sql target JNDI objects rather than just strings

Add command-line functionality

deduper should function as both a library and a cli. Add a command-line interface with argument display and parsing so it can function on the command-line.

Remove null checks from result set tight loop.

This should only be done once at the beginning of the program

Dedupe against a known set of hashes

deduper should allow an input parameter of a list of "seen" hashes. This will allow deduper to dedupe against other data sets as well as against "itself"

Add the ability for deduper to source hash values from a Kafka topic as well as write them to a target topic

API should support listing JNDI contexts available

Dupe Count report should contain a count of all dupes as well as unique dupes

Make executor service timeout a parameter in the builder

Executor timeout setting should be set as a parameter in the Deduper Builder API with a default of 60 seconds

Allow ability for string capitalization to be considered or ignored

Allow ability for capitalization to be considered or ignored

Update javadocs for async merged code

Some method signatures are undocumented or incorrect after async code merge

Add ability to gather and persist hashes

deduper should be able to extract a full set of hashes (1 per row). This will allow deduping between multiple data sets.

Add ability to detect dupes using COUNT/HAVING SQL Syntax

CI/CD pipeline

Add proper support for CI/CD of project

Add new jndi entries programmatically

deduper should have the ability to add new jndi entries on the fly w/o modifying the config file:

https://www.journaldev.com/2509/java-datasource-jdbc-datasource-example

"Seen" hashes to be cached on disk or in memory

Deduper should allow option for "seen" hashes to be cached on disk or in memory

Flat file output should be locked while being written to

Deduper should process null values based on input parameter

Null values should be "stringified" and "hashed in" by default or optionally ignored

Performance metrics

Get some performance metrics on how fast deduper runs with different numbers of rows with different amounts of columns.

SqlPersistor transaction rollback shouldn't occur in finally block

SqlPersistor transaction rollback shouldn't occur in finally block, it should occur in the exception block

Add library build instructions to README

Options for emitting deduped data and duplicates

Add input parameters to control whether duplicates and deduped data are persisted

Add option to include json for output hashes into csv option

Currently only Sql persistor accepts a parameter for persisting the json representation of a hash

Use trove4j to store the long representation of string hashes when building up hash list in type loop

Use trove4j to store the long representation of string hashes when building up hash list in type loop. This will save tons of space vs having a mutable map of <String, Long>

Use builder pattern to improve config parameter collection

Use builder pattern to improve config parameter collection:

https://www.baeldung.com/kotlin-builder-pattern

Refactor table creation/deletion code into SqlUtils library

Refactor table creation code into SqlUtils library

Sql Target should reflect null/not null of the jdbc source

Sql Target should reflect null/not null of the jdbc source. Currently this property is not reflected in the target.

Allow for on disk/cache hybrid persistence model of "seen" hashes

On large data sets, the number of seen hashes stored in memory could be too heavy. Add the ability to keep some hashes in memory and some on disk using some sort of smart caching behavior.

Fill out proper README

Improve csv output

Options such as quoting values, what to do in case of delimiters contained within values, etc should be better handled. Take a look at OpenCsv:

compile group: 'com.opencsv', name: 'opencsv', version: '4.6'

Make hash column primary key in SQL Persistor

Make hash column primary key in SQL Persistor duplicate table

File output defaults

Flat file output in jndi should have reasonable defaults IE comma delimited, txt as extension

Allow each JNDI connection to be attached to a different context

Option to delete dupe/target persistence

Currently no checks are being done on whether the dupe or target persistence already exists. Add an input parameter for optionally dropping/truncating existing dupe/target persistence.

Refactor Consumer classes

There is repetitive code in the Consumer classes which could be refactored with a base Consumer class

Add Apache Drill JDBC to open up more sources to read from

Host dokka content on git

Lock file for csv isn't always deleted when a run is over

Duplicate API should store all instances of duplicates

Change Duplicates API to store all instances of a duplicate. Current implementation stores each instance as a separate item. This is overkill and wastes space.