README.md

finding_duplicate_rows

[WIP] simple view to find all duplicate rows from PK python script to give more readable results

Much like maintenance_schema, this view allows to find exactly how many occurences and which tuples are duplicates. This only takes into account primary keys as entry. Need to expand to UNIQUE constraints as well.

This is how to use it.

postgres@dv400bd04g7018:~$ psql -t -d postgres -c "select * from rpt_duplicate_rows" | sed 's/"\|+//g' | psql -d postgres
 duplicate_rows | count |  pks
----------------+-------+-------
 toto           |     2 | {1,2}
(1 row)

 duplicate_rows | count | pks
----------------+-------+-----
(0 rows)

 content | duplicate_rows | count |  pks
---------+----------------+-------+-------
 toto2   |             12 |     2 | {2,5}
(1 row)

postgres@dv400bd04g7018:~$ psql
psql (9.6.11)
Type "help" for help.

postgres=# select * from dummy;
 id | content
----+---------
  1 | toto
  2 | toto
  3 | toto2
(3 rows)

postgres=# select * from dummy2;
 id | content
----+---------
(0 rows)

postgres=# select * from dummy3;
 id | content | num
----+---------+-----
  1 | toto    |  12
  2 | toto2   |  12
  3 | toto    |  14
  4 | toto2   |  14
  5 | toto2   |  12
(5 rows)

Better now, use pg_deduplicator.py

postgres@dv400bd04g7018:~$ python pg_deduplicator.py

Relation name: public.dummy
Duplicate content | number or duplicates(L) | Name of keys
('toto', 2L, [1, 2])


Relation name: public.dummy2
Duplicate content | number or duplicates(L) | Name of keys


Relation name: public.dummy3
Duplicate content | number or duplicates(L) | Name of keys
('toto2', 12, 2L, [2, 5])


Relation name: public.dummy4
Duplicate content | number or duplicates(L) | Name of keys
(14, 'toto', 2L, [3, 4])


Relation name: public.dummy5
Duplicate content | number or duplicates(L) | Name of keys

TODO:

====================

Use ctid column to keep either :

first/oldest/former
last/newest/latter
Add multiple PKs and unique Keys
Delete step
variables for connection
options : --report-onscreen (default), --report-html, --keep-oldest, --keep-newest

-- from wiki 
DELETE FROM tablename
WHERE id IN (
    SELECT
        id
    FROM (
        SELECT
            id,
            row_number() OVER w as rnum
        FROM tablename
        WINDOW w AS (
            PARTITION BY column1, column2, column3
            ORDER BY id
        )

    ) t
WHERE t.rnum > 1);
-- use ID or timestamptz

IDEAS : https://www.dataqualityapps.com/know-how/121-postgresql-deduping-data.html

Best so far : https://www.alibabacloud.com/blog/postgresql-data-deduplication-methods_596032

emerichunter / finding_duplicate_rows Goto Github PK

finding_duplicate_rows's Introduction

README.md

finding_duplicate_rows

TODO:

finding_duplicate_rows's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent