We implemented an auditing system per the code provided here. One issue we've had, though, is that client-level info (client_query, client_addr, client_port, action_tstamp_tx, etc.) is repeated for each row that is updated based on a given transaction. Under certain usage patterns, this wouldn't matter much, but in our case it has caused our audit.logged_actions table to be extremely bloated. As an example of the kind of transaction that has caused serious bloat, a backend user may update a bunch of rows using something like the following:
UPDATE foo SET bar1='updated bar 1', bar2 = ... WHERE bar_id = 1;
UPDATE foo SET bar1='updated bar 2', bar2 = ... WHERE bar_id = 2;
...
UPDATE foo SET bar1='updated bar 100000', bar2 =... WHERE bar_id = 100000;
Run as a single transaction, this leads to 100,000 rows in the logged_actions table, with the same values (i.e., all lines as indicated above) for client_query, plus repeated client_addr, client_port, transaction_id, action_tstamp_tx, etc.), in every logged_actions row. That redundancy doesn't matter much for most of those columns, but the client_query column repeats the 100,000 line-long statement for every row, in the example above. When the data being updated are significant from a size perspective (e.g., each row contains many columns with varchar types), that adds up to lots of diskspace being used with redundant client_query info.
By way of explanation, we aggregate data from other live databases to which we don't have direct access or connection. Sometimes those databases modify records on their side, so we periodically (e.g., annually) update our records to reflect those changes, using a unique ID to compare and update data. Our older workflows involved users constructing update statements using Excel or some other means outside the database, then running these via pgAdmin.
I am trying to move backend users away from doing things in that manner, so that each update or delete statement is handled as a separate transaction, to avoid that kind of duplication in the client_query column, but we're stuck dealing with the existing records in a logged_actions table that is around 56 GB (for sake of comparison, that is almost half the total size of our database). Even with more normal/sane usage patterns, it seems like there's significant DRY violation happening here. In a case where updates are done by referencing a temporary table after an import from CSV, for example, the client_query (as in the example shown below) still gets unnecessarily duplicated for each row that's updated as a result, e.g.,
UPDATE foo
SET
foo.col_1 = update_table.col_1
foo.col_2 = update_table.col_2
foo.col_3 = update_table.col_3
...
foo.col_n = update_table.col_n
FROM update_table
WHERE update_table.id = foo.id;
I'm wondering whether you've considered refactoring this so that the client-level info (client_query, etc.) is written to one table and the actual information being audited (schema_name, table_name, action, row_data, changed_field, etc.) is written to another, with the latter just referencing via FK a row in the former? It seems that the trigger function could be modified so that it would write client-level information to that table and return a unique primary key (if it did not already exist there -- otherwise it'd return the primary key) and then use that primary key as the FK in an "event" level table that tracks the actual changes.