diffix / pg_diffix Goto Github PK
View Code? Open in Web Editor NEWImplementation of the Open Diffix anonymization mechanism for PostgreSQL.
Home Page: https://www.open-diffix.org
License: MIT License
Implementation of the Open Diffix anonymization mechanism for PostgreSQL.
Home Page: https://www.open-diffix.org
License: MIT License
ERROR: Unsupported AID type (OID 1043)
If I write the following query SELECT count(name) FROM users
which gets rewritten to diffix_count(id, name)
I get the following RTE:
{RTE
:alias <>
:eref {ALIAS
:aliasname users
:colnames (
"id"
"name"
)
}
:rtekind 0
:relid 136662
:relkind r
:rellockmode 1
:tablesample <>
:lateral false
:inh true
:inFromCl true
:requiredPerms 2
:checkAsUser 0
:selectedCols (b 9)
:insertedCols (b)
:updatedCols (b)
:extraUpdatedCols (b)
:securityQuals <>
}
When I write SELECT diffix_count(id, name) FROM users
directly I get this RTE:
{RTE
:alias <>
:eref {ALIAS
:aliasname users
:colnames (
"id"
"name"
)
}
:rtekind 0
:relid 136662
:relkind r
:rellockmode 1
:tablesample <>
:lateral false
:inh true
:inFromCl true
:requiredPerms 2
:checkAsUser 0
:selectedCols (b 8 9)
:insertedCols (b)
:updatedCols (b)
:extraUpdatedCols (b)
:securityQuals <>
}
These are the same except for selectedCols
: :selectedCols (b 9)
vs :selectedCols (b 8 9)
.
The (b ...)
indicates a bit list. We probably need to match this as the planner could use this data to determine what it needs to look up.
This should be similar to diffix_lcf
aggregate found in pg_diffix/template/agg_lcf_a.h
. The only difference would be the final function. In this case it would return the noisy uniq count instead of true/false.
This is a continuation of the previous issue: #1
We have install instructions for how to get the extension running, but going forward it would be wonderful to have a Dockerfile where the extension is built, installed, and loaded for you.
I'm writing the int8 AID tracker, but I'm getting some second thoughts. Are we sure that we benefit enough to merge AIDs to int8?
We can get rid of the macros in other ways.
AID can be an untagged union and some function pointers can handle the rest. We determine the functions when we create the AID/contribution tracker by examining types sent to the aggregate.
By squeezing AIDs into the same shape we're losing bijection (for strings). We also lose the nice explain
features which could be a debugging and teaching utility.
Thoughts?
If an AID contributes with NULL
, do we include that AID in the seed material?
Currently, custom variables are used to set the anonymization parameters of the system, e.g.:
SET pg_diffix.default_access_level = 'publish';
SET pg_diffix.noise_seed = 'secret_seed';
SET pg_diffix.minimum_allowed_aids = 3;
I did some reading on this topic and thought about alternative solutions, and, in the end, I think the initial approach here was the correct one.
The Postgres config file recognizes custom variables, so the default values we provide can be changed at server start-up.
Using a configuration table would allow us to persistently change the settings dynamically during run-time, but I don't think that is something we'll need. It would also be more complex.
I suggest to keep using custom variables to configure the anonymization settings of the system.
Later, we can add support for different anonymization profiles, like publish, cloak or knox, and then use security labels to assign a profile to a user.
There is also the question if we should allow setting different anonymization parameters per table or per column. That should be possible using security labels, but it would add significant complexity to the system (in implementation and usage) and I don't think it makes sense to worry about it at this stage. We might never need it.
This is branching off from: #21
In his comment, @yoid2000 asked some questions. I'll add a comment to his question #2 here.
Regarding 2, can we avoid this question by always knowing what the actual AID is?
The AID would be null
if the value is missing in the dataset, rather than as an artefact of the query itself. So it will in fact not always be possible to know what an AID is. In Aircloak Insights we explicitly filtered out the data where there was no AID value present.
In practice, this will lead to severe data loss for certain datasets (I saw this a couple of times with Aircloak insights), but in all the instances that I encountered the problem would not have been solved by including this data. Had it all been represented by a null
-AID it would have been suppressed as an extreme outlier. Had it been assigning random AID values in place of the null
-value, then we would have lost all ability to make any claims about the resulting anonymity. This all leads me to think we must drop null
-AIDed data in Open Diffix too.
Some users need to have raw access and some users need to have anon access.
Swap count(*)
and count(any)
aggregates with the respective diffix_count
versions. Aggregate OIDs can be looked up in OidCache
found in "pg_diffix/oid_cache.h"
.
We load config once at startup. This has issues if we remake tables or if tables don't exist yet at time of activation.
Some possible approaches:
Current idea is an untagged union of double / int64
and a struct with functions.
If we EXPLAIN SELECT count(*) FROM sensitive_table
then this is considered as a non-anon query
AllocSetContextCreate
/ MemoryContextResetAndDeleteChildren
Once we have implicit counts we need to unroll them to one row per noisy count.
I suggest adding the count as a junk (or maybe regular) column, then nest this into a subquery and do some cross join magic to unroll rows.
We rewrite this in the post_parse_analyze
hook.
Alternatively, SRF (set-returning functions) could be used.
Currently DEBUG
is hardcoded here https://github.com/diffix/extension-prototype/blob/65e94fc021a066e05c36ff9d0733a1d282df473d/pg_diffix/utils.h#L4
We need to make it so the build process determines whether we're making a dev or prod build. Ideally an env variable should be used to indicate this.
If we change custom variables using SET x = y
, these changes apply to the current session only.
The
SET
command allows modification of the current value of those parameters that can be set locally to a session; it has no effect on other sessions. The corresponding SQL function isset_config(setting_name, new_value, is_local)
Only superusers can modify these anyway, but for a persisting solution these need to be put in the config file.
We have to determine if the config file recognizes custom variables.
We need preloading because otherwise hooks won't run and you can query sensitive tables.
We need to research the easiest way to do this and provide simple instructions in the README.
We need some reference what labels apply to what.
Tables, AIDs, and other relation config need to be managed dynamically. A config table should be created to model this information.
https://www.postgresql.org/docs/current/xoper.html
I think operators is better for optimization.
We have to recursively identify AIDs in each subquery and add those to select lists. Parent subqueries include AIDs from their tables + all child subqueries.
We consider AIDs to be nominal so we don't want deduplication and luckily that's easier to implement.
As the project is growing we get more and more files in a flat structure. I think we need to organize headers and source files logically.
Loose example of what I'm suggesting:
.
├── aggregation
│ ├── aid.h
│ ├── aid_tracker.h
│ ├── contribution_tracker.h
│ └── random.h
├── config.h
├── hooks.h
├── query
│ ├── context.h
│ ├── node_helpers.h
│ ├── oid_cache.h
│ ├── rewrite.h
│ └── validation.h
└── utils.h
Not happy with aggregation
but that was the first thing which came to mind.
Thoughts?
Maybe with a hook like https://github.com/AmatanHead/psql-hooks/blob/master/Detailed.md#ExplainOneQuery_hook
First we have to determine whether EXPLAIN is a security risk
We leave int4 and in8 as they are. For strings we use their hash as an AID.
We can have the first parameter of aggregates as ANY
, then figure out their type during state initialization.
This will greatly simplify the aggregates.
We need to move this repository before making it public
Currently aggregates start from aid_seed=0
, meaning the same set of AIDs will produce the same RNG, even for different types of aggregates.
Should each of lcf
, count(*)
, count(col)
, count(distinct aid)
have a different starting seed so that the final hash and therefore the RNG is different?
If not, we can check process_shared_preload_libraries_in_progress
and fail early.
Maybe the following is more correct:
We can register custom plan nodes with RegisterExtensibleNodeMethods
[1] [2]
We can register custom scan methods with RegisterCustomScanMethods
[1]
We can watch for reloid invalidations with CacheRegisterRelcacheCallback
We can manage extension assets in a separate schema [1]
We can check process_shared_preload_libraries_in_progress
during init.
We probably need to check IsBinaryUpgrade
& RecoveryInProgress()
during init.
We can create our cache memory context with AllocSetContextCreate
/ MemoryContextResetAndDeleteChildren
PGXN, the PostgreSQL Extension network, is a central distribution system for open-source PostgreSQL extension libraries. https://pgxn.org/
How to scan a table [1] [2] [3]
Tuplestore allows caching of tuples [1]
#if (PG_VERSION_NUM < 120000)
#define table_open(r, l) heap_open(r, l)
#define table_close(r, l) heap_close(r, l)
#endif
verify_anonymization_requirements
in validation.c
must be extended to check for currently unsupported aggregates such as SUM
, MAX
, etc. I think whitelisting is better than blacklisting.
To know which aggregates to allow we can check OidCache.postgres
.
Header catalog/pg_type_d.h
contains OIDs for basic types. We use this to identify types such as any
, int4
, text
, and so on.
If this is incompatible across versions then we need to dynamically fetch these from pg_type
.
This ensures buckets pass LCF. Aggregate OIDs can be looked up in OidCache
found in "pg_diffix/oid_cache.h"
.
What if we have a global aggregate? This shouldn't be suppressed.
With grouping sets this can get a little complicated.
@edongashi, how do you get this extension installed for testing?
What would be super neat might maybe be to have a Dockerfile in which the latest version is compiled and installed?
Would be nice for quick testing for someone who doesn't have the full toolchain installed locally.
If we cache config table OIDs we need to CacheRegisterRelcacheCallback
We're using C/C++ extension formatter, thus we can remove traces of pgindent.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.