dp-3t / documents Goto Github PK

View Code? Open in Web Editor NEW

2.2K 2.2K 180.0 261.76 MB

Decentralized Privacy-Preserving Proximity Tracing -- Documents

Shell 100.00%

documents's People

Contributors

Stargazers

Watchers

Forkers

aglime beastneedsmoretorque ppfernandez idkwim dirkx freekdeman vpnetconsult namzxc adityasivaraj fo0nikens jmuatantum yongmi282 z4ck404 goofy-mdn mediaeater ariadnimac ncase darongmean mikbry drazioti loxadim pab10v roastedmonk jai bjornw benedekh kwoot maaike binglekruger dar-scott mtwestra mediabuff ayoubhaggui nordarwin rishistyping wecacuee mmjac10 altoplano mdrenger hhy5277 antoniorossi matekatona dixonsiu yoshikazukashiwazaki jrpinteno janbussieck shimono therealpecus ezzeldinadel fiftytwodegreesnorth marado bezouwenr chmiiller miklobit ruymanreyes viaskyse dl5rcw sebastianhaas josephfrazier sylirisk bartclone tristancacqueray perry-amsterdam 00gh ronhendriks mandoggeli lparth yilinjuang sfrias valiodotch wf227 sakurann wang7203311 saferplaces jikk moneytech massimiliano76 annp1987 kraw dmnized shaunabanana jackom83 gravitytrope murmuras-tech aizonic javierleon leonschmidlin mniehus cpeeyush anogu01 m5b altermobili alpsayin jonasbardino ac-ctrl xmrio helohe ryanbnl tdiam otmanemajdane

documents's Issues

Data-protection-relevant actor missing: Mobile Network Provider

The list of
"five main data-protection-relevant actors in this system:
1/ users
2/ health authorities
3/ backend server
4/ epidemiological research projects
5/ mobile phone operating system providers"

is missing an important actor in this system,
the mobile network provider.

Mobile network provider data are - apart from the mobile phone itself -
the richest source of identity, network activity and location data,
and frequently used, in both legal and illegal ways, to correlate datasets,
e.g. by entities that the operator is obliged to share data with.

(see e.g. recent legal/tech case in Denmark, Teledatasag - need to find english source ..)

A sub-issue arising from this seems to be the fact that
specific network activity is triggered by infection events.
Such network activity will be relatively easy to detect and personalize.

While encryption can provide protection against full eavesdropping,
the fact that the very occurrence - not the content - of network activity implies an event
makes the event trackable.
(This is an identified attack vector against smart meters and IoT units.)

(Service providers are mentioned in the White Paper, under "Eavesdropping" -
which is not what this issue is about.)

Some questions about the design

When reading the paper, I had two additional questions that I think might be important to address:

What are the number of false positives and false negatives that can be expected with this technology?
Why is dummy data sent to the epidemiologists, but not to the central system?

security and data protection-relevant actor missing: Proximity Tracking industry

The Proxbook report on "Proximity marketing in airports and transportation" states the following concerning Munich airport:

The installed hardware includes 120 “bluloc Big Bays”. These “power beacons” [..] make the transmission of signals possible for distances up to 100 m. [..] The beacons also enable to“sniff” WLAN and Bluetooth-data within the reception radius. The data that can be collected includes timestamps for entering and leaving a defined zone. As a result, it is possible to measure visitor streams without the visitors using an app

This is of course just one example of existing deployments of vast arrays of Bluetooth sensors alongside other means of collection of more direct identifiers (security cameras, tickets, etc). Another example could be the deployment of "smart" advertising panels in the subway, or more exotic deployment of roving Bluetooth sensors in taxicabs in a city full of security cameras. There are very many possibilities according this New York Times report and no good reason to expect the situation is very different in Europe (see for instance the company Fluxloop and its high profile European clients, such as advertising panels JCDecaux).

These are obviously relevant as security actors, but they are also relevant to the data protection assessment because there are vast and already existing deployments of Bluetooth passive antennas.

Bluetooth iOS implementation challenges

I absolutely love this design. I just wanted to bring up a couple implementation challenges that this might face on iOS, specifically with regard to Bluetooth Low Energy (BLE).

Apple places limitations on how frequently an iOS app can send out BLE signals while it is in the background. In general, the iOS operating system is very stingy about allowing apps to perform activity in the background. If an app tries to do too much, the system will allow it to wake up less and will eventually cut it off entirely. With Bluetooth, apps can mitigate this somewhat by decreasing how often they send signals out, but as Apple says in their documentation, it can't run forever. To make sure that the application is still sending out the Bluetooth signal, designers may want to ask users to open the application before they leave the house. This would also ensure that the app has loaded more EphIDs and does not put it at risk from being punished by the operating system for doing too much in the background.

However, there may be a more fundamental challenge to this design on iOS. According to Apple documentation, it looks like an iOS app that is sending a Bluetooth signal while in the background cannot connect to an iOS app that is listening for a Bluetooth signal while also i the background. I have not tested this myself but this StackOverflow answer suggests the same thing. Apps scanning in the background can only scan for specific UUIDs (here the UUID would represent the app itself, not the EphID) and apps sending BLE data in the background can only send their UUID in what Apple calls a "special 'overflow' area" that I am fairly sure would not be recognizable to the device as a UUID.

Again, I have not been able to test this myself yet so I am not totally sure this is correct. I would love to hear from other iOS developers on this issue, or even better, see what code you all have coming down the pipeline. Outside of GitHub, you can find me at @GabeNicholas.

Asymmetric signing of EphIDs to mitigate fake contacts risk

It looks like the EphIDs are broadcasted in plain-text, while the "secret key" is just used as a seed for generation.
EphID spoofing is fairly easy in this scenario, and this could potentially lead to targeted attacks where fake contacts between an individual and known at-risk infected individuals can be simulated.

To mitigate this, each user could be assigned a private key, and broadcasted data could contain the EphID, the timestamp and a signature which can be verified with his public key.
The timestamp can be synchronized by the app itself with good tolerance of eventual device-specific offsets.

This packet signature could be an easy mitigation for fake contact attacks (and possibly lead to other good side effects), though they would still be theorically possible by instantanous recording and remote replication of broadcasted packets.

Fix claims on the PEPP-PT website to match what is announced here

One of the authors of this set of papers has publicly referred to these documents as relevant to PEPP-PT.

However the PEPP-PT website states (among other things):

Our privacy core: anything we provide is based on voluntary participation, provides anonymity, does not use personal data nor geolocation information, operates in full compliance with GDPR, and has been certified and tested by security professionals.

This does not quite square with the assessment made in the documents published today. For instance the Breyer discussion is - even without making changes due to #9 or other changes suggested by others - not quite as categorical in claiming that the centralizing entity "does not use personal data" and therefore that it magically "operates in full compliance with GDPR".

This protocol will encourage data sharing between governments at a time when:

some European governments are diverging away from their commitments to human rights, and
some are deploying very creative approaches to weakening GDPR by somewhat weaponizing precisely the jurisprudence out of Breyer (see my testimony in the UK Parliament on the topic, at the end)

In addition, as acknowledged by the authors, this protocol could be repurposed for vastly different means down the line, and overconfident "no personal data" claims could give a techno/legal mandate to circumvent usual GDPR protections.

As a consequence, the legal language has to be extremely sharp in all communications, already now.

Provide PDF source

While this repository/project asks for feedback, it provides the documents in PDF format, which does not facilitate review or commentary. Please consider providing the documents in either other format, or with the source.

Misscalculation of storage and download amount?

The decentralized design scales very well. For each infected user, the backend needs to
store a 32 bytes key for the duration of the infectious window. Storage cost at the backend is
therefore not a problem. Throughout the day, smartphones download the 32 byte keys of
newly diagnosed patients. This data is static, and can therefore be effectively served through
a content delivery network.

I don't understand why you only need 32 bytes to store the entire time period, in the Infectious period of a person. Because you generate a unique SK_t every day and you have to upload SK_t keys to the server for about 14 days (or rather starting with the first infectious day) if the person tests positive. You have to store 448 bytes, which is not a big deal, but it's more than 32 bytes.

Smartphones download a small amount of data every day. For 40.000 new infections per
day, smartphones download 1.25 MB each day. They require a few seconds of computation
time to regenerate the ephemeral keys EphID , and to check if they are included in the local
list of observed EphIDs .

In your White Paper you claim that 1.25MB are the downloaded amount of each day assuming you have 40000 new infections per day. But if you upload 32 byte keys / per infected person / per day and you report the last 14 days you end up with 17.92MB to download per day and not 1.25MB.
1.25 MB is for one day of 40.000 infections not for the total amount of day they are infectious!

voluntary data release

I understand that there are risks to data release of this type, but it could be dramatically more useful, and simpler, if people who test positive volunteer their recent location history. This could be based on augmented GPS, such as from Google or other service providers, or manually recounted in an approximate way.

This data source would be very complete and technically simple. This is the kind of model that is being used in South Korea, although there it is definitely not voluntary. The location histories are broadcast by SMS to all the residents of an entire municipality.

Thus far, BT tracking has not been successful even in places where its uptake is encouraged like Singapore. It's important to consider promoting other options. These could rely on goodwill and community support which has been so essential to our current response (basically voluntary lockdown and isolation of much of the EU).

Easy deanonymization of infected individuals

I think the scenario description of how tech-savvy people can identify infected individuals is too convoluted, and the actual operation can be much easier and effective.
It is assumed the malicious user should keep a detailed log of people he meets, possibly register multiple accounts, modify the app, but nothing of this is actually necessary.

A malicious user just needs to implement a georeferenced EphID tracing app, and then go all around all the houses at a time when people are most probably home (9-12 in the evening). With proper evaluation of RSSI values he can easily get a good estimate of where each EphID lives.

In a small community, a single user could effectively identify all infected individuals families.

Furthermore, it would be extremly easy to build a collaborative database of georeferenced EphIDs observations, and even easier to build a database of infected individuals EphIDs (as these are basically public).

I don't get how the coarse time-framing inside the app and the random EphID usage order can mitigate this in any way, as the document seems to suggest.

UN Declaration of Human Rights Article 27 concerns around wording and conflation of roles

Please revise the language used throughout the documents, avoiding the use of the word "epidemiologist" as a stand-in for the conflation of several roles:

the actor centralizing the data;
the actor building analysis tools, pipelines, protocols, etc;
the actor running those tools.

I suggest instead avoiding this conflation and using more neutral language reflecting the technical roles assumed by the different entities.

Your protocol seeks to limit the amount of data centralized in order - as stated - to "enable epidemiologists to improve their recommendations to policy makers and health authorities". This does not require that "the epidemiologists obtain an anonymized proximity graph with minimal information", but rather that some data is centralized and epidemiologists are able to push their computation to the centralized data. This enables - with no change to your protocol, but certainly some alignment on downstream analysis tooling - scenarios of deployment that might seem desirable from a security standpoint (such as at-risk-community-based computation of parameters of epidemiological relevance) or even also from a public health standpoint (since it would increase the quality of the collected data, given that a community member might have less to fear from some deployments than others, and would feel more engaged in the designing of policies to get out of the crisis).

It is a natural question to wonder which entities could then assume the role of centralizing this data. One possibility (to be adapted for civil law systems) would be bottom up data trusts, for instance, but this is clearly a separate question. In any case it would seem counterproductive to foreclose on our capacity to imagine creative solutions together due simply to poor wording.

Additionally, I would encourage anyone who thinks the word change that I am suggesting to be exotic to read a letter I recently (pre-COVID!) co-wrote in response to a Call for Contributions by the Committee on Economic, Social and Cultural Rights of the Office of the High Commissioner for Human Rights on a Draft General Comment on Science (i.e. tied to UN Declaration of Human Rights Article 27 on the right to share/participer/participar/участвовать in science):

We, as citizens, think this distinction between active participation and mere passive enjoyment of scientific advancement is crucial to our full flourishing as autonomous individuals, capable or even sometimes authorized to approach complex problems with a systematic mindset.

We each dedicate significant effort in methodologically understanding our world, our communities, our selves, and/or the interactions between these. Many of us also develop new ways for doing so, to enable active participation by others. All of us contribute to blurring the distinction between a citizen engaged in systematic discovery and a scientist employed in a traditional research institution.

We each recognize we benefit from communities and networks of peers engaging in a similar process of discovery, as well as proximate rights of access to information enabling us to learn from and spread new discoveries. For many problems the perspective of nontraditional researchers (e.g. patients and communities) is essential not only in order to fully understand our social or ecological environments, our bodies, or the practicality of solutions derived from formalized science, but also in redefining the process of science for instance in relation to diverse representation, ethics or data collection.

It would have been hard to be more prescient (while blind to COVID at the time), given that epidemiologists are now engaging in some of the most massive crowdsourcing effort on symptoms, household composition and lifestyle patterns.

Courtesy notification

Thank you for submitting this publicly.

I have been or have attempted to be in contact with some of the authors of the privacy framework presented here over the last 14 days (whom I highly respect, but didn't necessarily know were authors of this framework).

Given that up until two hours ago:

the importance of those matters,
the announced timeline for rollout of code and apps,
the content displayed on the site and the communication made available by the PEPP-PT project (for instance the lack of individual accountability behind the chosen privacy framework),

I felt it was important to signal to the privacy community potential issues around the protocol, as I could piece it together from public claims, the website, conversations with some members of the PEPP-PT team, and others.

For this reason, I have given a talk a mere hour before you published this framework to an online seminar of the OpenRightsGroup on the failures of PEPP-PT as far as I could tell from where I was. My goal was to be able to act as an exterior grain of salt in nudging what must be complex internal politics in the right direction, at the risk of making a fool of myself.

The video of my talk should hit Youtube within a couple days. I am therefore leaving a comment here as a courtesy, so you can update those who might see my video and then ask you what you make of my presentation.

I see and welcome that you are mentioning the issue of the Breyer jurisprudence in your assessment (~~as I mentioned to some of you was relevant, again not knowing you were involved~~). I am looking forward to reading the exact details, which will matter of course, and that I have not been able myself to reflect on.

~~In the unlikely event that my private comments have led to an improvement to the paper, I would appreciate mention of this in your documents.~~ (EDIT 2020.04.06 here and above. For more, see below)

In the more likely event that you decide to change the wording of some claims on your website in light of the documents presented here, I would suggest for transparency tracking the exact changes made in the content of the site.

aggregated mobility data

A differentially private mobility data set could be used by epidemiologists to help organize responses to ongoing outbreaks or flareups during long-term suppression of the pandemic.

How would this fit into the proposed plan?

It could be collected in parallel, using data directly provided by telcos. In most implementations, a trusted party is needed to generate the DP model. The resulting data could be made public without risk to individuals, and without risk to groups below a certain size, depending on how it is built.

Targeted hacking can detect 2 persons met each other

If a hacker target 2 mobiles, and find the same ephid (of a 3rd persons) in the same period, he could deduce that the 2 persons probably met each other (with a third person).

This could be partially mitigated if, instead of publishing one ephid at a time, the device would publish n IDs, and the receiver only keep one of them randomly selected.

(Not sure if it is possible with BLE)

Radio distance is not spatial distance

Bluetooth distance is radio signal distance -
not identical or even directly relatable to spatial distance
(which in turn is not identical or directly relatable to infection distance).

Bluetooth signals are not suitable for distance measurement.

Consequently, BLE beacons and such do not output distances
(other then rough classifications, "near", "medium", "far").

This fundamental fact has implications on many levels, e.g.

1/ creation of false positives and negatives
-- "two people hug over a metal fence" - negative
-- "two people close, phones in well filled backpacks" - negative
-- "two people distant to each other in front of strong reflection surface" - positive
-- "two people really close but protected by plexi glass shield" (standard shop situation) - positive

2/ error source in areas of high beacon density

3/ opening of attack surfaces
-- assuming prank or hostile motivation, easy to "infection trigger" large groups of people by simple use of strong antennas or super beacons

While this is a well known fact and generally acceptable in systems merely interested in statistics
(e.g. airport queues, shopping center heat maps),
this poses a different challenge when the system seeks to identify and classify individual events.
The "just good enough" of airports and supermarkets is not "good enough" here.
At the very least, the occurrence of false events and its implications needs to be modeled.

(Note: despite Github's note "Similar to existing issues", it is not.)

Alerting users on contact event

Maybe the users can receive an instant notification (e.g a vibration or auditory note) when they enter the personal zone of each other in a similar manner to that of car parking sensors.

The benefit of this feature is:

Sustain the culture of social distancing for long time and keep everyone conscious of it when the lockdown boundaries are gradually removed.
it will enable organizations to impose social distancing as part of workplace safety regulations, resulting in more companies and governmental vital sectors able to protect their workforce and to sustain their businesses, services respectively.
give the user an instant value of downloading the app, which in one way or another increases the popularity of the application.

This addition completely depend on proximity and is a side-effect of using it to log contacts that the application is doing already so I don't expect it to be an extra effort, except for Front-end component of the system to make it optional.

you can find an extensive description of the proposal in the enclosed Narrative Document.

Infection reveals lots, so consider using cuckoo filters

If I understand, an infected user reveals the linkage between all their ephids by revealing their sk_t. It's quite efficient solution, but revealing this linkage might discourage adoption and/or discourage disclosure. I'd think too few individuals worry about their privacy enough for this to be a serious problem.

There is however some risk that individuals might observe and publish ephids along with location information, which ties published sk_t to real movements. If done, this could harm adoption or increases disclosure refusals more than linkability concerns.

We could consider doing the hashing inside some trusted enclave, except iOS lacks this. If iOS proves unworkable for other reasons like #7 then this sounds more plausible.

Instead, I'd suggest merely giving user control over pausing and switching between sk chains whenever they like. I doubt devices could manage a few independent sk chains automatically, but appl forks could attempt to do this too.

IP issue

On page 9 of the data protection and security paper, you say that the backend server doesn't store information about IP address or time. But it means that users have to trust the implementation of the backend server and their administrator. We can mitigate this risk by using onion routing to transmit information to the backend server.
Another point is that you don't say how the secret seeds are secured on the smartphone. Do you consider encrypting it and how?

Some thoughts on possible use cases and user groups

What is the effect of not knowing the identity of the people that are infected or that are at risk? I understand completely why identity of app users is not known in the design, and I agree with it. However, we need to know the possible negative consequences of that as well in order to mitigate them. I am thinking about people not taking the warning that they are at risk serious, for example. Or, on the other hand, people becoming very scared when they hear that they are at risk. What about people that have cognitive issues leaving them unable to understand the warning? What are ways to get help to those people if we are not sure who they are?

A related issue is that we do not know who does not have the app. You mention that some extremely cautious people might not install the app. However, this is not the only group I would expect not the install the app, and I would expect the people not installing the app not to be distributed evenly throughout the population. For example, a lot of the elderly do not know how to install and use apps. Some might not regularly use a mobile phone. It is to be expected that in general people tend to interact more with people similar to them. This means that 'pockets' could exist of people not using the app where the infection can go around easily. Especially when it concerns the elderly, this might have serious consequences. To what extent will these things be a problem in practice? Would it be somehow possible to detect the 'edges' of these groups not using the app to make intervention in the groups possible when they get infected (and should we do this)?

Another question is whether people who have the app will have a false sense of safety and therefore will be taking risks. Would this be a problem and could be do something about that?

In addition, people could also claim that they have been warned because they want to get tested and tests are scarce, for example. Would it be possible to check whether someone has truly been at risk, or is it easy to fake this?

It also could be interesting to base the risk score not only on one contact, but on multiple contacts with the same or different infected persons. Also, for someone who has been in contact with more people or is in regular contact with many others (e.g., someone working at a supermarket), it might be useful to get tested with lower risk scores than people who hardly have any contact with others. Especially if tests are scarce. However, a medical specialist should probably decide whether this is useful.

Assess existing software and systems

At some point it will be nice to assess the main apps among those which mushroomed in this period, against the best practices as defined in the documents. NOYB has a nice wiki page with a growing list:
https://gdprhub.eu/index.php?title=Projects_using_personal_data_to_combat_SARS-CoV-2

Typos, general writing and grammar

White Paper:

p10 could use a couple of short sentences of introduction. It may be well known in the community what exactly goes into a threat model section, but the White Paper will probably be getting quite a bit of attention even beyond the tighter security community, and it might be good to explain that this section lists the capabilities various potential adversaries are assumed to have. Without such an introduction some parts of p10 may read like they describe weaknesses of the currently proposed system. E.g.:

(Network adversary) Can use observed network traffic to determine the state of a user (e.g., whether they are at-risk, infected, etc.)

might be read as saying that for the current model a network adversary can actually discover at-risk or infected status of a user. As I understand the system the described, blinding should make it impossible for a network adversary to determine at-risk status. Similar blinding could easily be added so that infection reports are not detectable by a network adversary.

p10:

Can observer network communication (i.e., source and destination of packages, payload, time) and/or Bluetooth BLE broadcast messages.

The health authority learns information about at-risk people only when these at-risk people themselves reach out to the health authority (e.g., after receiving a notification from their app).

p13:

This attack is inherent to any proximity-based ~~system~~ notification system, as the adversary only uses the fact that they are notified together with additional information gathered by their phone or other means.

p16:

The latter can only be changed by (1) being infected with SARS-Cov-2, and then (2) reporting somebody else’s key $SK_t$ so that that key is treated as infected.

Here it might be useful to add that $SK_t$ is only available to the attacker if the owner of the key has provided it to them. (Or it has already been published, but then there is nothing to be gained from republishing.)

Data Protection and Security:

p2:

From this data, the identity of the patient cannot be derived by the server or by the apps of other users (see below), it is nearly anonymous. Before this point, no data other than the broadcast EphIDs leaves the phone.

I think the phrase "it is nearly anonymous" does more harm than good. It makes you wonder, well, what now, is it anonymous or not?

p3:

Figure 1; Normal Operation: B should not be recording its own EBID

p4:

hard to parse sentence:

The theoretical potential of this attack is the tradeoff to obtain technical guarantees that prevent function creep and ensure limitation by design.

p5:

also hard to understand for me:

As a result, the fact that the sensitive information, including health information, has equivalent protection to genuinely anonymous data, means that it is protected from all actors by among the most technically stringent safeguards possible in a system with the functions necessary for this purpose.

Other deployment scenarios and stigmatization concerns

In the threat model, the Tech-Savvy user is described as "Blackhat/Whitehat hackers, NGOs, Academic researchers, etc". It might be worth adding explicitly three roles there: journalist, public health authority and epidemiologist.

Indeed it might also be useful for epidemiological research or public understanding to deploy sensors in one location in order to count the number of infections signaled to public authorities transiting through that location, without the need to obtain consent from the individuals (see Art 9.2.i). This might be particularly useful when done along a highway or on a train, for instance.

Note that this deployment scenario introduces new concerns around stigmatization ("this neighborhood is full of infected cases"), but I am not sure how the GDPR appreciation of this would work, as it would amount to an individual encouragement to install an app that would potentially lead to collective effects.

Using a blockchain for backend servers

I think that for backend servers you could use a smart contacts in a blockchain with ZKP (Zero known proof)

The app would be a blockchain wallet that makes transactions with other apps, and in the transaction the app knows if the other user is infected.

Is 14 days enought?

If I don't have symptoms, I may contaminate people. When the people that I contaminated will be sick, tested and will finally report their sickness. At that time, I will be detected as being at risk, I may be tested positive (or have the anticorp)

In that case, wouldn't it be very helpful to have a longer history so that the other persons that I may have contaminated earlier (that are maybe also asymptotic) can be informed that they are also at risk and should be tested?

An other reason to have longer period is if I'm still contagious longer than 14 days.

Data protection consequence of missing security actor

In #43 I give many examples of deployments of vast meshes of passive Bluetooth antennas, providing easier means of re-identification than relayed in the security analysis.

The deployment of those systems should encourage a more careful assessment around the Breyer test. On page 8 of the Overview of Data Protection and Security, you state:

To underscore the data protective nature of these measures, it is worth noting that the re-identification test set out by the CJEU in Breyer ( C-582/14) as necessary to classify this as personal data would not be met. Firstly, establishing an effective side-database would likely require breaking the law by surveilling individuals without an effective lawful basis (e.g. illegitimately using covert cameras directed outward from the person, see Ryneš (C-212/13)). In Breyer, the Court noted that the test of means reasonably likely to be used to identify a natural person would not be met ‘if the identification of the data subject was prohibited by law’. Furthermore, it is also arguable that these specialised attacks would require ‘a disproportionate effort in terms of time, cost and man-power, so that the risk of identification appears in reality to be insignificant’ (Breyer) . However, as discussed, we suggest ensuring the obligations applying to personal data are still applied as good practice.

In light of the BLE deployments of #43, the threat you mention (using covert cameras) is reductive of the full threat landscape, which in its fuller extent actually nullifies the first test of Breyer: these databases already do exist, with a legal basis that is actually considered legitimate by many. In addition, these databases reduce the "efforts in terms of time, cost and man-power" so much that it no longer is true that the "risk of identification appears in reality to be insignificant" (in fact, as described above, there are commercial services performing this task). As for the prohibition by law in the Breyer test, it is a very very very thin line to rely on in the current circumstances, and certainly warrants a lot more detailed discussion in scenarios where the threat comes from state actors fetching additional data from private actors to facilitate reidentification.

It seems ill advised to rely on a gap in jurisprudence for such high stakes protocol and not be more forceful in asserting that this data would indeed consist of personal data in some deployment scenarios.

Non sequential storage of ephid

It is maybe obvious, but I don't see it mentioned.

The local storage for the collected ephid should not be sequential or use a structure from which you can derive the sequence of insertion.

That would reduce the possibility to drill down the time of infection to a finer level.

Why does an infected patient pick a new completely random key?

The DP-3T white paper mentions that «After reporting their current SK_t , the smartphone of the infected patient picks a new completely random key.» (bottom of page 7).

By doing so, do you accept that infected patients to take a new identity, without being advertised as infected?

I understand that, upon a positive infection diagnosis, the patient can share her/this secret key (SK) with the server to notify at-rist citizens. But, what happen to the infected patient and her/his app from there?

Some of them will have to stay at the hospital, thus having controller interactions with the rest of the world, but others can also stay confined at home, depending on their symptoms, if benign. The latter might therefore remain indirectly in contact with other citizens, even if the confinement procedure aims at reducing the risk.

By automatically regenerating a new SK, this means that, after reporting the current SK_t, the infected patient cannot be tracked, but she/he won't be detected by others before eventually uploading the new SK at some point in the future.

I was therefore rather thinking of keeping the same SK as long as the patient remains infected, by replacing the SK_(t-1) with the current SK_t by the end of the day, to keep detected potential at-risk new cases, and rather introducing an official procedure that allows, through an authorization given by the practicians (e.g., a qr-code), the patient to generate a new secret key after being officially claimed as recovered from COVID-19, and therefore not presenting any danger for others. This can also be used to remove the former SK from the server after some days.

Interplay between security and data protection assessments

The way this protocol is rolled out is an interesting kind of optimization within constraints:

decentralize as much as possible (security);
while ensuring no personal data is given to some actors (data protection)

It is therefore composed of two assessments: a data protection assessment on top of a security assessment.

However the paper fails to convey well how dynamic the data protection assessment should be, compared to the security assessment.

For instance, if some of the following conditions are satisfied:

one theoretical attack is demonstrated, or
some function creep does start occurring (hinted at in #12)
some new commercial actors come in market (see #9),
new deployment scenarios are considered.

the entire data protection assessment would immediately reach a different conclusion through the Breyer test.

[Improvement] Getting rid of push (FCM/APN)

In the current draft during the process of contact tracing (4. in the DP&S doc) new EphIDs of infected ppl would be pushed from backend to clients regularly, this however is the only use for push and therefore the only reason to involve Google's FCM / Apple's APN, right?

Why not rather use a regular diff-based direct pull mechanism and therefore cutting out the third actors (especially those actors)? Since only static EphIDs are being used, load issues should be solvable. In addition: The app could be FOSS and truly libre software. :)

Any comments?

Use of TEEs

The TEE mitigation (https://github.com/DP-3T/documents/blob/master/DP3T%20White%20Paper.pdf, 6ac1884, p.15) is not really motivated. TEEs would only make sense if used in a way where the backend would only communicated SKt to a user iff their app is attested to run in a TEE. However, this would break other properties of your approach as the attestation step would uniquely bind an app instance to a specific mobile device.

The White Paper seems to imply that TEEs offer merely an additional level of isolation. Whether this is true depends on the deployment scenario. In the context of TrustZone on mobile phones, for example, the TEE may be shared. The means to verify that part of an app actually executes under the TEE's protection primitives, is through attestation. Attestation effectively breaks anonymity, unless there's a trusted third party involved.

My intuition is that Trusted Computing can probably help to harden your system against a number of strong adversaries, but it won't be by means of some quick of-the-shelf "let's stick it into an enclave" approach.

Anonymising communication metadata when communicating with the backend server.

I understand that in the "Data Protection and Security" document you have mentioned that:

Network observers only observe encrypted data between the app and the backend. This
data is the same for every app installation. Therefore network observers can conclude only
that somebody just installed the app .

Because there are a lot of information that can be inferred by analysing the metadata of the communications between apps and backend, I was wondering if you would be interested in considering that users should be anonymous with regard to the backend. The property of anonymity itself is more than just providing an encrypted connection between the source and the destination of a given conversation.
Personally, I believe that this requirement would protect the app users that would prefer not to trust the backend (i.e. those fearing possible discrimination).

At the moment in fact you mention that the App would open a TLS connection between the backend and the server, but this connection could be correlated with other information by ISPs or other entities.

Some example of information that could be correlated are:

location - where the app is activated
IP addresses - where do the first calls come from

There are possible mitigations that can be considered:

Activation of the app happens at a later stage from when the user install the apps and gives consent.
The connection between the app and the backend can pass through a series of proxy servers (other users' apps). Alternatively more secure network protocols could be considered (ex: Tor).

Personally I would be interested in expanding/collaborating on these points, especially with regards to preserving anonymity of users in relation to the backend and other passive eavesdropper.

Add recomendations about open-source or not

Althought it is maybe not a technical aspect (and maybe out of scope of your design), the way the app and the server application is developed might have an impact on the adoption.

It could be usefull to have an expert advice on whether the application should be developed in opensource.

I guess some people will trust it more if it is open sources, and other would trust it less. I don't know on which side will be the majority.

Use forkeable document format, not PDF

PDF is not very cool on GitHub.

Limit fake contact event attacks by using shared hashed "handshake"

To limit fake contact event attacks we can rely on shared information between the two devices Instead of storing EphID.

We can do it like that:

Generate a random id RiD1 every X hour
Broadcast this RiD1
When detecting another device RiD2 store an EventHash = CHF(Sort(RiD2,RiD1), date(Year,Month,Day). Where CHF is a cryptographic hash function such as SHA2 and Sort just sort the _RiDs so they are in a predictable order. We also add the current day so we create one event by day.
The other devices generate and store the same EventHash

In case of infection

Publish all recent EventHash
All devices check if they have this EventHash stored

Security problem
Someone can get my RiDx and broadcast it. But it will have to do that physically at ~2m of each person.

How to mitigate the risk of gamification

Is there a risk that stupid people start to play with the app making, trying to be the first one reported at risk?

(I have no knowledge in sociology, to say if it is a real risk nor how to mitigate. Maybe using a boring interface and avoid visible ranking).

Add a license

This repository is currently proprietary, so no collaboration is possible. Please add a free license statement. The PDFs contain a "CC-BY 4.0" note so it's easier to align to that.

Clarification of the relationship between DP-3T and PEPP-PT

One of the author of the paper has posted this clarification:

https://twitter.com/mikarv/status/1246493793886580739?s=20

On PEPP-PT: this is a submission to the PEPP-PT consortium for consideration around how to make a decentralised set-up that scales, involving many of the same institutions and people. It is published to start discussion, for transparency.

The authors might want to temporarily add this context to the documents or (better) the README here, to reduce confusion.

Discard GPS and use serving mobile network MCC

In order to get the country location, use serving mobile network MCC information instead of using GPS. To do that the handset only needs to be connected to the mobile network.

You will save battery and remove privacy concerns about geolocation.

Probability of success and achievability missing as key criteria for validity of proposal

First of all great work. Very European way of tackling this problem. 👍

A core thing I missed in the white paper is the success probability of this project setup. A naive implementation (track all users, store all data centrally and allow scientists to operate on data) has resilience towards design mistakes. All data is still available. Analysis such as tracking gaps, errors etc can be done retroactively.

In this design, the developers and operators of this system have very little insight into the effectiveness of the system. (Keyword Metrics). What if the apps happen to be terminated by the OS and stop recording IDs for a few hours? These interactions would simply be lost. No one would know. What if regional differences in culture and mentality lead to increased downloads / refusal to do so?

In short, while in a perfect world, this setup should allow individuals to be informed about their risk, it keeps authorities and operators completely blind. How is a country supposed to decide on policies if they don't know how many risky interactions have been recorded?

What is therefore missing in this proposal are reasonable metrics. While the privacy and security of the citizens is of importance, so too is the ability as an operating institution to confidently claim that the system is working or not. Without insights into the systems operations, this is hardly possible.

Recommended metrics (to be expanded)

number of IDs recorded / day
metrics about app FCs or app not running in background for periods of the day
offering users to share extended data that may identify them to support developers in evaluating system operations

Expectable error scenarios

2 users should have recorded each other's IDs (because it is know that they have been in proximity) but the phones did not record this. Why?
Phone OS or scheduler may terminate process due to low memory. Can we detect this and mitigate it through software changes?
Subregion of country sees low number of risky interactions. Why? Low infection rate? Low interaction rate? Low installation base? Thick walls?

Why not using direct public key exchange?

I don't understand the advantage of generating ephIDs instead of directly broadcasting public key (and rotate those key) like https://github.com/AndreasGassmann/WeTrace

Isn't 128bites key enough as we just want messages to be secure for a short period.

Using key exchange you open a communication channel and can send a more complex message.

Involve Google -> Use existing Maps history functionality

I would find it much smarter if we could get Google involved.

Here's my idea:
1.) Google could push out a message to iPhone and Android users of Google Maps, asking them to enable their location history or add a special function that only stores the last 3 weeks and is not used for anything else but contact tracing
2.) Google could enable in the Account page a function where people can report that they got infected, and inform others who were in the same area at the same time, and that they might be at risk to be infected too

In my opinion Google could act as so called Trusted Third Party. It only has to be clear that European data does not leave the continent. The rest can be done through everybody's consent.

License: CC BY 4.0

Single encounter problem

If in given epoch one is sure they've meet only one person and later they find this person id in the public repository of infected ids - they can be sure that person was infected.
This is a privacy concern and workaround is not trivial... but I believe this is still possible, although documents states this is not possible for any proximity tracing mechanism.
I have two ideas to fix that:

We utilize collisions by design. Ids can collide and false-positives encounters can happen. This can happen id-generation time, or the published infected ids can be true ones + randomly generated ones (for possible collisions). This would also hide a true number of cases.
The user with single (or low number) risky encounter during an epoch remembers an encountered id and then may re-use it in later epoch. This way it is not trivial to know who exactly was infected. Still only persons at risk would be alarmed.

Second visit to hospital problem

Thanks a lot for releasing these documents.
I'm looking forward to the upload of the reference applications source code.

I have one remark regarding section 3, Handling infected patients, of the Data protection and Security document.
If I understood correctly, the following steps would happen:

Hospital calls patient to inform them of positive test.
Patient goes back to hospital with their phone.
Hospital generates an authorization code.
Hospital uploads this authorization code to backend.
Patient scans authorization QR code.
Patient sends (SK_t, authorization code) to backend.

This assumes that the patient will go to the hospital a second time after being tested positive.
In practice, this is rarelly the case. Patients receive the results by phone and should stay isolated and not go back to hospital unless necessary.

What I propose would be to simply change a bit the order of these steps:

Hospital generates an authorization code and attach it to the patient test.
Patient scans authorization QR code and keep it locally.
When test results arrive and only if positive:
Hospital uploads authorization code to backend.
Hospital calls patient to inform them of positive test.
Patient sends (SK_t, authorization code) to backend.

It would also be nice to have more information about the backend.
For instance, how does it identify / authenticate who can upload new authorization codes?
Do you plan to also release the source code for the backend?

Thanks and cheers!

Coordinate with NOYB

NOYB already published https://noyb.eu/en/data-protection-times-corona and https://noyb.eu/sites/default/files/2020-04/Ad%20hoc%20Paper_Corona%20Tracking_v0.2.pdf

user behaviour on notification of risk

If the computed risk score is above the provided threshold, the app shows a notification to the user that she/he has been in proximity to an infected patient. The notification contains instructions on what to do and whom to contact.
...
Hence, from the perspective of an outside observer, the phones of both at-risk persons, those who have been in contact with an infected person, and those who are not at risk, behave the same.

This depends on what is the expected user behaviour on notification of risk and if it's possible to observe contacts made as per the instructions. An attacker able to assert control over the notification service (e.g., by simply dropping traffic) can selectively notify users and observe their behaviour. Reasonable tweaks:

randomize when the risk score is computed locally instead of doing it immediately upon receiving infected EphIDs (less granularity for such as an attacker)
recommend that the notification instructions point to some commonly used means of contacting health services, such as through the regional/national healthcare portal, instead of a specific means for infected patients (less distinct network traffic to observe)

The randomization has the added benefit of distributing the load on the contact point.

Infected user tracking

https://github.com/DP-3T/documents/blob/master/DP3T%20-%20Simplified%20Three%20Page%20Brief.pdf, 6ac1884, p.3:

A tech-savvy adversary could reidentify identifiers from infected people that they have been physically close to in the past by i) actively modifying the app to record more specific identifier data and ii) collecting extra information about identities through additional means, such as a surveillance camera to record and identify the individuals. This would generally be illegal, would be spatially limited, and high effort.

A tech-savvy adversary deploying an antenna to eavesdrop on Bluetooth connections can learn which connections correspond to infected people, and then can estimate the percentage of infected people in a small radius of 50m.

https://github.com/DP-3T/documents/blob/master/DP3T%20-%20Data%20Protection%20and%20Security.pdf, 6ac1884, p. 8:

The only way to link a device across multiple broadcast identifiers is by using information held on that individual’s device. This would require access to that device, or obtaining recordings from the device.

There seems to be a contradiction between these two quotes.

You (kind of) clarify this in the White Paper when you define the abilities of the eavesdropper. But, given that even supermarkets deploy ultrasonic and WiFi tracking, this appears to be a rather unrealistic attacker model?

Or do you assume users' mobile devices to use MAC randomisation on all wireless interfaces, and have no other apps installed? Maybe what's really missing is a system model?

Radial antenna pattern

The antenna pattern on phones for Bluetooth is (mostly) radial which may pick up neighbours above/below in an appartment building. Perhaps mitigate with a combination of dwell time with a white list of home location?

Public registry of infected ids, why not the ones at risk instead?

Why the registry consists of infected ids and not ones that were encountered by infected person?
This is what makes Location traceability much easier in decentralized scenario.

Proof of at-risk status

When someone gets a notification of their at-risk status, should they be able to prove to a third party that they have been identified as at-risk? e.g. to get testing, time away from work, or priority for home deliveries of supplies?

If so, does the protocol allow that?

If not, what's the threat from someone who tries to do it anyway?

Circular reasoning around Breyer, both affecting trust

The Breyer case invites two types of circular reasoning eventually affecting trust:

For the purpose of increasing trust in the deployment of the system, new laws could be rushed through so as to curtail commercial offerings around Bluetooth tracking (see #43 and #9), at least for the duration of COVID. The opportunity is too great that I would kick myself not to mention it to the authors to slip into their paper.

For the purpose of decreasing trust in the deployment of the system, an actor might want to leverage the interplay between security and data protection (see #44) to actually carry out an attack and therefore force a shift in the data protection assessment (this would then most likely lead to political footballing around whether it would be legal to collect such data under the exact pretense used, delegitimize those deploying the system, etc). In current times, it is my belief that such a trust attack would actually be quite likely.

In other words, the very existence of this trust attack actually increases the chance that some re-identification attack is conducted, which should in turn - via Breyer again! - affect the data protection assessment in the first place.