Code Monkey home page Code Monkey logo

floc's Introduction

Replaced by the Topics API

Note that this proposal has been replaced by the Topics API.

Federated Learning of Cohorts (FLoC)

This is an explainer for a new way that browsers could enable interest-based advertising on the web, in which the companies who today observe the browsing behavior of individuals instead observe the behavior of a cohort of similar people.

Overview

The choice of what ads to show on a web page may typically be based on three broad categories of information:

  1. First-party and contextual information (e.g., "put this ad on web pages about motorcycles")
  2. General information about the interests of the person who is going to see the ad (e.g., “show this ad to Classical Music Lovers”)
  3. Specific previous actions the person has taken (e.g., "offer a discount on some shoes that you left in a shopping cart")

This document addresses category 2, ads targeting based on someone's general interests.
For personalized advertising in category 3, please check out the TURTLEDOVE proposal.

In today's web, people’s interests are typically inferred based on observing what sites or pages they visit, which relies on tracking techniques like third-party cookies or less-transparent mechanisms like device fingerprinting. It would be better for privacy if interest-based advertising could be accomplished without needing to collect a particular individual’s browsing history.

We plan to explore ways in which a browser can group together people with similar browsing habits, so that ad tech companies can observe the habits of large groups instead of the activity of individuals. Ad targeting could then be partly based on what group the person falls into.

Browsers would need a way to form clusters that are both useful and private: Useful by collecting people with similar enough interests and producing labels suitable for machine learning, and private by forming large clusters that don't reveal information that's too personal, when the clusters are created, or when they are used.

A FLoC cohort is a short name that is shared by a large number (thousands) of people, derived by the browser from its user’s browsing history. The browser updates the cohort over time as its user traverses the web. The value is made available to websites via a new JavaScript API:

cohort = await document.interestCohort();
url = new URL("https://ads.example/getCreative");
url.searchParams.append("cohort", cohort);
creative = await fetch(url);

The browser uses machine learning algorithms to develop a cohort based on the sites that an individual visits. The algorithms might be based on the URLs of the visited sites, on the content of those pages, or other factors. The central idea is that these input features to the algorithm, including the web history, are kept local on the browser and are not uploaded elsewhere — the browser only exposes the generated cohort. The browser ensures that cohorts are well distributed, so that each represents thousands of people. The browser may further leverage other anonymization methods, such as differential privacy. The number of cohorts should be small, to reinforce that they cannot carry detailed information — short cohort names ("43A7") can help make that clear.

The meaning of a particular cohort should stay roughly consistent over time. As individual people's browsing behavior changes, their cohort will change too, but the algorithm that turns input features into cohort assignments should remain stable. If that cohort assignment algorithm does eventually need to change, then the migration to a new assignment algorithm will need to be clearly communicated by the API, so that consumers of the cohort signal are well informed of the need to update their usage. (See Issue #58 for more on this topic.)

Privacy and Security Considerations

There are several abuse scenarios this proposal must consider.

Revealing People’s Interests to the Web

This API democratizes access to some information about an individual’s general browsing history (and thus, general interests) to any site that opts into it. This is in contrast to today’s world, in which cookies or other tracking techniques may be used to collate someone’s browsing activity across many sites.

Sites that know a person’s PII (e.g., when people sign in using their email address) could record and reveal their cohort. This means that information about an individual's interests may eventually become public. This is not ideal, but still better than today’s situation in which PII can be joined to exact browsing history obtained via third-party cookies.

As such, there will be people for whom providing this information in exchange for funding the web ecosystem is an unacceptable trade-off. Whether the browser sends a real FLoC or a random one is user controllable.

Tracking people via their cohort

A cohort could be used as a user identifier. It may not have enough bits of information to individually identify someone, but in combination with other information (such as an IP address), it might. One design mitigation is to ensure cohort sizes are large enough that they are not useful for tracking. The Privacy Budget explainer points towards another relevant tool that FLoC could be constrained by.

Longitudinal Privacy

The expectation is that the user’s FLoC will be updated over time, so that it continues to have advertising utility. The privacy impacts of this need to be taken into consideration. For instance, multiple FLoC samples means that more information about a user’s browsing history is revealed over time. Possible mitigations include not updating FLoC on a site once it has been called (making it sticky), or reducing the rate of refresh.

Second, if cohorts can be used for tracking, then having more interest cohort samples for a user will make it easier to reidentify them on other sites that have observed the same sequence of cohorts for a user. Possible mitigations for this include designs in which cohorts are updated at different times for different sites, ensuring each site sees a different cohort while the semantic meaning of the cohort remains the same.

Sensitive Categories

A cohort might reveal sensitive information. As a first mitigation, the browser should remove sensitive categories from its data collection. But this does not mean sensitive information can’t be leaked. Some people are sensitive to categories that others are not, and there is no globally accepted notion of sensitive categories.

Cohorts could be evaluated for fairness by measuring and limiting their deviation from population-level demographics with respect to the prevalence of sensitive categories, to prevent their use as proxies for a sensitive category. However, this evaluation would require knowing how many individual people in each cohort were in the sensitive categories, information which could be difficult or intrusive to obtain.

It should be clear that FLoC will never be able to prevent all misuse. There will be categories that are sensitive in contexts that weren't predicted. Beyond FLoC's technical means of preventing abuse, sites that use cohorts will need to ensure that people are treated fairly, just as they must with algorithmic decisions made based on any other data today.

Opting Out of Computation

A site should be able to declare that it does not want to be included in the user's list of sites for cohort calculation. This can be accomplished via a new interest-cohort permissions policy. This policy will be default allow. Any frame that is not allowed interest-cohort permission will have a default value returned when they call document.interestCohort(). If the main frame does not have interest-cohort permission then the page visit will not be included in interest cohort calculation.

For example, a site can opt out of all FLoC cohort calculation by sending the HTTP response header:

Permissions-Policy: interest-cohort=()

Proof of Concept Experiment

As a first step toward implementing FLoC, browsers will need to perform closed experiments in order to find a good clustering method to assign users to cohorts and to analyze them to ensure that they’re not revealing sensitive information about users. We consider this the proof-of-concept (POC) stage. The initial phase will be an experiment with cohorts to ensure that they are sufficiently private to be made publicly available to the web. This phase will inform any potential additional phases which would focus on other goals.

For this initial phase of Chrome’s Proof-Of-Concept, simple client-side methods will be used to calculate the user’s cohort based on all of the sites that they visit with public IP addresses. The qualifying subset of users who meet the criteria described below will have their cohort temporarily logged with their sync data to perform the sensitivity analysis by Chrome described below. The collection of cohorts will be analyzed to ensure that cohorts are of sufficient size and do not correlate too strongly with known sensitive categories. Cohorts that don’t pass the test will be concealed by the browser in any subsequent phases.

How the Interest Cohort will be calculated

This is where most of the experimentation will occur as we explore the privacy and utility space of FLoC. Our first approach involves applying a SimHash algorithm to the registrable domains of the sites visited by the user in order to cluster users that visit similar sites together. Other ideas include adding other features, such as the full path of the URL or categories of pages provided by an on-device classifier. We may also apply federated learning methods to estimate client models in a distributed fashion. To further enhance user privacy, we will also experiment with adding noise to the output of the hash function, or with occasionally replacing the user's true cohort with a random one.

During the experimentation phase, Chrome's various efforts at cohort assignment algorithms will be documented at https://www.chromium.org/Home/chromium-privacy/privacy-sandbox/floc.

Qualifying users for whom a cohort will be logged with their sync data

For Chrome’s POC, cohorts will be logged with sync in a limited set of circumstances. Namely, all of the following conditions must be met:

  1. The user is logged into a Google account and opted to sync history data with Chrome
  2. The user does not block third-party cookies
  3. The user’s Google Activity Controls have the following enabled:
    1. “Web & App Activity”
    2. “Include Chrome history and activity from sites, apps, and devices that use Google services”
  4. The user’s Google Ad Settings have the following enabled:
    1. “Ad Personalization”
    2. “Also use your activity & information from Google services to personalize ads on websites and apps that partner with Google to show ads.”

Sites which interest cohorts will be calculated on

All sites with publicly routable IP addresses that the user visits when not in incognito mode will be included in the POC cohort calculation.

Excluding sensitive categories

We will analyze the resulting cohorts for correlations between cohort and sensitive categories, including the prohibited categories defined here. This analysis is designed to protect user privacy by evaluating only whether a cohort may be sensitive, in the abstract, without learning why it is sensitive, i.e., without computing or otherwise inferring specific sensitive categories that may be associated with that cohort. Cohorts that reveal sensitive categories will be blocked or the clustering algorithm will be reconfigured to reduce the correlation.

floc's People

Contributors

antoinebisch avatar dmarti avatar domenic avatar jkarlin avatar marcoscaceres avatar michaelkleber avatar xyaoinum avatar yoavweiss avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

floc's Issues

Test that FLoC cohorts are trained on site content, not accessibility or support for assistive technologies

Some sites might plan to use the FLoC cohort as part of a decision to show employment or housing advertising. In order to avoid unlawful discrimination in this situation, cohorts would need to not reflect a disability affecting users who are members of them.

Sites can vary greatly in both accessibility and in support for assistive technologies. Because the FLoC classifier is machine learning based, it is possible that it could "learn" to identify cohorts based not on site content, but on sets of sites with similar accessibility properties. (For example, an otherwise unrelated set of sites that all happen to have breakage when text is resized in the browser, or terrible screen reader support.)

Is there a way to show that in real world usage FLoC cohorts are not being trained on a11y properties of sites?

Related issues

Virtuous Incentives / Compensation to join FLoC?

Hi @michaelkleber,

This issue stems from Issue #41, raised by @fischerbach, which I believe brings up a broader, key question: "Why would any site owner want to allow FLoC computation on its sites?"

In essence, one could compare FLoC to a data-cooperation program that everyone could benefit from, regardless of how much they contribute to it. Usually, data-coops include rules like:

  • To benefit from the coop, you must contribute to the pool.
  • The value you extract out of the coop must be on par with the value you bring to it.

But we don't see any of that in the current FLoC proposal.

Moreover, the usage of top-domain only for FLoC computation (Cf. issue #43) creates a contribution asymmetry depending on domain size and specificity.

Big platforms like 'video.example', 'social.example', 'shoppingmall.example', even if they allow to include the visits they get for FLoC computation, won't contribute to much value (as they are widely popular generic websites).

On the other hand, they will extract a lot of value and keep concentrating the demand, when small websites like 'gadgets.net', 'moviereviews.com' or 'cookingwares.com' will contribute a great deal with specific, qualified audiences from their website, making their contextual monetization less relevant as siphoned by FLoC which makes it available on bigger platforms.

We think that to make FLoC a useful, attractive and fair system, the solution should:

  • set as a rule that FLoC ids will be available for users on a site only if this site is allowing the visits it gets to be used in FLoC computation.
  • Make the contribution independent from website architecture, by getting signals based on the content of the page, and not only its top-level domain or URL address.
  • Define rules and compensation mechanisms to ensure that each participant gets a fair share of the value produced by FLoC

What are your thoughts on this?

Workflow to quit one flock and try to get into a better one

If a user ends up in a low-status flock, and receives mostly advertising that seems to be low value or high risk, how can the user quit their current flock and join a higher-status flock?

(For example, a user who notices a lot of ads for predatory finance, political rage-bait, and alarming sketchy tech support offers might want to switch flocks to get more mainstream brand advertising with higher production values and lower perceived risk.)

Does the lack of true federation create a trust/maintenance blocker for FLoC?

The test posted by the Google team proposes that FLoC is not usable without either a SortingLSH sorting server or a SimHash anonymity server, both of which create centralization (this centralization is noted as loose in both cases, but I am not familiar enough with the related process and math to understand just how loose these cases are) which was not what I was hoping for on reading the initial proposal for FLoC.

This looks, to me, to be potentially a major source of contention or opportunity depending on how it is approached?

I think the first question is: is true federation for this proposal impossible? The initial proposal was, I understand, highly theoretical and perhaps what we are discovering is that, in practice, there is no way to provide the utility and provide a reasonable guarantee of privacy without some centralized server in the mix. The section on affinity hierarchical clustering seems to indicate that there may be some future federation option that could eliminate centralization, but it does not specify.

Do the proposers currently believe that it is likely that this product would go live with a requirement for a central server? Or is that requirement considered a blocker?

If a central server is not considered a blocker, this presents a big question:

Who provides this central server?

Assuming that the techniques described would assure that a central server receives no data which de-anonymized users (a fair assumption based on the description), there are still trust problems:

  1. That the entity running the service is trusted to not attempt to leverage the data to shape the market in any way.
  2. That we trust that the entity will continue to exist in a form that allows it to maintain these servers.
  3. That the server's costs are supported.

From these issues I think we have some smaller questions:

Does providing the server potentially give the entity running it an opportunity to participate in the ad process (ex: accounting for ad calls to take a percentage of revenue per-ad)?

Does the user have have the capacity to select servers?

Do the sites running ads have the capacity to select servers?

Could service providers hosting the servers for processing FLoC cohorts differentiate themselves based on the question of utility vs anonymity within particular limits? (Ex: can a server advertise itself as more private by requiring larger cohorts and allow users or user agents to select it on that basis?)

Thanks!

K-Anonymity Cohort>Conversion Calculator

Since there has been no final decisions made to cohort/k-anonymity sizes, as well as conversion thresholds, I created a calculator to see how different scenarios would perform by inputting cohort sizes, ad frequency rates, click rates, landing page rates, conversion rates, as well as % of conversions resulting from post view. I used to use this type of math to project how advertisers' campaigns would likely perform and make decisions on where to shift budgets.

Each campaign can vary based on objectives, as well as where it's running (programmatic, social, direct, etc.), and it appears that there could be many scenarios where conversion data would not be reported if the minimum thresholds were to be 100 conversions for an event, and if there was little to no attribution from post-view. This could be especially challenging for small to mid-size marketers.

If most of these scenarios are true, marketers would still find it valuable to see these numbers (including post view) to help calculate CPA and ROI since they are basing their decision on which publishers to invest with and which ones not.

Here's the link to the Google Sheets with my calculations: https://bit.ly/3pS38hq

Note: all fields highlighted can be edited, otherwise the rest of the sheet is locked. Feel free to play around with it.

Angelina Eng
VP, Measurement & Attribution @ IAB/IAB Tech Lab

Unsupervised Learning

The Problem

The FLoC proposal speaks to using federated learning to create groups of users with similar interests. An intuitive example is classifying users viewing websites selling cars as “in-market auto”. These classification systems are often built two ways, supervised and unsupervised. The example above is a naïve unsupervised approach -- websites are labeled as being related to auto sales and users are added to the audience if they have enough activity on those classified websites. A supervised model would take a label set – either people that recently purchased a car, or people that visited known auto sites – and find other similar users, based on their internet browsing behavior. An advantage to the supervised approach is that the model allows for discovery of other related behavior that may not have a similar classification (e.g., reading news about interest rates).

The current proposal lacks detail around who determines the algorithm and how it is deployed. Based on the available description, we are inferring only one unsupervised model will be available; users with ‘similar’ behavior will be placed into cohorts and a cohort ID will be available during bidding. Without information into how the cohorts were made we cannot create the naïve model above -- specifically, what websites contribute to each cohort and the tradeoffs between recency of activity, frequency of activity, and volume of activity. Moreover, unless we have reporting about conversions at the flock-level, we cannot use the supervised approach to discover flocks relevant to a given advertiser. While there are methods by which we can derive the contributing websites, these mappings would constantly need to evolve as the FLoC model changes.

Additionally, it is unlikely that the one-FLoC-for-all will work for all advertisers. In the auto targeting example, there is no guarantee the algorithm will associate users with similar auto-viewing behaviors into the same flock. In that case, it is likely a small percent of each flock will contain those users – rendering that method of targeting useless. Most users in a given flock will not be interested in buying a new car, resulting in a waste of advertiser money and a poor user experience. Even with robust conversion reporting and a sophisticated algorithm, advertisers will lose the ability to find their relevant audience.

Publisher & User Impact
The FLoC proposal as it stands today will favor larger publishers. Algorithms used by ad tech companies will start to index higher on publishers with higher traffic, resulting in more accurate targeting on their inventory. This will lead to a drop in revenue for the smaller publishers. To make up for the lost revenue, publishers will either have to show more ads per page or erect paywalls, neither of which are ideal outcomes for the end user or the future of the internet.

Sever-side component

One question I've had about this explainer for a while is who owns the server-side component used in aggregating the deltas from clients and training the model at scale. Is the idea to have this owned by a single trusted entity in the industry? Or perhaps a consortium of entities? Or is there a world where multiple sever-side components could securely/privately share their individual models to aid in the creation of one shared model?

If the latter is a possibility, it’d be great to understand more deeply what interfaces/protocols would need to be implemented to spin up one of these instances.

Thanks!

Floc numbers and creation

Hello everyone,

Following the initial PoC phase and whitepaper that Google went through, do we now have a clearer idea of the number of FLOC we should expect as well as their expected size?
Also, will FLOC be common across browsers, or will they be unique per browsers? In the second case, does it means that DSP and other tech provider optimizing toward FLOC will need to optimize toward FLOC / UA couple rather than FLOC alone?

Only train on pages on which document.interestCohort is called

In order to limit inadvertent sensitive group tagging, only train the FLoC classifier on the URL or content of pages on which document.interestCohort has been called. If the owner of a page wants to make it available for training but ignore the cohort, they can ignore the return value of the function.

The web currently has more than 1.2 billion sites (including parked domains). It is impractical for even a large browser developer to test for which patterns of usage of which sites are inadvertently revealing sensitive information about a user.

For example, web history on an ordinary-looking web-based game could result in providing inputs to the machine learning algorithm that train it to recognize a set of users with a specific disability that affects their gameplay, and expose that set of users as a cohort to any site they visit -- without revealing to the users affected that their cohort reveals this sensitive information to all those sites.

Patterns of usage of a set of general-interest world history or culture sites could result in training the algorithm to recognize people with specific political or religious concerns, again without revealing to the people affected that their cohort is flagging them as a likely member of a protected or at-risk group.

Many other patterns of emergent sensitive group tagging would likely only become evident only after FLoC has been deployed to real-world users with real web histories.

Source: https://news.netcraft.com/archives/category/web-server-survey/

Related issue: Publisher opt-out ? (Issue #13 covers an explicit opt-out that would remain in effect even if a script on the page later calls document.interestCohort)

(Sec-CH-Flock + Client's IP address) == Unique Identifier?

It is quite likely that the combination of (Sec-CH-Flock + Client's IP address) creates a unique identifier for most interesting devices on the planet, assuming by an interesting device we're talking about a device that has been used by its human user for some web browsing activity.

The proposal indicates that this combination might uniquely identify the user, which sounds like a far fetched assumption to be honest (but this is something that we'd need some data about).

What is the plan to ensure that just by adding this one header we do not create a unique identifier?

Publishers point of view

Most issues deal with possible leaks of users' browsing history from the user’s perspective.
I would like to draw attention to a similar problem, but from the perspective of those who maintain websites - the publishers.

Cohorts are calculated based on whether the user has been to various sites. For example, there is a publisher A who publishes a site dedicated to a niche topic such as teapots. Preparing the content and maintaining the servers are business costs that are covered by advertising revenue, i.e. the ability to reach users interested in teapots.
Since even superfans of teapots do not read sites about them very often, the number of visits may not be sufficient to display the advertisements of all interested advertisers.
Therefore, through various mechanisms, attempts are made to reach teapot fans on publisher B's mainstream sites, which are visited by the majority of the population.

In a world with third-party cookies, if publisher B wants to target the users of publisher A, the publishers must somehow come to a business agreement and implement tracking codes on A's websites.

If the current version of FLoC goes live, it will be possible to target ads to cohorts of users. If it is possible to learn about a cohort that includes visitors of publisher A (e.g. by methods described in other issues), that publisher may lose its market position to publisher B, which has a larger base. This may ultimately result in publisher A becoming unprofitable and the site being closed down.

In colloquial terms, some publishers wonder why they should bear the cost of producing content that will appeal to specific audiences when the fruits of that labour will be distributed to everyone else. This will result in an even greater flow of advertising budgets to big players like Facebook and Google, who will be able to catch literally every possible cohort.

Make Computation opt-in for websites

Currently website must opt in to interest tracking technologies by adding code to gather the data. Computation should be opt-in too. Any site that doesn't explicitly state that it wishes to be used to indicate a user's interests should not be included.

This would significantly help reduce harm from visits to sensitive websites being used to reveal information about a user by the floc(s) they are assigned to.

Please specify privacy measures

I remain interested in the potential of FLoC as a proposal, but I do have some privacy concerns that seem to be very limiting of any support we could give. While the proposal points to some specific browser-level constraints that would be needed, I think the proposal needs to be clearer about the constraints of the cohorts (numbers, size, etc...) and the limits of total counts of cohorts so we can better understand potential privacy impact.

Training based on FQDN or registerable domain?

The README says,

Our first approach involves applying a SimHash algorithm to the domains of the sites visited by the user in order to cluster users that visit similar sites together.

Will this be the FQDN or the registerable domain? (Will store.example.com, support.example.com, www.example.com all be counted separately or all included under example.com?)

General concerns about FLoC-powered abuse

We have some significant concerns about the long-term viability of FLoC due to some social consequences it might have.
Indeed, in FLoC, the user agent (Chrome browser) is the sole responsible to assign people "with similar interest (or behaviour)" into one cohort. This is a huge responsibility, and potentially one that could lead to some unintended, but very real societal consequences.

Let me give describe what I think is a likely bad usage of FLoC:

Let us consider an "attacker" who wants to harm a specific group (for instance, because of race, religion, sexual orientation, political views, etc). Some members of this group will likely share a FLoC, as it is FLoC's purpose to group user of similar interest based on their browsing history. The attacker can easily emulate the internet browsing history of a member of the group he is willing to harm and see the FLoC he/she has been added to. The "attacker" can then target this FLoC ID in any specific way they wish, even though they don't have access to any specific user. Would they want to get access to any specific user, the attacker would "just" need to get it via to a website with a PII browsed by anyone with the same FLoC ID.

This might look far fetched, but similar "artisanal" cases have already been used, for instance here https://www.pinknews.co.uk/2015/02/18/gay-dating-apps-used-by-attackers-to-trap-victims-in-ireland/

So it already exist today, in some form. But thing is, Chrome will have done all the heavy lifting to make such attacks work "at scale". Being part of the group, instead of shielding the user from potential harms, actually put a target on its back.

This makes this kind of attack significantly easier than with third-party cookies (you need either to drop directly a cookie on the group's website), and you benefit from added Chrome's intelligence to do so, as Chrome groups users together and give the FLoC IDs out to everyone.

Yes, the aforementioned threat could be reduced, but not eliminated. For example, as you proposed, not taking into account websites flagged as related in some way to marginalized communities could work on paper. But it is going to be extremely hard to set up in practice on an ever-changing web. As you stated in #27, it is possible to have bias even from a seemingly unbiased signal. Web browsing history represents the user's interests and therefore is by nature biased toward peoples' interest (and this includes groups suffering prejudices, or susceptible to be the target of malevolent actors). The web is extremely wide and diverse, and the chance that is no way that no remote part of it falls through the cracks, especially in countries and cultures unfamiliar to Chrome engineers, where endangered groups might wildly differ from the Western Hemisphere in general, and United States in particular.

Another point of contention of removing such groups is that it is discriminatory for businesses with legitimate interests, and people targeted by these businesses. For example, a FLoC of straight newlywed people will have a FLoC allowing for personalization and monetization for business targeting them, but the same business, specialized for LGBTQ+ (if they were to be filtered out for the sake of sensitivity) would not have any mean to do targeted advertising and therefore expand their business fairly compared to non-LGBTQ+ businesses!

I really believe that if FLoC happens, then seeing the examples such as those I listed above is not a matter of "what if", but "when". As such, it could seriously jam the long term prospect of FLoC as an accepted marketing framework.

Could you please let us know what elements would be put in place to shield FLoC from such risk and ensure its persistence in the long term?

Lookalike targeting using FLoC?

I'm not sure if you've seen this proposal: https://github.com/w3c/web-advertising/blob/master/privacy_preserving_lookalike_audience_targeting.md

The key idea there is to use the Aggregated Reporting API to perform logistic regression on embedding vectors with boolean labels. In that proposal, the suggestion was for publishers to provide custom embedding vectors for use in this process. I am wondering if FLoCs could be used as well?

While the proposal talks about FLoCs as "cohorts", I get the sense that they are not meaningless, arbitrary numbers. Specifically this part:

The browser uses machine learning algorithms to develop a flock based on the sites that an individual visits. The algorithms might be based on the URLs of the visited sites, on the content of those pages, or other factors. The central idea is that these input features to the algorithm, including the web history, are kept local on the browser and are not uploaded elsewhere — the browser only exposes the generated flock

I assume what this means is:

  1. The browser will use "Federated Learning" to train a Machine Learning model.
  2. This model will use the user's complete browsing history as "features" in this model
  3. This model will use XXX as labels (unknown and not stated... but super important and I hope you do clarify...)
  4. The trained model will be used to produce an "embedding vector" for each browser instance that captures the concept of "similarity" between different users
  5. To preserve privacy, the full, raw embedding is not shareable (it has too much entropy and could be used as a fingerprinting vector). As such, there is a dimensionality reduction down to just 16 bits (using something like Locality-Sensitive Hashing), and possibly some kind of differential noise is added to these 16 bits after that, and probably there is some kind of server-side coordination to ensure the distribution isn't too skewed and there is a minimum number of browsers in each of the 65,536 FLoCs.

My question is:

Will step 5 render the FLoC ID useless as anything but a random "cohort ID"? Or will it maintain some kind of meaning like the original embedding vector?

Let me give a concrete example to make my question more clear.
Assume:

- Person A is in FLoC 0x1FEB (0001111111101011)
- Person B is in FLoC 0x1BEB (0001101111101011)
- Person C is in FLoC 0xE223 (1110001000100011)

Are they person A and person B "more similar" than person A and person C? The Hamming Distance between person A and person B is 1. The Hamming Distance between person A and person C is 10.

If step 5 preserves some kind of meaning (for example, we can compare the Hamming Distance between FLoCs and use this as some kind of measure of similarity) then it seems like one could potentially apply the same "Logistic Regression in MPC" approach to FLoC IDs.

Prevent browsing history detection

This problem is partly discussed by #36 and is related to #38 (comment) but I want to make the threat scenario more explicit.

The security section in the explainer mentions revealing people's interests to the web, but it's important to note that FLoC may also potentially be reverse engineered to reveal the set of specific websites visited by the user.

For example, consider a user with a clean browsing profile who in the first few days after installing the browser visits their favorite news site, social network, and bank website. This will assign the user to a cohort with a random-seeming identifier; however, an attacker can also make a guess about the set of websites visited by the user, calculate the FLoC resulting from this history pattern offline, and compare the value to the user's actual FLoC. An attacker could compute a large set of likely FLoC values based on the popularity of websites, news articles published in a given period of time, content shared on social media, etc. Given that browsing patterns are not random, a motivated attacker can likely find matches for a large fraction of users. While the FLoC value doesn't give the attacker certainty that a user has visited specific set of sites, it can give them high confidence, especially if the attacker is willing to make some assumptions about which sites the user is likely to visit.

As mentioned in #38 (comment), the potential risk here is affected by the granularity of data taken into account during FLoC calculation. Less granularity (e.g. taking into account only the site of a visited page) reveals less information, but makes it easier to calculate a collision with the user's FLoC. More granularity (e.g. taking into account the full URL, or page contents) makes the FLoC harder to precompute, but may reveal sensitive cross-origin information if the attacker manages to find the right match.

We also know from past research that browsing preferences are relatively stable over time. This suggests that it may be easy for attackers to precompute FLoCs to find matches, and also increases the risk of reidentification: If I keep using the same bank, webmail and news site, but visit a few new viral websites linked to by my social network each week, I may get a new FLoC, but an attacker who knows which content is popular in a given week can infer what my bank & webmail websites are, linking my past and current profile.

This seems like an important problem to address. The main thing I can think of is to reduce the length of the FLoC so collisions are frequent enough to make it difficult to make inferences about the actual set of visited sites. Randomizing the FLoC (e.g. using a random seed for each user) seems unlikely to meaningfully help here because it will only require the attacker to do more work to compute the value.

Support for browser extensions

To help address concerns about possible side effects of FLoC (see issue 16), please help facilitate independent research by making the FLoC Key available for browser extensions to get and set.

The FLoC README states,

Flocks could be evaluated for fairness by measuring and limiting their deviation from population-level demographics with respect to the prevalence of sensitive categories, to prevent their use as proxies for a sensitive category. However, this evaluation would require knowing how many individual people in each flock were in the sensitive categories, information which could be difficult or intrusive to obtain.

That kind of data collection would be impractical to get consent for in general, but there is an exception. Some browser users choose to participate in NGO-led research to share information about advertising practices that affect them. The best-known example is Ad Observer - Chrome Web Store, previously Facebook Political Ad Collector.

Facilitating this kind of independent research by volunteer, opt-in browser users would be a valuable step to address concerns about discriminatory side effects of FLoC. For example, volunteers could run a research extension to gather data to help address the question: do people get different ads for jobs and housing when browsing in a flock related to assistive technology and in a flock related to some other technical interest?

Incrementality testing and optimization?

Hi,

In my opinion FLoC focuses too much on attribution based advertising, and does not offer any solution for more "truthful" measurement like incrementality. Do you plan on supporting it? If not I feel like FLoC constitutes a stepback in our goal to improve Web Advertising.

It is reasonable to assume that clustering people by interests as emulated through their browsing history will likely lead to a good correlation between the characteristics underlying the clustering of these people with the actual attributed outcome of the advertising event. In simpler words, a flock of "smartphone enthusiasts" (people who have read smartphone reviews, or browsed smartphone product pages on ecommerce websites etc) could end up buying a smartphone anyway (regardless of seeing an ad or not) - and a post-view or post-click attribution would give credit to ads done to this "organic buyers" group, indistinct from the group on which there was an actual causal effect.

That is a long-known flaw from attribution based models, and many marketers moved / are moving away from it to focus on incremental lift. Cookie-based targeting and measurement allow for incrementality testing and piloting of marketing campaigns, and theoretically TURTLEDOVE/FLEDGE could too (one could imagine that marketers could build their cohort with an incremental goal in mind, and measure it by deduplicating cohorts into test and control groups - not the most convenient, but it could happen).

However, I don't see how we could measure and optimize for Incrementality in FLoC:

  • the rules on which how cohorts are constituted will be largely unknown (and we can assume that people are not clustered together because of similar potential incremental ad lift),
  • there is no mean to exclude users from a flock to constitute a control group,
  • outside of the flock_id there is no variable that would allow marketers to remove "organic buyers" from these flocks (for example: people who bought a fridge online will still be in the fridge flock, but we can be pretty certain that they won't buy another one anytime soon, so no need to do more ads to them, both for them, the advertiser, and the publisher).

I understand that in the Privacy Sandbox the measurement proposals are separated from the Targeting ones, but in the incrementality context, the two pieces need to be connected for ABtesting to be possible. Do you envision support for such ABtesting capabilities in FLoC?

Link decoration in contextual/FLOC

Hi Michael,

reopening issue #28 as you closed it.

Do you mean that somehow the browser would clean / filter any 'link decoration'?

UTM_source, one of the most frequent 'link decoration' in use today, is a backbone of advertising measurement, championed most notably by Google Analytics ("GA"), the largest website traffic monitoring tool in the world. It is used by advertisers to compare performance and shift budget between different marketing channels, exposing a full range of signals, from clicks, to bounce rate, to sessions, to conversions, etc.

UTM_source is not so much a way to declare where the user came from, but rather how he arived on the website ('Facebook_video_ad', 'criteo_consideration', 'google_retargeting', etc...). This usage does not seem to infringe any of the Privacy Sandbox principles. Sure, nothing enforce the fact that UTM_source is used in this fashion. But it is an important usage nonetheless, if only for accountability. And rather than killing it, it would be better to make it more robust.

Losing UTM_source would mean that GA becomes obsolete for the Open Web, as many of these signals, crucial for advertisers (especially when they use contextual advertising, are gone. Do you plan on giving enough flexibility in the Conversion Measurement API to cover all current GA metrics?

GA also extends beyond Open Web display and video impressions, to encompass a full suite of Google first party assets like Youtube, Gmail, Search, Shopping, etc. The 'level playing field' that you guys mentioned a few weeks ago during our weekly calls certainly means that GA wouldn't use link decoration or anything of the sort to track performance outside of open web, giving these proprietary channels an unfair advantage in the process, correct?

Extending the argument, will these channels be subject to noise, delay, and high level aggregation as well?

Publisher opt-out ?

Hello everyone,

Publishers might not want browsing history from their content to be used by the browser to build Floc.
Can publishers opt-out from contributing to FLOC? If so, how would opt in/out be setup and what would be the limitations ?

Access to cohort ID in the aggregated measurement reports

Regarding the recent discussions around analytics and optimization use cases, I have a quick question on the availability of cohort IDs in the aggregate reports documented in the Conversion Measurement API and Aggregate Reporting API.

The aggregate reporting API mentions the possibility of requesting data sliced demographically or by market, can you confirm if there is a plan to allow ad networks to access those reports sliced by cohort IDs too? If we consider that cohort ID is the primary signal for behavioural targeting in the future, it seems critical that this information should be available natively (as opposed to having to be passed in the impression metadata for example).

Cohorts stability

FLoC proposal states that the FLoCK will be refreshed as the user traverses the web.

In Issue #22, the typical refresh period is said to be once a day.

Advertisers need stability over time for cohort ids in order to learn the affinities of user interests with their products, as shown in the graph below: the cohorts repartition may vary from day to day, but the cohort update should be coherent with previous cohort ids.

What kind of cohort stability FLoC commits to?

image

Performance measurement: ABtest-ability and incrementality

In digital marketing, AB-testing is usually performed by splitting an audience into populations based on a hash of the user id. By building these independent splits, the advertiser can:

  • measure the impact of a feature on users and thus improve his marketing strategy

  • measure the incrementality (the overall business uplift) brought by a marketing campaign

Is AB testing supported by the FLoC proposal?

As in FLoC, the uid is meant to be replaced by a cohort id, one could think of taking the hash of a cohort id to build the population splits. This is not a viable option because:

  • the dependence of cohort ids. By design, the cohort ids are chosen to represent users interests. A sports retailer may simply not statistically have the same performance on tennis-interested users (assuming there will be a cohort for such users) against those interested in football.

  • the loss of granularity: depending on his campaign setup, the advertiser may even be targeting a unique cohort id, that may fall into population A or B (given the result of hash(cohort_id)) but not into both.

Flocks seem problematically black-box-y

As an advertiser, I may want to know: what Flocks exist? Generally, what kind of people do they represent? Who am I spending ad dollars on and why?

As a user I may want to know: what Flocks have I been put in and why? How is this changing my browsing experience – from the ads I'm seeing to the first-party content (news/social post prioritization, products, prices) I'm being served?

Many legal and public-interest entities may want to know all of the above... what flocks exist? Why are individuals being sorted the way they are? Is this categorization fair to them as individuals, and what effects might it have on the broader society?

The proposal should make clear how these sorts of questions will be answered. Machine learning and flock names like "43A7" seem to obfuscate, rather than clarify, here.

Integrate Proprietary Cohorts Concept

Would you consider incorporating the concepts laid our in the Proprietary Cohorts proposal to FLoC to allow first parties to choose how cohorts are generated on their sites?

I believe this proposal provides most if not all of the benefits of FLoC while giving first parties some control and choice. It also has the added benefit of providing consistent cross-browser functionality as cohort generation would not differ from browser to browser.

floc origin

In Floc whitepaper were written 2 main examples of origins from where floc can be derived : 1) user domain encoding, and 2) user pre-existing topics scores, probably inferred from URL text content.
I understand the first encoding is more agnostic than the second one, which relies on an existing human set of categories.
What is today the likeliest path of Floc origin that GChrome will choose, or at least the one that will be chosen for the experiments that will be conducted from March ?
a) Domain/URL encoding without considering text content like the first example of whitepaper
b) Word encoding without considering a human set of categories (remain agnostic)
c) Topic Category encoding like in the last example of the whitepaper
d) Other... ?

reduce dimensionality of floc set with a bloom filter to reduce privacy budget impact

It seems a reasonably compact bloom filter could potentially express the entire set of common interests while reducing the entropy. Does this seem compatible with the floc project goals and envisioned implementation?

An enhancement to the bloom filter for this purpose could be ttl's on each element, but I am not sure if the theoretical properties of this enhanced bloom filter with positions expiring at different moments have been well demonstrated.

The advantage of the bloom filter is that it also preserves privacy; if you test the bloom filter for a segment, if the test returns false the user does not have the segment. If the test returns true, the user may have the segment. In this sense the bloom filter reduces media waste for the advertiser but prevents technology companies from being certain a user has an interest. The false positive rate is determined by the number of interests that might be expressed in the bloom filter (it seems the iab taxonomy committee has a role to play here) and the number of elements.

image

Who will be able to create a flock

Let say a publisher or an advertiser want to target a very specific audience (example : people intend to buy Mercedes SUV).

Will they be able to create this custom flock ? If so, what is the workflow ?

Thanks,
Ed

How does a FLoC lead to targeted interest based advertising?

I'm struggling to connect the FLoC idea with interest based advertising. Today an advertiser can target a group of users "Interested in international travel". Would a flock exist that contains users who are interested in international travel? How would these categories be chosen?

I'm likely misunderstanding, but a FLoC seems entirely different to how third party audiences are used today. Currently an advertisers target users who are interested in x (regardless of any other behavior), while a flock seems to target users who are similar to eachother?

Thank you!

Google's use of Chrome user browsing history data for advertising

In the recently published blog post by Google (mentioning FLoC), it is said that:

once third-party cookies are phased out, we [Google] will not build alternate identifiers to track individuals as they browse across the web, nor will we use them in our products

Could you confirm that this means that Google will cease its use of the Chrome Login to leverage the user's browsing history (even outside of Google properties) for advertising purposes?
This question was raised a few months ago during the WICG calls but I couldn't find it in the minutes.

FLoC+Server

@jkarlin, Would you entertain a conversation/thread about running cohort assembly on a trusted server in addition to assembling cohorts inside the browsers construct. I've written a proposal called FLoC+Server (FKA Gatekeeper), that seeks to copy the concepts behind cohort assembly onto a trusted, transparent server run by a not-for-profit entity. That entity is TBD and who it can be is still undefined, but conceptually, I think cohort assembly is too important to the ad tech ecosystem to be possible by just browsers.

https://github.com/MagniteEngineering/Gatekeeper

Ability to continue measuring post-click performance in a contextual + FLOC set up

Today, post-click performance measurement and optimization is done thanks to third-party cookies, or URL decoration (either UTM sources or a click id in the decoration).

Third-party cookies are going away, but will we still be able to use link decoration for performance measurement and optimization?

I understand from FLoC proposal that cohorts will be large enough to guarantee privacy and keep authorizing click decorations and UTM sources.

Can you confirm it?

SortingLSH after SimHash

For Floc generation,

  1. Is it planned to use SortingLSH after having applied SimHash, or to map all flocs with too few users to one unique global floc group ?

  2. For SortingLSH, I would suggest to not use a lexicographical order but rather an order like this because lexical order implies to jump from 00111 to 01000 when ordering (in dimension 5), which makes 4 different digits from an element to the following, so very different flocs can be grouped together. In the other hand, another order can be constructed so that 2 consecutive elements have only one different digit, which would lead to a better cohesion inside a group of gathered flocs.
    For example, in dimension 4, it gives an order like that : 0000,0001,0011,0010,0110,0111,0101,0100,1100,1101,1111,1110,1010,1011,1001,1000

How frequently can a users FLoCk change?

It's possible that a users browsing habits will change over time, possibly moving them between cohorts. How frequently will Google allow the users FLoCk to change?

Some questions on potential cooperation on decoding Cohort-IDs

UPDATE: Are Cohort-IDs site-specific so the below will be impossible or at least much more complex?

I assume there will be a continuous race to decode the Cohort-IDs in order to offer buyers as much inventory as possible.

Will publishers be working together in this task in order to offset the massive advantages of huge sites like Google and Facebook?

Are we going to see websites that publish the weekly Cohort-ID -> Interest (based on voluntary submissions from millions of smaller sites)?

Is it not a potential problem if all websites have immidiate knowledge of visitors interest (from one of those aggregating mechanisms)?

Addressed Category

this document starts saying that you will cover only the (2) category, I think that this can be more clear. I have the following considerations.

  1. if you would like to treat the solution as an alternative for user-ID, the document should also cover the third category, because actions are related to the user too.

  2. if you would like that the FLoC can represent a great solution for tailored advertising based on user history it is perfect. It's covering greatly the targeting capabilities based on client-side information, like web history, nut it is not covering the first party user segment information, that usually a DSP as server-side, for instance, an advertiser can have in their CRM the propensity to churn.

How to we avoid FLoC replicating algorithmic red-lining and the discriminatory behavioral targeting we've seen in other ML systems?

One of the big concerns that has arisen from behavioral targeting of advertising is the replication of red-lining of communities done via print ads on to the web with ever more precision and impact. I wrote briefly about this issue here.

I am concerned that the machine learning approach will only continue to create this issue in web advertising.

For one, it does not discriminate between any particular content provider. Issue 12 discusses this in regard to some advertiser concerns, but I think there's a serious social impact that would degrade trust in this system, it would replicate the issues we see now that are causing advertisers to have to maintain large blocklists (resource intensive) or small allowlists (bad for new publishers trying to enter the space) and create targeting issues that I know Turtledove proposes to bridge.

However, this also allows for particular content providers to potentially impact users' ad views in a way that would create discriminatory targeting (ex: would a user who primarily reads The Root and The Baltimore Afro-American be categorized in a racially-specific group as opposed to an interest-based one?). It's unclear to me what measure of intervention is involved by the browser or, if they wish, the user in the application of these groups and if there is any built-in intervention against this type of well-known algorithmic discrimination and the inevitable after-effects. If we do have an issue where racially specific targeting is incidentally created by the ML system what happens when advertisers target for or against it and who ends up responsible? Is there a central controller who can create a correction? Do browsers become regulatory or legal targets if the targeting FLoC turns out to be used primarily for restricting job offers in a discriminatory way? Or housing offers?

There is obviously a complex issue and one on which there have been more technical examinations than I've got here, but I think to move forward on this proposal requires serious examination in to how we can be sure it does not replicate the potentially illegal discriminatory behavior that algorithmic targeting of user behaviors has created in the past.

Converting it to a self supervised task

The previous successful deployments (to the best of my knowledge) have had the learning problem framed as a self supervised task because labels are not available. In the case of a model computing which cohort a user belongs to based on local browser history -
In order for the model to train in correctly predicting the cohort what will be used as a 'self-supervisory' signal?
It's also my understanding that the problem can't be framed as an unsupervised learning problem because it is generally only used for data exploration while this requires prediction/inference as well.

This proposal should define what is meant by a "sensitive category"

In order to evaluate how well we expect FLoC to be able to prevent leaks of “sensitive” categories, we first need to first agree on our definition of sensitive.

Are we defining sensitive as any category of information that advertisers are legally forbidden from using, either by their own policy commitments or by government regulations? As an example, see the categories considered sensitive by Google AdSense or government regulation on medical advertisements and advertising to children. Under this constrained definition, this seems like a difficult problem to solve. Even AdSense is not yet able to guarantee the classification of ads that fall into these categories; the AdSense page on the topic includes the disclaimer Our system classifies ads automatically [...]. Our technology will make its best attempt to filter ads from the categories above; however, we don't guarantee that it will block every related ad. If we’re not able to guarantee that we can filter out creatives that fall into these categories, why do we expect to be able to successfully filter out any portion of a user’s browsing history that directly reveals, or is a proxy for, a sensitive category?

Moreover, I believe most users would define “sensitive” to cover any information they would feel uncomfortable sharing or any information that, by exposing it, endangers the user. If we agree that this is the definition we should be using, then ensuring that FLoC protects “sensitive categories” is impossible.

Information that users are uncomfortable sharing will vary between individuals. E.g. I might be fine with sharing my income level, but another user may not. The context of the sharing also matters. E.g., I may be willing to share information about whether I have an interest in racing cars to the local car dealerships, but not with potential car insurance advertisers.

Determining what information may endanger an individual also seems impossible to define and filter. For example, let’s say we allow for age range to be captured in FLoC (and filter out children). Are we sure advertisers aren’t targeting scam ads to senior citizens? As another example, let’s say a user in Hong Kong ends up in a flock that many users who’ve participated in protests are in--it’s not inconceivable that those participating in the protests share some common interests that are distinct from those not participating. Are we confident simply being a member of that flock won’t be abused by the Chinese media given that they’re allegedly already abusing other forms of advertising?

Adversarial attack.

What will prevent a malicious website from abusing FLOC and force users into arbitrary cohort?

Some ideas:

  • Create iframes outside of the viewport and make them navigate many times to specific websites.
  • Create top-level document, far way from user's attention using w = window.open(); w.resizeTo(w,h); w.moveTo(x,y). Then make them navigate many times, once the popup is put into the background.
  • Periodically modify the full path of the URL using same-document navigations via the history API.
  • Periodically modify the content of the page, preferably in locations the user can't see.

It would worth documenting was is put in place for preventing this in practise.
+@arturjanc FYI.

User transparency and control

This is a fascinating proposal which offers a compelling opportunity to deliver relevant ads to users without leaking mounds of unique data about the user. That said, I’d strongly encourage this proposal to include an interface for users to see and control which clusters their in.

Not only could this allow the user to gain transparency to their segmentation, solve the concerns over “sensitive segments” it would increase the opportunity for the user to interact with relevant ads (without being creepy)

Going deeper: FLoC Index, Governance & Business logic

The purpose of this thread is to further define the universe of available FLoCs, how they are assigned based upon user behavior, and the level of transparency made available to websites (and ultimately, the ad ecosystem).

FLoC Index
Given the centralization of FLoC assignment occurring within the browser, there is expected to be a finite set of FLoC values that are standardized - ie - the FLoC Index. It seems reasonable to begin with the IAB Taxonomy, which includes over 600 content/interest categories with up to 4 levels of hierarchy.

FLoC Governance
It is assumed that the FLoC Index will be modified over time; FLoCs may be requested by the business community and privacy advocates may request certain FLoCs be removed (for example, Health-related labels). Maintenance of the FLoC Index is expected to be governed by the browser, and could be a new component within browser version updates. Browsers will need to allow for ongoing consumer and business community dialog related to FLoC Index changes.

Business logic
Apart from the FLoC Index, the below specifics should have further definition:

  • how many FLoCs can a user belong to at a moment in time?
  • how many FLoCs can a user belong to over a period of time (eg 30 days)
  • how many behaviors or page view events are required for a user to be assigned to a specific FLoC? Are page engagement (dwell time, scroll depth) also considered in the assignment of a user to a FLoC?
  • will the time period or recency of FLoC assignment to a user be exposed?

Granular vs Similar flocks

Hello,

Not sure this has been mentioned already, apologies if it is the case.

Should we expect groups of people to be partitioned in as many flocks as possible (up to the limit of eligibility) or up till there is enough similarity between them?

For the sake of the example, let us say that a flock minimum size will be 1 000. Now, if you have 2 000 000 people visiting one and only one website (website.example), will they be splitted into 2 000 differents flocks with 2 000 different flock_id, or will they be in the same big flock with the same flock_id?

The two cases have definitely different pros and cons...

The 'similar enough' flocks will lead to bigger groups, but also mean that it is going to be easier to reverse engineer a flock and then a full population that a malevolent party could try to influence one way or another (one could go to a website or engage in a particular browsing, observe flock_id, and use it to target all people doing the same thing, or physically extract flock_id from devices during investigations etc).

The 'granular' flock could be seen as too granular and crossed with other features could de facto become an identifier, but at least is more robust to any attacks as suggested above. It could also reassure publishers that FLoC does not consist in a 'leakage' of their contextual data.

This proposal makes false claims about the privacy properties provided by the anonymization techniques used

Useful by collecting people with similar enough interests and producing labels suitable for machine learning, and private by forming large clusters that don't reveal information that's too personal, when the clusters are created, or when they are used.

The constraint that a user’s attribute/interest is only revealed when it is part of a sufficiently large set of users (i.e., k-anonymity) helps address the risk of user re-identification. In terms of this proposal this would mean it is difficult to determine the identity of the user given their FLoC Key.

This is a much different property than the privacy property this proposal claims to provide, i.e., preventing the exposure of information that is “too personal”. How personal a piece of information is does not depend on the number of people that share that attribute. K-anonymity does nothing to help provide this property.

To give a concrete example: more than 30 million Americans (~10% of the population) suffer from diabetes. While we’re very unlikely to re-identify any of those users based solely on the knowledge that they have diabetes, I suspect we can agree that nearly every individual in this group would not want this information used by advertisers.

The number of flocks should be small, to reinforce that they cannot carry detailed information — short flock names ("43A7") can help make that clear.

Similar to the statement above, this is a misapplication of k-anonymity. A limit on the space of possible FLoC values does not provide the guarantee that the browser isn’t exposing “detailed information”. To provide another concrete example: the space of possible sexual preferences of a user is small, but would presumably be considered “detailed” by the majority of those users. The same goes for income level, age range, major health conditions, and so on.

What about authenticated traffic and Sec-CH-Flock, isn't there a privacy issue?

I am wondering if there is a privacy issue with combining authenticated traffic and the Flock header/Client Hint.

Let's take a concrete example: site example.com provides some content web users are consuming. Let's say some of them are subscribers of example.com (authenticated users). In this scenario, example.com can store an email (or login ID) alongside a Flock-ID, and start populating a database of email addresses/PII information attached to Flock IDs.

I have the feeling this creates a serious Privacy issue.
I may miss something, so feel free to correct me if I do.

Thanks.

Finding relevant flocks for a given intent, ensuring a level playing field

Hi,

As mentioned in #27:

If an advertiser wants to target "/Food & Dining/Coffee Shop Regulars", then an ad buying platform will need to have some way of deciding which flocks are good enough matches for that intent.

For most ad buying platforms, this shall be done by looking at users' FLoC ids on various websites, or performing trials on which FLoC ids bring the best performance for that task. Observing web traffic can only be done on a subset of sites, and exploring FLoC ids space is potentially a huge endeavor, so results are likely to be sub-optimal and delayed wrt to any change.

I'm probably stating the obvious, but is it fair to say that Google, as an ad buying platform and to decide which flocks are good enough matches for an intent, will use the same methods available to other ATP, and will not benefit from having access to the FLoC source data (through Chrome browsing history sync) and FLoC clustering algorithm?

Defining the level playing field | Google as a third party

While we had a lot of lively discussions around cohort mechanism, gatekeeper(s) or the lack of them and technical/theoretical aspects of these APIs in terms of privacy, it seems about time to start a conversation of the legal / user facing aspects of these and ultimately how the level playing filed should look like. This is a bit of a longer post to have an initial framing.

It is certainly not the most beloved topic for engineers (meaning the formalities), but with the limited time available to me there is an urgency to also prototype these aspects with the first APIs. With https://github.com/WICG/WebID we are already in rather in depth discussion for a couple of weeks around this, but I would argue that we should also start with the advertising related APIs now as we have enough information about one of them at least for one.

With FloCs:

  1. being the most simple API from a functionality perspective
  2. having prototype implementations making its way into the chromium sourcecode https://chromium-review.googlesource.com/q/FloC

it would server well to discuss these topics as the first example... I'm referring to the GDPR in the following, given we can agree I guess it's the most advanced regulation in that regard where we have a lot of policy experience in the market.

Why is the legal framing important?

The FloC mechanism works by

  1. Calculating cohorts out of the users browsing history, that means personal user data which makes it subject to the GDPR - i.e. one needs to think about the purposes and extent of that processing, the legal ground for this processing (as one can see in the chromium commits) and ultimately who is formally controlling and responsible for it.
  2. These cohorts are used to address users with interest-based advertising and are therefore shared between parties (for example with an advertiser) which have different relations and knowledge about that user (i.e there is a relation to other personal data processing in addition).

Given that

  1. The FloC Function is not a self-contained browser function like a password safe where one would argue that it's just a product feature that is used by the user on its own behalf and benefit, but it enables personal data processing beyond that.
  2. From a user's/DPAs perspective the responsible party will be the Publisher were FloC based Ads are shown and browsing history is collected, not the Browser and nor the Advertiser.
  3. Publishers will need answers to these questions. Even if FloC-IDs might theoretically be anonymous by themself, that does not change anything on this observation as it relates to the full extent of processing and leads to the display of an interest-based ads to the user

The level playing field

Unrelated to the legal framing for the processing publishers have a reasonable demand to have 100% clarity how these APIs are and will be entangled with other Google Services, with examples like the iOS 14 changes coming up and the ongoing anti-trust investigations around the globe on bundling services. For now, we do have a high level alignment to establish a level playing field, with FloCs we can really define it now.

Looking at the commits for the prototype it looks like:

  1. For now, the user control and legal grounds are bound to Google services and privacy policies. Practically its fully bound to Google Services for the PoC. Which makes sense to me given one could not even run the PoC with real users, but naturally raises concerns if it will stay that way or be removed down the road.

"Queries google to find out if user has enabled 'web and app " "activity' and 'ad personalization', and if the account type is "NOT a child account.'

  1. Secondly it seems that FloC-IDs are also synchronised to Google Backend Services, which again seems tangible for a PoC, but raises similar concerns again.

It's a service that is supposed to (as some functions are incomplete) regularly compute the floc id by sim hashing the navigation history and log it to chrome sync

Looking Forward:

Once the FloC API should be used to actually address user with personalised ads, one needs to answer these questions at least:

  • Whats is the independent legal framing to address a user based on FloC. To me it seems that there needs to be a means for a publisher to offer a user control and most probably even a consent to enable this. Within TCF that would be Purposes 3,4 (for the Publisher) most probably. Vendors would be less relevant here.
  • What are the UI components and who is operating them? TCF within the Browser, the Browser accepting a publisher consent signal, bespoke UIs?
  • If and how Google Services are de-coupled

My suggestion would be to also prototype these questions unrelated to the engineering aspects to also get publishers and advertiser more engaged and comfortable with these APIs and the general process.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.