Skip to content
This repository has been archived by the owner on Mar 16, 2023. It is now read-only.

SimHash may leak information about aggregate traffic of specific publishers #90

Open
bcyphers opened this issue Apr 2, 2021 · 8 comments

Comments

@bcyphers
Copy link

bcyphers commented Apr 2, 2021

Short version

Since SimHash floc IDs are just sums of vectors that correspond to individual domains, I think this version of floc could let large actors estimate traffic volume and aggregate demographics of visitors to other websites.
Related to #41 and #45, but I haven't seen this particular attack described yet.

disclaimer: I am not a chrome developer or a mathematician. if one of my assumptions here is off, please let me know!

Long version

The experimental version of FLoC uses SimHash, which is a deterministic mapping of browsing history -> floc ID. One of the project goals is to prevent sites/trackers from learning too much about any individual's browsing history. It should be impossible to use a single floc ID to determine with high likelihood whether a user visited a particular site. (longitudinal privacy is different, but leave that aside.)

But each floc ID will carry some information about the sites that are likely to make it up.

As best I can tell from here, SimHash in floc works like this:

  • Each domain is hashed (deterministically) into a vector of gaussian random variables. So example.com might map to <0.51, -1.21, 0.98, ... >.
  • The domains in a user's recent history are all hashed and summed up into a single vector. If the user has visited N domains, and domain di maps to vector xi, then the user's full floc vector is Sum(x1, ..., xN)
  • Finally, the summed vector is mapped to a bit vector - the floc ID. Any negative elements of the vector become 0, and positive elements become 1.

At a high level, each site has its own floc vector. A user's floc vector is the sum of the floc vectors of all the sites they've visited, and the floc ID is a coarser version of that. You could also say each site has its own floc ID.

For each bit in a user's floc ID, and for each site they visited, there is a higher-than-50% probability that the bit in their floc ID matches the bit in the site's ID. For example, if you know a user visited a site with the 4-bit floc ID 1111, without knowing what else they visited, you know each bit in their floc ID is (slightly) more likely than not to be 1. Some sites might even have dramatic floc vectors -- with several vector values more than a couple standard deviations away from 0 -- which will have a higher impact on user floc IDs.

Now suppose you're the admin of a large site, and you see millions of floc IDs per day. You want to estimate how many of your readers also visit competitor.example. You might have an idea of competitor.example's traffic from a source like Alexa, which can serve as your prior belief.

Each floc ID you observe lets you perform a Bayesian update on your prior belief about how your readership overlaps with competitor.example. Say floc ID 11011 is slightly more likely than average to contain competitor.example, while ID 01100 is slightly less likely than average. Seeing a 11011 will boost your estimate of competitor.example's traffic, and an 01100 will deflate it. Each ID carries very little information, but millions of them could give you a pretty accurate idea of a specific site's volume.

If this works, you could also segment your own readership to figure out cross-traffic to competitor.example among different demographics. For example, U.S. readers of your site might be twice as likely to visit your competitor as other nationalities.

This would leak information about visitorship of all sites that are included in floc calculations. You could run experiments to find out just how accurate this method would be -- maybe it's so fuzzy as to be useless, but I think it's worth looking into.

This will also be a more valuable tool for actors who observe lots of traffic in lots of different contexts. Since floc only uses information about top-level frame navigations, it will only leak information about first-party traffic. Websites that don't own ad networks will reveal information about their traffic, while actors that receive lots of third-party requests will learn more information than they expose about themselves.

@michaelkleber
Copy link
Collaborator

Hi Bennett, thanks for writing this out, it's an interesting question. We posted some details of the locality-sensitive hash we're using in the first Origin Trial here: https://www.chromium.org/Home/chromium-privacy/privacy-sandbox/floc.

A couple of observed statistics seem potentially relevant to the attack you're considering. First, FYI each cohort is defined by an LSH prefix whose length is between 13 and 20 bits. Second, as you already knew, each cohort has at least 2000 people in it (forced by the clustering design). But also noteworthy is that each cohort has at least 735 different sets of domains mapping to it. This number is measured, not forced by the clustering, but it gives a sense of how much hash collision is going on.

Of course you're right that a Bayesian could write down a model that used the revealed bits to update priors. My intuition is that there is very little signal left after disentangling competitor.example from the many other likely-correlated contributions in the hundreds of browsing histories in even a single flock. And of course any single other domain that's correlated with either yours or competitor.example's would introduce bias — and removing that bias would require knowing an answer to the very problem that your attack was about trying to answer in the first place. (The "among different demographics" part seems harder to believe, though — wouldn't those also correlate with new sets of noise for each slice?)

Your conclusion "maybe it's so fuzzy as to be useless, but I think it's worth looking into" seems plausible; I would be happy to read an analysis.

But overall, I suspect that you would get much better data by running a survey where you ask 1% of your users "What do you think of competitor.example?"

@bcyphers
Copy link
Author

bcyphers commented Apr 2, 2021

Thanks for the reply. I'll try to game this out a little more and post something that looks more like code -- I also think there's a good chance that this wouldn't be particularly useful.

The "among different demographics" part seems harder to believe, though — wouldn't those also correlate with new sets of noise for each slice?

I'm assuming that the site doing the learning is using something other than floc to get demographics. Say you're Facebook and you already know the self-reported gender of each user. You can easily segment the whole population into M/F/other and run this analysis on each segment individually.

But overall, I suspect that you would get much better data by running a survey where you ask 1% of your users "What do you think of competitor.example?"

you're probably right. but if this works, it's silent and free!

@millengustavo
Copy link

Hi Bennett, thanks for writing this out, it's an interesting question. We posted some details of the locality-sensitive hash we're using in the first Origin Trial here: https://www.chromium.org/Home/chromium-privacy/privacy-sandbox/floc.

Hi @michaelkleber, could you please tell me where I can find the details of the chrome.1.1 algorithm?
I'm testing FLoC's demo using https://floc.glitch.me/ instructions and it's telling me that

This browser's FLoC cohort version is... chrome.1.1.

Based on my tests this version works differently than chrome.2.1., for example, I noticed that my browser received a FLoC ID after visiting only one website (from your link: "An individual browser instance's cohort is filtered if the inputs to the cohort id calculation has fewer than seven domain names.")

Is there any way available to test version chrome2.1?

@michaelkleber
Copy link
Collaborator

@millengustavo What did you to do get a "chrome.1.1" flock version string? Is this based on starting up your own browser with some specialized set of command-line flags? I don't believe there is any such thing as chrome.1.1 in any experiment that Chrome is running.

If you want to start up your own browser in a way that filters cohorts until you've been on seven domains, you could change your command-line flags to include the string minimum_history_domain_size_required/7 instead of the same string ending with /1 as mentioned on floc.glitch.me. But you're still not turning on the same collection of settings that are used by the full experiment — which is good, because otherwise your flock would only be computed every 7 days, for example, which makes testing hard.

@millengustavo
Copy link

@millengustavo What did you to do get a "chrome.1.1" flock version string? Is this based on starting up your own browser with some specialized set of command-line flags? I don't believe there is any such thing as chrome.1.1 in any experiment that Chrome is running.

I just followed the instructions on floc.glitch.me, running canary from terminal with flags:

--enable-blink-features=InterestCohortAPI 
--enable-features="FederatedLearningOfCohorts:update_interval/10s/minimum_history_domain_size_required/1,FlocIdSortingLshBasedComputation,InterestCohortFeaturePolicy"

After visiting any domain, this "chrome.1.1." version appears as my browser's FLoC cohort version.

If you want to start up your own browser in a way that filters cohorts until you've been on seven domains, you could change your command-line flags to include the string minimum_history_domain_size_required/7 instead of the same string ending with /1 as mentioned on floc.glitch.me. But you're still not turning on the same collection of settings that are used by the full experiment — which is good, because otherwise your flock would only be computed every 7 days, for example, which makes testing hard.

Oh, that makes sense, thanks.

@xyaoinum
Copy link
Collaborator

xyaoinum commented Apr 6, 2021

Just add more clarification. The version is another parameter configurable server side when we rollout a new configuration, and we will use an exclusive version for each new configuration.

To be able to see the correct version during testing, you can just specify it in command line under the the same FederatedLearningOfCohorts feature, like:
"FederatedLearningOfCohorts:finch_config_version/2".

@TheMaskMaker
Copy link

Sounds related to #100

@riking
Copy link

riking commented Apr 28, 2021

I'd like to point out that for any traffic statistics that might be exposed through this, the browser vendor (Chrome/Google) already has access to them. This scheme is inherently extracting a pattern from the history training data, and therefore that pattern must already exist in the training data.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants