Code Monkey home page Code Monkey logo

Comments (38)

Zegnat avatar Zegnat commented on June 15, 2024 6

I came upon SponsorBlock today for the first time and a browser-around let me to this issue. As I personally like to worry about privacy and would not want to send every single video ID to a SponsorBlockServer instance (unless maybe I was operating one within my own network). So I considered looking into this and maybe doing a PR. But first I would like to share some thoughts.

To start I would like to point at an implementation of k-anonimity for data queries that pre-dates Pwned Passwords: the “Safe Browsing” (Chrome) and “Block dangerous and deceptive content” (Firefox) systems. Information about it is a bit spread around, but some good resources are: the API Documentation (look for “Update API”), more links collected on the Mozilla Wiki, and a study by Gerbet et al.

I think this is a good implementation to keep in the back of our heads when continueing the discussion because of the many parallels there are. Safe Browsing is set-up to facilitate a client (web browser) wanting to check a centralised repository (Google’s collection of mallicious links) for something about a resource (website) without telling the repository about every resource the client uses. Now make the client a video player, the repository SponsorBlockServer, and the resource a YouTube video, and we are in the exact same situation.

Safe Browsing also has a number of points that address some arguments raised by this discussion. You do not need to study it before reading on, just do not be surprised if I reference it again (and again, and again) 😄

I will consider the threat vector as: the SponsorBlockServer will get to know every video the user watches. Leave other discussions for later. (If we do not trust the network, we may need obfuscation techniques like padding. If we do not trust the client … well that is a separate can of worms.)

Hashing

I would say hashing is absolutely required. It is the only way to ensure no actual information about the video ID is shared with SponsorBlockServer. Even part of a video ID is still information. Further more, are we sure YouTube video IDs are randomly distributed throughout the possible ID space? Is a - just as likely to be part of an ID as an a? I do not have a big enough dataset to run this analysis and I do not think we need to do so at all. Hashing means we eliminate that problem.

It does not really matter what hashing algorithm is chosen. As long as it upholds the default assumptions for a hash we can use it to map both value spaces to eachother: the amount of IDs known to SponsorBlockServer to the amount of IDs that can possibly exist in YouTube.

Pwned Password uses SHA-1. But Troy has also explained that he is not really hashing to seriously protect the data. Only to make sure “any personal info” like email addresses “in the source data is obfuscated”. People were also “cracking the passwords for fun” and succeeding at this. (Multiple discussions have happened around this, quotes taken from a February 2018 blog post.)

I am honestly not sure it matters much for prefixes in our specific case, but Safe Browsing uses SHA-256 and as I said might be closer to our flow. I also do not expect SponsorBlock clients to exist on many platforms that would not have access to SHA-256 primitives.

So let us say we hash every video identifier with a SHA-256 function. Let us also assume that storage and querying the database is trivial. (Because any debate about TEXT vs BLOB and optimisation of WHERE statements can be had when actual benchmarking can be done.)

What to hash

This is something that was only very shortly touched on in the discussion here by @phiresky:

[…] it allows for adding other websites later (e.g. hash whole normalized URL which is probably a better idea anyways) and if youtube id algo changes.

This is a step that once again exists in the Safe Browsing API. They do URL canonicalisation. However my recommendation would be to not do that. In my experience anything close to trying to normalise a URL leads to differences between clients. As a quick note: RFC 3986 includes steps for “Normalization”, the WHATWG URL standard includes steps for “equivalence”, Safe Browsing has steps for “Canonicalization”. If you are unlucky the language you are writing your client in will have access to anything from 0 to a 100 implementations of any of these.

Not only that, you would also end up having to specify additional steps per platform. YouTube has lots of possible URLs pointing at the same video. Sometimes from entirely different domains (youtu.be anyone?).

Instead stick to clear identifiers supplied by the platforms. The YouTube video ID is much less ambiguous than any URL will be. It is also likely that a client already has logic to extract the video ID from a bunch of different sources. Letting them work with that is better.

If we really value compatibility with other platforms simply generating a URI of our own suffices. Instruct clients to hash yt: followed by the YouTube video ID. That would allow simple future expansion without messing with URLs.

Hash prefix length

Setting the length of the prefix is a balancing act. The longer the prefix, the less privacy is given. This goes the full range from 0 to 256 bits (in the case of SHA-256).

If the client asks for a list of possibilities with a 0 bit long prefix it is really just asking for the entire database. This gives complete privacy to the user as the server gets no clues about what video was being accessed. But of course it is a bit annoying for all parties involved as the client would constantly request lots and lots of unneccessary data.

If the clients asks for a list of possibilities with a 256 bit long prefix it is really just sending the full hash to the server. This gives almost¹ no privacy to the user as the server could match this hash 1:1 with a video ID to know exactly what was being watched. So this is no win for the user.

The number we are aiming for is somewhere in the middle. I am however not convinced by @phiresky’s calculation of the prefix length. The length there seems to be argued from the data perspective and not from the user perspective. I do not even think there is any privacy difference depending on how many results the server returns. There is no relation I can see between number of results and the threat vector of not wanting to tell the server about what video I am watching.

From the server perspective I do not think we care. There is already an endpoint that allows the client to submit the exact video ID, so obviously we are OK with the 256 bit variant. The database is not secret and infact readily downloadable, so we are also OK with the 0 bit variant.

The only argument from the server side I can think of is to protect against a number of DoS problems. When the database grows, it may not always be OK to support continuous downloads of the entire thing. Server instances may want to scale up the minimum hash length from 0 to a more managable number in accordance with total database growth. Maybe in such a way that response sizes stay under a certain number of bytes.

From the client perspective I think we want to send as little information as possible. This is where the user is and where we care about the privacy. But here are also a number of platform limitations that need to be considered. Like @ajayyy mentioned before about web browsers, some platforms may put limitations on storage. Other platforms may be on limited bandwidth.

So for a first implementation I was thinking: why not leave it completely up to the client what prefix length it sends and only pick a bottom limit for the server?

We have 133417 sponsorTimes in the latest DB (just downloaded from the mirror). If we aim to never have an API response include more than a 1000 we need a minimum length that cuts our hash space in approximately 134 blocks. That is just the first 8 bits of the hash (or the first 2 hexadecimal characters).²

If the client feels this is problematic because of storage, memory, or bandwidth concerns, it can opt to send a longer prefix whilst sacrificing some of its user’s privacy.

Future of prefixes

The Safe Browsing API once again has an interesting system. The client can download a list of prefixes straight from the server. This is interesting for a number of reasons:

  1. If the client encounters a resource where the hash prefix is not in the list at all, it does not need to make any requests. 100% privacy!
  2. The server can make clear to the client exactly what prefixes it will respond to. If a certain 2 bit prefix only includes 10 videos, it does not need to split this up further. If another 8 bit prefix includes 10000 videos maybe it helps to only have 10 bit prefixes defined there.

Now this of course assumes the client has a way to store this list of prefixes and implements logic to keep it up to date. Therefor I am not entirely sure it makes sense to make this a must from day one.

Pull request

Like I said at the start, I am considering looking into this and getting started on a pull request. In the mean time I would love to hear everyone’s input on my thoughts. Whether you think I am completely missing the mark, right on it, or anywhere in between. (Shall we rate it 0 to 256 bits? 😉)

You will also find me in the SponsorBlock Discord as Zegnat if you would like to discuss any of the above with me outside of this issue.


¹: almost because if the video ID was not already known to the list, the hash needs to be reversed
²: this is just math. In reality we may have some uneven distributions through sheer randomness of the data. As our full dataset is still relatively small it would be simple to run the numbers and pick a good cut-off point.

from sponsorblockserver.

girst avatar girst commented on June 15, 2024 4

from sponsorblockserver.

hatzel avatar hatzel commented on June 15, 2024 4

I love that you are tackling privacy issues! I would, however, like to raise one concern regarding the effectiveness of this specific approach. I am not very active/experienced in the field of privacy preserving technology so take this with a grain of salt but I am fairly certain the base idea holds merit.

Some assumptions I'll just make explicit:

  • Attack scenario: the server want's to know which videos a user (as defined by an IP) is watching
  • For simplicity: a user has a fixed/static and unique IP address

Obviously hashing doesn't prevent the server from getting the set of videos that were requested. The hash is only a way of specifying a random set of videos, including the one at interest.
What the server will get from an extended video watching session is something like this (assuming k=3):

Access Time 9:08 9:21 9:35
Video 1 id <video_id> <video_id> video_id
Video 2 id <video_id> <video_id> video_id
Video 3 id <video_id> <video_id> video_id

There are many scenarios in which it is easy to identify which sequence of videos is the correct one. For example, if part 1 and part 2 of a series are in columns 1 & 2 it is very unlikely that those were chosen based on random chance. Similarly if a request for a new video is made after 14 minutes when only one video in the previous column was 14 minutes in length there is an increased likelihood of that video being the one that is watched.
More formally this can be modelled as a sequence of conditional probabilities, i.e. what's the probability of 0we7kcmgDOw following yNEWkY9D2k4. Maximizing that probability sequence would end up with the most likely sequence of videos that a user has watched.

From a practical perspective I see that it could be argued that this is still an improvement in privacy, even if an attack like this is sometimes possible (after all it still requires a lot of effort). I would however also like to at least mention the counterpoint that implementing such a scheme could give a false confidence in the privacy afforded by this service.

Keep in mind that human intuition on probability can be very misleading so maybe this attack quickly becomes infeasible. Some actual calculations might be required to assess the feasibility of this approach.

from sponsorblockserver.

ajayyy avatar ajayyy commented on June 15, 2024 4

#127

from sponsorblockserver.

girst avatar girst commented on June 15, 2024 2

from sponsorblockserver.

girst avatar girst commented on June 15, 2024 1

from sponsorblockserver.

phiresky avatar phiresky commented on June 15, 2024 1

Currently, there are 25k entries in the DB. this means that on average log2(25k) = 15 bit are needed to uniquely identify one video. If we want every request to pull e.g. 16 videos for privacy, then we need to query by a prefix of log2(25k) - log2(16) = 11 bits.

youtube video ids are base64 strings, so they have 6 bit per character. Hashing the id here doesn't matter regarding privacy I think but even though youtube video ids are pretty obviously random, it's probably still better to hash it to be safe of the distribution. Also it allows for adding other websites later (e.g. hash whole normalized URL which is probably a better idea anyways) and if youtube id algo changes.

I wouldn't use BLOB in sqlite for this since it's harder to filter by, better use hex strings.

Hex strings are log2(16)=4 bit per character.

So right now it would be pretty easy e.g. add a videoID_sha256 column then do queries by the first three characters of the hash (12bits) which would return on average 8 videos.

sqlite3 is able to use indices for prefix GLOB queries (select where videoID_sha256 GLOB 'a0e*'), so the performance impact on the server would be minimal.

For each doubling of the database size, we in theory want to increase the length of the prefix we are looking for by one bit. If we go by hex character prefixes this isn't really possible since we can only increase the prefix every 4 bits or every 16xing of db size.

from sponsorblockserver.

ajayyy avatar ajayyy commented on June 15, 2024 1

@devnoname120

ajayyy/SponsorBlock#189 (comment)

from sponsorblockserver.

girst avatar girst commented on June 15, 2024 1

from sponsorblockserver.

girst avatar girst commented on June 15, 2024

Technically, no new endpoint would be required. if the input has a hash_prefix key, query by sha1 prefix, if it has a videoID key, by video ID.

Storing hashes as binary blobs instead of strings is likely more efficient too; see for example https://stackoverflow.com/a/24011877

from sponsorblockserver.

ajayyy avatar ajayyy commented on June 15, 2024

Smart, I don't know how well this will work with how little videoIDs are in this DB, but it definitely could work.

from sponsorblockserver.

girst avatar girst commented on June 15, 2024

from sponsorblockserver.

ajayyy avatar ajayyy commented on June 15, 2024

Ah, true. I think a blob could work, since it could be converted from a string to blob (base 64 or something)

from sponsorblockserver.

ajayyy avatar ajayyy commented on June 15, 2024

One thing to note about this is is could be quite expensive, since it doesn't just send all sponsors for one video, it uses the weighted random generator.

from sponsorblockserver.

girst avatar girst commented on June 15, 2024

from sponsorblockserver.

ajayyy avatar ajayyy commented on June 15, 2024

Go for it! I think putting setting it up to start putting sha1sums is a good start, then we can add the rest after. Is there a reason to use sha1 over sha256?

from sponsorblockserver.

girst avatar girst commented on June 15, 2024

from sponsorblockserver.

ajayyy avatar ajayyy commented on June 15, 2024

That makes sense. So really, the only disadvantage is more data throughput and more calculations server side.

from sponsorblockserver.

afontenot avatar afontenot commented on June 15, 2024

I was going to file an issue suggesting exactly this, but I have one issue: I don't think there are likely enough videos submitted to make this sufficiently anonymous. It seems to me, that unless and until the database gets huge, the best thing to do would be to just have each client download the whole database, and maintain it locally. It's currently only ~3MB, and I expect it will stay at a manageable size for the foreseeable future.

To update (probably done every time a video is loaded), you wouldn't need to download the whole database again, just add an API endpoint for getting all entries since timestamp.

It's not 100% a fair comparison, but this is what other ad blocking software does: you don't send every URL you visit to a remote server, you download and sync a list that tells you what to block.

from sponsorblockserver.

girst avatar girst commented on June 15, 2024

from sponsorblockserver.

ajayyy avatar ajayyy commented on June 15, 2024

Is there an easy way to have a database in the browser? I was thinking of setting up a variant of the server that does what you are describing, and then making it so the extension can pull from that local server instead.

from sponsorblockserver.

girst avatar girst commented on June 15, 2024

from sponsorblockserver.

ajayyy avatar ajayyy commented on June 15, 2024

Yea, the other issue is that localStorage does have a very low max storage usage, so it would not scale well.

from sponsorblockserver.

ajayyy avatar ajayyy commented on June 15, 2024

I think an easier way would be to use an all lower case search of the video IDs (not just the first letters, a random subset of letters). The seem random enough, and when you use 3 characters, you usually get a few results.

from sponsorblockserver.

girst avatar girst commented on June 15, 2024

from sponsorblockserver.

ajayyy avatar ajayyy commented on June 15, 2024

Isn't that the same probability, just that there is more precision with more characters? If someone requests for "sh3", their video could be anything that contains "Sh3", "sH3", etc.

Maybe I am just misunderstanding.

from sponsorblockserver.

phiresky avatar phiresky commented on June 15, 2024

Alternatively to using sqlite3 GLOB it would be possible to filter by an exact number of prefix bits by

(a) just storing the sha hash prefix in binary strings (pretty easy since we'll never need more than 32 prefix bit characters which would require 32bytes of storage per entry in this case)

(b) doing range queries on the hex strings: Convert the sha256 to a bigint, set all bits apart from the first or last 11 to zero (=lower bound) and then to 1 (=upper bound), e.g. with lowerbound = shaint & ((1<<11) - 1) and upperbound = lowerbound + (1 << 11), and then doing the query as
select * from sponsorTimes where videoID_sha256 >= lowerbound and videoID_sha256 < upperbound (abusing alphabetic text ordering here)

Probably neither of these is worth it and just filtering by sha256 prefix is fine, especially since many videos we're looking for aren't even in the result set so just using the first 12 bits is fine until we switch to 16 bit when db size has increased to like 128k entries.

from sponsorblockserver.

ajayyy avatar ajayyy commented on June 15, 2024

I think 16 is too much. Maybe there can be an option to do that, but I think the default should be lower.

from sponsorblockserver.

devnoname120 avatar devnoname120 commented on June 15, 2024

Note that this wouldn't be able to fix the call to invidio.us:
https://github.com/ajayyy/SponsorBlock/blob/1abc1b9b28d59d2020d26810156cbdfbe3e8f7fa/content.js#L441

from sponsorblockserver.

ajayyy avatar ajayyy commented on June 15, 2024

Invidious API usage has been removed.

from sponsorblockserver.

ajayyy avatar ajayyy commented on June 15, 2024

Sounds great!

I'm going to repeat some things I've already said on discord to have them here more publicly.

What to hash

I think hashing just the video id is fine for now. I think the service should be added as a separate column (when needed). The service that they are getting info from isn't a large privacy concern and this would more easily allow it to be split to a different database or server in the future depending on growth.

Prefixing

Like @girst , I think there should be minimums, but allowing it to be changed client-side isn't a bad idea.

Caching

I want to implement some form of caching in the future but ideally this will hold the actual data instead of just hashes. This would only apply to videos older than some set time period.

The issue with setting up caching is that fetching the upload time actually takes time. This means that sponsor fetching would be slower for every video as it would fetch the upload time, wait for a response, then fetch times. This is why the current implementation of whitelisting sometimes misses zero second sponsors (though in the latest beta I have added an option to wait before fetching to resolve this issue in exchange for slower sponsor fetching).

I feel that caching can be dealt with separately from this.

from sponsorblockserver.

ajayyy avatar ajayyy commented on June 15, 2024

Still have to test performance before enabling globally, so it will be just an option for now. I might try enabling it for a certain percentage of all requests (50%) by default and see how it goes.

from sponsorblockserver.

girst avatar girst commented on June 15, 2024

from sponsorblockserver.

ajayyy avatar ajayyy commented on June 15, 2024

@girst Yes, it's about the addon. I have noticed that the public database size doubles with this change, so I am going to look into splitting this into a separate database file. It might not be live for a bit.

from sponsorblockserver.

ajayyy avatar ajayyy commented on June 15, 2024

@girst The API is now live on the main server

from sponsorblockserver.

girst avatar girst commented on June 15, 2024

from sponsorblockserver.

ajayyy avatar ajayyy commented on June 15, 2024

Sorry about that, should be good now

from sponsorblockserver.

girst avatar girst commented on June 15, 2024

from sponsorblockserver.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.