Anzen (安全) Google Safe Browsing API

Anzen (安全) is a web service with a rudimentary implementation of the Google Safe Browsing API to detect and block access to unsafe domain names and URLs known to distribute malware. A web application firewall sitting in front of the website passes each HTTP request to the URL threat analyzer and denies stops the connection if the website is unsafe.

Workflow Safe URL

Client sends an HTTP request to download to example.com/blog/hello-world;
HTTP proxy takes the request and sends the URL for threat analysis to ARIADOS;
ARIADOS runs some basic checks on the URL before proceeding with the analysis;
URL is hashed using SHA256: ca790f6428ecb8390cb8b6957c9c20c0a6ef0a259061135e8adcd05e68ad23b0;
Check the URL-hash in the Cuckoo Filter, if not found then the URL is safe;
If the URL was found (may be a false positive) then check if the URL-hash is in the LRU cache, if yes then return the Threat Info associated to it. Note: I didn’t invest time implementing this step, it is quite trivial to add LRU cache to the project, maybe later when I finish other things that are more relevant;
If the URL-hash is not in the LRU cache, then construct the URL for the server pool using the the first character of the hash sum, in this case c like this: threat-info-{SERVERPREFIX}.Hostname. Because the SHA256 sum is represented in hexadecimal, it means we will have at least sixteen (16) servers to lookup the threat info. If we configure the server to select two characters we will have a server pool of 256 servers. If select all sixty-four (64) characters we will have 115, 792, 089, 237, 316, 195, 423, 570, 985, 008, 687, 907, 853, 269, 984, 665, 640, 564, 039, 457, 584, 000, 000, 000, 000, 000 servers (I don’t even know how to read this number LOL, that’s more than 1M at least);
DUGTRIO-{SERVERPREFIX} looks up the URL-hash on the persistent storage (I am giving Redis as an example) and returns the Threat Info if available, otherwise returns {Threat:NONE} which means the URL appears to be safe.

Updates

Use POST /insert/THREAT_TYPE/URL to insert unsafe URLs to the probabilistic data structure and the persistent storage.

Deployment

Assumming you are familiar with Docker and make, the only thing you have to do is to open this directory and execute make. The Docker images will be created and the containers started. When everything is ready you can start playing.

Cleanup

Executing make clean will stop and remove all the Docker containers and images.

Unit-tests Tests

Only the main web service contains a couple of unit-tests:

cd ariados/ && go test -v ./...

Integration Tests

# Terminal one (top screen)
docker-compose logs -f

# Terminal two (bottom screen)
make test-insert
make test-retrieve

Benchmarks

Execute make benchmark-safe and make benchmark-unsafe to stress test the web service. Please keep in mind this is still a proof-of-concept, there is a lot of room for improvements, specially regarding the persistent storage and the implementation of the LRU cache. Right now, the web service can handle ±1454.79 reqs/sec when looking up safe URLs and ±414.07 reqs/sec when looking up unsafe URLs.

Stress testing the web service with a safe URL returns 1454.79 reqs/sec.

$ wrk -t4 -c100 -d60s --latency "http://localhost:8080/urlinfo/1/SAFE_URL"
Running 1m test @ http://localhost:8080/urlinfo/1/SAFE_URL
  4 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    69.02ms   31.45ms 296.04ms   70.06%
    Req/Sec   365.69    117.20   770.00     65.62%
  Latency Distribution
     50%   66.26ms
     75%   88.04ms
     90%  111.56ms
     99%  150.65ms
  87415 requests in 1.00m, 11.25MB read
Requests/sec:   1454.79
Transfer/sec:    191.79KB

Stress testing the web service with an unsafe URL returns 322.42 reqs/sec.

$ wrk -t4 -c100 -d60s --latency "http://localhost:8080/urlinfo/1/UNSAFE_URL"
Running 1m test @ http://localhost:8080/urlinfo/1/UNSAFE_URL
  4 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   240.97ms   79.18ms 695.40ms   71.82%
    Req/Sec   104.33     44.67   303.00     73.90%
  Latency Distribution
     50%  240.12ms
     75%  283.48ms
     90%  337.75ms
     99%  469.75ms
  24879 requests in 1.00m, 4.22MB read
Requests/sec:    414.07
Transfer/sec:     71.98KB

In comparison:

OpenStreetMap handles 10-20 reqs/sec
Wikipedia seems to be 30000-70000 reqs/sec spread over 300 servers (100-200 reqs/sec per machine, most of which is caches)
Geograph is getting 7000 images per week (1 upload per 95 seconds)
Twitter handles 600 reqs/sec. Average 200-300 connections per second. Spiking to 800 connections per second.

Research Notes: Initial Ideas

Use a Bloom Filter to quickly determine if a URL is benign, in which case we can finish the operation fast, otherwise run the malware identifier against it;
A Cuckoo Filter could be an option too. A cuckoo filter can also delete existing items, which is not supported by Bloom filters, and since infected URLs can be cleaned, it makes sense to offer the ability to remove them from the malware database;
Distributed Bloom/Cuckoo Filter? Definitely possible, as long as the size and hash functions is the same in all the nodes. Twitter effectively uses a Distributed Bloom Filter taking advantages of the monoid features of the data structure;
LRU cache for the most common malicious URLs. Keep track of how many hits other malicious URLs get. Update the cache every few minutes with the list of URLs with more hits;
Assume the proxy is internal which means implementing a rate limiter is not an option;
Assume the HTTP proxy enforces URL encoding for the resource path and query parameters. Otherwise, the HTTP router in this webservice will ignore parts of the original URL and return a false positive;
Worst case scenario is, every URL (including query params) is different and malicious;
Because the query string is also taken in consideration, we may be able to split the malware check in subsets of the URL. For example, check if the hostname is blacklisted, check if the hostname+port is blacklisted, if not then proceed to check the entire URL against the malware database. We could use a Bloom Filter here as well for the hostname;
Can we use HAProxy to manage a swarm of web services? This way we can distribute the load and offer high availability. However, this doesn’t help to address the memory consumption on each server;
Google Safe Browsing API uses gRPC Transcoding syntax for communication. For the sake of simplicity, this web service will return a JSON encoded object with some information associated to the URL.

Research Notes: Google Safe Browsing

After some thinking, I realized that this exercise is exactly what Google Safe Browsing is. They also use a Bloom Filter in Google Chrome to determine when to send a URL to the web service for a full scan or not. So I think implementing the algorithm in this exercise makes more sense now.

Some of the original engineers and managers in the Safe Browsing team —led by Niels Provos— have already moved to other companies or teams inside Google:

Niels Provos, Head of Security at Stripe
Panos Mavrommatis, Security Engineering Director at Google
Moheeb Abu Rajab, Principal Engineer at Google
Noé Lutz, Engineering Lead at Google AI
Nav Jagpal, Senior Staff Engineer at Google
Allison Miller, SVP Engineering at Bank of America
Fabrice Jaubert, Senior Software Development Manager at Google
Stephan Somogyi, Product Lead, Android Platform Security
Emily Schechter, Product Manager at Google Chrome
Brian Ryner, Software Engineer at Google
Lucas Ballard, Senior Staff Software Engineer at Google
Ian Fette, Senior Director Of Engineering at Slack

Nav Jagpal is the only one who is still involved in the project, so I went ahead and contacted him on LinkedIn to see if he could give me some ideas of how the web service works and the scale at which they are operating. I was recommended to take a look at the implementation code in these two projects:

cixtor / anzen Goto Github PK

anzen's Introduction

Anzen (安全) Google Safe Browsing API

Workflow Safe URL

Updates

Deployment

Cleanup

Unit-tests Tests

Integration Tests

Benchmarks

Research Notes: Initial Ideas

Research Notes: Google Safe Browsing

anzen's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent