Code Monkey home page Code Monkey logo

Comments (7)

directionless avatar directionless commented on August 28, 2024 1

I captured the cloudfront logs for most of a day, and did some analysis. there are a very small number of IP addresses that account for the vast majority of our served traffic. Most of these, are in AWS. And they look like they keep requesting the same deb files. I would bet these are VPC gateways to large sites with frequent or ephemeral node creation.

If we recognize that most of our traffic is within AWS, then we can plan hosting around that. Notably, transfer from S3 to an endpoint within the same region is free. This leads to a possible solution.

We create a bucket per region, replicate our package content, and then bounce users to these direct s3 URLs. There are, of course, implementation questions...

Possible Implementations:

What Pro Con
Lambda Lots of control; probably inexpensive Served from US; need to write it
Cloud Front Function Easily fits what we're doing today limited functionality
Lambda @ Edge Global need to write it; limited languages
S3 MRAP AWS implementation of what I'd write Unclear if custom URLs are supported; unclear price
Route53 Geo Load Balancing Simple Unclear if we can do the custom URLs for s3
Just use s3 from us-east-1 Simple Need to rename bucket; can't easily do other regions

I suspect that the best route for me, is to write the lambda. It gets us off cloudfront, go is supported, etc. But, that's going to take more more than a couple days. So as a stop gap, I've gone and implemented a CloudFront function to redirect users from our top 15 IP addresses to the bucket closest to them.

Getting to that, has a bunch of other pieces:

  1. There are now 3 additional buckets serving packages
  2. S3 is configured to replicate between them
  3. Cloudfront has a viewer request function to generate URLs for the top N users
  4. I enabled s3 storage metrics, they look pretty cheap

from foundation.

directionless avatar directionless commented on August 28, 2024 1

Another approach here, is to use github. There is some prior art in making an apt repo in a github pages:

from foundation.

mike-myers-tob avatar mike-myers-tob commented on August 28, 2024

CloudSmith offers OSS package hosting sponsorships, but we'd have to contact them: https://help.cloudsmith.io/docs/open-source-hosting-policy

GitHub Packages is free for OSS. Can we use this? https://docs.github.com/en/packages/learn-github-packages/introduction-to-github-packages ah....maybe only for Chocolatey / NuGet https://docs.github.com/en/packages/learn-github-packages/introduction-to-github-packages#supported-clients-and-formats

from foundation.

directionless avatar directionless commented on August 28, 2024

CloudSmith offers OSS package hosting sponsorships, but we'd have to contact them: https://help.cloudsmith.io/docs/open-source-hosting-policy

Always good to have more options! They're quoted bandwidth caps are orders of magnitude below our current bandwidth. (this is also true for packagecloud, and the various other places I've seen). This may be different if my S3 redirects pan out.

GitHub Packages is free for OSS. Can we use this? https://docs.github.com/en/packages/learn-github-packages/introduction-to-github-packages ah....maybe only for Chocolatey / NuGet https://docs.github.com/en/packages/learn-github-packages/introduction-to-github-packages#supported-clients-and-formats

I want this to be useful. But the supported repos are very lacking.

from foundation.

mike-myers-tob avatar mike-myers-tob commented on August 28, 2024

CloudSmith offers OSS package hosting sponsorships, but we'd have to contact them: https://help.cloudsmith.io/docs/open-source-hosting-policy

Always good to have more options! They're quoted bandwidth caps are orders of magnitude below our current bandwidth. (this is also true for packagecloud, and the various other places I've seen). This may be different if my S3 redirects pan out.

Yes, but the way I read it, those limits are the guaranteed for any open-source project, and if you accept a sponsorship agreement of some kind there would be other unstated limits (perhaps negotiable up to where we need them). Not sure if osquery is high-profile enough for them but we could ask.

We definitely want to get this hosting bill off your wallet, it's not sustainable and is kind of an existential risk to the project if you suddenly have to cut out.

I think we want to encourage these top downloaders to manage their own package cache. We could write a tutorial that explains how to create private mirrors of package repos so that they can point their ephemeral VMs to that instead of constantly re-downloading from our S3 bucket. Maybe we can even pitch it as a cost saver for them too, if it means less inbound network traffic cost to them. What if we could rate limit by IP address or IP range, whichever would be effective? Not for everyone, just for the repeat downloaders. Eventually they will notice that they should be caching.

from foundation.

robbat2 avatar robbat2 commented on August 28, 2024

Wondering if you have more stats:

  • size of the actual repo, in terms of distinct objects & byte count
  • From the Fastly engagement, showing how much origin traffic there was?

I'll send an email to @directionless making some introductions.

from foundation.

directionless avatar directionless commented on August 28, 2024

I'm not great at updating tickets...

I thought a bit, and realized that something seemed very off. The number of requests we see (about 2.5 million/day) is very very high for osquery packaging. So I went digging into the actual users of the data. Is it actually credible that we see ~30 computers a second download osquery?

Anyhow, I discovered that there is a single consumer that is responsible for the vast bulk of the traffic. I don't know much about it, other than it's an VPC in AWS us-east-1, and they keep downloading the ubuntu x86 package.

Because AWS in-zone S3 data transfer is free, this leads to a simple solution. For the busy clients, we can redirect them directly to the bucket. I went and setup some redirect magic in cloud-front to bounce the top 10 AWS ips to direct bucket links. Our monthly bill is now much more manageable. Though probably still a bit higher than desired.

I think I can get it even lower, by moving my redirecting project into lambda, and completely moving away from cloud-front.

from foundation.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.