Comments (19)
I think there are some aspects where kOps may be improved for reducing the cost and duration of the 5k tests:
- instance root volume size - the default is 48 GB, which seems way too much for this use case
- node registration - probably client-side throttling when so many nodes try to join at the same time
- collecting the logs and data from the nodes after tests - done sequentially, so we limited it to 500 nodes, but can probably do it in parallel
- resource cleanup - there is a lot of room for optimisation there, due to the many AWS API requests, they get throttled often
from k8s.io.
I don't think that was true in November '22?
But either way that doesn't account for using ~50% of budget on AWS and a much smaller fraction on GCP.
I think we have been running 2 tests until 2 days back, this PR 2 days back moved it to one test for every 24 hrs.
My understanding is that was true in the past, but in November we were not developing anymore?
This was discussed in today's meeting.
I remember kicking off one off tests when working on experimenting with Networking SLOs in Novemeber.
Maybe smaller worker node volumes are in order? Or a different volume type?
We can definitely improve cost basis on the instance types we choose, we are having a discussion on slack around this here - https://kubernetes.slack.com/archives/C09QZTRH7/p1701906151681289?thread_ts=1701899708.933279&cid=C09QZTRH7
from k8s.io.
For log dumping: The GCE scale test log dumper sshes to the nodes and then actually push the logs out from each node, so the local command in the e2e pod is relatively cheap/quick (plus parallel dumping mentioned above).
IIRC ssh thing is only done for nodes for which logexporter daemonset failed: https://github.com/kubernetes/test-infra/blob/master/logexporter/cluster/logexporter-daemonset.yaml
from scale test logs:
Dumping logs from nodes to GCS directly at 'gs://k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-performance/1732082847704944640' using logexporter
from k8s.io.
Dumping log and other info: 60 min
Resource cleanup: 25 min
from k8s.io.
cc @hakman @hakuna-matatah @upodroid
from k8s.io.
AFAIK, we are talking about a single test: ci-kubernetes-e2e-kops-aws-scale-amazonvpc-using-cl2
from k8s.io.
The test runs in a single account k8s-infra-e2e-boskos-scale-001
.
Breaking down some services used:
Service | November 23 |
---|---|
EC2-Instances | 118,494.91 |
EC2-Other | 15,299.39 |
Support (Business) | 8,603.50 |
Elastic Load Balancing | 258.91 |
Key Management Service | 13.66 |
S3 | 3.16 |
from k8s.io.
So what's EC2-Other
?
So EBS volumes on the EC2 instances and Data Transfer within the region.
from k8s.io.
-
One reason was that we were running 2 periodics vs 1 periodic on GCP.
-
Other reason is that we kicked off one off tests using pre-submits as well while making code changes and fixing some issues before we saw successful runs.
-
We are also running statefulsets in our cl2 tests, haven't looked into if GCP is running stateful sets as part of scale tests.
from k8s.io.
One reason was that we were running 2 periodics vs 1 periodic on GCP.
I don't think that was true in November '22?
But either way that doesn't account for using ~50% of budget on AWS and a much smaller fraction on GCP.
EDIT: These two jobs run 5k node in k8s-infra-e2e-scale-5k-project
https://testgrid.k8s.io/sig-scalability-gce#gce-master-scale-correctness
https://testgrid.k8s.io/sig-scalability-gce#gce-master-scale-performance
Other reason is that we kicked off one off tests using pre-submits as well while making code changes and fixing some issues before we saw successful runs.
My understanding is that was true in the past, but in November we were not developing anymore?
This was discussed in today's meeting.
We are also running statefulsets in our cl2 tests, haven't looked into if GCP is running stateful sets as part of scale tests.
If this can account for a dramatic cost increase, we should consider not doing it.
I doubt it, since the GCP tests have been extremely long running (like > 10h), so it's mostly down to the size and shape of the cluster under test (or something else comparable like making sure we have reliable cleanup of resources when the test is down)
EDIT: I remember wrong, the GCE jobs take 3-4h.
I suspect we can run smaller worker nodes or something else along those lines?
from k8s.io.
Aside: The GCP / AWS comparison is only to set a sense of scale by which this seems to be running much more expensively than expected, I don't expect identical costs and we're not attempting to compare platforms ... it will always be apples to oranges between kube-up on GCE and kops on EC2, but the difference is so large that I suspect we're missing something with running these new AWS scale test jobs cost effectively.
The more meaningful datapoint is consuming ~50% of budget on AWS just for this account, which is not expected.
So EBS volumes on the EC2 instances and Data Transfer within the region.
Maybe smaller worker node volumes are in order? Or a different volume type?
It looks like the spend on this account is by far mostly ec2 instance costs though, so probably we should revisit the node sizes / machine types?
from k8s.io.
from k8s.io.
resource cleanup - there is a lot of room for optimisation there, due to the many AWS API requests, they get throttled often
Do we know how long this step is taking today ? I want to compare it with internal EKS 5k test runs tear-down
time. And what throttling are we experiencing in terms of APIs ? We can try to increase the limits if current limits are not enough at the time of deletion process.
from k8s.io.
Dumping log and other info: 60 min
I was wondering if it's feasible to keep just 500 nodes and cleanup 4,500 nodes and then dump the logs for those 500 ? That way we would save 60mins on 4,500 nodes ? WDYT ?
from k8s.io.
Could we use spot instances at all? If not, let's document why not.
from k8s.io.
Could we use spot instances at all? If not, let's document why not.
Currently, I don't think it's possible as the scale test does not tolerate spot instances well. Definitely, that's something we would like to improve though.
from k8s.io.
We can definitely do one of those two options 😉
from k8s.io.
For log dumping: The GCE scale test log dumper ssh
es to the nodes and then actually push the logs out from each node, so the local command in the e2e pod is relatively cheap/quick (plus parallel dumping mentioned above).
https://github.com/kubernetes/test-infra/blob/master/logexporter/cluster/log-dump.sh
To do this it currently writes a file to the results with a link to where the logs will be dumped in a separate bucket and just grants the test cluster nodes access to the scale log bucket.
In this way it is dumping all nodes:
$ gsutil ls gs://k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-correctness/1732445180289617920 | wc -l
5003
from k8s.io.
Looks like there is another improvement made by @rifelpet w.r.t to parallelizing resource dumps to save cost here , linking it here as it's related to this discussion we are having here.
from k8s.io.
Related Issues (20)
- Requesting NATGateway quota increase HOT 6
- Kubeadm v1.24 installation failed on Red Hat-based distributions (CentOS7) using the new package repositories hosted at pkgs.k8s.io HOT 6
- aws-load-balancer-controller CI job failing
- Set up AWS Security Hub HOT 4
- etcd-io Infra and CI Migration HOT 12
- Blocked from APT repository by CloudFront HOT 12
- Access Issue in aws-iam-authenticator e2e tests HOT 4
- Deploy a new nginx instance to handle apt.kubernetes.io and yum.kubernetes.io HOT 4
- Release.key URL redirect causes problem with Cloud Init and gpg HOT 8
- Add a `go.k8s.io` redirect for contact information HOT 13
- Incident with GCP billing HOT 5
- N2 Quota changes for Scale Projects HOT 3
- OIDC Provider Count Affecting EFS CSI Driver Test Cluster Provisioning
- AWS: Pod got deleted unexpectedly HOT 13
- VPC Limit Reached in AWS account eks-e2e-boskos-005 HOT 5
- DNS REQUEST: <your-dns-record> HOT 2
- registry.k8s.io: Unauthenticated requests do not have permission for europe-west10 HOT 4
- `Quota 'CPUS' exceeded` error on containerd presubmits HOT 7
- Cluster API Infra Provider AWS (CAPA) e2e tests regularly because the maximum number of EventBridge Rules has been reached HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from k8s.io.