ovn-org / ovn-heater Goto Github PK
View Code? Open in Web Editor NEWMega script to deploy/configure/run OVN scale tests.
License: Apache License 2.0
Mega script to deploy/configure/run OVN scale tests.
License: Apache License 2.0
As discussed in #189 (comment) it would be great to be able to enforce this on every PR so that we ensure that the coding style and correctness is maintained.
After a successful deploy and install, browbeat run fails with the following:
[root@localhost ovn-heater]# ./do.sh browbeat-run browbeat-scenarios/switch-per-node-100.yml test-base
tuple.index(x): x not in tuple
Environment is stock Fedora 32 on 13 quad core VMs.
As @mkalcok points out these arguments give the wrong impression that users can choose random names. That's not exactly true. Container names are either (partially) hardcoded in ovn-fake-multinode or can be changed via config. It's probably better to pass the config (and maybe an index or some other unique id) and do the container naming internally.
We're currently collecting stats only from the first 3 fake nodes. While all of them should be equal, we can still overlook some potential issues where it is not the case. We can't just collect the data from all the nodes, the reports will become unreadbale and too large for browsers to open, but we could try to plot max, avg and sum for all the components by their type.
The main challenges would be to try and map data points taken from not exactly the same point in time into a single aggregate point, potentially by interpolating the data points filling the gaps in the grid. And another tricky part is to distinguish between different process types, e.g. we have multiple ovsdb-server processes in each central fake node and their pids are not always aligned.
I am trying to run this using F32 and F33 and in both cases I get:
docker: Error response from daemon: Conflict. The container name "/registry" is already in use by container "12c6e71a4917bcae404903c1ba3ee69bc18d17c1644abe77c9607bf4ae542ce3". You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.
./do.sh: line 400: /root/ovn-heater/runtime/browbeat/.rally-ovs-venv/bin/activate: No such file or directory
./do.sh: line 441: /root/ovn-heater/runtime/browbeat/.rally-ovs-venv/bin/activate: No such file or directory
I personally find it really hard to go through Python code that is formatted like C. Maybe we could use some
formatting tool as Black
[0], which makes the formatting uniform, pythonic and easy to read.
I recently did an audit of all places where the @ovn_stats.timeit
decorator is used. What I found was that the only truly useful places are in:
WorkerNode.wait()
: After provisioning a worker node, this waits for the corresponding Chassis
record to appear in the SBDB.WorkerNode.ping_port()
: This determines how long it takes before pings are successful [1] .The rest of the timed operations essentially call ovn-nbctl
a bunch of times. Since we use the ovn-nbctl
daemon on the central nodes, any ovn-nbctl
command is likely to complete near-instantly, especially write-only operations. Therefore, what we're really timing here is the roundtrip time for SSH command execution, not OVN. As an example look at the Namespace.add_ports
graph in the comment below this one. Aside from an oddity on iteration 17, the iterations take around a tenth of a second. This is because all that's being measured here are some ovn-nbctl calls that add addresses to an address set and adds ports to a port group.
The oddity at iteration 17 is interesting, but it runs afoul of a second issue with ovn-tester's timers: the timers do a poor job of illustrating the bottleneck(s). If you look at that graph, can you determine why iteration 17 took 25 seconds instead of a tenth of a second? That could be due to a network error that caused one or more SSH commands to take multiple attempts. Or it may be that the ovn-nbctl daemon got disconnected from the NBDB temporarily and had to reconnect, pulling down thousands of records and delaying the execution of a queued command. Or it could be something else entirely.
This problem also extends to the "useful" timers I mentioned above. When we time WorkerNode.ping_port()
, This includes the SSH connection overhead, plus python client code execution (such as multiple datetime.now()
calls). Therefore, if we see an oddity in a graph, it's difficult to pin the blame directly on OVN. We could have just lost network connectivity between ovn-tester and the worker node, for instance.
How can we fix this? There are a few steps to take:
external_ids:ovn-installled
set.[1] The usefulness of timing pings may disappear when asyncio code is merged. The asyncio PR removes all pings that were used to determine when a port comes up. Instead, it uses ovn-installed-ts
. The remaining use case for pings is for testing ACLs, and in that case, the fact the pings succeed is what's important, not how long it takes for the pings to start succeeding.
2022-12-18 08:34:16,188 | ovn_workload |INFO| Binding lport lp-172-6 on ovn-scale-172
2022-12-18 08:34:16,310 | ovn_workload |INFO| Binding lport lp-172-7 on ovn-scale-172
2022-12-18 08:34:16,419 | ovn_workload |INFO| Binding lport lp-172-8 on ovn-scale-172
2022-12-18 08:34:16,564 | ovn_workload |INFO| Binding lport lp-172-9 on ovn-scale-172
2022-12-18 08:34:16,703 | ovsdbapp.backend.ovs_idl.vlog |INFO| ssl:192.16.0.2:6641: clustered database server is not cluster leader; trying another server
2022-12-18 08:34:16,703 | ovsdbapp.backend.ovs_idl.vlog |INFO| ssl:192.16.0.2:6641: connection closed by client
2022-12-18 08:34:16,706 | ovsdbapp.backend.ovs_idl.vlog |INFO| ssl:192.16.0.3:6641: connecting...
2022-12-18 08:34:16,710 | ovsdbapp.backend.ovs_idl.vlog |INFO| ssl:192.16.0.3:6641: connected
2022-12-18 08:34:16,717 | ovsdbapp.backend.ovs_idl.transaction |ERROR| Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/ovsdbapp/backend/ovs_idl/connection.py", line 118, in run
txn.results.put(txn.do_commit())
File "/usr/local/lib/python3.9/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 92, in do_commit
command.run_idl(txn)
File "/usr/local/lib/python3.9/site-packages/ovsdbapp/schema/ovn_northbound/commands.py", line 1535, in run_idl
raise RuntimeError("LB %s alseady exist in switch %s" % (
RuntimeError: LB 4835350f-8e4e-43c8-827f-97c405e25939 alseady exist in switch 6215d9e7-5d03-4b4e-b6ff-1bcc099fe35d
2022-12-18 08:34:16,717 | ovn_context |INFO| Waiting for the OVN state synchronization
2022-12-18 08:34:17,091 | ovn_context |INFO| Exiting context base_cluster_bringup
Traceback (most recent call last):
File "/ovn-tester/ovn_tester.py", line 382, in <module>
run_base_cluster_bringup(ovn, bringup_cfg, global_cfg)
File "/ovn-tester/ovn_tester.py", line 357, in run_base_cluster_bringup
worker.provision_load_balancers(ovn, ports, global_cfg)
File "/ovn-tester/ovn_stats.py", line 20, in _timeit
value = func(*args, **kwargs)
File "/ovn-tester/ovn_workload.py", line 401, in provision_load_balancers
cluster.load_balancer6.add_to_switches([self.switch.name])
File "/ovn-tester/ovn_load_balancer.py", line 93, in add_to_switches
self.nbctl.lb_add_to_switches(lb, switches)
File "/ovn-tester/ovn_utils.py", line 661, in lb_add_to_switches
txn.add(self.idl.ls_lb_add(s, lb.uuid))
File "/usr/lib64/python3.9/contextlib.py", line 126, in __exit__
next(self.gen)
File "/usr/local/lib/python3.9/site-packages/ovsdbapp/api.py", line 110, in transaction
del self._nested_txns_map[cur_thread_id]
File "/usr/local/lib/python3.9/site-packages/ovsdbapp/api.py", line 61, in __exit__
self.result = self.commit()
File "/usr/local/lib/python3.9/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 64, in commit
raise result.ex
File "/usr/local/lib/python3.9/site-packages/ovsdbapp/backend/ovs_idl/connection.py", line 118, in run
txn.results.put(txn.do_commit())
File "/usr/local/lib/python3.9/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 92, in do_commit
command.run_idl(txn)
File "/usr/local/lib/python3.9/site-packages/ovsdbapp/schema/ovn_northbound/commands.py", line 1535, in run_idl
raise RuntimeError("LB %s alseady exist in switch %s" % (
RuntimeError: LB 4835350f-8e4e-43c8-827f-97c405e25939 alseady exist in switch 6215d9e7-5d03-4b4e-b6ff-1bcc099fe35d
For example, during a test run, at a given iteration simulate that a component (e.g., ovsdb-server) crashes.
One way of currently doing this is to use the ext_cmd
support in ovn-heater. It's generic enough so it might turn out that writing some samples about how to simulate component failures at different times is enough for closing this issue.
TODO: add more content describing the logical topology with ovn-ic.
IPv6 addresses in cluster-density tests are not bracketed leading to following errors:
2023-01-15T05:48:37.238Z|00047|lb|WARN|Failed to initialize LB VIP: 16::1:8080: should be an IP address and a
port number with : as a separator, 16::2:8080: should be an IP address and a port number with : as a separator,
16::3:8080: should be an IP address and a port number with : as a separator, 16::4:8080: should be an IP address
and a port number with : as a separator, 16::5:8080: should be an IP address and a port number with : as a
separator, 16::6:8080: should be an IP address and a port number with : as a separator, 16::7:8080: should be an
IP address and a port number with : as a separator, ...
And ovn-heater also manages to set up None
as an IP address at some point:
2023-01-15T06:33:09.875Z|00144|lb|WARN|Failed to initialize LB VIP: None:8080: should be an IP address and a port
number with : as a separator, None:8080: should be an IP address and a port number with : as a separator.
2023-01-15T06:33:09.875Z|00145|socket_util|ERR|None:8080: bad IP address "None"
2023-01-15T06:33:09.875Z|00146|socket_util|ERR|None:8080: bad IP address "None"
density heavy test case is also broken. It supplies correctly bracketed IPs, but that doesn't help with the None
issue:
2022-09-03T23:32:17.289Z|00070|ovn_util|WARN|bad ip address or port for load balancer key [None]:8080
2022-09-03T23:32:17.289Z|00071|socket_util|ERR|[None]:8080: bad IP address "None"
2022-09-03T23:32:17.289Z|00072|socket_util|ERR|[None]:8080: bad IP address "None"
2022-09-03T23:32:17.289Z|00073|socket_util|ERR|[None]:8080: bad IP address "None"
2022-09-03T23:37:00.470Z|28920|ovn_dbctl|INFO|Running command run -- create Load_Balancer name=lb_density_heavy6_4996-tcp protocol=tcp
2022-09-03T23:37:00.477Z|28921|ovn_dbctl|INFO|Running command run -- create Load_Balancer name=lb_density_heavy6_4996-udp protocol=udp
2022-09-03T23:37:00.483Z|28922|ovn_dbctl|INFO|Running command run -- create Load_Balancer name=lb_density_heavy6_4996-sctp protocol=sctp
2022-09-03T23:37:00.488Z|28923|ovn_dbctl|INFO|Running command run -- add Load_Balancer_Group 44608063-95da-4629-9532-e932b47dbbc7 load_balancer 30344b40-0296-43af-b4a6-4e9825dd3e77
2022-09-03T23:37:00.522Z|28924|ovn_dbctl|INFO|Running command run -- add Load_Balancer_Group 44608063-95da-4629-9532-e932b47dbbc7 load_balancer ed0f2773-3614-4906-9389-43940b4e01d1
2022-09-03T23:37:00.589Z|28925|ovn_dbctl|INFO|Running command run -- add Load_Balancer_Group 44608063-95da-4629-9532-e932b47dbbc7 load_balancer 4089ea3c-ac32-447d-975d-31bfe3404bf6
2022-09-03T23:37:00.624Z|28926|ovn_dbctl|INFO|Running command run -- set Load_Balancer 30344b40-0296-43af-b4a6-4e9825dd3e77 "vips:\"[109.194.0.1]:80\"=\"[None]:8080\""
2022-09-03T23:37:00.630Z|28927|ovn_dbctl|INFO|Running command run -- set Load_Balancer ed0f2773-3614-4906-9389-43940b4e01d1 "vips:\"[109.194.0.1]:80\"=\"[None]:8080\""
2022-09-03T23:37:00.637Z|28928|ovn_dbctl|INFO|Running command run -- set Load_Balancer 4089ea3c-ac32-447d-975d-31bfe3404bf6 "vips:\"[109.194.0.1]:80\"=\"[None]:8080\""
The issue can be tracked all the way back to August '22. We don't have test data before that.
We currently try to support multiple container runtimes (docker, podman, docker-ce) but that makes the code quite complex. Maybe we should try to stick to a single runtime (e.g., podman).
We should also remove the need for a local image registry so we don't rely on "random" (e.g., from docker hub) registry images. We can probably just ansible-copy the ovn-multi-node
image do.sh builds to all DUT nodes instead of having them pull the image from a registry running on the tester node.
CC: @igsilya
We currently provision ports of type "internal" when simulating pods:
https://github.com/dceara/ovn-heater/blob/d5ae548dceae639cfb956149742a8fe557b2e393/ovn-tester/ovn_utils.py#L25
https://github.com/dceara/ovn-heater/blob/d5ae548dceae639cfb956149742a8fe557b2e393/ovn-tester/ovn_workload.py#L300
https://github.com/dceara/ovn-heater/blob/d5ae548dceae639cfb956149742a8fe557b2e393/ovn-tester/ovn_workload.py#L313
Using veth instead would make configuration look closer to what ovn-kubernetes provisions.
This should probably be configurable (to allow side-by-side tests).
PR #70 collected information about CPU usage and poll intervals. Something similar for memory usage would be very useful. We already collect ps
output at the end of the run, we just need to aggregate/parse the outputs a bit.
CC: @igsilya
Somewhere around 40K load balancers, ovn-heater aborts due to DEFAULT_VIP_SUBNET
being exhausted:
2022-09-21 00:04:30,710 | ovn_workload |INFO| Creating load balancer lb_density_heavy_79868
2022-09-21 00:04:32,356 | ovn_workload |INFO| Creating load balancer lb_density_heavy_79870
2022-09-21 00:04:34,006 | ovn_workload |INFO| Creating load balancer lb_density_heavy_79872
2022-09-21 00:04:35,558 | ovn_context |INFO| Waiting for the OVN state synchronization
2022-09-21 00:05:21,236 | ovn_context |INFO| Exiting context density_heavy_startup
Traceback (most recent call last):
File "/root/ovn-heater/ovn-tester/ovn_tester.py", line 314, in <module>
test.run(ovn, global_cfg)
File "/root/ovn-heater/ovn-tester/tests/density_heavy.py", line 78, in run
DEFAULT_VIP_SUBNET, [[ports[i]]])
File "/root/ovn-heater/ovn-tester/tests/density_heavy.py", line 50, in create_lb
vip_net = DEFAULT_VIP_SUBNET.next(len(self.lb_list))
File "/root/ovn-heater/runtime/venv/lib/python3.6/site-packages/netaddr/ip/__init__.py", line 1251, in next
ip_copy += step
File "/root/ovn-heater/runtime/venv/lib/python3.6/site-packages/netaddr/ip/__init__.py", line 1102, in __iadd__
raise IndexError('increment exceeds address boundary!')
IndexError: increment exceeds address boundary!
A non-exhaustive list of knobs that configure how ovn-tester provisions the test cluster is:
We should document these so a user doesn't have to decipher the code to figure out what a knob does.
Scenario execution with DBs in standalone mode section should be updated to set ovn_cluster_db: "False"
instead of ovn_cluster_db: False
.
Otherwise browbeat and rally won't interpret the parameter as boolean value False
and will try to start DBs in clustered mode (default).
As pointed out by @dceara #179 (comment), ovn-heater run should be consistent and reproducible, that's why random assignment of GW chassis to router ports in Openstack CMS is not the best approach.
Alternative to this could be a round-robin distribution of GWs to ports. This sounds deceivingly simple, but we need to keep in mind that it's not only about evenly distributing GWs to ports, it's also about balanced distribution of priorities these GWs receive. We need to ensure that high priority is not accumulated on single GW.
Example of bad distribution:
# Port # GW # Priority
port1 gw1 1
port1 gw2 2
port1 gw3 3
port2 gw1 1
port2 gw2 2
port2 gw3 3
port3 gw1 1
port3 gw2 2
port3 gw3 3
In above example, 3 GWs are evenly distributed between 3 ports, but all the ports prioritize gw1
It's possible that tasks executed via ansible hang. For example, there were a few cases of podman system prune -f
indefinitely blocking. That's likely because of a bug in the container runtime. Nevertheless, do.sh
should not indefinitely hang.
We should instead add timeouts to ansible tasks. One option is to set a global timeout:
https://docs.ansible.com/ansible/latest/reference_appendices/config.html#task-timeout
It's possible though that some tasks need a longer timeout than others. This needs to be investigated further.
The density_heavy test currently creates a single load balancer at the beginning of the test. On each iteration, this load balancer has further VIPs/backends provisioned on it.
In actuality, we should not be creating a single monster load balancer, but instead we should be adding a new load balancer during each iteration of the test. This way, we will end up with a number of load balancers equal to the number of iterations, and each of those load balancers will have a small number of VIPs/backends.
Currently, many of our operations involve repeated calls to ovn-nbctl [1]. These can be and should be combined into single transactions when possible.
First, this is the behavior that CMSes already have. ovn-tester is attempting to emulate a CMS, so ovn-tester shouldn't hamstring itself by issuing so many SSH commands.
Second, when asyncio is used, these repeated ovn-nbctl calls lead to skewed times being reported. The event loop can give control to other tasks between the ovn-nbctl calls. If we are timing the operation that makes those ovn-nbctl calls, then the time will include "idle" time when the event loop was allowing other tasks to run.
Third, fewer SSH calls generally means faster test execution.
The actual mechanics of this are left to the implementer. For now, the easiest method would be to combine the related ovn-nbctl calls into a single SSH invocation.
[1] Also ovn-sbctl and ovs-vsctl, however ovn-nbctl is the most common and what I'll refer to throughout the rest of the issue.
Having the DBs for offline inspection would help in troubleshooting issues.
This should be triggered on push/pr and should run at least the ovn-low-scale.yml
scenario.
Hopefully this catches errors earlier and makes PR review/testing easier.
Originally discussed in PR #183 (#183 (comment)), a possible solution was suggested in ac3cba3 but that has downsides.
Instead we should probably have two layers of caching in CI:
a. a global cache to allow re-using runtimes across jobs
b. a per job cache to allow re-using the runtime within a job (for different address family + test combinations)
This would also allow us to reduce the number of CPUs when running tests.
CC: @igsilya
Sometimes following errors appear in the test log:
datamash: invalid numeric value in line 1 field 1: '1634ms'
datamash: invalid numeric value in line 1 field 1: '1521ms'
This means that some time values are not separated from the 'ms' part while parsing the log, so datamash fails to use that data.
Presumably that started to happen since parallel compaction, but I'm not 100% sure.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.