Comments (38)
Still not sure how scalelite should know in advance how large a meeting will grow. What if a couple of new meetings arrive at the same time that will all be very large in a couple of minutes? How to ensure that these do not end up on the same server?
One idea would be to take meeting age into account and factor in the uncertainty of very fresh meetings. For example: If a meeting was created very recently and is still empty, count it as if it already had as many users as the largest meeting. If a meetings was created more than 15 minutes ago, it will not grow much more, so assume the current user count is correct. Interpolate between these two.
New meetings will now prefer servers with low user count, but avoid servers with many new meetings for which the final user count is not known yet. This approach would prevent the issues I mentioned earlier, that too many new meetings could end up on the same server.
from scalelite.
Just wanted to add a few observations
- Giving different weights to audio and video is important, but not the full story. Having a single meeting with 50 video streams is different from having 50 meetings with a single video streams, so there should be a non-linear component.
- adding weights to servers (I assume to express their capabilities) may not be needed when the overall system load metric is taken into account that expresses how (over)loaded the server is.
- When speaking about cloud (EC2) deployments, it may make sense to further incorporate the cpu.steal value in the decision, as it describes how overloaded the physical server is that hosts the virtual machine.
Anyway, also having sophisticated metrics in place, the system as built now has some natural inertia so that it may be a bad idea to always pick the least loaded server (assume five new meetings starting a time interval of 1 minute; it will take some time to see the impact of this new meeting in the metrics; also it will take some time until all users have joined the meetings and so on.). So, there still needs to be a Round Robin-like element in the heuristic that ensures that in avbove scenario not all new meetings end up on the same host. Currently, with only counting meetings, this round robin behaviour is built-in the algorithm.
from scalelite.
If a meeting was created very recently and is still empty, count it as if it already had as many users as the largest meeting. If a meetings was created more than 15 minutes ago
One approach is to give each meeting a "warmup period" where Scalelite gives it a minimum size of 15 users for the first 15 minutes, and thereafter uses the actual size.
from scalelite.
We've been running Scalelite with these changes for nearly a year now for a school district in Germany. We regularly have 2000 concurrent users handled by a cluster of 15 conference nodes (12 dedicated CPU cores each). This works nicely. Servers aren't loaded over capacity, we don't overprovision too much. Without those patches things would be unusable, or we'd have to run three times the number of servers. We do not have any kind of foresight, no planned scheduling, just a weighted load based on the number of active video & audio channels instead of the number of conferences.
I think my point is that we've got a "perfect is the enemy of good" situation. Sure, bringing planned meetings into the fray might make things a bit more precise. However, the real advantage and the real improvement here is in going from "solely use the number of conferences" to "guess current load based on actual usage numbers (audio, video, listen-only)". Going from "guess current load based on actual usage numbers (audio, video, listen-only)" to ""guess current load based on actual usage numbers (audio, video, listen-only) and take future conferences into account" is a much, much smaller step, maybe one that turns out to be impossible[1]. So why wait for the latter? Why not implement the former for the time being? It would obviously help out a lot of people based on the comments posted here and in other issues & PRs.
So pleeeease, think seriously about implementing a hands-off approach of simply taking the current load weighted by active stream types into account. I'm pretty sure it'd save 90% of the problems people have here.
[1] Possible reasons why it might be impossible & why it isn't a panacea:
- Requiring users to schedule their meeting is a hurdle a lot of users won't be willing to take. I know that the teachers in our school district have way too many things to do already, a lot of them aren't tech-savvy, and requiring even more tech interaction & planning from them will simply not work in the real world. We'd have to teach each and every one of them how to estimate conference sizes in advance, how to differentiate the audio types etc.
- Even for scheduled meetings you cannot estimate the load correctly as it isn't clear who'll use video, who joins via audio and who joins as a listen-only client. Those have vastly differing load characteristics. Sure, you could implement scheduling parameters for each type, but see 1: no one will want to do that work.
- Resource blocking might lead to waste if the resources aren't needed for some reason, e.g. you schedule a meeting, you get sick, forget to cancel the meeting, and then the server will sit there doing nothing.
from scalelite.
Hi @paul1278,
No you're correct. Scalelite was designed to equal balance meetings across servers based only on the number of meetings. Ideally, there are multiple factors that tie into correctly load balancing across a server. The ones that come to mind are:
- Server weight
- Number of meetings
- Number of users
- Number of streams
Not quite sure if/when this will be changed as out current focus now is mostly on maintenance of our infrastructure and community support
from scalelite.
The load
factor is a very bad metric for actual pressure on a server, and also not exposed via the BBB API.
Also, there are scenarios (e.g. scheduled lectures) where meetings are created in advance, and in a short timespan, but actual load only increase some minutes after that. With using meeting-count as the only factor, this is not a Problem: Each new meeting will increment the load factor by one, and remove the server from the top of the list immediately. Meetings are distributed evenly. But if we use any other metric, then there is a risk that the same server stays 'best choice' for a while and all new meetings are created on the same server. There MUST be a random factor in server selection if we move away from the meeting-count-only metric.
from scalelite.
Any plan on this? I kind of had the scenario today, where meetings were scheduled on a highly loaded server, because there were only two very large meetings on it, while other servers had a couple of meetings with 2-3 users.
While I get that the solution including linux load would be nice, I doubt that this will make it in BBB so soonish; On the other hand, the proposal to incorporate users/videostreams does not look like too big of a pull request?
from scalelite.
#476 was merged and it is planned to be part of v1.3
from scalelite.
Hi @farhatahmad,
oh ok, thank you for that information! Nevertheless good work!
from scalelite.
I'll look into providing a PR with a slight improvement over the current situation. I think it should be doable, since the poller requests "getMeetings" anyway, so I'll only have to dig a little bit deeper into the XML, there's a count of Videos and Audios or something in there...
Server weight is a different story completely, because that would mean introducing a new data field into the program. Pretty sure my Ruby isn't good enough to do that 😁
from scalelite.
That would be veeeery nice thank you so much!
from scalelite.
Oh @einhirn, what I did on another script (has nothing to do with scalelite):
per Meeting, there is a node
participantCount
, holding the total participants in the meetinglistenerCount
, holding the amount of people just listeningvoiceParticipantCount
, holding the amount of people listening & with microphone turned onvideoCount
, holding the amount of people with turned on video
My ruby is not that good either, but if you have any problems, I am happy to help.
I guess you mean the lines here https://github.com/blindsidenetworks/scalelite/blob/master/lib/tasks/poll.rake#L34-L36 ?
from scalelite.
Thanks @paul1278, I found out that the "status" task already queries all the fields I want to use anyway, so I just copied that code with minor adjustments 😁
from scalelite.
Oh very nice forgot about that! Really appreciate your work! :D
One last thing if you have some spare time left: could you add an option, if an environment-variable like "overwrite_load_video" or "overwrite_load_participants" is set, then it overwrites the default values when calculating the server-load?
Just in case anybody needs that!
I guess environment variables are pretty easy to read when using ENV["whatever"].to_i
(example)
Another example:
ENV.has_key?('overwrite_load_video') ? ENV['overwrite_load_video'].to_i : 100
from scalelite.
I believe that every element (not only the new ones) of the load calculus should be weighted with an external env variable, so you can choose the final formula with these values based on your needs.
thanks for your efforts
from scalelite.
[X] done, I think. See #108
from scalelite.
Looking good, but sorry my fault, travis shows that has_key?
is the wrong function to check if a key exists..., it says https://travis-ci.com/github/blindsidenetworks/scalelite/builds/156361390#L317
But it looks like key?
works the same way:
from scalelite.
Seems to work now! Thank you very much!
from scalelite.
Great !
from scalelite.
Just a question to you all: when using server-weights, do you think its enough that the whole server-load is multiplied by a number, e.g. 1 = default weight, 0.5 = the server can handle two times the load of a default-server, 2 = the server can handle half the load of a default server etc.?
from scalelite.
I think so.
from scalelite.
Ok, I also did some work, it seems to work #113 (Load-multiplier to weight servers)
Well, the CI-tests don't work, will fix them.
Edit: works now
from scalelite.
Hi @farhatahmad, thanks for your response in #108 - I can understand that you don't want to introduce something as close to the core of your product just like that.
Especially, because the default values I chose off the top of my head will change the behaviour. Maybe setting the default weight values for (Video, Voice, Meetings) to (0,0,1) would be a way to introduce it without directly changing the behaviour in existing installs while unconfigured. I went with much lower factors for video and voice streams (7,3,1) when deploying it on our farm, but I'm going to run it with my patch, so if you need some data, perhaps I can help.
Of course Paul's Idea has it's merits too, especially when you put your farm together from differently 'able' hardware...
Anyway, the main issue I see with load balancing this kind of dynamic workload is that you can't move a room from one server to another or even have a room split across two servers. One can't (at least in my case) know how many users or even video streams a room will have over time, so you can just place the room on the least loaded server and hope for the best...
from scalelite.
Coming back around to visit this now, we have discussed a bunch of different ways to accomplish this. Piggybacking off of what @jodoma said, simply counting the video streams isn't enough. I need to spend some time exploring different ways to do it, but we'll probably need to somehow factor in a video_streams * users number somewhere into the load
from scalelite.
Right, basically @jodoma said that we don't know the future - and need a way to distribute the load evenly in spite of that. I'm running my patch for a while now and it seems to work quite well, but still, the "Round-Robin"-Component needs some work for edge cases, I guess. See a screenshot from our monitoring containing the total participants and the participants on each server. Looks nice and even on a Friday with no obvious lectures...
But even on a busy day where the lectures just started this semester (you can spot them easily) the load seems to be quite well balanced:
EDIT: Oh, the blue line on top is the total amount in both images...
from scalelite.
I did not see that this issue was still open an posted to #108 instead.
We improved upon this idea and added separate load factors for audio/video downstreams (in addition to upstreams). Downstream counts depend on the individual meeting size and must be calculated per meeting, then summed up. But since we are iterating over all meetings anyway, that was easy to add.
Note that audio and video downstreams must be calculated differently, because BBB mixes audio into a single channel, but does not do that for video. A meeting with 10 participants, each transmitting video, has roughly the same video downstream load as a lecture with a single presenter and 100 viewers.
For an implementation, see my comment in #108
from scalelite.
Maybe taking the real server load into account is the simplest solution. By load I mean /proc/loadavg
, because this is the load that actually matters. A server is then configured to have a max_load
(the number of cores, i.e. nproc
). Then you calculate loadavg
/max_load
and use this value to determine the least-loaded-server. Which of the 3 loadavgs is best I can't tell :)
We wouldn't need to model the load behavior of rooms with videos, listeners, talkers, ..., instead, we query the actual server load. What do you think?
from scalelite.
but you need to get that info back to scalelite - it's not available via BBB API.
from scalelite.
Indeed, creating meetings in advance wouldn't be balanced when the meeting load hits later. That's not the case with the current algorithm either. But the only way to balance them correctly before the meetings are fully utilized, is to tell the balancer about the expected size beforehand. Otherwise it can't guess the expected load.
If a server stays the best choice, then that's totally ok: It is the least loaded server then. To optimize pre-distribute, we can check if multiple server loads are within some bounds, and select one of them randomly.
Exposing the linux load via BBB api should be very simple. And I wouldn't say it's a very bad metric, it's likely the best metric we can have for "now". It would be way better than the current round-robin balancing, which also doesn't look into the future.
If we want to take into account the future load, we have to tell scalelite (e.g. via meeting metadata) as "expected load", somehow select a server by that value. To distribute, we could do
(sum(all_expected_loads_on_the_server) + linux_load) / max_load
and then select the server which got the smallest value (plus the bounds-check-and-randomly-distribution-optimization from above).
from scalelite.
@einhirn @paul1278 did you had a better output or improvement in your performance?
Scalelite should also know if the servers are processing recordings. If true it should lower their priority on loadMultiplier.
(Beside videoCount, voiceParticipantCount, listenerCount, participantCount and Number of meetings).
Opened this issue here: #291
from scalelite.
Ok, i finally have been bit by this. I just had a server collaps (main.js dying) during a rather important conference, because the LB scheduled another large meeting on the same server, while the other node in the cluster was basically free (~7 users vs. 200 on the active node; but 4 meetings on the empty one vs. 2 on the full one. Another meeting with 100 users was then scheduled on top to the second one.)
Is there any way we can get this prioritized higher?
from scalelite.
Good point about the knowing in advance part. However, as a basic thing, it would already help if the server with 100 times more users does not get another meeting scheduled (regardless of the new meeting's size).
Besides, i like the algorithm you are proposing.
from scalelite.
We would also appreciate an enhancement in the load-balancing strategy. I think @defnull proposal looks best (#99 (comment))
Until now we tend to oversize the number of servers in the pool behind Scalelite in order to prevent a single server to get overloaded. It is FAR FROM COST EFFECTIVE !
Please consider giving this issue the highest priority !
from scalelite.
You could theoretically also track state across time on the LB, i.e., avg. max users over the past N sessions (maybe using a top N percentile approach to get 'test sessions' with 1-2 users out of the equation).
from scalelite.
Hi, wouldn't it be even wiser to use this metric :
Sum up for each room the product : number of users in the room * (number of active webcams + 1 (if screen sharing))
It would be roughly the number of video flows treated by the server, which is IMHO what really "loads" the server.
from scalelite.
Our workload includes a great variation of participant numbers - which leads to the already mentioned starvation problem when putting several "to be big" meetings on one server at a time. For the moment we start the meeting in advance and then disable the server in scalelite until enough people are in.
We think, that the KISS approach to handle this situation would be to create meetings with the expected number of participants and increase the server load in scalelite with this number.
The expected number of particpants would be taken from:
- an additional parameter of the create call,
- the
maxParticipants
parameter of the create call, - a global constant given via the environment (medium room size).
in this order.
Rationale:
- Resource reservation for a conference is immediate. If a current request "loads" a server higher than all other servers any subsequent request will go to one of them, regardless of (laggy) real time load.
- There is no need/place for in-advance scheduling of meetings with all it's hassles.
- One has to size the cluster anyway big enough for the maximum load to expect (and a little bit higher), "squeezing" out resources from "unused slots" in running meetings will lead to overload every now and then.
Additionally a max users per server warn/reject limit could be introduced. This would be a function of:
- current users, with/without camera, microphone, audio, etc.
- server load
- server capability
and could therefore be derived from real time measurement. For a starter, however, we propose a fixed, per server number.
from scalelite.
I applied @defnull's code proposition from #108 to current master. Maybe this time the PR (#476) will be accepted.
from scalelite.
Hmm... Maybe I'm wrong, but it doesn't seem to exist in the code base of 1.3.3.1. Oh well, I'll keep the branch from the PR available, just patch your installation if you need/want this.
from scalelite.
Related Issues (20)
- Scalelite / Poller does not delete meeting information in the REDIS database in a special case. HOT 1
- problem with getRecordings HOT 2
- Garbage-Collect for Redis? / "LOADING Redis is loading the dataset in memory" HOT 2
- Documentation VOICE_BRIDGE_LEN value does not match default value in code HOT 1
- Use a different secret for the Scalelite management APIs
- scalelite_prune_recordings.sh will never be executed if put under /etc/cron.daily (as suggested) HOT 1
- Deletion of recordings related to the tenant along with its deletion HOT 2
- [v1.5+] protected recording can be accessed by normal BBB playback link HOT 6
- `scalelite_post_publish.rb` ignores `-f` parameter and transfers formats multiple times HOT 1
- ISSUE IN VERSION 1.5 of SCALELITE UNEQUALITY LOAD DISTRIBUTION OVERLOAD OF MEETINGS TO THE SAME SERVER
- Add server groups so that different tenants can use different groups. HOT 2
- Output of rake servers:yaml is incompatible to servers:sync
- Using LRS basic auth does not work HOT 3
- Create meeting with pre-upload slides fails HOT 1
- How to transfer Scalelite recording from one scalelite server to another properly HOT 2
- Recording transfer doesn't work. scalelite_post_publish.rb HOT 4
- deleteRecordings API endpoint should have an optional parameter for specific format HOT 1
- API call with presentationUploadExternalDescription returns ERROR checksumError HOT 2
- ERROR -- : Failed to import recording: undefined method `at_xpath' for nil:NilClass
- Request for improvement of functionality with tags to allow balancing by region HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scalelite.