Code Monkey home page Code Monkey logo

Comments (38)

defnull avatar defnull commented on August 12, 2024 4

Still not sure how scalelite should know in advance how large a meeting will grow. What if a couple of new meetings arrive at the same time that will all be very large in a couple of minutes? How to ensure that these do not end up on the same server?

One idea would be to take meeting age into account and factor in the uncertainty of very fresh meetings. For example: If a meeting was created very recently and is still empty, count it as if it already had as many users as the largest meeting. If a meetings was created more than 15 minutes ago, it will not grow much more, so assume the current user count is correct. Interpolate between these two.

New meetings will now prefer servers with low user count, but avoid servers with many new meetings for which the final user count is not known yet. This approach would prevent the issues I mentioned earlier, that too many new meetings could end up on the same server.

from scalelite.

jodoma avatar jodoma commented on August 12, 2024 2

Just wanted to add a few observations

  • Giving different weights to audio and video is important, but not the full story. Having a single meeting with 50 video streams is different from having 50 meetings with a single video streams, so there should be a non-linear component.
  • adding weights to servers (I assume to express their capabilities) may not be needed when the overall system load metric is taken into account that expresses how (over)loaded the server is.
  • When speaking about cloud (EC2) deployments, it may make sense to further incorporate the cpu.steal value in the decision, as it describes how overloaded the physical server is that hosts the virtual machine.

Anyway, also having sophisticated metrics in place, the system as built now has some natural inertia so that it may be a bad idea to always pick the least loaded server (assume five new meetings starting a time interval of 1 minute; it will take some time to see the impact of this new meeting in the metrics; also it will take some time until all users have joined the meetings and so on.). So, there still needs to be a Round Robin-like element in the heuristic that ensures that in avbove scenario not all new meetings end up on the same host. Currently, with only counting meetings, this round robin behaviour is built-in the algorithm.

from scalelite.

ffdixon avatar ffdixon commented on August 12, 2024 2

If a meeting was created very recently and is still empty, count it as if it already had as many users as the largest meeting. If a meetings was created more than 15 minutes ago

One approach is to give each meeting a "warmup period" where Scalelite gives it a minimum size of 15 users for the first 15 minutes, and thereafter uses the actual size.

from scalelite.

mbunkus avatar mbunkus commented on August 12, 2024 2

We've been running Scalelite with these changes for nearly a year now for a school district in Germany. We regularly have 2000 concurrent users handled by a cluster of 15 conference nodes (12 dedicated CPU cores each). This works nicely. Servers aren't loaded over capacity, we don't overprovision too much. Without those patches things would be unusable, or we'd have to run three times the number of servers. We do not have any kind of foresight, no planned scheduling, just a weighted load based on the number of active video & audio channels instead of the number of conferences.

I think my point is that we've got a "perfect is the enemy of good" situation. Sure, bringing planned meetings into the fray might make things a bit more precise. However, the real advantage and the real improvement here is in going from "solely use the number of conferences" to "guess current load based on actual usage numbers (audio, video, listen-only)". Going from "guess current load based on actual usage numbers (audio, video, listen-only)" to ""guess current load based on actual usage numbers (audio, video, listen-only) and take future conferences into account" is a much, much smaller step, maybe one that turns out to be impossible[1]. So why wait for the latter? Why not implement the former for the time being? It would obviously help out a lot of people based on the comments posted here and in other issues & PRs.

So pleeeease, think seriously about implementing a hands-off approach of simply taking the current load weighted by active stream types into account. I'm pretty sure it'd save 90% of the problems people have here.

[1] Possible reasons why it might be impossible & why it isn't a panacea:

  1. Requiring users to schedule their meeting is a hurdle a lot of users won't be willing to take. I know that the teachers in our school district have way too many things to do already, a lot of them aren't tech-savvy, and requiring even more tech interaction & planning from them will simply not work in the real world. We'd have to teach each and every one of them how to estimate conference sizes in advance, how to differentiate the audio types etc.
  2. Even for scheduled meetings you cannot estimate the load correctly as it isn't clear who'll use video, who joins via audio and who joins as a listen-only client. Those have vastly differing load characteristics. Sure, you could implement scheduling parameters for each type, but see 1: no one will want to do that work.
  3. Resource blocking might lead to waste if the resources aren't needed for some reason, e.g. you schedule a meeting, you get sick, forget to cancel the meeting, and then the server will sit there doing nothing.

from scalelite.

farhatahmad avatar farhatahmad commented on August 12, 2024 1

Hi @paul1278,

No you're correct. Scalelite was designed to equal balance meetings across servers based only on the number of meetings. Ideally, there are multiple factors that tie into correctly load balancing across a server. The ones that come to mind are:

  • Server weight
  • Number of meetings
  • Number of users
  • Number of streams

Not quite sure if/when this will be changed as out current focus now is mostly on maintenance of our infrastructure and community support

from scalelite.

defnull avatar defnull commented on August 12, 2024 1

The load factor is a very bad metric for actual pressure on a server, and also not exposed via the BBB API.

Also, there are scenarios (e.g. scheduled lectures) where meetings are created in advance, and in a short timespan, but actual load only increase some minutes after that. With using meeting-count as the only factor, this is not a Problem: Each new meeting will increment the load factor by one, and remove the server from the top of the list immediately. Meetings are distributed evenly. But if we use any other metric, then there is a risk that the same server stays 'best choice' for a while and all new meetings are created on the same server. There MUST be a random factor in server selection if we move away from the meeting-count-only metric.

from scalelite.

ichdasich avatar ichdasich commented on August 12, 2024 1

Any plan on this? I kind of had the scenario today, where meetings were scheduled on a highly loaded server, because there were only two very large meetings on it, while other servers had a couple of meetings with 2-3 users.

While I get that the solution including linux load would be nice, I doubt that this will make it in BBB so soonish; On the other hand, the proposal to incorporate users/videostreams does not look like too big of a pull request?

from scalelite.

jfederico avatar jfederico commented on August 12, 2024 1

#476 was merged and it is planned to be part of v1.3

from scalelite.

paul1278 avatar paul1278 commented on August 12, 2024

Hi @farhatahmad,

oh ok, thank you for that information! Nevertheless good work!

from scalelite.

einhirn avatar einhirn commented on August 12, 2024

I'll look into providing a PR with a slight improvement over the current situation. I think it should be doable, since the poller requests "getMeetings" anyway, so I'll only have to dig a little bit deeper into the XML, there's a count of Videos and Audios or something in there...
Server weight is a different story completely, because that would mean introducing a new data field into the program. Pretty sure my Ruby isn't good enough to do that 😁

from scalelite.

paul1278 avatar paul1278 commented on August 12, 2024

That would be veeeery nice thank you so much!

from scalelite.

paul1278 avatar paul1278 commented on August 12, 2024

Oh @einhirn, what I did on another script (has nothing to do with scalelite):
per Meeting, there is a node

  • participantCount, holding the total participants in the meeting
  • listenerCount, holding the amount of people just listening
  • voiceParticipantCount, holding the amount of people listening & with microphone turned on
  • videoCount, holding the amount of people with turned on video

My ruby is not that good either, but if you have any problems, I am happy to help.
I guess you mean the lines here https://github.com/blindsidenetworks/scalelite/blob/master/lib/tasks/poll.rake#L34-L36 ?

from scalelite.

einhirn avatar einhirn commented on August 12, 2024

Thanks @paul1278, I found out that the "status" task already queries all the fields I want to use anyway, so I just copied that code with minor adjustments 😁

from scalelite.

paul1278 avatar paul1278 commented on August 12, 2024

Oh very nice forgot about that! Really appreciate your work! :D

One last thing if you have some spare time left: could you add an option, if an environment-variable like "overwrite_load_video" or "overwrite_load_participants" is set, then it overwrites the default values when calculating the server-load?

Just in case anybody needs that!

I guess environment variables are pretty easy to read when using ENV["whatever"].to_i (example)

Another example:
ENV.has_key?('overwrite_load_video') ? ENV['overwrite_load_video'].to_i : 100

from scalelite.

rabser avatar rabser commented on August 12, 2024

I believe that every element (not only the new ones) of the load calculus should be weighted with an external env variable, so you can choose the final formula with these values based on your needs.
thanks for your efforts

from scalelite.

einhirn avatar einhirn commented on August 12, 2024

[X] done, I think. See #108

from scalelite.

paul1278 avatar paul1278 commented on August 12, 2024

Looking good, but sorry my fault, travis shows that has_key? is the wrong function to check if a key exists..., it says https://travis-ci.com/github/blindsidenetworks/scalelite/builds/156361390#L317

But it looks like key? works the same way:

from scalelite.

paul1278 avatar paul1278 commented on August 12, 2024

Seems to work now! Thank you very much!

from scalelite.

rabser avatar rabser commented on August 12, 2024

Great !

from scalelite.

paul1278 avatar paul1278 commented on August 12, 2024

Just a question to you all: when using server-weights, do you think its enough that the whole server-load is multiplied by a number, e.g. 1 = default weight, 0.5 = the server can handle two times the load of a default-server, 2 = the server can handle half the load of a default server etc.?

from scalelite.

rabser avatar rabser commented on August 12, 2024

I think so.

from scalelite.

paul1278 avatar paul1278 commented on August 12, 2024

Ok, I also did some work, it seems to work #113 (Load-multiplier to weight servers)

Well, the CI-tests don't work, will fix them.

Edit: works now

from scalelite.

einhirn avatar einhirn commented on August 12, 2024

Hi @farhatahmad, thanks for your response in #108 - I can understand that you don't want to introduce something as close to the core of your product just like that.
Especially, because the default values I chose off the top of my head will change the behaviour. Maybe setting the default weight values for (Video, Voice, Meetings) to (0,0,1) would be a way to introduce it without directly changing the behaviour in existing installs while unconfigured. I went with much lower factors for video and voice streams (7,3,1) when deploying it on our farm, but I'm going to run it with my patch, so if you need some data, perhaps I can help.

Of course Paul's Idea has it's merits too, especially when you put your farm together from differently 'able' hardware...

Anyway, the main issue I see with load balancing this kind of dynamic workload is that you can't move a room from one server to another or even have a room split across two servers. One can't (at least in my case) know how many users or even video streams a room will have over time, so you can just place the room on the least loaded server and hope for the best...

from scalelite.

farhatahmad avatar farhatahmad commented on August 12, 2024

Coming back around to visit this now, we have discussed a bunch of different ways to accomplish this. Piggybacking off of what @jodoma said, simply counting the video streams isn't enough. I need to spend some time exploring different ways to do it, but we'll probably need to somehow factor in a video_streams * users number somewhere into the load

from scalelite.

einhirn avatar einhirn commented on August 12, 2024

Right, basically @jodoma said that we don't know the future - and need a way to distribute the load evenly in spite of that. I'm running my patch for a while now and it seems to work quite well, but still, the "Round-Robin"-Component needs some work for edge cases, I guess. See a screenshot from our monitoring containing the total participants and the participants on each server. Looks nice and even on a Friday with no obvious lectures...
image
But even on a busy day where the lectures just started this semester (you can spot them easily) the load seems to be quite well balanced:
image
EDIT: Oh, the blue line on top is the total amount in both images...

from scalelite.

defnull avatar defnull commented on August 12, 2024

I did not see that this issue was still open an posted to #108 instead.

We improved upon this idea and added separate load factors for audio/video downstreams (in addition to upstreams). Downstream counts depend on the individual meeting size and must be calculated per meeting, then summed up. But since we are iterating over all meetings anyway, that was easy to add.
Note that audio and video downstreams must be calculated differently, because BBB mixes audio into a single channel, but does not do that for video. A meeting with 10 participants, each transmitting video, has roughly the same video downstream load as a lecture with a single presenter and 100 viewers.

For an implementation, see my comment in #108

from scalelite.

TheJJ avatar TheJJ commented on August 12, 2024

Maybe taking the real server load into account is the simplest solution. By load I mean /proc/loadavg, because this is the load that actually matters. A server is then configured to have a max_load (the number of cores, i.e. nproc). Then you calculate loadavg/max_load and use this value to determine the least-loaded-server. Which of the 3 loadavgs is best I can't tell :)

We wouldn't need to model the load behavior of rooms with videos, listeners, talkers, ..., instead, we query the actual server load. What do you think?

from scalelite.

einhirn avatar einhirn commented on August 12, 2024

but you need to get that info back to scalelite - it's not available via BBB API.

from scalelite.

TheJJ avatar TheJJ commented on August 12, 2024

Indeed, creating meetings in advance wouldn't be balanced when the meeting load hits later. That's not the case with the current algorithm either. But the only way to balance them correctly before the meetings are fully utilized, is to tell the balancer about the expected size beforehand. Otherwise it can't guess the expected load.

If a server stays the best choice, then that's totally ok: It is the least loaded server then. To optimize pre-distribute, we can check if multiple server loads are within some bounds, and select one of them randomly.

Exposing the linux load via BBB api should be very simple. And I wouldn't say it's a very bad metric, it's likely the best metric we can have for "now". It would be way better than the current round-robin balancing, which also doesn't look into the future.

If we want to take into account the future load, we have to tell scalelite (e.g. via meeting metadata) as "expected load", somehow select a server by that value. To distribute, we could do

(sum(all_expected_loads_on_the_server) + linux_load) / max_load

and then select the server which got the smallest value (plus the bounds-check-and-randomly-distribution-optimization from above).

from scalelite.

ryprfpryr avatar ryprfpryr commented on August 12, 2024

@einhirn @paul1278 did you had a better output or improvement in your performance?

Scalelite should also know if the servers are processing recordings. If true it should lower their priority on loadMultiplier.
(Beside videoCount, voiceParticipantCount, listenerCount, participantCount and Number of meetings).

Opened this issue here: #291

from scalelite.

ichdasich avatar ichdasich commented on August 12, 2024

Ok, i finally have been bit by this. I just had a server collaps (main.js dying) during a rather important conference, because the LB scheduled another large meeting on the same server, while the other node in the cluster was basically free (~7 users vs. 200 on the active node; but 4 meetings on the empty one vs. 2 on the full one. Another meeting with 100 users was then scheduled on top to the second one.)

Is there any way we can get this prioritized higher?

from scalelite.

ichdasich avatar ichdasich commented on August 12, 2024

Good point about the knowing in advance part. However, as a basic thing, it would already help if the server with 100 times more users does not get another meeting scheduled (regardless of the new meeting's size).

Besides, i like the algorithm you are proposing.

from scalelite.

pielonet avatar pielonet commented on August 12, 2024

We would also appreciate an enhancement in the load-balancing strategy. I think @defnull proposal looks best (#99 (comment))
Until now we tend to oversize the number of servers in the pool behind Scalelite in order to prevent a single server to get overloaded. It is FAR FROM COST EFFECTIVE !
Please consider giving this issue the highest priority !

from scalelite.

ichdasich avatar ichdasich commented on August 12, 2024

You could theoretically also track state across time on the LB, i.e., avg. max users over the past N sessions (maybe using a top N percentile approach to get 'test sessions' with 1-2 users out of the equation).

from scalelite.

pielonet avatar pielonet commented on August 12, 2024

Hi, wouldn't it be even wiser to use this metric :
Sum up for each room the product : number of users in the room * (number of active webcams + 1 (if screen sharing))
It would be roughly the number of video flows treated by the server, which is IMHO what really "loads" the server.

from scalelite.

glxnet avatar glxnet commented on August 12, 2024

Our workload includes a great variation of participant numbers - which leads to the already mentioned starvation problem when putting several "to be big" meetings on one server at a time. For the moment we start the meeting in advance and then disable the server in scalelite until enough people are in.

We think, that the KISS approach to handle this situation would be to create meetings with the expected number of participants and increase the server load in scalelite with this number.
The expected number of particpants would be taken from:

  1. an additional parameter of the create call,
  2. the maxParticipants parameter of the create call,
  3. a global constant given via the environment (medium room size).

in this order.

Rationale:

  • Resource reservation for a conference is immediate. If a current request "loads" a server higher than all other servers any subsequent request will go to one of them, regardless of (laggy) real time load.
  • There is no need/place for in-advance scheduling of meetings with all it's hassles.
  • One has to size the cluster anyway big enough for the maximum load to expect (and a little bit higher), "squeezing" out resources from "unused slots" in running meetings will lead to overload every now and then.

Additionally a max users per server warn/reject limit could be introduced. This would be a function of:

  • current users, with/without camera, microphone, audio, etc.
  • server load
  • server capability

and could therefore be derived from real time measurement. For a starter, however, we propose a fixed, per server number.

from scalelite.

einhirn avatar einhirn commented on August 12, 2024

I applied @defnull's code proposition from #108 to current master. Maybe this time the PR (#476) will be accepted.

from scalelite.

einhirn avatar einhirn commented on August 12, 2024

Hmm... Maybe I'm wrong, but it doesn't seem to exist in the code base of 1.3.3.1. Oh well, I'll keep the branch from the PR available, just patch your installation if you need/want this.

from scalelite.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.