Currently TFW_SCHED_MAX_SERVERS is defined as 64 back

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

As we've seen in our performance benchmarks and shown in <a href="https://www.rootuser

It seems the issue is done, but we still have no results from <a class="issue-link js-

Requests scheduling to massive farm of backend servers about tempesta HOT 13 CLOSED

krizhanovsky commented on August 10, 2024

Requests scheduling to massive farm of backend servers

from tempesta.

Comments (13)

vdmit11 commented on August 10, 2024

Well, the TFW_SCHED_MAX_SERVERS was added for a purpose.
A fixed-size array of servers has certain advantages over a linked list:

You can use binary search (although linked list may be transofrmed to a skip list, but this is more complicated).
You can allocate per-CPU arrays easily. The small value of TFW_SCHED_MAX_SERVERS allows to do that statically. Dynamic per-CPU linked lists are not that easy.
All the memory is packed together which is good for caching, hence the performance is better.

Of course, we can re-allocate the array as needed and so on,
but for me it looks easier to have two separate modules for the two cases:

A module for small groups of servers that are online most of the time.
Another module for large groups of servers that go offline all the time.

These two cases require different implementations and involve different optimizations, so I think we really need separate modules.

from tempesta.

krizhanovsky commented on August 10, 2024

    ....to have two separate modules for the two cases:

    A module for small groups of servers that are online most of the time.
    Another module for large groups of servers that go offline all the time.

This is not a different logic, this is just a different cases, so they should be treated in the same code base. Probably, you just can allocate array for small server set or use hash table or trees to handle thousands of servers. But the different containers should be processed by the same logic.

Or please make an example of logic which is fundamentally different for the cases.

from tempesta.

krizhanovsky commented on August 10, 2024

The system should dynamically establish new connections to busy upstream servers and also dynamically shrink redundant connections (also applicable for forward proxy case).

UPD. It still has sense to be able to change number of connections to upstream servers. However, Tempesta FW will not support forward proxying. With wide HTTPS usage forward proxying is limited by corporate networks and other small installation which do not process millions requests per second. There is no ISP usage any more. So this is completely different use case with different environment and requirements.

UPD 2. I created a new issue #710 for the functionality, so no need to implement it this time.

from tempesta.

krizhanovsky commented on August 10, 2024

As we've seen in our performance benchmarks and shown in third-party benchmarks HTTP servers, like Nginx or Apache HTTPD, shows quite low performance on 4 concurrent connection, so our current default 4 server connections and 32 as a maximum number are just inadequate. I'd say 32 as default connections with VMs running Tempesta together with a user space HTTP server in mind, and 32768 (USHORT_MAX - 1024, which is 64512 is the maximum number of ephimeral ports.

The main consequence of the issue is that all current scheduling algorithms must be reworked to support dynamically sized arrays.

A naive solution could be to keep schedulers data per CPU and establish number of upstream connections equal to N * CPU_NUM. However, Tempesta FW can service thousands of weak virtualized servers, so if it's running on for example 128 cores hardware, then we have to maintain too many redundant connections and will cause unnecessary load onto the weak servers.

The issue relates to #51 since that also updates schedulers code.

from tempesta.

krizhanovsky commented on August 10, 2024

While the 2-tier schedulers are certainly should be modified to support dynamically sized arrays, the real performance issue is with HTTP scheduler which in practice must be able to process thousands of server groups. The problem is in tfw_http_match_req() which traverses the list of thousands rules and perform string matching against each item. The matcher must be reworked to handle rules in a hash table, such that we can make a quick jump by a rule key. The key can be calculate by the string and ID of the HTTP field.

In current milestone these constants should be eliminated in PRs #670 and #666.

UPD. This comment is separate into a new issue #732, so it shouldn't be done in context of #76.

from tempesta.

vankoven commented on August 10, 2024

All the requirements are already implemented or moved to separated issues/task.

from tempesta.

krizhanovsky commented on August 10, 2024

It seems the issue is done, but we still have no results from #680 test. Let's close it if the test shows that we really able to efficiently handle 1M hosts.

from tempesta.

vladtcvs commented on August 10, 2024

Creating many backends, with 1 backend in server group, causes problems. Creating 16 interfaces with 64 ports on interface, makes problem:

ERROR: start() for module 'sock_srv' returned the error: -12 - ENOMEM

8x32: TCP: Too many orphaned sockets and kmemleak messages
8x128: much more TCP: Too many orphaned sockets and much more kmemleak

backends created with nginx, single nginx per interface, nginx config contains server {} for each port

ports used: 16384, 16375, etc for each interface

from tempesta.

vladtcvs commented on August 10, 2024

testing: test_1M.py from vlts-680-1M

from tempesta.

krizhanovsky commented on August 10, 2024

I didn't notice TCP: Too many orphaned sockets messages, but using tempesta_fw.conf generated by @ikoveshnikov 's script and his nginx.conf (the both are attached), I see that sysctl -w net.tempesta.state=stop (on --restart) or sysctl -w net.tempesta.state=start (on --reload) takes about 20 seconds for 1000 backends config. The call stack for the sysctl process:

[<ffffffffafecb533>] __wait_rcu_gp+0xc3/0xf0
[<ffffffffafecfc9c>] synchronize_sched.part.65+0x3c/0x60
[<ffffffffafecfdf0>] synchronize_sched+0x30/0x90
[<ffffffffc027f169>] tfw_sched_ratio_del_grp+0x49/0x80 [tfw_sched_ratio]
[<ffffffffc043e462>] tfw_sg_release+0x22/0x80 [tempesta_fw]
[<ffffffffc043e512>] tfw_sg_release_all+0x52/0xb0 [tempesta_fw]
[<ffffffffc0443656>] tfw_sock_srv_stop+0xb6/0xd0 [tempesta_fw]
[<ffffffffc043c19c>] tfw_ctlfn_state_io+0x19c/0x530 [tempesta_fw]
[<ffffffffb004e025>] proc_sys_call_handler+0xe5/0x100
[<ffffffffb004e04f>] proc_sys_write+0xf/0x20
[<ffffffffaffca322>] __vfs_write+0x32/0x160
[<ffffffffaffcb660>] vfs_write+0xb0/0x190
[<ffffffffaffcca83>] SyS_write+0x53/0xc0
[<ffffffffb03dd72e>] entry_SYSCALL_64_fastpath+0x1c/0xb1

scrip_cfg.tar.gz

from tempesta.

krizhanovsky commented on August 10, 2024

After the fix 6d11ff1 perf top shows for 100K servers reconfiguration:

    76.25%  [kernel]            [k] strcasecmp
    16.01%  [tempesta_fw]       [k] tfw_cfgop_begin_srv_group
     5.64%  [tempesta_fw]       [k] tfw_sg_lookup_reconfig

from tempesta.

krizhanovsky commented on August 10, 2024

After the fix 94b18ed performance profile became:

    62.33%  [tempesta_fw]  [k] tfw_cfgop_begin_srv_group
     9.25%  [tempesta_fw]  [k] tfw_apm_prcntl_tmfn
     7.98%  [tempesta_fw]  [k] __tfw_stricmp_avx2

However, reloading 10K server groups takes about 30 seconds, the same as for full restart. tempesta_fw.conf for 10K servers takes about 1MB, so all the parsing and server groups manipulations, e.g. tfw_cfgop_begin_srv_group(), takes time.

from tempesta.

krizhanovsky commented on August 10, 2024

With the commit c58993a (also https://github.com/tempesta-tech/linux-4.9.35-tfw/commit/f20d5703592ce3078d3415edbc5b2703f614d9b7 for the kernel) I still cannot normally start Tempesta FW with 30K backends using configuration #680 (comment) . (Surely it'd be better to use many IP addresses and ports to avoid lock contention on single TCP socket.) The system hangs on softirq softlockups. Only following patch allows to "normally" start Tempesta FW:

diff --git a/tempesta_fw/apm.c b/tempesta_fw/apm.c
index b82a3ce..5f78ee1 100644
--- a/tempesta_fw/apm.c
+++ b/tempesta_fw/apm.c
@@ -1034,9 +1034,10 @@ tfw_apm_add_srv(TfwServer *srv)
 
        /* Start the timer for the percentile calculation. */
        set_bit(TFW_APM_DATA_F_REARM, &data->flags);
+       goto AK_DBG;
        setup_timer(&data->timer, tfw_apm_prcntl_tmfn, (unsigned long)data);
        mod_timer(&data->timer, jiffies + TFW_APM_TIMER_INTVL);
-
+AK_DBG:
        srv->apmref = data;
 
        return 0;
diff --git a/tempesta_fw/sock_srv.c b/tempesta_fw/sock_srv.c
index dc9e0ba..3b4e361 100644
--- a/tempesta_fw/sock_srv.c
+++ b/tempesta_fw/sock_srv.c
@@ -227,7 +227,12 @@ tfw_sock_srv_connect_try_later(TfwSrvConn *srv_conn)
        /* Don't rearm the reconnection timer if we're about to shutdown. */
        if (unlikely(!ss_active()))
                return;
-
+{
+       static unsigned long delta = 0;
+       timeout = 1000 + delta;
+       delta += 10;
+       goto AK_DBG_end;
+}
        if (srv_conn->recns < ARRAY_SIZE(tfw_srv_tmo_vals)) {
                if (srv_conn->recns)
                        TFW_DBG_ADDR("Cannot establish connection",
@@ -249,7 +254,7 @@ tfw_sock_srv_connect_try_later(TfwSrvConn *srv_conn)
                timeout = tfw_srv_tmo_vals[ARRAY_SIZE(tfw_srv_tmo_vals) - 1];
        }
        srv_conn->recns++;
-
+AK_DBG_end:
        mod_timer(&srv_conn->timer, jiffies + msecs_to_jiffies(timeout));
 }
 
@@ -2119,7 +2124,7 @@ static TfwCfgSpec tfw_srv_group_specs[] = {
        },
        {
                .name = "server_connect_retries",
-               .deflt = "10",
+               .deflt = "1", // AK_DBG "10",
                .handler = tfw_cfgop_in_conn_retries,
                .spec_ext = &(TfwCfgSpecInt) {
                        .range = { 0, INT_MAX },

The reason is #736: TIMER_SOFTIRQ is the higest priority softirq functions and we setup about 60K timers for the test of 30K groups and all the timers aren't so lightweight. So the timers just block any activity in the system and don't allow it to make progress.

from tempesta.

Requests scheduling to massive farm of backend servers about tempesta HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent