Before we jump full on board with FastClick I would like to get a better understanding

Full push is just the fact of using only push elements, i

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Differences between FastClick and Click about fastclick HOT 8 CLOSED

tbarbette commented on August 23, 2024

Differences between FastClick and Click

from fastclick.

Comments (8)

tbarbette commented on August 23, 2024

Full push is just the fact of using only push elements, it is absolutely not mandatory. It provides good improvement, but you don't want to make the full jump in one go for sure. Actually, even if you avoid inter-core communications, with a parallel approach but using push-to-pull you will actually loose performance because of the useless notifications. For the underlying question of using a parallel vs pipeline approach. I actually did a deeper study in my thesis about when pipeline is better than parallel according to many many factors and the answer is pretty much never. So my point is don't change anything, but try to move to a full push/parallel approach. If you want to keep a pipelined threading model, then you can look at the Pipeliner element that will keep the full push semantic but changing core at the pipeliner element itself.
DPDK elements do allow clones and indirect reference. I'm not sure where you read that. DPDK have a built-in support for pretty much the same feature than Click's shared packets. You can send a DPDK buffer to be transmitted by a DPDK device with a use_count of 2, and the buffer will not be freed after beeing sent. So you can actually keep buffered copies of packets that were sent (for retransmission, probably), totally for free. Though, if you want to be sure you avoid the packet copy we should double check it according to your usage. Somehow you have to convert the click's use_count == 2 to the DPDK internal one. Packet->clone(true) will do this DPDK-internal shadow copy, but then Click will not be aware of the underlying shadow clones and you may write the buffer concurrently.
There is no change on the threading model. Though FromDPDKDevice and FromNetmapDevice have some auto-pinning facilities, if you StaticThreadSched them it will be automatically disabled. However you have a lot of helpers for multi-threading. Any element can call get_passing_threads() to receive a bitvector telling which threads may pass through the element. That is fully compatible with StaticThreadSched and good old pull-push queuing. So you can tailor locks and data structures according to what's going to happen. However dynamic scheduling is not supported (it is actually partly supported in the Metron branch. If you want it let's start a second issue). MTSafe element is also able to detect when multiple threads will pass through an element that has not been marked ELEMENT_MT_SAFE in the cc.
Batching is compatible with pull and push. It is one of the biggest improvements of FastClick, so yes you want that in the long term. But start the transition with --disable-batch, no full jump in one go.
Yes we converted a lot of standard elements, actually all we ever needed to use have been converted for batching, and most also for multi-threading btw. You do not need to convert elements that are not in the fast path like ICMP errors etc. When an element is not batch-compatible in the pipeline, push_batch will automatically convert to a serie of push(). Similarly, pull_batch(max) will build a batch from a serie of up to MAX pull().
There are a lot of macros to help you making elements batch-compatible. Basically what you'll want to do is to convert all elements in the fast-path, going from the input towards the output. You will be helped by messages that tell you where the batching support stops when you launch click. I'll add a TODO to mark in the documentation the batch-compatible element somehow automatically.
Then there is the auto batching. --enable-auto-batch=port will attempt to automatically reconstruct batches at all push ports of vanilla elements. So when the element does output(0).push(), the packet will be buffered in the port. And when the push function exits, all batch are flushed. auto-batch=jump will actually buffer at all downstream batch-compatible elements. So you'll keep un-batch compatible parts of your pipeline but re-batch when possible. To know which want is best, you have to try... auto-batch=list mode will do the same than jump but pre-build a list of compatible element at configuration time, which may or may not enhance performances. List is what was presented in the paper. It does not work anymore, actually... My last work was towards auto-batch=port.
But the best is of course to convert all fast path elements to be batch compatible.
A lot of locking and multithreading facilities in multithread.hh, and a thread-safe HashTable in hashtablemp.hh . But it was tested for x86 only ! Then you have the helpers around batching. I added support for atomic 64 but in most case the per_thread template is what you'd want. The hash-table is a hierarchical-locking hashtable and performs actually very well. The semantics are a little bit different (easier to use, actually) than Click's ones. I also buit a RCU data structure that's available there.

Gotchas :

In Click you have a pool of Packets and Buffers. In FastClick you have a pool of Packets and Packets+Buffers, "full" packets. This to allow single memory allocation with Netmap. And in general you'll need one of those twos. Rarely separated items. Also, with --enable-netmap-pool or --enable-dpdk-pool those full packets are Netmap/DPDK packets, not Click buffers, so they are ready to be sent without a copy. It's needed if you have elements calling Packet::make() blindly not particularly tied to DPDK or Netmap.
So it did provide performance improvements. However if you end up in a case where you allocate "full" packets, but separate them yourself, and then recycle Packets objects separately from your buffers you'll end up with a skewed allocation/release and that's very bad. Unlikely to happen, but looking backward I should have made that optional. This problem will be detected and will shout plenty of messages when it appears, so not too much worries for now.

Other improvements :

If you read time a lot, you'll probably see in profiling that you spend way too much time in the gettimeoftheday VDSO. Look at TSCClock if that happens (my thesis would be a needed first documentation). The same thing with click_random may happen actually. The solution would be to use /dev/urandom from time to time and use pseudo rand between the calls.

So the conclusion is start exactly as you are, with --disable-batch. First thing to do is to use DPDK or Netmap if you don't already. I have a tendancy to use DPDK these times, so it is a little bit more maintained. And then go for batching or full push. It may also be the occasion to review your multi-threading model, using FastClick's helpers and facilities. My final advice is to use "perf top" all the time, profile again and again. Flamegraphs are a cooler alternative :p

from fastclick.

bcronje commented on August 23, 2024

Thank you @tbarbette for your detailed explanations, much appreciated. Our use case and requirements are probably slightly different than your usual case. Our slow path typically makes out a significant portion of overall packets processed. So we have focused a lot on improving our slow path (compression, deduplication, streaming, etc) and ensure that the fast path (typically bypassed traffic) are kept on dedicated thread(s).

That means for us a lot of FastClick's unique features probably plays less of a critical role. However I do believe once I start digging more into the code we'll find lots of opportunities to integrate these features and improvements into our system.

I'll keep this issue open for a little while in case I or someone else has any other questions.

from fastclick.

tbarbette commented on August 23, 2024

And what do you use for I/O ? PCAP ?

DPDK, even using Kernel facilities such as AF_PACKET as backend, and soon AF_XDP would go faster than PCAP for sure. All other userlevel Click sockets-based facilities are ageing very badly.

from fastclick.

bcronje commented on August 23, 2024

We use PCAP at the moment, with our tests we got better throughput with PCAP than AF_PACKET. Admittedly these tests were some time ago. Historically PCAP hasn't been a limiting factor in our specific use case and throughput targets, hence our focus has been elsewhere in the code.

That said, migration to DPDK/Netmap/AF_XDP is high on our TODO list.

from fastclick.

ahenning commented on August 23, 2024

Thanks @tbarbette for the thorough breakdown. I have a question that may already have been answered but I just want to confirm. From the click readme:

As the DPDK EAL will handle thread management instead of Click, Click's -j/--threads argument will be disabled when --dpdk is active

I am not sure whether I understand this text from the click documentation in full context. We depend on -j threads to handle threading in the slow path. Is it possible to run FastClick / DPDK for the fast path e.g. FromDevice, Classifiers, EtherSwitch, KernelTap, ToDevice elements, but still maintain the click -j multi-threading implementation for the slow path elements?

I think from the discussion above, it seems the answer is yes, but the documentation seems to suggest no.

from fastclick.

cffs commented on August 23, 2024

In DPDK mode, thread creation and pinning is handled by the DPDK EAL. The total number of threads is deduced from EAL arguments. For example, passing a core mask argument -c 0x1f to DPDK (i.e. 0b11111, use the first 5 logical cores) would be equivalent to passing -j 5 to Click. The -j NTHREADS argument of Click is ignored in DPDK mode (and you'll get a warning if you use it).

Apart from that, you can still use the traditional Click machinery for thread handling, including elements like StaticThreadSched).

from fastclick.

tbarbette commented on August 23, 2024

To give a practical example, click -j 5 CONFIG would be equal to
click --dpdk -c 0x1f -- CONFIG
or
click --dpdk -l 0-4 --CONFIG
And totally equivalent if you don't use any DPDK element (seems stupid but you're forced to do so if you compiled with --enable-dpdk).

from fastclick.

ahenning commented on August 23, 2024

Thanks. Got it, only the -j argument is affected and not the thread scheduling elements e.g. StaticThreadSched.

from fastclick.

Differences between FastClick and Click about fastclick HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent