max0x7ba / atomic_queue Goto Github PK

C++ lockless queue.

License: MIT License

Makefile 5.72% C++ 77.85% Python 2.90% HTML 5.06% JavaScript 3.72% CSS 0.75% Shell 1.55% Meson 1.09% CMake 1.37%

atomic atomic-queues atomics benchmarks c-plus-plus c-plusplus circular-queue cplusplus cpp data-structures datastructures high-performance lock-free lockfree lockless low-latency multi-threading multithreading queue ring-buffer-array

atomic_queue's People

Contributors

Stargazers

Watchers

Forkers

cameron314 feng-y jcelerier wallacehsh yangchenxifreddy jwhur72 mottmatt duanqn oliviersohn dendisuhubdy erez-strauss mpoeter kisuke95 dotkt igordzreyev paulfd gsumalo theworldwego wenqzheng 345161974 skyformat99 fengshaoxie tiba3195 blockspacer 5l1v3r1 yemaoxin jpcima vgteam smilechopin hxrain zzl88 asdlei99 externalrepositories pastens erpalma lihuibng nzoghbi mtnsat jonathanhuangjq rainmark kdtiankong william-cheung yply yy405145590 meiliqi n1f2c3 lingtoshigure mukachi wangscript007 w1d2s cocodinglee xunliu feiyunwill awfeequdng zabrane liuyf5231 musicinmybrain lxlyh kamelat brendonsmuts syegres8 syegres louweijia rastargame guozhijun111 rnz isha333-web zkas s4oul nhatuan84 superligen floatingupstream ptahmose aenniwang angelo-abel vbirds belong-wind clayne wsuraiders67 xiao6768 kingtaurus generalzzd cwengc xiejiss aacirino luozhiya houshijie-2020 mrshannon vincentlao weishi-deng lih8322 alfaix lipsonx imkk666 jared-zdc zdeli tosinalagbe alberthuang96 noxxprojects ahao1995

atomic_queue's Issues

Unbounded queue possible with atomic_queue?

I'm looking to replace folly::UnboundedQueue. Is it possible to orchestrate multiple atomic_queues to basically grow without a fixed size limit? Where e.g. the last item in the queue point to a new fixed sized queue and is thrown away when poppen and created dynamically when pushed at the last index in the queue?

Where is OptimistAtomicQueue?

I cannot find OptimistAtomicQueue in any of the header files...

Could you add more apis such as Front() and PushFront()?

Dear author,
I have a requirement such as to inquire the first element of queue but not pop it. It's just like what Front() does in std::vector.
Also I want to push one element in the front of queue.
Do you think these two functions Front and PushFront can be added in? Thank you.

'optimist atomic queue' description is something different in read.md

In your bench code(benchmarks.cc), Just 'AtomicQueue/AtomicQueue' is defined with retryDecorator (i.e. busy-waits).
and Optimist alias is not defined with retryDecorator.

Elaborate in docs what "atomic elements" means.

I found this in source,
but it was not obvious from reading the docs:

assert(std::atomic<T>{NIL}.is_lock_free()); // This queue is for atomic elements only. AtomicQueue2 is for non-atomic ones.

I for one was thinking that maybe the T itself needs to be std::atomic<SomeOtherT>.

P.S. Not sure if that assert could be static_assert, IDK if there can be a type that is_lock_free but not constructible at compile time.

Benchmark comments

Can you please add some of your comments on the webpage? What do you think? Which to use for each case? And maybe an explanation of the difference in results if possible?

why compare_exchange_strong() in a loop?

out of curiousity, why compare_exchange_strong() is used in loops where I believe compare_exchange_weak() would be sufficient? based on benchmarks results? do you still have them or remember details?

and while I have you here :) why platform dependent definition of CACHE_LINE_SIZE? any known issues with std::hardware_destructive_interference_size?

thanks

Ordering guarantees

I'm just about to check out your queue. I'm currently using MoodyCamel ConcurrentQueue, but in one of my projects I now come across the limitation that there are no ordering guarantees.

Are items enqueued to atomic_queue dequeued in (almost) the same order?

Bugfix : try_push blocks

Hello,

Using the following queue:

  using Queue = atomic_queue::AtomicQueueB2<
  /* T = */ float,
  /* A = */ std::allocator<float>,
  /* MAXIMIZE_THROUGHPUT */ true,
  /* TOTAL_ORDER = */ true,
  /* SPSC = */ true
  >;

when the queue is full, try_push blocks.

A fix for this behaviour is here : OlivierSohn/cpp.algorithms@37ccb8d

I verified that it solves my use case.

Feel free to use it!

Implementations for ppc64le and s390x?

I’ve packaged this library for Fedora Linux as part of an effort to eventually package sfizz.

Both ppc64le (little-endian 64-bit PowerPC) and s390x (Linux on IBM Z) are supported primary architectures in Fedora. It’s fine that they aren’t supported here—we can just ensure that atomic_queue and any packages that depend on it aren’t built on those architectures.

Still, I wonder—is supporting these two architectures easy or difficult?

Unfortunately, I’m not entirely confident in the exact requirements of the spin_loop_pause() function, nor am I particularly familiar with the nuances of those two architectures, so it’s not easy for me to contribute implementations myself.

Would you like to check one more lockfree mpmc queue?

The https://github.com/erez-strauss/lockfree_mpmc_queue/ is fast atomic lockfree mpmc queue.
The diff file:
add_es_lockfree_mpmc_queue_20191123.diff.zip

Please let me know if you would like to have it included in the benchmarks.

C++17 is needed in xenium

Thanks for the amazing work! While following the installation steps, in one of the required dependency, i.e. xenium, https://github.com/mpoeter/xenium currently only support C++17.

Can you suggest the previous version/commit you used for the benchmarks (with C++14)?

CPU load at 100%

Hi,

I'm writing a high-performance consumer of real-time messages and need a thread-safe, fast queue.
I've implemented the AtomicQueueB in my application and noticed that one or more CPUs are always at 100% load during runtime.
When running the example.cc with
unsigned constexpr N = std::numeric_limits<unsigned>::max() - 10; // Pass this many elements from producers to consumers.
I noticed similar behaviour.

Is this normal? If no, any ideas on how to set up the queue so it doesn't exhibit the load?
I didn't disable real-time thread throttling.

Also, if the tasks are pinned to a specific core, is it possible to disable this? My application needs to run in the cloud (only virtual cores available there...)

Compiled with g++ 9.3.1 with C++14

The queue does not accept NIL value.

Hello @max0x7ba

I have a shared fixed-size buffer whose element's data type is a big size record. Instead of store elements of buffer directly to the queue directly, I just store indices of the buffer. One issue I have with the queue is, push() and try_push() do not accept a zero value which is the first index of the buffer.
I want to know if there is any way to make the queue accept a zero value.

Thanks

Approximate number of items

Is it possible to get the approximate number of items in the queue?

Question: busy waiting?

This repo is great.
But, what happens if a thread is waiting to enqueue/dequeue for more than say 100ms?
In a busy-waiting scenario, isn't that hogging a lot of CPU resources, which could be otherwise spent by another thread?
In which case, one might argue that a mutex-cvar based queue which puts threads to sleep might be more appropriate.
Is there a way to hybridize this?
So, would there be a way for a thread to busy-wait for say any time less than 1 micro second, then go to sleep for any time after that?

push std::vector data package

Hi, I just find push one data in example.cc, how to push std::vector data package ? need help ,Thanks!

A commit that fixes some compilation errors and with some minor additions

Hello,

This commit contains fixes / additions that could be useful:

cast fix (compilation error)
move-construction fix (compilation error)
enables the construction of 0-sized queues that do not allocate anything (usefull in my use case)
a method to get the number of elements in the queue

OlivierSohn@2c819a0

Atomics are not trivial

Thats it. It was guaranteed that atomic integer specializations are trivially constructible until c++ 20, however this is not the case anymore.
So this static_assert will fail starting with c++ 20
I ran into this while attempting to compile this under windows, microsoft stl library already follows the standard and reports assertion fail
https://godbolt.org/z/9WeonahxW

Docs: optimistic version "busy-waits when empty or full", and what does the other do?

Something unclear to me while reading the README and also peeking at the code. So considering:

AtomicQueue - a fixed size ring-buffer for atomic elements.
OptimistAtomicQueue - a faster fixed size ring-buffer for atomic elements which busy-waits when empty or full.

What does AtomicQueue do when empty or full?

mention vcpkg package

In the Using the library section, you can mention that atomic_queue is available via vcpkg. For example:

$ ./vcpkg install atomic-queue

FYI: the vcpkg port is pretty simple: https://github.com/microsoft/vcpkg/tree/master/ports/atomic-queue

Create a queue with object ptrs as queue elements.

I am trying to create a queue with Class Object pointers as elements of Queue for my exploration and taken sample code from
src/example.c. The example.c uses uint32_t and also -1 as the special value that cannot be pushed/popped and what could
be it for queue of pointers of objects ?
For eg:

class file :
class Room {
    public:
        double length;
        double breadth;
        double height;

        double calculateArea(){
            return length * breadth;
        }

        double calculateVolume(){
            return length * breadth * height;
        }

};

main file : 
#include "atomic_queue/atomic_queue.h"

#include "Element.h"
#include <thread>
#include <cstdint>
#include <chrono>
#include <iostream>

int main() {
    using namespace std::chrono_literals;
    int constexpr PRODUCERS = 1; // Number of producer threads.
    int constexpr CONSUMERS = 1; // Number of consumer threads.
    unsigned constexpr N = 500; // Each producer pushes this many elements into the queue.
    unsigned constexpr CAPACITY = 400; // Queue capacity. Since there are more consumers than producers the queue doesn't need to be large.

    using Element = Room *; // Queue element type
    Element constexpr NILELEMENT = reinterpret_cast<Element>(0xdeaddead);
//    static constexpr Element invalid1 = reinterpret_cast<Element>(0xdeadead);
    //Element constexpr NILELEMENT = static_cast<Element>(&room); // Atomic elements require a special value that cannot be pushed/popped.
    using Queue = atomic_queue::AtomicQueueB<Element, std::allocator<Element>, NILELEMENT>; // Use heap-allocated buffer.

    // Create a queue object shared between all producers and consumers.
    Queue q{CAPACITY};

    // Start the consumers.
    uint64_t sums[CONSUMERS];
    std::thread consumers[CONSUMERS];
    for(int i = 0; i < CONSUMERS; ++i)
        consumers[i] = std::thread([&q, &sum = sums[i]]() {
            uint64_t s = 0; // New object with automatic storage duration. Not aliased or false-shared by construction
	    auto n = q.pop();
            while(n != NILELEMENT) { // Break the loop when 0 is pop'ed. 
                free(n);
		n = q.pop();
		s++;
		std::cout << " Consuming here " << n << " " << s << std::endl << std::flush;
		std::this_thread::sleep_for(20ms);
	    }
            // Store into sum only once because it is element of sums array, false-sharing the same cache line with other threads.
            // Updating sum in the loop above saturates the inter-core bus with cache coherence protocol messages.
            std::cout << " Done with Consumer " << std::endl << std::flush;
            sum = s;
        });

    // Start the producers.
    std::thread producers[PRODUCERS];
    for(int i = 0; i < PRODUCERS; ++i)
        producers[i] = std::thread([&q, &NILELEMENT]() {
            // Each producer pushes range [1, N] elements into the queue.
            // Ascending order [1, N] requires comparing with N at each loop iteration. Ascending order isn't necessary here.
            // Push elements in descending order, range [N, 1] with step -1, so that CPU decrement instruction sets zero/equal flag
            // when 0 is reached, which breaks the loop without having to compare n with N at each iteration.
            for(auto n = N; n > 0; --n) {
                q.push(new Room());
	    }
	    q.push(NILELEMENT);
	    std::cout << "Done with pushing " << std::endl << std::flush;
        });

    // Wait till producers have terminated.
    for(auto& t : producers)
        t.join();

    // Tell consumers to terminate by pushing one 0 element for each consumer.
    for(int i = CONSUMERS; i--;)
        q.push(nullptr);
    // Wait till consumers have terminated.
    for(auto& t : consumers)
        t.join();
    // When all consumers have terminated the queue is empty.

    // Sum up consumer's received elements sums.
    uint64_t total_sum = 0;
    for(auto& sum : sums) {
        total_sum += sum;
        if(!sum) // Verify that each consumer received at least one element.
            std::cerr << "WARNING: consumer " << (&sum - sums) << " received no elements.\n";
	else
	    std::cerr << "CONSUMER CONSUMED ALL ELEMENTS \n";
    }

    // Verify that each element has been pop'ed exactly once; not corrupted, dropped or duplicated.
    uint64_t constexpr expected_total_sum =  N * PRODUCERS;
    if(int64_t total_sum_diff = total_sum - expected_total_sum) {
        std::cerr << "ERROR: unexpected total_sum difference " << total_sum_diff << '\n';
        return EXIT_FAILURE;
    }

    std::cout << "Successfully completed " << std::endl;
    return EXIT_SUCCESS;
}

I am unable to get this compiled,

   src/example.cc: In function ‘int main()’:
   src/example.cc:18:36: error: ‘reinterpret_cast’ from integer to pointer
   18 |     Element constexpr NILELEMENT = reinterpret_cast<Element>(0xdeaddead);

I have also tried with static_cast and nothing works.

I expect this generic queue to support any datatype (pointer). Any help is appreciated to get this compiled..

Thanks in advance.

Low latency programming?

I'm looking for a resource to learn low latency programming so I can write libraries such as these. Can you please recommend a book?

Create a `v1.6.2` release

Create v1.6.2 release to contain CMake support.
Resolves #64

max size of the queue is 2^26 due to `alighas(CACHE_LINE_SIZE)`

A minor correction for your docs that state:

That limits the queue size to 4,294,967,295 elements

Question: memory order in fetch_add()

In the push and pop functions, the memory order of fetch_add is set to memory_order_acquire if total order is not required. In that case, would it be wrong to set the memory order to relaxed? My reasoning is that fetch_add is not synchronized with any store, and also there is a data dependency between this evaluation and the do_push()/do_pop() that follows in the next line, so I don't see a risk of reading the wrong 'head' due to reordering of some read operation after fetch_add. I may be wrong and lacking some fundamental understanding of memory order, but I can’t seem to come up with any scenario where using relaxed memory order leads to error.

Construct allocator-aware queues with allocator

It is possible to specify an allocator type for AtomicQueueB and AtomicQueueB2 however there is no way to construct the queue with non-default constructed allocator.

Ever incresing queue size

Hi, I am trying to use the queue in my project, where opencv frames are queued. I needed to store a queue size of 5(for most recent frames), but what I observe is that if the consumer thread is not activated, and the producer will keep pushing the data in queue which leads to the size of queue ever increasing. At the rate of 30fps, my queue size is crossing 1gb in just few min, instead of few mb(s) for just 5 frames.

Code snippets are as below:

Declaration:
atomic_queue::AtomicQueueB2<FramePacket> m_LiveFrameBuff{5};
Producer call:

    auto temp = framePacket;
    m_LiveFrameBuff.push(temp);

Consumer call:

  void FrameBuffManager::getLiveFrame(FramePacket &ret)
{
   funcname();
   if (m_stopworkers.load())
       return;
   ret = m_LiveFrameBuff.pop();
   return;
}

Size print check

std::cout << m_LiveFrameBuff.was_size() << std::endl;

The readme also mentions about other available queues like OptimistAtomicQueue2, but I am unable to find/use it. Can more simpler use case examples be provided for the respective available containers. It will be really helpful.

would you like to add vyukov's mpmc queue to your benchmarks?

here is a variation of his queue with minor improvements:

https://github.com/romange/gaia/blob/master/base/mpmc_bounded_queue.h

Problem of using atomic_queue with debug-version of malloc and free.

The problem occurs when I use the queue under DEBUG mode with memleak checking in MSVC. The excerpt of code like below:

#ifdef _DEBUG
#define _CRTDBG_MAP_ALLOC
#include <stdlib.h>
#include <crtdbg.h>

#undef THIS_FILE
static char THIS_FILE[] = __FILE__;
#define new new(_NORMAL_BLOCK, THIS_FILE, __LINE__)
#endif

class TestAtomicQueue {
	atomic_queue::AtomicQueue<int, 1, -1> m_queue;
};

int main()
{
#ifdef _DEBUG
	_CrtSetDbgFlag(_CRTDBG_ALLOC_MEM_DF | _CRTDBG_LEAK_CHECK_DF);
#endif
	auto pObject = new TestAtomicQueue;
	delete pObject; // <-- broken here
}

about tag release

@max0x7ba can we do a tag release ? thx.

how to put struct to queue

Can anyone help?

typedef struct {
    int a;
    char b[32];
} UserData;

compile error when:

constexpr unsigned CAPACITY = 1024;
AtomicQueue<UserData, CAPACITY> q;

Throttling when idle

Is it possible to throttle CPU usage when the queue is basically idle? Like exponentially increase pause time until some upper limit in case no data is inflight?

'try_push' variant for move-only types

I adapted the queue so that it's compatible with move-only types (std::unique_ptr for example)

I made minor modifications here and there (adding std::move on some assignments) and found that it would be usefull to have a try_push variant for moveable types, because if the push fails, for moveable types, the moved object is lost. In the variant I propose, the moved object is returned to the caller when the push fails:

    template<class T>
    std::optional<T> try_push_moveable(T&& element) noexcept {
        auto head = head_.load(X);
        if(Derived::spsc_) {
            if(static_cast<int>(head - tail_.load(X)) >= static_cast<int>(static_cast<Derived&>(*this).size_))
                return {std::forward<T>(element)};
            head_.store(head + 1, X);
        }
        else {
            do {
                if(static_cast<int>(head - tail_.load(X)) >= static_cast<int>(static_cast<Derived&>(*this).size_))
                    return {std::forward<T>(element)};
            } while(ATOMIC_QUEUE_UNLIKELY(!head_.compare_exchange_strong(head, head + 1, A, X))); // This loop is not FIFO.
        }

        static_cast<Derived&>(*this).do_push(std::forward<T>(element), head);
        return {};
    }

I'll submit a full pull request when I find some time to do it.

spin-locking

I'd like to raise a concern about spin-locking in atomic_queue implementation. I read README section about this, but I'm still not convinced. Let me describe an issue I saw:
An NDA platform with a non-fair OS scheduler that runs on millions of consumer devices. We had a spin-lock that was using _mm_pause/__yield instructions, same as atomic_queue. Which resulted in a low but statistically signficant number of deadlocks on only that platform. All cores were occupied by spinning threads, while the thread that was supposed to unlock them was never re-scheduled. The culprit here is a non-fair scheduler. It was giving a CPU core to a thread until it got blocked (e.g. on a mutex) or a higher-priority thread got ready for execution.
The fix was to add std::this_thread::yield() (not exactly but doing the same) into spin-locking. This solved the problem because while it wasn't communicating to the OS which thread it waits for, it was freeing a CPU core, and that was enough to eventually schedule the unlocking thread.
Now, std::this_thread::yield() is order of magnitudes slower than _mm_pause, at least on multiple platforms where I had a chance to benchmark it. A typical solution is to use a hybrid approach: spin on _mm_pause/__yield for a while (hundreds of times) for low-latency response and only then switch to spinning on std::this_thread::yield() as the last resort. This way it doesn't matter how slow it is.
To be clear, I don't think it's a reliable solution. I suspect it can be still vulnerable to the priority inversion problem, though I don't have data to back this claim. To avoid priority inversion, a proper blocking is required. So the most versatile solution I can think of would be:

spin on _mm_pause for N iterations;
spin on std::this_thread::yield() for another M iterations;
block the thread, e.g. by std::atomic<T>::wait(). right, I remember this library is C++14, so it would be rather std::mutex.
The cost is that unlocking thread would need to check every time if there's a waiting thread that needs to be notified (a relaxed load), and 8 bytes of memory. The implementation would be a bit convoluted though. So, from practical point of view, I'd go with just adding std::this_thread::yield() to the spinning. it's simple enough, costs nothing and solves a real problem in many real cases.

build failure with gcc-13

atomic_queue fails to build from source with Gcc 13 on today's Debian sid. The relevant part of the error message starts like:

In file included from /usr/include/c++/13/bits/atomic_base.h:39,
                 from /usr/include/c++/13/atomic:41,
                 from include/atomic_queue/defs.h:7,
                 from include/atomic_queue/atomic_queue.h:7,
                 from /<<PKGBUILDDIR>>/src/benchmarks.cc:5:
In function ‘std::_Require<std::__not_<std::__is_tuple_like<_Tp> >, std::is_move_constructible<_Tp>, std::is_move_assignable<_Tp> > std::swap(_Tp&, _Tp&) [with _Tp = thread::id]’,
    inlined from ‘void std::thread::swap(std::thread&)’ at /usr/include/c++/13/bits/std_thread.h:193:16,
    inlined from ‘std::thread& std::thread::operator=(std::thread&&)’ at /usr/include/c++/13/bits/std_thread.h:187:11,
    inlined from ‘{anonymous}::cycles_t {anonymous}::benchmark_throughput(atomic_queue::HugePages&, const std::vector<unsigned int>&, unsigned int, unsigned int, bool, sum_t*) [with Queue = atomic_queue::RetryDecorator<CapacityToConstructor<atomic_queue::AtomicQueueB<unsigned int, atomic_queue::HugePageAllocator<unsigned int>, 0, false, false, true>, 65536> >]’ at /<<PKGBUILDDIR>>/src/benchmarks.cc:252:43,
    inlined from ‘void {anonymous}::run_throughput_benchmark(const char*, atomic_queue::HugePages&, const std::vector<unsigned int>&, unsigned int, unsigned int, unsigned int) [with Queue = atomic_queue::RetryDecorator<CapacityToConstructor<atomic_queue::AtomicQueueB<unsigned int, atomic_queue::HugePageAllocator<unsigned int>, 0, false, false, true>, 65536> >]’ at /<<PKGBUILDDIR>>/src/benchmarks.cc:292:60:
/usr/include/c++/13/bits/move.h:198:7: error: array subscript 2 is outside array bounds of ‘long long int [1]’ [-Werror=array-bounds=]
  198 |       __a = _GLIBCXX_MOVE(__b);
      |       ^~~
In file included from /usr/include/x86_64-linux-gnu/c++/13/bits/c++allocator.h:33,
                 from /usr/include/c++/13/bits/allocator.h:46,
                 from /usr/include/c++/13/memory:65,
                 from include/atomic_queue/atomic_queue.h:13:
In member function ‘_Tp* std::__new_allocator<_Tp>::allocate(size_type, const void*) [with _Tp = long long int]’,
    inlined from ‘static _Tp* std::allocator_traits<std::allocator<_Tp1> >::allocate(allocator_type&, size_type) [with _Tp = long long int]’ at /usr/include/c++/13/bits/alloc_traits.h:482:28,
    inlined from ‘std::_Vector_base<_Tp, _Alloc>::pointer std::_Vector_base<_Tp, _Alloc>::_M_allocate(std::size_t) [with _Tp = long long int; _Alloc = std::allocator<long long int>]’ at /usr/include/c++/13/bits/stl_vector.h:378:33,
    inlined from ‘std::_Vector_base<_Tp, _Alloc>::pointer std::_Vector_base<_Tp, _Alloc>::_M_allocate(std::size_t) [with _Tp = std::thread; _Alloc = std::allocator<std::thread>]’ at /usr/include/c++/13/bits/stl_vector.h:375:7,
    inlined from ‘void std::_Vector_base<_Tp, _Alloc>::_M_create_storage(std::size_t) [with _Tp = std::thread; _Alloc = std::allocator<std::thread>]’ at /usr/include/c++/13/bits/stl_vector.h:395:44,
    inlined from ‘std::_Vector_base<_Tp, _Alloc>::_Vector_base(std::size_t, const allocator_type&) [with _Tp = std::thread; _Alloc = std::allocator<std::thread>]’ at /usr/include/c++/13/bits/stl_vector.h:332:26,
    inlined from ‘std::vector<_Tp, _Alloc>::vector(size_type, const allocator_type&) [with _Tp = std::thread; _Alloc = std::allocator<std::thread>]’ at /usr/include/c++/13/bits/stl_vector.h:554:47,
    inlined from ‘{anonymous}::cycles_t {anonymous}::benchmark_throughput(atomic_queue::HugePages&, const std::vector<unsigned int>&, unsigned int, unsigned int, bool, sum_t*) [with Queue = atomic_queue::RetryDecorator<CapacityToConstructor<atomic_queue::AtomicQueueB<unsigned int, atomic_queue::HugePageAllocator<unsigned int>, 0, false, false, true>, 65536> >]’ at /<<PKGBUILDDIR>>/src/benchmarks.cc:245:30,
    inlined from ‘void {anonymous}::run_throughput_benchmark(const char*, atomic_queue::HugePages&, const std::vector<unsigned int>&, unsigned int, unsigned int, unsigned int) [with Queue = atomic_queue::RetryDecorator<CapacityToConstructor<atomic_queue::AtomicQueueB<unsigned int, atomic_queue::HugePageAllocator<unsigned int>, 0, false, false, true>, 65536> >]’ at /<<PKGBUILDDIR>>/src/benchmarks.cc:292:60:
/usr/include/c++/13/bits/new_allocator.h:147:55: note: at offset 16 into object of size 8 allocated by ‘operator new’
  147 |         return static_cast<_Tp*>(_GLIBCXX_OPERATOR_NEW(__n * sizeof(_Tp)));
      |                                                       ^

This issue has initially been reported in Debian bug#1037715. I have been unable to identify what exactly in the porting guide to gcc-13 was relevant to the error at play.

Optimization proposal

Hello,

I have a suggestion to improve the performance of the queues that are dynamically sized:

Each time an element is pushed or poped, there is a call to a modulo operator to put head and tail in the [0,size) range.

This modulo might be very fast in the case where the size is a constant power of 2 known at compile time (if the compiler converts this to a bitwise operation), but when the size is set a run-time, this modulo operator might be expensive...

Instead of using modulo we could simply keep head and tail in the range [0,size) (i.e reset them to 0 when they become equal to size).

AtomicQueueB2 size does not work

I would expect the following :

atomic_queue::RetryDecorator<atomic_queue::AtomicQueueB2<float>> q(10);
    
    for (size_t i = 0 ; i < 100 ; i++)
    {
        cout << "enqueuing " << i << endl;
        q.push(i);
    }

to only print up to 10 and then hang. But it doesn't. it goes all the way up to 100

Available in Conan

Hi,
First of all, thanks for this library.

I wanted to share that it's now available as a package in conan.
As you already mention vcpkg, it might be a good idea to also mention conan, as it makes the library more accessible to a wider range of users. Plus, you get to use a cool badge:

Best regards,

moodycamel::ConcurrentQueue performs better if using enqueue instead of try_enqueue

moodycamel::ConcurrentQueue performs better if using enqueue instead of try_enqueue.
as below:
this->enqueue(element);
// while(!this->try_enqueue(element))
// spin_loop_pause();

Consider adding try_pop returning optional

Hi,

I think it would be useful if there was something like std::optional<T> try_pop_opt(), at least for types for which it's possible.

What do you think?

Questions about evaluation graphs

In the 'A wait-free queue as fast as fetch-and-add paper', it shows the performances nearly 100 mops/seconds. But, the benchmark in this repo. shows 10 msgs/seconds (in case of XEON).
I know the environments are differ each others. But, why are they making differences?
what is the major difference between AtomicQueue and OptimistAtomicQueue regarding to performance? It has a little change in the code, but it shows the performance large gap.
Did you check the performance comparison with LRCQ?

LRCQ: Fast Concurrent Queues for x86 Processors, PPOPP '13

-- figure in wait-free queue paper

Thanks...
Junghan

could you please add some user guide for beginners?

Hi, after check the benchmark page, I know this is another great high performance lock-free queue.
If I want to use this atomic_queue in my project, I found it's hard for me. It's better if you can provide a simple guide for beginners how to use atomic_queue just like moodycamel::ConcurrentQueue's doc include some basic use guide.
Thanks!

Make benchmark dependencies optional in meson.build?

Consider allowing the tests to be built with meson even when the benchmark dependencies are not available.

Possible implementation: #30

Please document the meaning of the NIL template parameter clearly

I tried out the atomic_queue today in the context of a bigger project. The first local tests where promising, but then some of our CI tests failed. It took me a few hours to finally figure out that the reason was, that under some circumstances a default constructed element was pushed into the queue which then compared to NIL. I wasn't aware of the meaning of that third template argument and only found a description of it in an old issue.

I was about to revert all the changes and use another queue implementation because of this. Now after setting it to a different value, everything works fine. A small hint in the readme, explaining this important parameter or some documentation in the code would have saved me some hours of working time. Could you please document that somewhere?

Consider adding moodycamel::ConcurrentQueue to the benchmarks

Hi,

Currently you only have the moodycamel::ReaderWriterQueue incorporated in the benchmarks - Is there any specific reason to exclude moodycamel::ConcurrentQueue? It's also a lock-free multiple-producer-multiple-consumer queue, just like this one.

I think it would be a great addition!

max0x7ba / atomic_queue Goto Github PK

atomic_queue's People

Contributors

Stargazers

Watchers

Forkers

atomic_queue's Issues

Recommend Projects

Recommend Topics

Recommend Org