datadog / glommio Goto Github PK

Glommio is a thread-per-core crate that makes writing highly parallel asynchronous applications in a thread-per-core architecture easier for rustaceans.

License: Other

Rust 99.23% C 0.75% Shell 0.02%

rust linux thread-per-core iouring async

glommio's Introduction

glommio

Join our Zulip community!

If you are interested in Glommio, consider joining our Zulip community. Tell us about exciting applications you are building, ask for help, or just chat with friends 😃

What is Glommio?

Glommio (pronounced glo-mee-jow or |glomjəʊ|) is a Cooperative Thread-per-Core crate for Rust & Linux based on io_uring. Like other rust asynchronous crates, it allows one to write asynchronous code that takes advantage of rust async/await, but unlike its counterparts, it doesn't use helper threads anywhere.

Using Glommio is not hard if you are familiar with rust async. All you have to do is:

use glommio::prelude::*;

LocalExecutorBuilder::default().spawn(|| async move {
    /// your async code here
})
.expect("failed to spawn local executor")
.join();

For more details check out our docs page and an introductory article.

Supported Rust Versions

Glommio is built against the latest stable release. The minimum supported version is 1.65. The current Glommio version is not guaranteed to build on Rust versions earlier than the minimum supported version.

Supported Linux kernels

Glommio requires a kernel with a recent enough io_uring support, at least current enough to run discovery probes. The minimum version at this time is 5.8.

Please also note Glommio requires at least 512 KiB of locked memory for io_uring to work. You can increase the memlock resource limit (rlimit) as follows:

$ vi /etc/security/limits.conf
*    hard    memlock        512
*    soft    memlock        512

Please note that 512 KiB is the minimum needed to spawn a single executor. Spawning multiple executors may require you to raise the limit accordingly.

To make the new limits effective, you need to log in to the machine again. You can verify that the limits are updated by running the following:

$ ulimit -l
512

Contributing

See Contributing.

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

glommio's People

Contributors

Stargazers

Watchers

Forkers

glommer hippobaro isgasho dio sitano sobolevn shaunstanislauslau teymour-aldridge hadryan tako8ki monocodus-demonstrations sergeykhegay chaosstudygroup daniel-b-smith lukehoersten cxz 5gapp doytsujin matklad poga abhiss nrc age-rs wackywendell mxxo jefftt fullstop000 bryandmc cw2030 gardnervickers ibraheemdev raventid thirstycrow isabella232 toidiu digikata suyashb95 duanjp8617 ben2077 g6x fujita eliasyaoyc qingyangwtg jvanz astradot jakkusakura sstadick waynexia andresmargalef nugine tiamat-tech chandanpasunoori daschl ives9638 shikhar duarten lizardwizzard harana-oss taiki-e ihciah purewhitewu nohupped g302ge kangzhenkang sinhasantos davidblewett topecongiro trtsl thinkharderdev chux0519 tigrato yago-123 qxtech-dev oliver-giersch ekmixon zojw maksimsco i10416 chenlongxi666 quintintao masami87 zhaopufeng zhangxi123051 johnnnywang veilair tpc-tikv songzhi benauthor andrii0lomakin codyps r3v2d0g jzajpt c-nichols armpdq lightsofapollo dignifiedquire khodzha sehz ray1888 vmingchen

glommio's Issues

The `echo` example does not work

OS: Ubuntu 20.04
Kernel: 5.6.17-050617-generic

The echo example hangs after printing the first line. After retrying a dozen of times, it starts to panic at initialization. And every glommio application failed to start because of the same reason. Have to reboot to make them work again.

$ target/debug/examples/echo
Server Listening on 127.0.0.1:10000 on 1 connections
^C
$ target/debug/examples/echo
Server Listening on 127.0.0.1:10000 on 1 connections
^C
$ target/debug/examples/echo
Server Listening on 127.0.0.1:10000 on 1 connections
^C
$ target/debug/examples/echo
Server Listening on 127.0.0.1:10000 on 1 connections
^C
$ target/debug/examples/echo
Server Listening on 127.0.0.1:10000 on 1 connections
^C
$ target/debug/examples/echo
Server Listening on 127.0.0.1:10000 on 1 connections
^C
$ target/debug/examples/echo
Server Listening on 127.0.0.1:10000 on 1 connections
^C
$ target/debug/examples/echo
Server Listening on 127.0.0.1:10000 on 1 connections
^C
$ target/debug/examples/echo
Server Listening on 127.0.0.1:10000 on 1 connections
^C
$ target/debug/examples/echo
Server Listening on 127.0.0.1:10000 on 1 connections
^C
$ target/debug/examples/echo
thread 'server-0' panicked at 'cannot initialize I/O event notification: Os { code: 12, kind: Other, message: "Cannot allocate memory" }', glommio/src/parking.rs:249:48
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Any', examples/echo.rs:97:26
$ target/debug/examples/hello_world
thread 'main' panicked at 'cannot initialize I/O event notification: Os { code: 12, kind: Other, message: "Cannot allocate memory" }', glommio/src/parking.rs:249:48
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

$ RUST_BACKTRACE=1 target/debug/examples/hello_world
thread 'main' panicked at 'cannot initialize I/O event notification: Os { code: 12, kind: Other, message: "Cannot allocate memory" }', glommio/src/parking.rs:249:48
stack backtrace:
   0: rust_begin_unwind
             at /rustc/18bf6b4f01a6feaf7259ba7cdae58031af1b7b39/library/std/src/panicking.rs:475
   1: core::panicking::panic_fmt
             at /rustc/18bf6b4f01a6feaf7259ba7cdae58031af1b7b39/library/core/src/panicking.rs:85
   2: core::option::expect_none_failed
             at /rustc/18bf6b4f01a6feaf7259ba7cdae58031af1b7b39/library/core/src/option.rs:1221
   3: core::result::Result<T,E>::expect
             at /home/jianghua/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/result.rs:933
   4: glommio::parking::Reactor::new
             at ./glommio/src/parking.rs:249
   5: glommio::executor::LocalExecutor::new
             at ./glommio/src/executor.rs:646
   6: glommio::executor::LocalExecutorBuilder::make
             at ./glommio/src/executor.rs:472
   7: hello_world::main
             at ./examples/hello_world.rs:26
   8: core::ops::function::FnOnce::call_once
             at /home/jianghua/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:227
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Bug in documentation of local_channel::LocalSender

Hi,
In https://github.com/DataDog/glommio/blob/master/scipio/src/channels/local_channel.rs#L241 it is written -
For an unbounded channel this is just a more expensive version of [`send`]. Prefer to use [`send`] instead.
Probably you meant try_send ?

Consider moving integration tests over to `/tests/integration.rs`

Rust has two style of testing -- unit tests inside the lib crate itself, and integration tests in /tests dir. Each integration test is a separate crate, which is linked to the library crate and has access only to public API.

I think it makes sense to move tests that can be integration to integration tests. The benefits are:

integration test don't have access to private items, so we make sure that we expose all the required API.
integration test interact with scipio exactly as a client would do (scipio::foo over crate::foo or super::foo), so they can be used to test-drive the ergonomics.
integration tests can be a good source of documentation for users of scipio
by clearly separating unit and integration tests, it would be easier to detect API breaking changes, once we start caring about stability. If you change some code and some tests don't compile, it's fine for unit-tests, and requires a bump of major version in Cargo toml for integrated tests.

Regularly merge the `iou` and `uring-sys` forks with upstream

The https://github.com/glommer/iou and https://github.com/glommer/uring-sys forks should be, if possible, regularly merged or rebased against the upstream crates to prevent an eco-system deviation/split.

Zulip link is broken

Zulip link is broken. Would be cool to join community back.

Writes failed if file is bigger than Buffer_size * write_behind

The commit that changed the write api (f494c50) has a bug.

The unit tests don't catch it because the buffer size may be aligned up due to Direct I/O requirements, so turns out we have no test currently that writes more than the write behind.

Add a collection of introduction articles

(documentation issue, zulip reference)

A few articles about glommio are already published. The most basic ones should be probably collected, so that it is easier to find informations/reasoning (mostly) about the architecture of glommio.

Proposal:

produce new docs + crates.io releases regularly

We should consider a mechanism to release Glommio more frequently.. Maybe this exists and I am just unaware, but I am updating the readme because I noticed something out of date and realized that the only version on crates.io is kinda old, or at least older than what is living in master right now. Obviously it'll become more important later to be more careful with releases (post 1.0) but right now we can probably release more frequently.

So maybe we should create a system to do the releases or at least host the docs somewhere. Does this already exist?

Thoughts?

shared_channel is destoryed with pending wakers alive

A test is as as following.

diff --git a/glommio/src/channels/shared_channel.rs b/glommio/src/channels/shared_channel.rs
index 723ddaa..6a70102 100644
--- a/glommio/src/channels/shared_channel.rs
+++ b/glommio/src/channels/shared_channel.rs
@@ -456,6 +456,7 @@ mod test {
     use super::*;
     use crate::timer::Timer;
     use crate::LocalExecutorBuilder;
+    use futures_lite::FutureExt;
     use futures_lite::StreamExt;
     use std::sync::atomic::{AtomicUsize, Ordering};
     use std::sync::Arc;
@@ -650,28 +651,33 @@ mod test {
 
     #[test]
     fn send_to_full_channel() {
-        let (sender, receiver) = new_bounded(1);
+        let (sender, receiver) = new_bounded::<u8>(1);
 
         let status = Arc::new(AtomicUsize::new(0));
         let s1 = status.clone();
 
         let ex1 = LocalExecutorBuilder::new()
             .spawn(move || async move {
-                let sender = sender.connect().await;
-                sender.send(0).await.unwrap();
-                let x = sender.try_send(1);
-                assert_eq!(x.is_err(), true);
-                s1.store(1, Ordering::Relaxed);
             })
             .unwrap();
 
         let ex2 = LocalExecutorBuilder::new()
             .spawn(move || async move {
-                let receiver = receiver.connect().await;
-
-                while status.load(Ordering::Relaxed) == 0 {}
-                let x = receiver.recv().await.unwrap();
-                assert_eq!(0, x);
+                let receiver = receiver.connect();
+                let sender = sender.connect();
+                let (receiver, sender) = futures::future::join(receiver, sender).await;
+
+                future::poll_fn(move |cx| {
+                    let mut f1 = receiver.recv().boxed_local();
+                    assert_eq!(f1.poll(cx), Poll::Pending);
+                    assert_eq!(sender.try_send(1).is_ok(), true);
+                    let r = receiver.recv_one(cx);
+                    assert_eq!(r, Poll::Ready(Some(1)));
+                    r
+                }).await;
             })
             .unwrap();
 
diff --git a/glommio/src/parking.rs b/glommio/src/parking.rs
index 5cbb568..87364e8 100644
--- a/glommio/src/parking.rs
+++ b/glommio/src/parking.rs
@@ -305,6 +305,8 @@ impl Reactor {
 
     pub(crate) fn unregister_shared_channel(&self, id: u64) {
         let mut channels = self.shared_channels.borrow_mut();
+        assert_eq!(channels.wakers_map.len(), 0);
+        assert_eq!(channels.connection_wakers.len(), 0);
         channels.check_map.remove(&id);
     }

Rust `Waker`s need to be `Send + Sync`

It is currently not possible in Rust to safely define non thread-safe Wakers. They are defined to be Send and Sync, and thereby the RawWakers must fulfill those guarantees.

This commit, as well as the general Waker implementation for glommio tasks violate those properties.

Non thread-safe Wakers had been part of the earlier async/await design in form a so called LocalWaker. However they had been removed from the design in order to simplify it, which unfortunately makes life a bit harder (or slower) for projects like this.

In addition to this Context is also Sync, which is something that wasn't initially planned for and probably still should be fixed.

You might want to consider a different wakeup mechansism than directly referencing the tasks via non-thread safe pointers. The original RFC mentioned an alternative implementation using thread-local storage and task IDs which could work. However there is also a drawback with this design: It would never allow for non-local wakeups, since the executor on the other thread would not be found (which might however be less of an issue for this project).

DmaFile: provide a `create_new` method

The create_new method should work similarly to create_new.

Alternative: provide a DmaFileBuilder struct with a create_new method.

Unresolved doc links

I ran cargo doc against glommio and it returned 5 warnings:

warning: unresolved link to `Task::spawn_into`
    --> glommio/src/executor.rs:1136:25
     |
1136 |     /// [`spawn_into`]: Task::spawn_into
     |                         ^^^^^^^^^^^^^^^^ the struct `Task` has no field or associated item named `spawn_into`
     |
     = note: `#[warn(broken_intra_doc_links)]` on by default

warning: unresolved link to `Task::create_task_queue`
  --> glommio/src/shares.rs:18:20
   |
18 | /// [`TaskQueue`]: Task::create_task_queue
   |                    ^^^^^^^^^^^^^^^^^^^^^^^ no item named `Task` in scope

warning: unresolved link to `Task::create_task_queue`
  --> glommio/src/shares.rs:22:24
   |
22 |     /// [`TaskQueue`]: Task::create_task_queue
   |                        ^^^^^^^^^^^^^^^^^^^^^^^ no item named `Task` in scope

warning: unresolved link to `Task::create_task_queue`
  --> glommio/src/shares.rs:27:24
   |
27 |     /// [`TaskQueue`]: Task::create_task_queue
   |                        ^^^^^^^^^^^^^^^^^^^^^^^ no item named `Task` in scope

warning: unresolved link to `Task::create_task_queue`
  --> glommio/src/shares.rs:71:20
   |
71 | /// [`TaskQueue`]: Task::create_task_queue
   |                    ^^^^^^^^^^^^^^^^^^^^^^^ no item named `Task` in scope

warning: 5 warnings emitted

Clean up stale and merged branches

Provide an ergonomic way to open a file given a directory

Linux has an open_at syscall that opens a file given a directory.
The reason we don't have the same yet is that open_at overloads its meaning: if you pass an absolute path the dir argument is ignored.

That doesn't sit well with our level of abstraction: if we have an API that opens (or create) a file inside a directory we should make sure that the pathbuf does not contain directories and is just a file name.

The advantages of doing that include:

We will be able to enhance error messages, as we'll always be able to tell the user which directory was involved in an open/create
We will be able to store information easily about properties of a file (for instance block alignment, preferred I/O ring), and in the future device for scheduling, without the need to stat
We will be able to use io_uring's registered fd support by pre-opening a bunch of file descriptors in that directory

Memory allocation failures in "cargo test"

[penberg@nero scipio]$ cargo test
    Finished test [unoptimized + debuginfo] target(s) in 0.02s
     Running target/debug/deps/scipio-77adf7d6245cfb16

running 14 tests
test executor::create_fail_to_bind ... ok
test executor::create_and_destroy_executor ... ok
test executor::invalid_task_queue ... ok
test executor::create_and_bind ... ok
test executor::spawn_without_executor ... ok
test local_semaphore::broken_semaphore_if_close_happens_first ... ok
test executor::task_optimized_for_throughput ... FAILED
test local_semaphore::explicit_signal_unblocks_waiting_semaphore ... FAILED
test executor::ten_yielding_queues ... FAILED
test local_semaphore::broken_semaphore_returns_the_right_error ... FAILED
test local_semaphore::permit_raii_works ... FAILED
test local_semaphore::semaphore_acquisition_for_zero_unit_works ... FAILED
test local_semaphore::broken_semaphore_if_acquire_happens_first ... ok
test executor::task_with_latency_requirements ... ok

failures:

---- executor::task_optimized_for_throughput stdout ----
thread 'executor::task_optimized_for_throughput' panicked at 'cannot initialize I/O event notification: Os { code: 12, kind: Other, message: "Cannot allocate memory" }', src/parking.rs:359:19

---- local_semaphore::explicit_signal_unblocks_waiting_semaphore stdout ----
thread 'local_semaphore::explicit_signal_unblocks_waiting_semaphore' panicked at 'cannot initialize I/O event notification: Os { code: 12, kind: Other, message: "Cannot allocate memory" }', src/parking.rs:359:19

---- executor::ten_yielding_queues stdout ----
thread 'executor::ten_yielding_queues' panicked at 'cannot initialize I/O event notification: Os { code: 12, kind: Other, message: "Cannot allocate memory" }', src/parking.rs:359:19

---- local_semaphore::broken_semaphore_returns_the_right_error stdout ----
thread 'local_semaphore::broken_semaphore_returns_the_right_error' panicked at 'cannot initialize I/O event notification: Os { code: 12, kind: Other, message: "Cannot allocate memory" }', src/parking.rs:359:19

---- local_semaphore::permit_raii_works stdout ----
thread 'local_semaphore::permit_raii_works' panicked at 'cannot initialize I/O event notification: Os { code: 12, kind: Other, message: "Cannot allocate memory" }', src/parking.rs:359:19

---- local_semaphore::semaphore_acquisition_for_zero_unit_works stdout ----
thread 'local_semaphore::semaphore_acquisition_for_zero_unit_works' panicked at 'cannot initialize I/O event notification: Os { code: 12, kind: Other, message: "Cannot allocate memory" }', src/parking.rs:359:19


failures:
    executor::task_optimized_for_throughput
    executor::ten_yielding_queues
    local_semaphore::broken_semaphore_returns_the_right_error
    local_semaphore::explicit_signal_unblocks_waiting_semaphore
    local_semaphore::permit_raii_works
    local_semaphore::semaphore_acquisition_for_zero_unit_works

test result: FAILED. 8 passed; 6 failed; 0 ignored; 0 measured; 0 filtered out
```

reduce allocations in RawTask

I have run a profiler today, and top of the list for allocations is RawTask::allocate:

MOST CALLS TO ALLOCATION FUNCTIONS
2500010 calls to allocation functions with 2.02KB peak consumption from
glommio::task::raw::RawTask$LT$F$C$R$C$S$C$T$GT$::allocate::hfcfb94f2ae2fac56
  in /home/glauber/glommio/target/release/deps/storage-ecc5486f9a5b74db
2045453 calls with 1.66KB peak consumption from:
    glommio::executor::LocalExecutor::spawn::h2480a54bc747dc84
      in /home/glauber/glommio/target/release/deps/storage-ecc5486f9a5b74db
    glommio::executor::Task$LT$T$GT$::local::h9231bd6aaf3300c5
      in /home/glauber/glommio/target/release/deps/storage-ecc5486f9a5b74db

We can make that significantly cheaper with a slab.
Rust has slab crates but I don't like any of them because they are all backed by a vector. That can be horrible for latency
because adding elements can lead to the entire vector to reallocate.

The kernel slab is backed by pages, which is likely what we should have: you can allocate some memory when needed (say, 4kB), and then partition that memory into objects as needed.

This one here is particularly tricky, because there can be a lot of allocations alive if there are many tasks alive. So vector-based storage is likely to grow too big.

Merge read requests for random file access

Although a fully fledged I/O Scheduler is probably way into the future, there is a functionality that I would like to see from an early implementation that can serve as a seed for that: random read request merging.

Imagine a read workload for a datastore. It is possible, if not likely, that many independent readers may be touching the same file, in the same - or close enough - position.

If that is the case, we don't have to go to storage and every time.
In preparation for such a change, in PR #72 I am changing the return of read_dma and read_dma_aligned to be a ReadResult instead of a DmaBuffer. The ReadResult encapsulates a buffer, and two or more reads that end up reading from the same buffer may reuse them.

If reads are queued from positions similar to the one we are reading from (either adjacent or within a tolerance gap), we can extend the reads. For example:

Let's say we fire read for position 0 for file A, reading 512 bytes.
Then another reader tries to read, 1024 bytes from position 1024.

We shouldn't ever delay firing the reads waiting for possible merges: but let's say when the time comes to fire the I/O, we see that those two requests are in the queue: it is better to read 2048 bytes and just give each one a slice of the buffer.

Unclear soundness of `PosixDmaBuffer`

Hi!

I've been casually browsing source of scipio (I am really excited about TpC in Rust!), and I think PosixDmaBuffer might lead to undefined behavior. At least, it's not clear to me why that wouldn't be the case.

Here are my notes:

PosixDmaBuffer::trim_to_size and PosixDmaBuffer::trim methods are safe. However, they don't check that sizes are in bounds, and so can be missused to cause UB. Specifically, as_ptr causes UB, if data.add(self.trim) fails in-bounds requirements (sic, just pointer arithmetic can cause UB), and as_bytes additionally causes UB if self.size fails in-bounds.
By itself, this isn't necessary a problem -- if PosixDmaBuffer is not a public type, then the crate itself might just be careful enough to use it correctly. It's not obvious from the source if the type is indeed public, but making it pub(crate) and fixing re-exports clears this up.
However, there might be indirect UB exposed still. PosixDmaBuffer is used by the DmaFile, and I think that might lead to UB.
Specifically, I feel that something like read_dma(1, usize::MAX) might be UB. I haven't traced the code thoroughly so I might be wrong, but I think the following would happen:
- b will be 1
- eff_size will be 0 in release due to overflow
- trim_front will then trim the empty buffer by 1
Again, this is just me eyeballing the code, so this particular problem might turn out to be non-problem at all, but it's pretty hard to conclude that no such problems could exist.

My recommendations would be:

Make PosixDmaBuffer pub(crate) to be sure it's not accidentally exposed (this also triggers a couple of "unused" warnings -- compiler is better about reasoning about non-pub types).
Make trim and trim_front methods unsafe and add debug_asserts inside, OR add asserts.

I'd personally lean towards asserts -- this is IO, so I wouldn't expect bounds checking to be costly.

Poll executor for a while before going to sleep

Going to sleep is expensive, so it is better to poll the executor for a while before going to sleep, by calling poll_io instead of park even if we have no tasks. If we always do that, though, then our CPU usage is always 100% which is bad too.

Seastar, for instance, will default to polling for some time before going to sleep.

In our implementation, an executor may or may not be pinned to a CPU:

Executors that are pinned should never poll.
Executors that are pinned may or may not poll, depending on the configuration.

The default should definitely be to not poll.

When polling is employed, we need to keep statistics about how much time was spent polling to inform users about "real" CPU usage vs "seen-by-Linux" CPU usage.

Implement native support for Unix Sockets

A good implementation of this should likely reuse code with Udp and TcpStreams but that will not be mandated.

Tests and benchmarks are important

error information is lost

Part of the information we encode in errors is lost when we print them.

Take for example this:


        if start_id != end_id {
            return Err(GlommioError::<()>::WouldBlock(ResourceType::File(format!(
                "Reading {} bytes from position {} would cross a buffer boundary (Buffer size {})",
                len, self.current_pos, buffer_size
            ))));
        }

This very helpful message was supposed to tell us how much we are reading, by how much we failed, etc.

However due to the way we are implementing Display for GlommioError, the only information that make the output is:

thread 'test-1' panicked at 'called `Result::unwrap()` on an `Err` value: Custom { kind: WouldBlock, error: "File operation would block" }',

`LocalExecutor::make_default` leaks memory via RC cycle

The

And by all I mean, seriously, even spawn_without_executor() failed [ASAN] and there's clearly nothing wrong with that one.

comment got me curious, so I decided to miri that test.

$ MIRI_NIGHTLY=nightly-$(curl -s https://rust-lang.github.io/rustup-components-history/x86_64-unknown-linux-gnu/miri)
$ rustup toolchain add $MIRI_NIGHTLY
$ rustup component add miri --toolchain $MIRI_NIGHTLY
$ export MIRIFLAGS="-Zmiri-disable-isolation"
$ cargo +$MIRI_NIGHTLY miri test -- spawn_without_executor

Sure enough, it found a memory leak:

   Compiling scipio v0.1.0 (/home/matklad/projects/scipio/scipio)
   Compiling examples v0.0.0 (/home/matklad/projects/scipio/examples)
    Finished test [unoptimized + debuginfo] target(s) in 0.94s
     Running target/x86_64-unknown-linux-gnu/debug/deps/scipio-c56c9e43674a62a8

running 1 test
test executor::test::spawn_without_executor ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 83 filtered out

The following memory was leaked: alloc632956 (Rust heap, size: 168, align: 8) {
    0x00 │ 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 │ ................
    0x10 │ 00 00 00 00 00 00 00 00 08 00 00 00 00 00 00 00 │ ................
    ........................................
    ........................................
alloc633455 (fn: std::intrinsics::drop_in_place::<[closure@scipio/src/executor.rs:455:17: 458:18]> - shim(Some([closure@scipio/src/executor.rs:455:17: 458:18])))
alloc633456 (fn: executor::LocalExecutor::init::{closure#0})
alloc633457 (fn: executor::LocalExecutor::init::{closure#0})
alloc633458 (fn: <[closure@scipio/src/executor.rs:455:17: 458:18] as std::ops::FnOnce<()>>::call_once - shim(vtable))

error: the evaluated program leaked memory

error: aborting due to previous error

error: test failed, to rerun pass '-p scipio --lib'

Poking around closure@scipio/src/executor.rs:455:17: 458:18 revealed what seems to an Rc-cycle memory leak. This commit adds a minimal demonstration and a crude fix via weak pointers:

matklad@9f2912f

Add helpers to create many executors with placement hints.

Using the hwloc library, we are able to discover the topology of the box.

Our current API to create an executor takes a CPU number. But it would be nice if we could also have another, more expressive helper where we can express placement constraints and get that from hwloc.

Examples:

"Give me two executors, placing them both in the same physical CPU, just in different hyperthreads"
"Give me one executor in each socket in this box"

@penberg - you might find this interesting

Consider using futures-intrusive

This crate contains yet another implementation of async channels and semaphores. The implementation here comes with the downsides of the earlier futures-like implementation that an unbounded list of waiting tasks needs to be stored in the sync primitives. This has 2 downsides:

A heap allocation is necessary
This list might grow in some high-load situations, and might never be shrunk later on, leading to memory leakage

futures-intrusive demonstrated how to build async sync primitives which use intrusive linkage instead of heap allocated task lists, and thereby doesn't require any dynamic memory allocation. The same approach was now also picked up by the tokio sync primitives.

You migth be interested in doing the same. And btw: futures-intrusive also offers non-Sync versions of those primitives which do not require any "real" OS mutex, e.g.

Those could be helpful to you out of the box

Zulip invitation

The README states that the Zulip community is the place to ask questions, but that seems to be invite only. Perhaps update the README, or open up the zulip community?

I do have some questions regarding Scipio, but I do not wish to pollute the issue tracker :)

executor::test::test_ping_pong_yield is flaky

This test fails for me once in a while:

$ cargo t -q --lib
running 95 tests
..................F............................................................................
failures:

---- executor::test::test_ping_pong_yield stdout ----
thread 'executor::test::test_ping_pong_yield' panicked at 'assertion failed: counter.get() < 10', scipio/src/executor.rs:1609:9

Asan not returning cleanly

@Laa hit this recently, but our cargo test with asan does not return cleanly.

Command line:

cargo +nightly test -Zbuild-std --target x86_64-unknown-linux-gnu

with flags:

export RUSTFLAGS=-Zsanitizer=address

They seem all related to Rc / Arc lifetime but some will be tricky to fix
In particular, the way those dependencies are usually broken is by moving things to Weak. But we don't want to do this with the Arc we use for eventfd: passing a Weak pointer would make us have to upgrade it before use, and that's full of atomics which would tank the shared channel performance.

The others I found are related to the submission queue -> I tried to move the EnqueuedSource to Weak and it didn't fix it, so needs more investigation, and the memory allocator for registered buffers.

alignment requirements can be different across different systems

For the sake of simplicity we are assuming the write-buffer alignment to be 4096kB, and the read-buffer alignment to be 512b. Those are likely good across the majority of systems, but some disks do not support 512b access.

Linux will tell us on sysfs what is the preferred alignment. We should use that.

timer may not be destroyed if not yet fired when executor is destroyed

The following code gets ASAN screaming at us:

    #[test]
    fn asan() {
        use crate::LocalExecutorBuilder;
        test_executor!(async move {
            let action = TimerActionOnce::do_in(Duration::from_millis(100), async move {
                println!("hello");
            });
        });
    }

Note how we create a TimerActionOnce, but because we never join it the executor exists right away.
The timer within never fires, and its Drop function is not called either. So we leak the timer memory.
I added debugging information to the task queue's drop: indeed when we are dropping this we have zero tasks in the task queue.

The timer, however, hasn't fired, so we will not have called its waker, which is why the task queue is empty.

Question on dma buffer allocation

Is it guaranteed to success when allocating a dma buffer? Is there a size limit for the dma buffer? And how would it go with memory pressure?

https://github.com/DataDog/glommio/blob/master/glommio/src/sys/uring.rs#L626

source reuse is possessed by the devil

Technically we should be able to reuse a Source by taking its results and adding a new SourceType.
I did that for timers in commit 4c6460f.

However that had to be reverted because it regressed heavily the shared channel benchmark.

Lo and behold, I spent an entire day of my precious life today with a problem that I eventually tracked down to the same thing:
While reworking the tcp data structures, I wanted to work with reusable sources from the beginning. And then all hell broke loose.

The symptoms of that was that every time we would sleep waiting to receive from a socket, waking up would take a bit longer.
It would grow linearly, initially taking around 20us (the expected cost of a receive) and eventually balloon for into ms.

Once I reverted to use single-purpose allocatable sources it all worked again. Common between the two issues is that we sleep with the reusable source resent to the ring.

However inspecting this I found no obvious leaks. io_uring is not aware of what our Sources are, and they all seem to be consumed correctly.

This bug already caused a major trauma for my 2 year old son, who was deprived of playing with his daddy today. He may never recover.

What is the recommended way to distribute requests across multiple threads?

This is probably mostly a documentation thing... The first question that comes to my mind when thinking about building a server with TPC design and glommio is: how should I distribute requests across multiple threads (probably via channels)?

There are probably three different approaches to this:

Spread accepted TCP connections across threads/executors (this is possible using glommio::net::AcceptedTcpStream).
Manage all accepted TCP connections in one thread, but divide and spread individual tasks invoked by client requests across multiple threads.
Like the first approach, but combine it with "work-stealing", like rayon (see also https://docs.rs/crossbeam-deque)...

The first approach has the advantage that the latency is probably lower, because it doesn't need to perform much cross-thread communication, but it has the disadvantage that one connection could starve, impact or reduce performance of other connections "running" in the same thread. Example: A server accepts 8 connections and spreads them evenly across 4 threads with one executor per thread and one thread per core. All but two connections which unluckily end up in the same thread, are terminated (successful or unsuccessful doesn't matter) soon. If the remaining two connections both incur a huge load on the server, then only one core is utilized. This is bad.

The second approach has the advantage that it doesn't suffer from the same starvation problem that much (it only suffers from that problem if mostly data is transferred across the network via that thread, but only extremely small amouts of work can be offloaded to other threads). This approach has the disadvantage that the latency probably is higher, especially, if a huge amount of data is transferred over the network, and not much preprocessing is done on the server.

The third approach is probably the most appropriate one, for many generic workloads (althrough the other approaches are probably completely fine for short-lived connections and other special cases). It doesn't suffer from the starvation issue, as other executors would notice that they do nothing, and would try to steal the connections that is currently not processed (e.g. assume one connection is currently handled by the executor, it e.g. retrieves data from it, then other threads "steal" the currently waiting connection). Thus, this would repeatedly rebalance the connections across the threads as soon as one thread becomes idle. I think that glommio should support that use case somehow.

Implemeantion of write_all/read_exact in DMA/Buffered file

Hi,
I am the only beginner in this area, but as far as I understand write/read calls of the file implementations can write/read provided data only partially (I suppose behaviour of async counterparts is the same as sync counterparts) and if the developer wants to be sure that whole data are read/written she needs to write her own implementation. I understand that such functionality can be implemented using futures create and provided streams, but it implies overhead of copying data into DmaBuffer. So probably would be good to provide such methods as part of the API?

DmaFile::close is unsound

This code is unsound for two reasons:

https://github.com/DataDog/scipio/blob/35c116c8db5ee4bd432968fb8417799198526de7/scipio/src/io/dma_file.rs#L521-L523

first, -1 is an invalid fd, and this is going to be directly exploited soon: rust-lang/rust#74699
second, after we close the file vie uring, the usual Drop for the file field runs and closes the fd for the second time.

I think the best fix here would be to make close take a self rather than &mut self, but this changes the overall API. I think this is a good change (in general, I think that Rc are overused), but I need to study the usages better to be sure. The draft fix is in matklad@503d7d2

Implement error system similar too the approach implemented in io::Error structure

This issue was discussed in community chat. Main issues with io::Error are:

Mapping between errors that happened inside the glommio library and presented in io::Errror is fuzzy and can not precisely describe the root cause of the issue.
io::Error lacks useful information as for example the path of the file where error happens. The error system should provide means to expose such information without allocation overhead for no-error situations. In general, error types should be very thin and do not request a substantial amount of memory.

Despite the visible simplicity of the task, it requires deep investigation of current implementations of error handling in Rust and a good understanding of a possible set of errors which could happen (will happen) inside of the library.

MacOS support or fallback?

It would be nice to develop with my MacBook. Are there any plan for a fallback on MacOS?

Improve the Tcp benchmark to show multi-executor performance

The "bandwidth over 100 sockets" test is only able to use two threads. Therefore it will not scale as well as Tokio and other frameworks. This is totally fine because a real application should use more cores but it would still be nice to see that scalability in the benchmark.

Aspects to consider when dealing with this:

Is it better to have separate executors for accept or have an accept + connect task queues in each executor ? I suspect the latter but also need to make sure the connections all become CPU local making for an unfair benchmark

My intuition tells me that there has to be a good, ergonomic way to spawn similar work in all or a subset of executors. An outstanding implementation will do that first as a stepping stone but not a must.

"Previously named Glommio"

In glommio/src/lib.rs, the documentation states:

This crate was previously named Glommio but was renamed after a trademark dispute. We are removing this message soon but it is now here to avoid confusion.

I assume you meant to write Scipio here?

Use io_uring's native support for networking

As we inherited code from async-io, we are still doing networking with non-blocking sockets being polled + read/write syscall.

io_uring supports reading directly from blocking sockets which is supposed to be simpler and faster, not to mention there is support for fancy things like buffer selection.

I'd love to see the networking code changed to use io_uring's native capabilities. To do that we'll probably no longer be able to write things like Async<TcpStream>, as Async represents a pollable fd.

hit "Attempting to release a source with pending waiters!"

In Rust futures don't have to be polled to completion. Probably the canonical example is a timeout, which can be implemented by wrapping a future and just not polling the inner future after some time.

This makes an impedance mismatch with io_uring, which requires buffers to be around until an operation completes. The attached program has no unsafe code, but hits a panic and, even with that avoided with blocking, I think it could still hit a use after free.

For more information see: https://without.boats/blog/io-uring/

hello_world.rs.txt
log.txt

The `executor::test::test_ping_pong_yield` seems to randomly fail

Sometimes, the executor::test::test_ping_pong_yield fails with assertion failed: counter.get() < 10, but if rerun, it succeeds.

Shared channel benchmark fails/hangs with probability

I wrote a benchmark for shared channels, which has a probability to fail with free(): invalid pointer:

$ cargo +nightly bench --bench shared_channel
    Finished bench [optimized] target(s) in 0.14s
     Running target/release/deps/shared_channel-013a11e62adf9291
Shared channel (size: 10): 877.422145ms, 877.4221ns
Shared channel (size: 100): 387.61481ms, 387.6148ns
free(): invalid pointer
error: process didn't exit successfully: `/project/target/release/deps/shared_channel-013a11e62adf9291 --bench` (signal: 6, SIGABRT: process abort signal)

It also has a probability (lower than that of freeing invalid pointer) to hang:

$ cargo +nightly bench --bench shared_channel
    Finished bench [optimized] target(s) in 0.02s
     Running target/release/deps/shared_channel-013a11e62adf9291
Shared channel (size: 10): 815.309948ms, 815.30994ns
Shared channel (size: 100): 384.314961ms, 384.31497ns
prod waiting
recv waiting
recv waiting
prod waiting
^C

Code for the benchmark:

use std::time::{Duration, Instant};

use glommio::channels::shared_channel;
use glommio::prelude::*;
use glommio::timer::sleep;

fn main() {
    const RUNS: usize = 1_000_000;

    bench_shared_channel(RUNS, 10);
    bench_shared_channel(RUNS, 100);
    bench_shared_channel(RUNS, 1000);
    bench_shared_channel(RUNS, 10000);
}

fn bench_shared_channel(runs: usize, channel_size: usize) {
    let (sender, receiver) = shared_channel::new_bounded(channel_size);
    let t = Instant::now();

    let producer = LocalExecutorBuilder::new()
        .pin_to_cpu(1)
        .spawn(move || async move {
            let sender = sender.connect();

            let mut timeout = Local::local(async {
                sleep(Duration::from_secs(2)).await;
                println!("prod waiting")
            })
            .detach();

            for i in 0..runs {
                if (i + 1) % 10000 == 0 {
                    timeout.cancel();
                    timeout = Local::local(async {
                        sleep(Duration::from_secs(2)).await;
                        println!("prod waiting")
                    })
                    .detach();
                }
                sender.send(1).await.unwrap();
            }

            timeout.cancel();
        });

    let consumer = LocalExecutorBuilder::new()
        .pin_to_cpu(2)
        .spawn(move || async move {
            let receiver = receiver.connect();

            let mut timeout = Local::local(async {
                sleep(Duration::from_secs(2)).await;
                println!("recv waiting")
            })
            .detach();

            for i in 0..runs {
                if (i + 1) % 10000 == 0 {
                    timeout.cancel();
                    timeout = Local::local(async {
                        sleep(Duration::from_secs(2)).await;
                        println!("recv waiting")
                    })
                    .detach();
                }
                receiver.recv().await.unwrap();
            }

            timeout.cancel();
        });

    producer.unwrap().join().unwrap();
    consumer.unwrap().join().unwrap();

    let label = format!("Shared channel (size: {})", channel_size);
    print_elapsed(label.as_ref(), t, runs);
}

fn print_elapsed(label: &str, start: Instant, runs: usize) {
    let elapsed = start.elapsed();
    let avg = elapsed.as_nanos() as f32 / runs as f32;
    println!("{}: {:?}, {:?}ns", label, elapsed, avg);
}

crates.io?

It seems that this library is not on crates.io. Any plan to publish it?

Implement native support for UDP

Same as we did for TcpStream recently, we can benefit for an UDP implementation as well.

A good contribution will come with tests and a benchmark. A comparison with tokio as we did for TCP is welcome as well.

Program hangs

use glommio::{LocalExecutor, Local};

fn main() {
  LocalExecutor::make_default().run(async {
    Local::local(async { println!("hello") }).detach().await;

    // The program hangs if the following line is uncommented.
    // Local::local(async { println!("hello") }).detach().await;

    println!("world")
  });
}

Advanced options on networking are not exposed, but should be

If we look at the API defined by net::TcpStream, net::UdpSocket, and others there are many options we are not implementing (setting timeouts, multicast, etc).

They shouldn't be hard to implement -> essentially just passing options down sockets, but I will require that they come with unit testing unless it is really not possible to.

provide a specialized logging macro set

Rust has the very helpful macros from log : info!, warn!, etc.

It would be helpful to have our own, that would include the executor name or ID on it.
That's because often times executors are all performing the same tasks, only in different parts of the dataset. It is helpful to know where exactly the issues are coming from.

The challenge in this issue is not to implement the macros, but to do it in an ergonomic way that existing applications can adapt without too much trouble.

What do you think of moving away `create_task_queue` out of Task<T>/Local?

Some functions like create_task_queue in the Task<T> do not have any dependency on self but they can't be accessed via just Task::func. Now the workaround is to do Task::<()>:: or Local::. Do you want to keep them in this namespace of we could move them out to the global scope?

Consider adding a forum/chat?

If we intend to grow community around scipo, we would nee some community-space chat-like thing eventually. If we'll need it eventually, we might want to setup it sooner rather than later, because why not? Besides, I already have a couple of questions/thoughts that are ill-fitted for github issues ^^

Common choices in this area are:

Gitter (actix web https://github.com/actix/actix-web)
Discord (tokio https://github.com/tokio-rs/tokio)
Discourse (bors-ng https://github.com/bors-ng/bors-ng)
Zullip (druid https://github.com/linebender/druid)

I personally have a strong preference towards Zulip or Discourse

For spinning executors, have a version of `poll` that doesn't mess with timers.

For executors that are configured to spin, we just continue the loop if we decide to keep spinning meaning we will go into poll_io to gather I/O.

This code is correct but it could be more efficient:

poll_io takes a duration parameter and it will reset the preempt timer for that duration. That means that we will have to do that timer dance every time we call poll_io.

If we are just spinning we don't have to worry about the preempt timer. Plus, if the natural result of spinning is that we give up and go to sleep after the spinning period ends, we will want to sleep with the timer disabled anyway.

So, it would be very helpful if the Reactor provided a new companion version of poll_io and park, such that:

when we detect that we Start spinning, (IOW, there is a preempt timer registered), we disable it and while we are in spin mode we don't register it again.
when we stop spinning (IOW, events are generated from the poll), then we register the preempt timer again.

Technically this could be doing just with logic inside poll_io, but it would probably be cleaner to have a specialized version.