As this filesystem aims to provide ACID functionality, I tested if it can handle a ful

thank you very much <a class="user-mention notranslate" data-hovercard-type="user" dat

I built a new version (<a href="https://github.com/fin-ger/zbox-fail-test/releases/tag

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

I have added <a href="https://github.com/fin-ger/zbox-fail-test/blob/master/s/ru

zbox is unable to successfully handle a machine failure about zbox HOT 9 OPEN

zboxfs commented on June 19, 2024

zbox is unable to successfully handle a machine failure

from zbox.

Comments (9)

burmecia commented on June 19, 2024

thank you very much @fin-ger , that's an interesting test.

I run your test script and can see it stopped at 7th round but cannot reproduce the error you posted. That crash happened during writing file, not during opening the repo. From your error log, it looks like the error happened when reading super block, which is during open an existing repo. Could you share how you build the executable file on Alpine Linux? I tried built one on Ubuntu but it cannot run on Alpine.

Also, the published zbox v0.6.1 is quite old, the latest code on master branch has been refactored a lot and with many bugs fixed and performance improvement, could you test using the latest code instead? Just use below dependency line in your Cargo.toml:

zbox = { git = "https://github.com/zboxfs/zbox.git", features = ["storage-file"] }

Another tip is you can turn on zbox debug output by setting environment variable in filerun-test.exp:

RUST_LOG=zbox=trace

Looking forward to seeing more result, thanks.

from zbox.

fin-ger commented on June 19, 2024

The error happens when running zbox-fail-test --file data check in the previously forcefully stopped VM (run-check.exp)

The executable was automatically build by the travis-ci configuration. I am using the official alpine:edge docker container:

docker run --rm -v $(pwd):/volume alpine:edge /bin/sh -c 'cd /volume && apk add rust cargo libsodium-dev && export SODIUM_LIB_DIR=/usr/lib && export SODIUM_STATIC=true && cargo build --target x86_64-alpine-linux-musl'

The executable can be found in ./target/x86_64-alpine-linux-musl/debug/zbox-fail-test.

I will create a new version of my test now which uses the latest master of zbox and the RUST_LOG configuration.

The error is only happening when running the test inside a VM that gets forcefully stopped. The repository is afterwards (booting the VM again) checked against the previously generated data file for differences (the check command).

from zbox.

fin-ger commented on June 19, 2024

I built a new version (0.4.0) that uses zbox from the current git master and added RUST_LOG=zbox=trace to the run and check action of the test (run-test.exp, run-check.exp).

from zbox.

burmecia commented on June 19, 2024

Thanks @fin-ger . What I found is it looks like QEMU didn't flush write data to its driver. After the test crashed on the 7th round, the repo folder is like this:

zbox:~# ls -l zbox-fail-test-repo
total 8
drwxr-xr-x 2 root root 4096 Apr 4 15:15 data
drwxr-xr-x 4 root root 4096 Apr 4 15:15 index
-rw-r--r-- 1 root root 0 Apr 4 15:15 super_blk.1

So you can see there is only one super block and it is empty. And the wal folder is not even created at all. Super block and wal must be guaranteed persistent to disk. The correct one should like this:

/vol # ls -l zbox-fail-test-repo
total 16
drwxr-xr-x 5 root root 160 Apr 4 11:26 data
drwxr-xr-x 5 root root 160 Apr 4 11:26 index
-rw-r--r-- 1 root root 8192 Apr 4 11:26 super_blk.0
-rw-r--r-- 1 root root 8192 Apr 4 11:26 super_blk.1
drwxr-xr-x 8 root root 256 Apr 4 11:26 wal

So that means QEMU lies to zbox the write() and flush() are completed but it is actually not. The possible reason could be the cache mode not specified when starting the QEMU VM. You can try add it in run-test.exp line 10

-drive file=qemu/zbox.img,format=raw,cache=directsync

Different cache mode explanation can be found here. I've tried some but still cannot see the files are guaranteed written to disk.

from zbox.

fin-ger commented on June 19, 2024

Okay, so if this is a qemu issue than it is not relevant for zbox. Have you tested failures of real machines with zbox?

from zbox.

burmecia commented on June 19, 2024

Honestly, I haven't tested the real machine failure because I can't find a good reproducible way to do that test. But I did some random IO error fuzz tests by using a special faulty storage. That storage will generate IO error and the fuzzer will reopen repo randomly but deterministically.

Your test makes me think maybe I can use QEMU to do the fuzz crash test, just like this guy did for OS testing, but still need to figure out how to make persistent write in QEMU first.

from zbox.

fin-ger commented on June 19, 2024

I tested if a dd if=dd-test-src of=dd-test-dst status=progress would also produce a dd-test-dst of 0 bytes. And indeed, no matter which -drive ...,cache=something I provided, it was always 0 bytes in size. Than I tried doing

dd if=dd-test-src of=dd-test-dst status=progress iflag=direct oflag=direct

without any cache flag provided for qemu, and than the dd-test-dst has roughly the size reported by the dd progress. I also tried oflag=dsync,nocache and it also worked. I am currently trying to setup an expect script for the dd command. The VM needs coreutils to run the above command as the provided dd by alpine does not support the iflag and oflag. I am also looking into comparing the dst and src file but did not come up with a good solution yet.

from zbox.

fin-ger commented on June 19, 2024

I have added run-dd-test.exp and run-dd-check.exp. The test writes a generated string file (can be diffed 😅) with dd and oflag=direct to dd-test-dst and the check looks if there are any lines in dd-test-dst that are not in dd-test-src. The expected result is only one additional line (the one not completed during the write) in the dd-test-dst file. So it looks like with dd qemu is handling the I/O correctly or maybe just "better". I will look into the dd source code later!

from zbox.

burmecia commented on June 19, 2024

QEMU file io looks so tricky, I might test the dd using different images later on.

from zbox.

zbox is unable to successfully handle a machine failure about zbox HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent