Code Monkey home page Code Monkey logo

Comments (37)

Zygo avatar Zygo commented on June 7, 2024

OK, so the sections are named after paths? Probably a good idea to have something like

[name]
  path = /mnt/for/bees

[name2]
  path = /mnt/for/more-bees

so that you simplify string quoting rules when you get users with cases like "/mnt/data/My Anime Drive [1960-1999]"

Also I have a wishlist for glob expansion in paths so a bunch of directories can be set up at once.

from bees.

kakra avatar kakra commented on June 7, 2024

I've updated the branch. My C++ is a bit rusty these days, mostly only doing Ruby on Rails now. I used to do some Qt programs in the past, with some more intense C and Haskell programming back in University days, and previously using Pascal, and Basic, even Assembler back in DOS days. ;-)

So I could need some reviews every now and then with recommendations.

Yes, the basic idea is to take every section starting with / as a section specific to the mount path. I'm planning to use realpath() to normalize that first. There's one specific section called global, other non-slashed sections could work as an include block but that's currently out of my scope.

I think, string quoting rules are very simple: Every section header is enclosed by []. We'll have no problem with unquoted [] inside the outer square braces, just ignore unbalanced use of [ and ]. I don't think we need to be picky about that, do you?

Adding globbing later should be easily possible but the first implementation would not support that.

from bees.

kakra avatar kakra commented on June 7, 2024

The parsing itself would be two-pass. First pass just scans through the config files, identifies the current section, tokenizes the configuration option and inserts or replaces the value in a list. Config files would support drop-in directories, i.e. when config is in /etc/bees/bees.conf, it would also read all files in /etc/bees/bees.conf.d in alphanumeric order. If bees.conf does not exist, the .d drop-in directory would also do fine.

In the second pass, each section is then initialized from the global section, then overwritten by the values specific to this section. We could also support include blocks at that phase.

Your BeesRoot class could then receive a pointer to its configuration section. The cmdline parser will be enhanced to directly operate on the Configuration class interface. Thus, the Configuration classes would also act as a registry for all sorts of runtime configuration, fixing some of the open issues on the way, and having one place where to put configuration, i.e. prefix-timestamps etc.

The basic idea is:

  1. Start first pass
  2. Initialize (reserved) cmdline section in class Configuration
  3. Read configuration, get conf filename if given
  4. Read configuration files which initializes the global section (and all FS sections)
  5. Initialize reserved environment section with environment variables if defined
  6. Start second pass
  7. Initialize each section with global data, then FS specific data, then cmdline section, then environment section (this actually defines the order of preference)

At first glance, it may sound complicated, but if done right, it should be very streamlined, and adding more configuration options later is a piece of cake.

It basically comes down to:

  1. Collect everything we have
  2. Parse it in correct order
  3. Do syntax and sanity checks on the way

Then it exposes all possible configurations through one API to the rest of the code.

from bees.

kakra avatar kakra commented on June 7, 2024

Thinking about your path = suggestion: Should be easy to implement it that way, too. I could simply reuse that internally as the section name. I'll make up my mind about that during the implementation. We have a testing framework in place and can easily try different implementations.

from bees.

kakra avatar kakra commented on June 7, 2024

Addendum: Having the path directly as the configuration section would allow something like awk -e'/^\[.*\]$/ { print $1 }' in the beesd wrapper script to extract all the paths for mounting. Of course the same would work with path=, and with some awk-fu we can return a section name which the user must name by uuid. Maybe we can create a small binary which loads the config and then simply returns a parse error or all paths for beesd wrapper to use. After all, config files will be a bit more complex then just running them through awk.

from bees.

nefelim4ag avatar nefelim4ag commented on June 7, 2024

Just curious,
how you see the complete way for setup bees on machine with such configs?

example:
I have a machine with several FS, all mounted with subvol=<..>
Currently, i just create config file with FS UUID -> start service with FS UUID -> profit.

@kakra, can you explain your vision?

Thanks.

from bees.

kakra avatar kakra commented on June 7, 2024

It will work similarly. I want to keep it working with your beesd wrapper script. No worries here. And that's not what the config file is about: It's not going to be a replacement for the wrapper script. Bees itself won't start mounting directories or resolve UUIDs. I think @Zygo wants to keep such dependencies out of bees, and I agree with that idea.

In the long run it allows per volume configurations, and probably also exclude/include patterns as @Zygo noted somewhere else (if the latter become implemented).

from bees.

Zygo avatar Zygo commented on June 7, 2024

If you used the UUID as the section name, it could look like this:

[default]
# Options apply to any filesystem unless it matches a UUID below
no-timestamps
thread-factor=1.5
scan-mode = 0

[3b9853cc-6695-4cb5-8914-2ddff41a9250]
scan-mode = 0
bees-home = /var/lib/beeshome/off-disk-beeshome
# fast SSD, lots of threads
thread-factor = 4.0

[07c97603-554e-4dc9-93ee-2b88fedea738]
scan-mode=1
# paths relative to root
bees-home = .beeshome
# slow spinning rust, low IO latency
thread-factor = 0.5

then you don't have to worry about canonicalizing or comparing paths, you simply match UUIDs from the config file to the filesystem presented to bees. We already look up the UUID at startup (legacy code left over from shared BEESHOMEs).

from bees.

Zygo avatar Zygo commented on June 7, 2024

Something is going to start bees, and that thing is very likely not going to be bees. There are lots of environmental things to set up (namespaces, chroots, cgroups, watchdogs, rlimits, ionice, filesystem mounting, restarting on crashes, etc) for each instance. It should really be one bees process per filesystem, and some other program (systemd, a shell script wrapper, etc) is going to handle all that before bees starts.

from bees.

Zygo avatar Zygo commented on June 7, 2024

Assuming we get the root path first, we can also support a config file in beeshome, especially for things like exclude patterns which an admin might not want to expose on a separate (possibly unencrypted) /etc filesystem. (how do we know where beeshome is until we've read the config file...the global config file could tell us that, or we use the compiled-in default.)

I'm OK with deprecating the command-line options and environment variables once the config files are working. It gets rid of two levels of precedence.

from bees.

Zygo avatar Zygo commented on June 7, 2024

Side note: include/exclude path support might be expensive.

Early bees prototypes did a find-style tree-walk which made path exclusions really easy because we could check a filename against a pattern before opening it, and skip the whole file or subtree.

Current bees looks up partial filenames mid-way through processing, but it never knows the full path at any single point during an open--this is why name_fd looks outside bees in /proc for filenames to use in log messages. Current bees can still easily exclude entire subvols by simply pretending they don't exist, though the dedup logic isn't smart enough to cope with that restriction well (it will do a lot of IO and not free any space, or even consume extra space). Pattern-matches on partial filenames (e.g. "exclude files ending in .qcow2") could be done with the same caveat.

Future bees will get the subvols later in the scanning process, so we could possibly still exclude on those without too much penalty. Looking up filenames just to exclude them will likely take more amortized time than simply deduping everything (unless there's so many duplicate extents that we're always opening every file in the filesystem--or a few really huge files, and we can blacklist them by inode number instead of file name). By the time the new dedup logic discovers an extent is referenced by an excluded filename, it may have already performed some copy or dedup operations on the extent (new dedup logic is lazy and parallel).

from bees.

kakra avatar kakra commented on June 7, 2024

Yes, I was thinking about having the UUID in the section name. I just didn't decide about the final syntax because I want to support "beesd wrapper"-esque runtime path somewhere. But I have an idea which should work well, still testing.

So that we could support both styles: Manually mounted paths and uuid-style automounted paths by the wrapper script.

from bees.

kakra avatar kakra commented on June 7, 2024

@Zygo Okay, I read you want to deprecate running multiple FS in one bees instance?

from bees.

Zygo avatar Zygo commented on June 7, 2024

There's little compelling advantage for running multiple FS in one process. You can have multiple BeesContexts in one process, but that's most just a result of using C++ classes for state. If you want to stop one of the filesystems or add another one, you have to kill the bees process for all of them at once. Similarly, if one crashes they all go. They also share FD namespace, memory arenas, worker thread pools, we have to embed UUIDs into various places (e.g. crawl and status filenames), and the parts of bees that are sharable between contexts need context pointers embedded in them, making them somewhat larger in memory.

At one point I was trying to share a single hash table between multiple filesystems. The hash table entries would need to be tagged so that one BeesContext doesn't use another BeesContext's dedup hits, and each bit used for hash table tagging is one less bit you can use for hash table content. A shared hash table could use RAM more efficiently if there are many filesystems being deduped on the same machine.

It didn't work as well as I hoped--generally the larger or more active filesystems push out all the table entries for the smaller filesystems, so only one of the filesystems sharing the cache works well. It turned out to be better to undersize the hash tables a little and just keep them all separate.

Right now there is partial support for multiple filesystems in one process, but it's only there because it's harmless and it would be extra work to remove it.

from bees.

kakra avatar kakra commented on June 7, 2024

Okay, I'll design my config proposal further around that idea then and when in conflict, I'll ignore running multiple FS in one instance in favor of running one FS per instance.

from bees.

Massimo-B avatar Massimo-B commented on June 7, 2024

At one point I was trying to share a single hash table between multiple filesystems.

I would appreciate this as I'm planning to have 3 separate bees for 3 btrfs. But as the beeshash.dat of each btrfs is preferred to be stored on the btrfs itself, because it could be a mobile btrfs being processed by different machines, it could be better to have the hashtables separated, as Zygo said. For memory savings especially when multiple btrfs are "beed" the discussion is #36 or the next topic:

Could the configuration get a global option to make bees --watch (or --daemonize), as it does right now, but alternatively terminating after "Crawl master ran out of data..."? This would make it easier to let bees(d) run by a nightly cronjob only without requiring another cronjob or method to terminate it.

Then I would like to have a global option, to process all sections in parallel (parallel bees process per section), or only one at a time? That would mean for a --watch, that it starts bees for the first section, terminating and starting for the next section, and returning to the first at the end. It would mean to periodically switch between the sections, also if no writes have detected. Without a --watch, like a single run, it would just process all sections in a line and terminate.

I also preferred the [uuid] as a section name, which was my first idea letting beesd do the default mounting under /run/bees...

from bees.

kakra avatar kakra commented on June 7, 2024

I like the idea of having an option to let bees automatically quit after it thinks there would be no further work to do. But this itself should not be part of this issue otherwise we end up with an issue that never closes. I think it's better to open a new request for this referencing this with "once issue XX is implemented...".

Also the parallel vs. sequential processing thing should really be a follow-up issue to that latter issue, maybe in the descriptive manner.

OTOH, I don't see anything[*] that prevents implementing an auto-quit feature now even without configuration parsing integrated right into the daemon itself. You should open a new issue for that request. I think there's already a similar request open. It totally does not depend on configuration files. The only part that these requests would share is that configuration files gained the ability to also configure this cmdline option. But this is future. Let's get one thing done first, and that's per your request and auto-quit feature - as a new issue. :-)

[*]: Except that Zygo currently has no code to cleanly shutdown the daemon, it is just killed. You could watch the log output via script and send kill once the line you're looking for shows up. Using tee and awk could be your friend here: tee duplicates stdin to a file and stdout (so you can still have a log file), awk can execute commands once a pattern matches. Here's a quick google lookup for you: https://unix.stackexchange.com/questions/12075/best-way-to-follow-a-log-and-execute-a-command-when-some-text-appears-in-the-log

from bees.

Zygo avatar Zygo commented on June 7, 2024

When I do time trials I do grep the logs for 'Crawl master ran out of data' and kill the bees process. On the other hand, when I do time trials I don't care that bees hasn't saved any data because the next thing that happens is the filesystem gets reformatted.

If you have a specific time window for bees to run, you can just run it for the entire time window and kill it at the end.

I haven't found the ability to share hash tables useful. On hosts with multiple filesystems I simply divide the available memory for bees among the processes, proportional to the filesystem sizes. This makes bees not find some duplicates but it's not linearly proportional to the hash table size so I still get 90% of the duplicates (and 100% of duplicates that are written near the same time).

I should make a graph or something to show how that works.

from bees.

Zygo avatar Zygo commented on June 7, 2024

Aside from the non-saving of data when the bees process is killed, there's another problem with exiting at the "Crawl master ran out of data" stage: bees isn't really done on the first pass.

During the first pass bees creates new extents that are ideally sized to be replaced by existing larger extents, but it doesn't remove those existing extents at that time. On the next pass those new extents are scanned bees then dedupes then. This repeats until all the extents are either entirely unique or removed.

I don't think that matters very much if the intended use case is to run bees on several filesystems sequentially, since that would eventually run another bees pass and only the timing is changed.

It does matter if you're doing time trials, though, because bees isn't done at the end of each pass. It's only really done if a pass triggers no temporary extent copies.

from bees.

Zygo avatar Zygo commented on June 7, 2024

Another other problem is that when there are multiple subvols, the subvol scans all cycle independently of each other. If there are a few large subvols and rapid snapshot cycles it means that there's always some subvol that is not in the idle state.

Granted, that problem goes away if extent-tree (tree 2) scanning is implemented, since there is only one tree to scan (no concurrent start/stop issue) and after each scan of that tree, all of the extents are fully deduped (i.e. there is no need for the next scan to finish anything).

from bees.

Massimo-B avatar Massimo-B commented on June 7, 2024

Not sure if this was mentioned already. Having /etc/bees/usbdata.conf I would like to call 'beesd usbdata' instead of 'beesd <btrfs_uuid>'. The UUID is set in the config file itself.

from bees.

kakra avatar kakra commented on June 7, 2024

Could make sense... We'll look into it. I keep it in my notes for the feature. But I'm still busy with some other projects currently.

from bees.

Massimo-B avatar Massimo-B commented on June 7, 2024

Another idea for config files and beesd: It could be interesting to have one beesd processing all UUIDs in parallel. So for now I would need to call beesd (or later beesd ) for every existing config file. But it could be interesting to call beesd <label1> <label2> <label3>. Only disadvantage would be to have the stdout merged. But for logging it would be preferred anyway to have /var/log/beesd/.log with some less verbose debug level.

from bees.

kakra avatar kakra commented on June 7, 2024

@Massimo-B @Zygo
I don't think, it's Zygos intention to revert bees back to that behavior (multiple FS in one instance)...

Anyway, something similar can become possible later: bees reads cmdline and then forks multiple children which acts as a single instance each.

from bees.

Massimo-B avatar Massimo-B commented on June 7, 2024

I did not mean bees processing multiple fs, I meant beesd forking multiple bees, just like I do when I run multiple beesd. So this would be something very simple like I could do like

for bees_uuid in $(grep UUID= /etc/bees/*.conf |cut -d "=" -f 2);do beesd "$bees_uuid" & ;done

from bees.

Zygo avatar Zygo commented on June 7, 2024

something similar can become possible later: bees reads cmdline and then forks multiple children which acts as a single instance each.

The shell script can do that now. Nothing stops you from having multiple bees instances. I'd prefer something to do that from outside of bees because everybody has their own process-launcher framework they want to plug bees into, instead of having the bees binary fork a bunch of children.

If you're really sneaky (and maybe just a tiny bit evil ;) you could run it as a systemd response to mount events...

from bees.

kakra avatar kakra commented on June 7, 2024

Yes but we shoud consider/decide one thing first: What happens if one child dies? As bees now acts as a single instance, should it simply restart what failed? Or should it also kill the other children and return with an exit code?

That for-loop idea above is missing at least a call to "wait" below it. But it should also trap signals to catch errors in children and act accordingly (which is yet to be defined).

from bees.

Zygo avatar Zygo commented on June 7, 2024

The for-loop could fork while-loops...

for x in a b c; do
  while ! [ -e /please/stop ]; do bees $x; done &
done
wait

If systemd puts the whole thing into its own cgroup, then you can just rely on the systemd kill-the-whole-cgroup behavior to stop everything when requested. If running on a system that has cgroups but not systemd you can still do it that way:

kill -9 $(cat /sys/fs/cgroup/bees/tasks)
for x in a b c; do
    sh -c 'echo $$ > /sys/fs/cgroup/bees/tasks; while :; do bees "$1"; done' -- $x &
done
wait

This seems to me to be clearly a policy decision, and shouldn't be burned into the binary without a compelling advantage for doing it that way (contrast with the loadavg stuff, where there was an advantage for managing load from within the process).

from bees.

kakra avatar kakra commented on June 7, 2024

I'd still prefer if we just leave it up to the init system to do that, both systemd and openrc allow multi-instance services. Not sure how other distributions than Gentoo handle this. But it looks most clean to me that way. We should probably prevent any feature creep. Process management is already perfectly implemented outside the scope of bees.

from bees.

Massimo-B avatar Massimo-B commented on June 7, 2024

Not sure if this was mentioned already. Having /etc/bees/usbdata.conf I would like to call 'beesd usbdata' instead of 'beesd <btrfs_uuid>'. The UUID is set in the config file itself.

Hi, any new plans to implement something like this? For my separate configuration files

# ls /etc/bees/
localdata.conf  mobiledata.conf  root.conf

I would like to be able to call beesd root or beesd mobiledata. For now I always need to remember the uuid :)

from bees.

Massimo-B avatar Massimo-B commented on June 7, 2024

I have created a fork to support config file names: f341b59
but did not get how to create a pull request. Do you think supporting both UUIDs or config names like this would be appropriate?

from bees.

kakra avatar kakra commented on June 7, 2024

Without having looked at your branch, UUIDs are machine-specific and difficult to deploy as the NixOS guys suggested. We should try to find a different solution or have the option to work with both like fstab does it (it supports labels, uuids and device nodes).

from bees.

Massimo-B avatar Massimo-B commented on June 7, 2024

If I got you right, this is not about the layer about how to specify the device either by UUID= or LABEL=, which would mean how to specify the block device inside fstab or the the bees configuration.
My change is about how to tell beesd which setup to use. Currently this is done via UUID and then grep for the right configuration file with that UUID. I added support of passing the configuration name itself, which you can name whatever you like, /etc/bees/kakra-main.conf.

from bees.

kakra avatar kakra commented on June 7, 2024

Ah okay, sounds reasonable.

from bees.

Massimo-B avatar Massimo-B commented on June 7, 2024

I have created a fork to support config file names: f341b59 but did not get how to create a pull request. Do you think supporting both UUIDs or config names like this would be appropriate?

Hi, I noticed I'm still using my old fork of the beesd script locally and lost latest changes...
Is there today any style using beesd with the config name? Or is beesd still just called with UUID?

from bees.

Massimo-B avatar Massimo-B commented on June 7, 2024

btw. I just tried the current beesd script and starting like beesd ..: After cancelling with CTRL+C I always need to umount /run/bees/mnt/bab5... for the next run. Is that intended?

from bees.

tlaurion avatar tlaurion commented on June 7, 2024

@Massimo-B I saw your fork and the configuration file is not parsing a lot of options. Are you planning on presenting a PR upstream?

https://github.com/Zygo/bees/blob/28ee2ae1a88c811e2e5faae6b40ef63a48324a5d/docs/options.md

For some reason I thought configuration files were supported, worked on a wrapper to do proper calculations and create configs, to realize most important options are not parsed from configuration...

@Zygo @Massimo-B is it still planned or I should create a wrapper on systemd file to dynamically calculate all I need to fit QubesOS use case packaging? https://github.com/tlaurion/qubes-bees

from bees.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.