markusressel / zfs-inplace-rebalancing Goto Github PK

View Code? Open in Web Editor NEW

254.0 5.0 23.0 58 KB

Simple bash script to rebalance pool data between all mirrors when adding vdevs to a pool.

License: Creative Commons Zero v1.0 Universal

Shell 97.81% Dockerfile 2.19%

zfs balancing bash inplace zfsonlinux script

zfs-inplace-rebalancing's Introduction

zfs-inplace-rebalancing

Simple bash script to rebalance pool data between all mirrors when adding vdevs to a pool.

How it works

This script recursively traverses all the files in a given directory. Each file is copied with a .balance suffix, retaining all file attributes. The original is then deleted and the copy is renamed back to the name of the original file. When copying a file ZFS will spread the data blocks across all vdevs, effectively distributing/rebalancing the data of the original file (more or less) evenly. This allows the pool data to be rebalanced without the need for a separate backup pool/drive.

The way ZFS distributes writes is not trivial, which makes it hard to predict how effective the redistribution will be. See:

Note that this process is not entirely "in-place", since a file has to be fully copied before the original is deleted. The term is used to make it clear that no additional pool (and therefore hardware) is necessary to use this script. However, this also means that you have to have enough space to create a copy of the biggest file in your target directory for it to work.

At no point in time are both versions of the original file deleted. To make sure file attributes, permissions and file content are maintained when copying the original file, all attributes and the file checksum is compared before removing the original file (if not disabled using --checksum false).

Since file attributes are fully retained, it is not possible to verify if an individual file has been rebalanced. However, this script keeps track of rebalanced files by maintaining a "database" file in its working directory called rebalance_db.txt (if not disabled using --passes 0). This file contains two lines of text for each processed file:

One line for the file path
and the next line for the current count of rebalance passes

/my/example/pool/file1.mkv
1
/my/example/pool/file2.mkv
1

Prerequisites

Balance Status

To check the current balance of a pool use:

> zpool list -v

NAME                                              SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
bpool                                            1.88G   113M  1.76G        -         -     2%     5%  1.00x    ONLINE  -
  mirror                                         1.88G   113M  1.76G        -         -     2%  5.88%      -    ONLINE  
    ata-Samsung_SSD_860_EVO_500GB_J0NBL-part2        -      -      -        -         -      -      -      -    ONLINE  
    ata-Samsung_SSD_860_EVO_500GB_S4XB-part2         -      -      -        -         -      -      -      -    ONLINE  
rpool                                             460G  3.66G   456G        -         -     0%     0%  1.00x    ONLINE  -
  mirror                                          460G  3.66G   456G        -         -     0%  0.79%      -    ONLINE  
    ata-Samsung_SSD_860_EVO_500GB_S4BB-part3         -      -      -        -         -      -      -      -    ONLINE  
    ata-Samsung_SSD_860_EVO_500GB_S4XB-part3         -      -      -        -         -      -      -      -    ONLINE  
vol1                                             9.06T  3.77T  5.29T        -         -    13%    41%  1.00x    ONLINE  -
  mirror                                         3.62T  1.93T  1.70T        -         -    25%  53.1%      -    ONLINE  
    ata-WDC_WD40EFRX-68N32N0_WD-WCC                  -      -      -        -         -      -      -      -    ONLINE  
    ata-ST4000VN008-2DR166_ZM4-part2                 -      -      -        -         -      -      -      -    ONLINE  
  mirror                                         3.62T  1.84T  1.78T        -         -     8%  50.9%      -    ONLINE  
    ata-ST4000VN008-2DR166_ZM4-part2                 -      -      -        -         -      -      -      -    ONLINE  
    ata-WDC_WD40EFRX-68N32N0_WD-WCC-part2            -      -      -        -         -      -      -      -    ONLINE  
  mirror                                         1.81T   484K  1.81T        -         -     0%  0.00%      -    ONLINE  
    ata-WDC_WD20EARX-00PASB0_WD-WMA-part2            -      -      -        -         -      -      -      -    ONLINE  
    ata-ST2000DM001-1CH164_Z1E-part2                 -      -      -        -         -      -      -      -    ONLINE

and have a look at difference of the CAP value (SIZE/FREE vs ALLOC ratio) between vdevs.

No Deduplication

Due to the working principle of this script, which essentially creates a duplicate file on purpose, deduplication will most definitely prevent it from working as intended. If you use deduplication you probably have to resort to a more expensive rebalancing method that involves additional drives.

Data selection (cold data)

Due to the working principle of this script, it is crucial that you only run it on data that is not actively accessed, since the original file will be deleted.

Snapshots

If you do a snapshot of the data you want to balance before starting the rebalancing script, keep in mind that ZFS now has to keep track of all of the data in the target directory twice. Once in the snapshot you made, and once for the new copy. This means that you will effectively use double the file size of all files within the target directory. Therefore it is a good idea to process the pool data in batches and remove old snapshots along the way, since you probably will be hitting the capacity limits of your pool at some point during the rebalancing process.

Installation

Since this is a simple bash script, there is no package. Simply download the script and make it executable:

curl -O https://raw.githubusercontent.com/markusressel/zfs-inplace-rebalancing/master/zfs-inplace-rebalancing.sh
chmod +x ./zfs-inplace-rebalancing.sh

Dependencies:

perl - it should be available on most systems by default

Usage

ALWAYS HAVE A BACKUP OF YOUR DATA!

You can print a help message by running the script without any parameters:

./zfs-inplace-rebalancing.sh

Parameters

Name	Description	Default
`-c` `--checksum`	Whether to compare attributes and content of the copied file using an MD5 checksum. Technically this is a redundent check and consumes a lot of resources, so think twice.	`true`
`-p` `--passes`	The maximum number of rebalance passes per file. Setting this to infinity by using a value `<= 0` might improve performance when rebalancing a lot of small files.	`1`
`--skip-hardlinks`	Skip rebalancing hardlinked files, since it will only create duplicate data.	`false`

Example

Make sure to run this script with a user that has rw permission to all of the files in the target directory. The easiest way to achieve this is by running the script as root.

sudo su
./zfs-inplace-rebalancing.sh --checksum true --passes 1 /pool/path/to/rebalance

To keep track of the balancing progress, you can open another terminal and run:

watch zpool list -v

Log to File

To write the output to a file, simply redirect stdout and stderr to a file (or separate files). Since this redirects all output, you will have to follow the contents of the log files to get realtime info:

# one shell window:
tail -F ./stdout.log
# another shell window:
./zfs-inplace-rebalancing.sh /pool/path/to/rebalance >> ./stdout.log 2>> ./stderr.log

Things to consider

Although this script does have a progress output (files as well as percentage) it might be a good idea to try a small subfolder first, or process your pool folder layout in manually selected badges. This can also limit the damage done, if anything bad happens.

When aborting the script midway through, be sure to check the last lines of its output. When cancelling before or during the renaming process a ".balance" file might be left and you have to rename (or delete) it manually.

Although the --passes parameter can be used to limit the maximum amount of rebalance passes per file, it is only meant to speedup aborted runs. Individual files will not be process multiple times automatically. To reach multiple passes you have to run the script on the same target directory multiple times.

Dockerfile

To increase portability, this script can also be run using docker:

sudo docker run --rm -it -v /your/data:/data ghcr.io/markusressel/zfs-inplace-rebalancing:latest ./data

Contributing

GitHub is for social coding: if you want to write code, I encourage contributions through pull requests from forks of this repository. Create GitHub tickets for bugs and new features and comment on the ones that you are interested in.

Attributions

This script was inspired by zfs-balancer.

Disclaimer

This software is provided "as is" and "as available", without any warranty.
ALWAYS HAVE A BACKUP OF YOUR DATA!

zfs-inplace-rebalancing's People

Contributors

Stargazers

Watchers

zfs-inplace-rebalancing's Issues

not working on xigmanas (FreeBsd)

lsattr and perl is not installed on xigmanas platfform.

lsattr can be installed on xigmanas (login via ssh) by:
pkg install e2fsprogs

perl is only used in teh script to calc the percentage.

just replace line 72:
progress_percent=$(perl -e "printf('%0.2f', ${current_index}*100/${file_count})")
with
progress_percent=$((${current_index}*100/${file_count}))

New Behavior with Block Cloning?

What effect will block cloning in OpenZFS 2.2 have on this script? The documentation may need to be updated.

Filename exceeded maximum length

When creating a copy of certain files, if the maximum length of the new filename (with .balance appended to it) exceeds the system's maximum defined filename length, the rebalancing operation fails. Would it be possible to modify the script to skip files like these?

line 60: bc: command not found

Hey there,

Trying to run this on my TrueNAS Scale system. When doing so I just get "./zfs-inplace-rebalancing.sh: line 60: bc: command not found"

Seems to be related specifically to the "progress_percent=$(echo "scale=2; ${current_index}*100/${file_count}" | bc)" line in the code for the script.

Any ideas? Thanks!

Automatic multi-pass

Keep track of the lowest pass count in the db and run the script in a loop internally until the target count is reached.

Filenames with square brackets are not checked for skipping

In the case that a filename is something like /pool/projects/[2019] cool project/file.txt, it will constantly be rebalanced, even if a prior run had balanced it.
This rebalancing will often be done silently, but depending on the filename, it will also print an error from grep relating to an invalid regular expression.

Ignore missing files

Files might be deleted while the script runs. Since this should not affect the rebalancing, it should be checked to avoid an error. Note that the db item has to be removed in that case or it might lead to an infinite loop when #2 is implemented.

Handle Hardlinks

Hi all,

I would like to discuss, and expand my knowledge, about potentially enhancing the script to handle hardlinks, rather than simply ignoring them.

In #22 johnpyp mentions "(data) can't be trivially un-hardlinked after without knowledge of which path is in the "balance target" vdev."
I don't quite understand this, could you please expand? Lets say we don't know the path of the balance target vdev and pick a path at random, in the end would it not average out ?

Lets say i have 2vdevs, and they are 60% populated, with files that each have one or more hardlinks, which is balanced 50-50 data usage.
I now add a new empty equal size vdev to the pool and wish to rebalance.

We look at file_1, its got two hardlinks (50%-50%-0)
Could i copy file_1 --> file_1_tmp
Delete the two hardlinks & file_1
Create two new hardlinks to file_1_tmp
Rename file_1_tmp

What does the end data result look like? (15%-15%-70% or so?)

I understand i could be quite naive/crude in this approach, but wish to understand and hope to resolve this, as i'm sure alot of people with similar *arr media setups would appreciate such a feature.

Thanks,
NickyD

Help is no help

root@test-box-not-quiz-box:~# ./zfs-inplace-rebalancing.sh
Usage: zfs-inplace-rebalancing -checksum true -passes 1 /my/pool

See anything strange there?

It appears the the usage shows one dash as part of the parameter requirements not two. That is to say "-checksum" instead of "--checksum" (dash vs dash dash).

I would very respectfully request that this issue be remediated.

Stuart

github action

ShellCheck linting

[Question] would this work even without any zfs disk remodeling?

Would this script work even without any disk remodeling?

I started my Raidz1 with a recordsize of 128k and my files are all huge (11GB up to 1.5TB) I changed it recently to recordsize=1M and the new files read 20x faster than old ones, so I wanted to rewrite the whole 65TB worth of data.

Thanks for this.

Running in TrueNAS Scale: bc missing

./zfs-inplace-rebalancing.sh: line 60: bc: command not found

My guess it is something just missing in the base install of TrueNAS Scale.
What are your thoughts running the script in a Docker container that has the missing component?

Doesn't work on folder that has windows-written files in it

I'm using this on truenas Scale (if that matters), trying to rebalance a directory (that I changed checksuming on) and get the following error:

 cp -adxp foo.tif foo.tmp
cp: preserving permissions for ‘foo.tmp’: Operation not permitted

-a and -p don't seem to work, my guess possibly because of windows permissions?

-dx work but this creates new timestamps. The weird thing is, even with the "operation not permitted" message, the file is created with the same timestamp.

-rwxrwxr-x  1 evan root 122878676 Jul 16  2017  foo.tmp
-rwxrwxr-x  1 evan root 122878676 Jul 16  2017  foo.tif.balance
-rwxrwxr-x  1 evan root 122878676 Jul 16  2017  foo.tif

Syntax Error with TrueNAS 13.0U6

If I try to execute this script in a standard GUI root shell, it fails with the following erorr:

/root/zfs-inplace-rebalancing.sh: 28: Syntax error: "(" unexpected

Is this expected? What can I do to mitigate the issue? Other .sh scripts here execute without issues. This version of TrueNAS runs on FreeBSD13.1 release p9.

Pointer to resume script if SSH connection times out

I've never gotten the job to complete because if my SSH connection times out in putty or my PC sleeps, when I resume, the script starts from the beginning of the file list, even though it doesn't reprocess the files it's already rebalanced.

I'd like to request an enhancement, which is an option to write a flag for each file that's moved and then have the script resume from that point, rather than starting from the beginning. I realize that doing so would break rebalancing for multiple passes, which is why I suggest it be an option to enable.

FreeBSD doesn't have lsattr

Something else needs to be used or the file attributes comparison needs to be removed

md5 claims failure, seems to not handle fsacls?

Start rebalancing Mon Jul 1 01:48:22 PM MDT 2024:
Copying 'blah' to 'blah.balance'...
Comparing copy against original...
MD5 FAILED: ---------------------- -rw-rwxr--+ user media 5f7973ff9e4152827994d4149d8af39d != ---------------------- -rw-rwxr--+ user user 5f7973ff9e4152827994d4149d8af39d

tried on a few datasets, they all error immediately. Just wondering if its because of fsacl as the md5's do match.

Also feature request to clean up .balance files on exit.

Thank you

Would rewriting blocks work too?

Reading through some of the information here on this project I began to contemplate if there might be in any utility in doing something like using dd to read and rewrite each individual block in a file to accomplish the same mission intended by your code?

If so, this would mean that instead of needing enough space to hold the largest file in your directory, you would just need one free block worth of space. Moreover, one could set a number of blocks (if this idea would work), like read 10 blocks write 10 blocks, as it walked through the entire file.

Stuart

Stop after given number of files processed (or better, bytes processed)

TL;DR: What I need is to be able to tell the script to rebalance a directory, but stop after X number of files or, better yet, stop after Y number of bytes. Another command line switch, perhaps?

Read on, for my specific situation that prompted this request...

I just doubled the size of my main pool by adding a second, same-sized vdev to it. As part of the process, the first vdev ended up at a little more than 94% utilization. Not ideal, but hoped to be temporary. So, after adding the new vdev, I was sitting at 94% usage on the first vdev and 0% on the second.

After using du a few times, to look for good candidates for the script to rebalance, I'm currently sitting at 76% / 20% balance. That's certainly good enough. I'm not looking for 47% / 47% balance, here. I know this whole rebalancing process is a bit of a feel good measure, but 94% / 0% just feels icky. What I'm looking for is more like 70% / 24%.

So now I'm facing a dilemma. The only directory I have left, that's big enough, is my "Movies" directory, at 13 TB. That directory is filled with movies, each in a subdirectory, one level down. I must either rebalance the entire Movies directory (13 TB is unnecessarily too much,) or find some other large directory to rebalance (there isn't one.)

I need what the TL;DR says.

permission issue causing MD5 FAILED

It often stops (not always though) with this issue

Comparing copy against original...
MD5 FAILED:
---------------------- -rw-r--r--+ naser wheel 27ca390baf807f9728af92442217769b
!=
---------------------- -rwxr--r--+ naser wheel 27ca390baf807f9728af92442217769b

it seems to me that the MD5 matches, only the x permission is different.
I have issue with permissions sometimes (umask 022)

Hardlinked files... How are they handled?

Hi, Mark

I have a simple question:
I have lots of hardlinked files in ZFS, they have been deduped with Czkawka.

How are they handled? Here's my use case:
/mnt/Zpool/downloads/media/some.ext
/mnt/Zpool/Media Library/Movies/Movie Name (xxxx)/correctname.ext (hardlink to some.ext)

Because I recently expanded my pool from 2x(12x 1.2TB SAS Raid-Z2) to 3x(12x 1.2TB SAS Raid-Z2) my storage is 60+% percent on the first 2x VDEV's (90+% full each) and the last VDEV is sitting empty...