burntsushi / walkdir Goto Github PK

Rust library for walking directories recursively.

License: The Unlicense

Rust 99.30% C 0.45% Python 0.25%

walkdir's Introduction

walkdir

A cross platform Rust library for efficiently walking a directory recursively. Comes with support for following symbolic links, controlling the number of open file descriptors and efficient mechanisms for pruning the entries in the directory tree.

Dual-licensed under MIT or the UNLICENSE.

Documentation

docs.rs/walkdir

Usage

To use this crate, add walkdir as a dependency to your project's Cargo.toml:

[dependencies]
walkdir = "2"

Example

The following code recursively iterates over the directory given and prints the path for each entry:

use walkdir::WalkDir;

for entry in WalkDir::new("foo") {
    let entry = entry.unwrap();
    println!("{}", entry.path().display());
}

Or, if you'd like to iterate over all entries and ignore any errors that may arise, use filter_map. (e.g., This code below will silently skip directories that the owner of the running process does not have permission to access.)

use walkdir::WalkDir;

for entry in WalkDir::new("foo").into_iter().filter_map(|e| e.ok()) {
    println!("{}", entry.path().display());
}

Example: follow symbolic links

The same code as above, except follow_links is enabled:

use walkdir::WalkDir;

for entry in WalkDir::new("foo").follow_links(true) {
    let entry = entry.unwrap();
    println!("{}", entry.path().display());
}

Example: skip hidden files and directories efficiently on unix

This uses the filter_entry iterator adapter to avoid yielding hidden files and directories efficiently:

use walkdir::{DirEntry, WalkDir};

fn is_hidden(entry: &DirEntry) -> bool {
    entry.file_name()
         .to_str()
         .map(|s| s.starts_with("."))
         .unwrap_or(false)
}

let walker = WalkDir::new("foo").into_iter();
for entry in walker.filter_entry(|e| !is_hidden(e)) {
    let entry = entry.unwrap();
    println!("{}", entry.path().display());
}

Minimum Rust version policy

This crate's minimum supported rustc version is 1.34.0.

The current policy is that the minimum Rust version required to use this crate can be increased in minor version updates. For example, if crate 1.0 requires Rust 1.20.0, then crate 1.0.z for all values of z will also require Rust 1.20.0 or newer. However, crate 1.y for y > 0 may require a newer minimum version of Rust.

In general, this crate will be conservative with respect to the minimum supported version of Rust.

Performance

The short story is that performance is comparable with find and glibc's nftw on both a warm and cold file cache. In fact, I cannot observe any performance difference after running find /, walkdir / and nftw / on my local file system (SSD, ~3 million entries). More precisely, I am reasonably confident that this crate makes as few system calls and close to as few allocations as possible.

I haven't recorded any benchmarks, but here are some things you can try with a local checkout of walkdir:

# The directory you want to recursively walk:
DIR=$HOME

# If you want to observe perf on a cold file cache, run this before *each*
# command:
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

# To warm the caches
find $DIR

# Test speed of `find` on warm cache:
time find $DIR

# Compile and test speed of `walkdir` crate:
cargo build --release --example walkdir
time ./target/release/examples/walkdir $DIR

# Compile and test speed of glibc's `nftw`:
gcc -O3 -o nftw ./compare/nftw.c
time ./nftw $DIR

# For shits and giggles, test speed of Python's (2 or 3) os.walk:
time python ./compare/walk.py $DIR

On my system, the performance of walkdir, find and nftw is comparable.

walkdir's People

Contributors

Stargazers

Watchers

Forkers

tempbottle ereichert hatahet vandenoever azdle pombredanne ryman mcharsley forgottenswitch michaelsproul s3rvac rlugojr budziq msehnout jeremielate sjeohp-zz pritikumr jehiggs tmccombs nivkner linecode jcsoo andygauge jchlapinski meven alisha17 kodraus opilar gurgalex ignatenkobrain mckaymatt vmx jasongrlicky vitiral gilnaa nvzqz exphp-forks livingthought nonsensecreativity algolia ruuda agriffis iptq sahwar timmmm guillaumegomez ericdeansanchez forensicmatt marcmo anderender danieleades midaslamb thinkchaos lengyijun nvksv lo48576 basil-underscore omac777 dtolnay-contrib kimundi wilbeibi icodein legorooj robinhundt matklad lideen999 ives9638 atouchet martin-t collinc97 transparencies doxterpepper refi64 ldm0 mylovetop zeta1999 sunfishcode danielparks x0f5c3 nesteiner byron lucatrv gordon01 jlll1 vsuryamurthy luxagen trumorethanmost c0ka mmzen embarkstudios aplanguage georgeberdovskiy kallyaleksiev fdiakh kenchou rustworks meakae deraniludr raitobezarius saethlin

walkdir's Issues

Remove re-export of is_same_file

The is_same_file function doesn't need to be re-exported in walkdir anymore now that there's a dedicated same_file crate. The plan is:

Deprecate the is_same_file re-export for a non-breaking release
Remove the is_same_file re-export for the next breaking release

Recursing over multiple directories

I think it would be useful if WalkDir could take multiple Paths and build a single iterator containing every file and folder once.

E.x.

a
 \a2
b
 \b2
  \b3

let walker = WalkDir::new("a");
walker.add("b3");
walker.add("b2");
let iter = walker.into_iter();
// iter["a", "a2", "b2", "b3"]

Add example for content_first

Add an example that demonstrates the content_first method. Maybe this can just be another example in the crate root docs. With a comment like:

// When contents_first is true outputs:
// `dir/first_file`
// `dir/last_file`
// `dir`

Does anyone have any thoughts on how the docs could illustrate the results of certain filters better?

Add Error docs to methods that return Result

Relevant API Guideline

Add an Errors section to methods that return Results:

Iter (for the Iterator::next() method)
IterFilterEntry (for the Iterator::next() method)
DirEntry::metadata()

How would I emulate the following find command.

I'm trying to find all git repos in a directory..

find . -name .git -type d -prune

It searches for directories with .git and then stops further recursing once it's been found..

So I tried something like this with filter_entry

fn is_git_repo(entry: &DirEntry) -> bool {
    let path = entry.path();
    // TODO: Handle errors
    let files = fs::read_dir(path).unwrap();
    files
        .map(|r| r.unwrap())
        .any(|s| s.file_name().to_str().unwrap() == ".git")
}

fn main() {
    let mut walker = WalkDir::new(p).into_iter();

    for entry in walker.filter_entry(|e| {
        // Only give directories
        e.file_type().is_dir()
    }) {
        let entry = entry?;

        if is_git_repo(&entry) {
            let path = entry.path();
            println!("{}", path.display());
        }
    }
}

It works.. but it's not efficient...
I want to stop recursing after a git repo is found.. but I cant seem to code that logic in filter_entry

Support more metadata for sort_by

Possibly makes #44 irrelevant.

Is it possible to support more file metadata in the sort_by function? If the file metadata is already around by the time sort_by is run this could be worthwhile, otherwise it might not be practical.

Add links to other walkdir items in WalkDir docs

Relevant API Guideline

Add reference links in the WalkDir docs prose when mentioning other walkdir items:

WalkDir::filter_entry in WalkDir
WalkDir::min_depth in WalkDir

Endless loop when ReadDir returns an error

The ReadDir documentation states that it may return an Err if there's some sort of intermittent IO error during iteration.
However, it seems that there are some cases in which the error will be persistent, for instance, some directories in a /proc entry of a zombie process on Linux.

The entry wont be popped in this case, resulting in an endless loop:

$ cargo run --example walkdir /proc/2226/
    Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs
     Running `target/debug/examples/walkdir /proc/2226/`
/proc/2226/
/proc/2226/task
/proc/2226/task/2226
/proc/2226/task/2226/fd
ERROR: IO error for operation on /proc/2226/task/2226/fd: Permission denied (os error 13)
/proc/2226/task/2226/fdinfo
ERROR: IO error for operation on /proc/2226/task/2226/fdinfo: Permission denied (os error 13)
/proc/2226/task/2226/ns
ERROR: IO error for operation on /proc/2226/task/2226/ns: Permission denied (os error 13)
/proc/2226/task/2226/net
ERROR: Invalid argument (os error 22)
ERROR: Invalid argument (os error 22)
ERROR: Invalid argument (os error 22)
ERROR: Invalid argument (os error 22)
ERROR: Invalid argument (os error 22)
ERROR: Invalid argument (os error 22)
ERROR: Invalid argument (os error 22)
ERROR: Invalid argument (os error 22)
ERROR: Invalid argument (os error 22)
...

Change OsString args in sort_by to OsStr

The WalkDir::sort_by method should take OsStr arguments instead of OsStrings.

Update Minimal Rust Version Support

We've got some PRs that are stumbling over the minimum Rust version being 1.10:

#52 for ? (1.13)
#59 for some Debug impls in std (not sure what version they came in yet)

Is it worth bumping the minimum version to 1.13? Even if that doesn't give us the Debug impls for std::fs::ReadDir it might be worthwhile for ?.

cc: @budziq @tmccombs @BurntSushi

Use `?` in docs instead of unwrapping

Relevant API Guideline

Instead of calling unwrap in the doc examples, we should use the ? operator:

examples in crate root
examples on WalkDir type
examples on WalkDirIterator.skip_current_dir
examples on WalkDirIterator.filter_entry

"symbolic_link" vs "symlink"

To follow precedence set by stdlib, DirEntry::path_to_symbolic_link needs to be renamed to DirEntry::path_to_symlink.

Join efforts with Walker

I'm the author / publisher of the Walker crate. Its signature is the same as the recursive directory walker that was in the pre-1.0 stdlib. A brief search of the crates index suggests it is currently the only such library usable in stable.

Unless you envision walkdir as a very different library, how about we join these two up, and provide a seamless upgrade for all who are already using Walker into a better algorithm, more efficient implementation, etc?

Move DirEntry::ino method to an extension trait

The unix only DirEntry::ino method should be moved to a platform-specific extension trait, like DirEntryExt in the standard library. So essentially we need to:

Add a unix module with a trait called DirEntryExt
Move the ino method to DirEntryExt
Implement DirEntryExt for DirEntry

Add links to other walkdir items in DirEntry docs

Relevant API Guideline

Add reference links in the DirEntry docs prose when mentioning other walkdir items:

filter_entry in root skip hidden files and directories efficiently on unix example
path, file_name and follow_links in DirEntry
WalkDir::new in DirEntry::path
follow_links in DirEntry::path_is_symbolic
follow_links in DirEntry::metadata
follow_links in DirEntry::file_type

Implement Debug for WalkDir, Iter and IterFilterEntry

Relevant API Guideline

Implement the Debug trait for:

WalkDir
Iter
IterFilterEntry

It may not be appropriate to derive Debug, but at least it should be non-empty.

Document why unwraps won't fail

We should add some inline comments to the few calls to unwrap so it's clear to readers why it's safe to call unwrap in those places. All unwraps:

<Iter as Iterator>::next
Iter::get_deferred_dir
Iter::push

We might also want to change these to expects with a simple message that the panic is a bug in walkdir.

`DirEntry`s `FileType` can have different equality properties depending on code path

I'm trying to create a form of zipper walkdir iterators in my revision to dir-diff. In the original code path, we just did quality checks on the FileType and everything worked. When I manually look up the FileType, my tests fail.

---- easy_good stdout ----
FileType(FileType { mode: 32768 }) != FileType(FileType { mode: 33206 })
true != true
false != false
false != false

32768 -> 0x8000
33206 -> 0x81B6
(seems like Debug for FileType should output hex)

It looks like the extra information is S_IMODE

>>> hex(stat.S_IMODE(a.st_mode))
'0x1b6'

I've narrowed it down to read_dir vs metadata.

The reason I bring this up here is that DirEntry seems to have a way to populate its ty from either fs::DirEntry or metadata, I'm just not hitting that case in my code. This could cause some surprising behavior for walkdir users.

Add option to iterate in a fixed order

Currently the sort order is dependent on filesystem readdir() ordering, it would be useful for reproducible builds if we could fix this.

Add an option to stay on the same filesystem

On Unix this could be implemented with the dev() function of Metadata on Unix and dwVolumeSerialNumber on Windows.

Add html_root_url attribute

Relevant API Guideline

Add a html_root_url attribute to the crate root that points to https://docs.rs/walkdir/$version.

Option for breadth-first search?

In typical directory hierarchies, it's less efficient with a less predictable memory usage but it could be useful for streaming searches in huge directories.

Example: a fuzzy directory finder.

approaching 1.0

walkdir has seen quite a bit of use but little iteration on the API. There are a few open PRs/issues mostly related to additional functionality, but no significant design changes. Because of that, I'd like to move walkdir to 1.0 soon. I'm thinking within the next few weeks unless there's a good reason not to.

Rename IterFilterEntry to FilterEntry

Relevant API Guideline

The IterFilterEntry type is produced by the filter_entry method, so it should be called FilterEntry instead of IterFilterEntry. As an example, see collections::btree_map::Keys.

Option to process parent before/after contents

According to the docs:

Results are returned in depth first fashion, with directories yielded before their contents

Would it be possible to have an option to yield directories after their contents. That would be useful e.g. for recursively deleting a directory.

"File name too long" error at 4096 bytes

WalkDir cannot handle long paths that find handles fine.

extern crate walkdir;
use std::fs::create_dir;
use std::env::{current_dir, set_current_dir};

fn main() {
    let dir = current_dir().unwrap();
    let name = "a";
    for i in 0..2200 {
        if i % 100 == 0 {
            println!("Create dir at level {}.", i);
        }
        current_dir().unwrap(); // this line shows that rust can handle it
        create_dir(name).unwrap();
        set_current_dir(name).unwrap();
    }

    for r in walkdir::WalkDir::new(dir) {
        let entry = r.unwrap(); // this gives an error for long paths
        let len = entry.path().to_string_lossy().len();
        if len > 4090 {
            println!("{}", len);
        }
    }
}

...
Create dir at level 2100.
4091
4093
4095
4097
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { depth: 2042, inner: Io { path: Some("/home/walkdir/a/a/.../a/a/a"), err: Error { repr: Os { code: 36, message: "File name too long" } } } }', src/libcore/result.rs:788
note: Run with `RUST_BACKTRACE=1` for a backtrace.

Make skip_current_dir and filter_entry inherent methods

The WalkDirIterator trait isn't pulling its weight, so we should make them inherent methods and remove the trait. The plan is:

Make the skip_current_dir and filter_entry methods inherent methods on Iter and IterFilterEntry
Deprecate the WalkDirIterator trait for a non-breaking release
Remove the WalkDirIterator trait for the next breaking release

explore traversal that yields directory entry twice

This is about issue #18 and PR #19 . Aside: the more I dug in, the more I learned, and I'm grateful to all involved for the work on walkdir.

@mcharsley, I played around with contents_first, and for the most part, liked it.

However, it doesn't seem to work with filter_entry() or skip_current_directory() -- the yielded entries look to be skipping the remainder of the directory the skipped directory is in, as opposed to the skipped directory.

Additionally, I was looking for an option which also included the initial directory.

I see @michaelsproul proposed something which yields both edges, and I played around with a modified version of that and seemed to be able to do what I was looking for.

I also played around with writing an iterator adapter which also works okay, but didn't feel quite right, as it didn't work to implement WalkDir, so skip_current_directory was outside of that.

I'm torn between a simpler walkdir plus another crate for an adapter controlling when and how often a directory is listed, and an integrated version, making use of the stacks that are already there. I could see either way, either rolling back the contents_first commit or adding to it to list the directory twice and have it working with filtering.

Update build configurations

On Travis the failure looks to be from an old version of Rust and Windows tries to build something for an hour.

Rename Iter to IntoIter

Relevant API Guideline

The Iter type is produced by the IntoIterator trait, so it should be called IntoIter instead of Iter. As an example, see vec::IntoIter.

Add links to other walkdir items in Iter and IterFilterEntry docs

Relevant API Guideline

Add reference links in the Iter and IterFilterEntry docs prose when mentioning other walkdir items:

WalkDir in Iter
WalkDir::min_depth in IterFilterEntry
WalkDir::max_depth in IterFilterEntry

Make WalkDir Send + Sync

Relevant API Guideline

Add Send + Sync bounds to the sort_by function box, so that WalkDir can be Send + Sync. To make sure it doesn't regress, we should also add an assert_send and assert_sync macro (see the API guideline for inspiration) and test WalkDir, Iter and IterFilterEntry.

Implement Clone for WalkDir

Relevant API guideline

From the discussion: The traits to implement for WalkDir are:

Clone

Raised by @epage on reddit. The walkdir type should implement some common traits. They might not all make sense, but I think:

Debug

Clone

Eq was also mentioned. I'm not sure what the use-case for this one is though, maybe @epage can elaborate more.

Any other thoughts on this one?

Add links to other walkdir items in WalkDirIterator docs

Relevant API Guideline

Add reference links in the WalkDirIterator docs prose when mentioning other walkdir items:

WalkDirIterator::filter_entry in WalkDirIterator::skip_current_dir
WalkDirIterator::skip_current_dir in WalkDirIterator::filter_entry
WalkDir::min_depth in WalkDirIterator::filter_entry
WalkDir::max_depth in WalkDirIterator::filter_entry

Tracking issue for libz blitz evaluation of walkdir

This is the tracking issue for the evaluation performed by the libs team this week.

add example for extracting the underlying I/O error

The Error type that walkdir exposes is a light wrapper around a std::io::Error. The purpose of the custom error type is to provide better error messages by default (for example, by including the file path name that caused a specific error). However, it is sometimes desirable to access the underlying I/O error. While there is no concrete method for accessing the underlying I/O error, one can convert a walkdir::Error into a std::io::Error using io::Error::from(err) where err is a walkdir::Error. This is because walkdir provides a impl From<walkdir::Error> for std::io::Error.

Once you have a std::io::Error, you could then extract its ErrorKind and do case analysis there.

It would be nice to have an example for this. We might also consider adding an io_error method that returns a &std::io::Error.

readdir() performance on large directories

See this article: http://be-n.com/spw/you-can-list-a-million-files-in-a-directory-but-not-with-ls.html

ls and practically every other method of listing a directory (including python os.listdir, find .) rely on libc readdir(). However readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory (i.e. 500M of directory entries) it is going to take an insanely long time to read all the directory entries, especially on a slow disk. For directories containing a large number of files, you'll need to dig deeper than tools that rely on readdir(). You will need to use the getdents() syscall directly, rather than helper methods from libc.

I'm just mentioning this since you have a comparison to find and using a larger buffer size than 32kByte (~400 files per syscall) might improve walking performance further.

error: specified package has no binaries

This happens when running cargo install walkdir

Document that `Iter` and `IterFilterEntry` are the result of trait methods

As an example, see the docs on the Zip type in the standard library.

The Iter and IterFilterEntry types should note that they're the result of calling into_iter and filter_entry on WalkDir and Iter respectively.

Correct errors in WalkDir type docs

The docs on WalkDir have a few inaccuracies about the behaviour that should be corrected:

If contents_first is true, then directories aren't emitted before their contents
The order isn't unspecified if a sort_by function is given. Or does 'unspecified' mean something different here?

Add option to yield empty directories

Currently, with the structure:

test
├── a
├── b
├── d_a/
│   ├── c
│   └── d
└── d_b/

where d_a and d_b are directories, WalkDir will not yield d_b. It would be useful (to me, at least) to be able to optionally iterate over empty directories.

I may have a chance to look at implementing it in the future, if this is a desired feature.

Add categories to Cargo.toml

Relevant API Guideline

Add some categories the Cargo.toml so walkdir has better discoverability on crates.io. Maybe add the filesystem category. I'm not sure if others might also be applicable, there's a list here

How find path relative to walkdir-root

Hello,

I'm looking for an easy way to find the relative path between dir given to WalkDir::new and found files. The docstring below from DirEntry says that the path() is formed by joining the WalkDir::new-path with the relative path I'm interested in, below called "file name of this entry". But the file_name() logically only returns the filename, not the dirs between...

fn path(&self) -> &Path[−]

The full path that this entry represents.

The full path is created by joining the parents of this entry up to the root initially given to WalkDir::new with the file name of this entry.

Documentation link should point to docs.rs

The documentation link currently points to http://burntsushi.net/rustdoc/walkdir/, which seems to have an outdated copy of the documentation. You may want to point to https://docs.rs/walkdir/ instead.

Rayon support for walkdir iterators

I'd love to use Rayon parallel iterators with walkdir. Walkdir's iterators don't implement the necessary traits to do so. @nikomatsakis provided a different pattern that provided some thread-pool-based parallelism, but it would be nice to directly use par_iter / into_par_iter with walkdir. walkdir itself could actually run the walk in parallel in that case, processing multiple directories in parallel.

Release 2.0 on crates.io

The libz blitz evaluation in #47 has finished (thank you to everyone that contributed!).

So I thought I'd open this issue to track any other things you'd like to do before releasing 2.0.

Final pass over documentation
Sort out #22?

Any ideas @BurntSushi?

Link references to std in docs

Relevant API Guideline

Turn references to items in the standard library in docs into references to that item in https://doc.rust-lang.org/stable/std/, so for example:

Reference to [`std::io::Error`][IoError] in markdown docs.

[IoError]: https://doc.rust-lang.org/stable/std/io/struct.Error.html

    for entry in WalkDir::new("/tmp/").into_iter().filter_entry(|e| e.file_type().is_file()) {
        let entry = entry.unwrap();
        println!("{}", entry.path().display());
    }

which I expect that all the files in "/tmp/" dir should be printed out,
it just skips the root path(because it is not a file) and prints nothing.

Is this a code mistake or I must have missed some tips?

Thanks for help.

burntsushi / walkdir Goto Github PK

walkdir's Introduction

walkdir

Documentation

Usage

Example

Example: follow symbolic links

Example: skip hidden files and directories efficiently on unix

Minimum Rust version policy

Performance

walkdir's People

Contributors

Stargazers

Watchers

Forkers

walkdir's Issues

Recommend Projects

Recommend Topics

Recommend Org