Code Monkey home page Code Monkey logo

walkdir's Introduction

walkdir

A cross platform Rust library for efficiently walking a directory recursively. Comes with support for following symbolic links, controlling the number of open file descriptors and efficient mechanisms for pruning the entries in the directory tree.

Build status

Dual-licensed under MIT or the UNLICENSE.

Documentation

docs.rs/walkdir

Usage

To use this crate, add walkdir as a dependency to your project's Cargo.toml:

[dependencies]
walkdir = "2"

Example

The following code recursively iterates over the directory given and prints the path for each entry:

use walkdir::WalkDir;

for entry in WalkDir::new("foo") {
    let entry = entry.unwrap();
    println!("{}", entry.path().display());
}

Or, if you'd like to iterate over all entries and ignore any errors that may arise, use filter_map. (e.g., This code below will silently skip directories that the owner of the running process does not have permission to access.)

use walkdir::WalkDir;

for entry in WalkDir::new("foo").into_iter().filter_map(|e| e.ok()) {
    println!("{}", entry.path().display());
}

Example: follow symbolic links

The same code as above, except follow_links is enabled:

use walkdir::WalkDir;

for entry in WalkDir::new("foo").follow_links(true) {
    let entry = entry.unwrap();
    println!("{}", entry.path().display());
}

Example: skip hidden files and directories efficiently on unix

This uses the filter_entry iterator adapter to avoid yielding hidden files and directories efficiently:

use walkdir::{DirEntry, WalkDir};

fn is_hidden(entry: &DirEntry) -> bool {
    entry.file_name()
         .to_str()
         .map(|s| s.starts_with("."))
         .unwrap_or(false)
}

let walker = WalkDir::new("foo").into_iter();
for entry in walker.filter_entry(|e| !is_hidden(e)) {
    let entry = entry.unwrap();
    println!("{}", entry.path().display());
}

Minimum Rust version policy

This crate's minimum supported rustc version is 1.34.0.

The current policy is that the minimum Rust version required to use this crate can be increased in minor version updates. For example, if crate 1.0 requires Rust 1.20.0, then crate 1.0.z for all values of z will also require Rust 1.20.0 or newer. However, crate 1.y for y > 0 may require a newer minimum version of Rust.

In general, this crate will be conservative with respect to the minimum supported version of Rust.

Performance

The short story is that performance is comparable with find and glibc's nftw on both a warm and cold file cache. In fact, I cannot observe any performance difference after running find /, walkdir / and nftw / on my local file system (SSD, ~3 million entries). More precisely, I am reasonably confident that this crate makes as few system calls and close to as few allocations as possible.

I haven't recorded any benchmarks, but here are some things you can try with a local checkout of walkdir:

# The directory you want to recursively walk:
DIR=$HOME

# If you want to observe perf on a cold file cache, run this before *each*
# command:
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

# To warm the caches
find $DIR

# Test speed of `find` on warm cache:
time find $DIR

# Compile and test speed of `walkdir` crate:
cargo build --release --example walkdir
time ./target/release/examples/walkdir $DIR

# Compile and test speed of glibc's `nftw`:
gcc -O3 -o nftw ./compare/nftw.c
time ./nftw $DIR

# For shits and giggles, test speed of Python's (2 or 3) os.walk:
time python ./compare/walk.py $DIR

On my system, the performance of walkdir, find and nftw is comparable.

walkdir's People

Contributors

andygauge avatar budziq avatar burntsushi avatar byron avatar eh2406 avatar exphp avatar guillaumegomez avatar ignatenkobrain avatar jackpot51 avatar jasongrlicky avatar jchlapinski avatar jcsoo avatar jeremielate avatar kimundi avatar lo48576 avatar lukaskalbertodt avatar mcharsley avatar meven avatar nabijaczleweli avatar nivkner avatar opilar avatar psbarrett avatar ruuda avatar ryman avatar s3rvac avatar tmccombs avatar tshepang avatar vandenoever avatar vsuryamurthy avatar yufengwng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

walkdir's Issues

Remove re-export of is_same_file

The is_same_file function doesn't need to be re-exported in walkdir anymore now that there's a dedicated same_file crate. The plan is:

  • Deprecate the is_same_file re-export for a non-breaking release
  • Remove the is_same_file re-export for the next breaking release

Recursing over multiple directories

I think it would be useful if WalkDir could take multiple Paths and build a single iterator containing every file and folder once.

E.x.

a
 \a2
b
 \b2
  \b3

let walker = WalkDir::new("a");
walker.add("b3");
walker.add("b2");
let iter = walker.into_iter();
// iter["a", "a2", "b2", "b3"]

Add example for content_first

Add an example that demonstrates the content_first method. Maybe this can just be another example in the crate root docs. With a comment like:

// When contents_first is true outputs:
// `dir/first_file`
// `dir/last_file`
// `dir`

Does anyone have any thoughts on how the docs could illustrate the results of certain filters better?

How would I emulate the following find command.

I'm trying to find all git repos in a directory..

find . -name .git -type d -prune

It searches for directories with .git and then stops further recursing once it's been found..

So I tried something like this with filter_entry

fn is_git_repo(entry: &DirEntry) -> bool {
    let path = entry.path();
    // TODO: Handle errors
    let files = fs::read_dir(path).unwrap();
    files
        .map(|r| r.unwrap())
        .any(|s| s.file_name().to_str().unwrap() == ".git")
}

fn main() {
    let mut walker = WalkDir::new(p).into_iter();

    for entry in walker.filter_entry(|e| {
        // Only give directories
        e.file_type().is_dir()
    }) {
        let entry = entry?;

        if is_git_repo(&entry) {
            let path = entry.path();
            println!("{}", path.display());
        }
    }
}

It works.. but it's not efficient...
I want to stop recursing after a git repo is found.. but I cant seem to code that logic in filter_entry

Support more metadata for sort_by

Possibly makes #44 irrelevant.

Is it possible to support more file metadata in the sort_by function? If the file metadata is already around by the time sort_by is run this could be worthwhile, otherwise it might not be practical.

Endless loop when ReadDir returns an error

The ReadDir documentation states that it may return an Err if there's some sort of intermittent IO error during iteration.
However, it seems that there are some cases in which the error will be persistent, for instance, some directories in a /proc entry of a zombie process on Linux.

The entry wont be popped in this case, resulting in an endless loop:

$ cargo run --example walkdir /proc/2226/
    Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs
     Running `target/debug/examples/walkdir /proc/2226/`
/proc/2226/
/proc/2226/task
/proc/2226/task/2226
/proc/2226/task/2226/fd
ERROR: IO error for operation on /proc/2226/task/2226/fd: Permission denied (os error 13)
/proc/2226/task/2226/fdinfo
ERROR: IO error for operation on /proc/2226/task/2226/fdinfo: Permission denied (os error 13)
/proc/2226/task/2226/ns
ERROR: IO error for operation on /proc/2226/task/2226/ns: Permission denied (os error 13)
/proc/2226/task/2226/net
ERROR: Invalid argument (os error 22)
ERROR: Invalid argument (os error 22)
ERROR: Invalid argument (os error 22)
ERROR: Invalid argument (os error 22)
ERROR: Invalid argument (os error 22)
ERROR: Invalid argument (os error 22)
ERROR: Invalid argument (os error 22)
ERROR: Invalid argument (os error 22)
ERROR: Invalid argument (os error 22)
...

Update Minimal Rust Version Support

We've got some PRs that are stumbling over the minimum Rust version being 1.10:

  • #52 for ? (1.13)
  • #59 for some Debug impls in std (not sure what version they came in yet)

Is it worth bumping the minimum version to 1.13? Even if that doesn't give us the Debug impls for std::fs::ReadDir it might be worthwhile for ?.

cc: @budziq @tmccombs @BurntSushi

"symbolic_link" vs "symlink"

To follow precedence set by stdlib, DirEntry::path_to_symbolic_link needs to be renamed to DirEntry::path_to_symlink.

Join efforts with Walker

I'm the author / publisher of the Walker crate. Its signature is the same as the recursive directory walker that was in the pre-1.0 stdlib. A brief search of the crates index suggests it is currently the only such library usable in stable.

Unless you envision walkdir as a very different library, how about we join these two up, and provide a seamless upgrade for all who are already using Walker into a better algorithm, more efficient implementation, etc?

Add links to other walkdir items in DirEntry docs

Relevant API Guideline

Add reference links in the DirEntry docs prose when mentioning other walkdir items:

  • filter_entry in root skip hidden files and directories efficiently on unix example
  • path, file_name and follow_links in DirEntry
  • WalkDir::new in DirEntry::path
  • follow_links in DirEntry::path_is_symbolic
  • follow_links in DirEntry::metadata
  • follow_links in DirEntry::file_type

Document why unwraps won't fail

We should add some inline comments to the few calls to unwrap so it's clear to readers why it's safe to call unwrap in those places. All unwraps:

  • <Iter as Iterator>::next
  • Iter::get_deferred_dir
  • Iter::push

We might also want to change these to expects with a simple message that the panic is a bug in walkdir.

`DirEntry`s `FileType` can have different equality properties depending on code path

I'm trying to create a form of zipper walkdir iterators in my revision to dir-diff. In the original code path, we just did quality checks on the FileType and everything worked. When I manually look up the FileType, my tests fail.

---- easy_good stdout ----
FileType(FileType { mode: 32768 }) != FileType(FileType { mode: 33206 })
true != true
false != false
false != false

32768 -> 0x8000
33206 -> 0x81B6
(seems like Debug for FileType should output hex)

It looks like the extra information is S_IMODE

>>> hex(stat.S_IMODE(a.st_mode))
'0x1b6'

I've narrowed it down to read_dir vs metadata.

The reason I bring this up here is that DirEntry seems to have a way to populate its ty from either fs::DirEntry or metadata, I'm just not hitting that case in my code. This could cause some surprising behavior for walkdir users.

Option for breadth-first search?

In typical directory hierarchies, it's less efficient with a less predictable memory usage but it could be useful for streaming searches in huge directories.

Example: a fuzzy directory finder.

approaching 1.0

walkdir has seen quite a bit of use but little iteration on the API. There are a few open PRs/issues mostly related to additional functionality, but no significant design changes. Because of that, I'd like to move walkdir to 1.0 soon. I'm thinking within the next few weeks unless there's a good reason not to.

Option to process parent before/after contents

According to the docs:

Results are returned in depth first fashion, with directories yielded before their contents

Would it be possible to have an option to yield directories after their contents. That would be useful e.g. for recursively deleting a directory.

"File name too long" error at 4096 bytes

WalkDir cannot handle long paths that find handles fine.

extern crate walkdir;
use std::fs::create_dir;
use std::env::{current_dir, set_current_dir};

fn main() {
    let dir = current_dir().unwrap();
    let name = "a";
    for i in 0..2200 {
        if i % 100 == 0 {
            println!("Create dir at level {}.", i);
        }
        current_dir().unwrap(); // this line shows that rust can handle it
        create_dir(name).unwrap();
        set_current_dir(name).unwrap();
    }

    for r in walkdir::WalkDir::new(dir) {
        let entry = r.unwrap(); // this gives an error for long paths
        let len = entry.path().to_string_lossy().len();
        if len > 4090 {
            println!("{}", len);
        }
    }
}
...
Create dir at level 2100.
4091
4093
4095
4097
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { depth: 2042, inner: Io { path: Some("/home/walkdir/a/a/.../a/a/a"), err: Error { repr: Os { code: 36, message: "File name too long" } } } }', src/libcore/result.rs:788
note: Run with `RUST_BACKTRACE=1` for a backtrace.

Make skip_current_dir and filter_entry inherent methods

The WalkDirIterator trait isn't pulling its weight, so we should make them inherent methods and remove the trait. The plan is:

  • Make the skip_current_dir and filter_entry methods inherent methods on Iter and IterFilterEntry
  • Deprecate the WalkDirIterator trait for a non-breaking release
  • Remove the WalkDirIterator trait for the next breaking release

explore traversal that yields directory entry twice

This is about issue #18 and PR #19 . Aside: the more I dug in, the more I learned, and I'm grateful to all involved for the work on walkdir.

@mcharsley, I played around with contents_first, and for the most part, liked it.

However, it doesn't seem to work with filter_entry() or skip_current_directory() -- the yielded entries look to be skipping the remainder of the directory the skipped directory is in, as opposed to the skipped directory.

Additionally, I was looking for an option which also included the initial directory.

I see @michaelsproul proposed something which yields both edges, and I played around with a modified version of that and seemed to be able to do what I was looking for.

I also played around with writing an iterator adapter which also works okay, but didn't feel quite right, as it didn't work to implement WalkDir, so skip_current_directory was outside of that.

I'm torn between a simpler walkdir plus another crate for an adapter controlling when and how often a directory is listed, and an integrated version, making use of the stacks that are already there. I could see either way, either rolling back the contents_first commit or adding to it to list the directory twice and have it working with filtering.

Update build configurations

On Travis the failure looks to be from an old version of Rust and Windows tries to build something for an hour.

Make WalkDir Send + Sync

Relevant API Guideline

Add Send + Sync bounds to the sort_by function box, so that WalkDir can be Send + Sync. To make sure it doesn't regress, we should also add an assert_send and assert_sync macro (see the API guideline for inspiration) and test WalkDir, Iter and IterFilterEntry.

Implement Clone for WalkDir

Relevant API guideline

From the discussion: The traits to implement for WalkDir are:

  • Clone

Raised by @epage on reddit. The walkdir type should implement some common traits. They might not all make sense, but I think:

  • Debug
  • Clone
  • Eq was also mentioned. I'm not sure what the use-case for this one is though, maybe @epage can elaborate more.

Any other thoughts on this one?

Add links to other walkdir items in WalkDirIterator docs

Relevant API Guideline

Add reference links in the WalkDirIterator docs prose when mentioning other walkdir items:

  • WalkDirIterator::filter_entry in WalkDirIterator::skip_current_dir
  • WalkDirIterator::skip_current_dir in WalkDirIterator::filter_entry
  • WalkDir::min_depth in WalkDirIterator::filter_entry
  • WalkDir::max_depth in WalkDirIterator::filter_entry

Tracking issue for libz blitz evaluation of walkdir

add example for extracting the underlying I/O error

The Error type that walkdir exposes is a light wrapper around a std::io::Error. The purpose of the custom error type is to provide better error messages by default (for example, by including the file path name that caused a specific error). However, it is sometimes desirable to access the underlying I/O error. While there is no concrete method for accessing the underlying I/O error, one can convert a walkdir::Error into a std::io::Error using io::Error::from(err) where err is a walkdir::Error. This is because walkdir provides a impl From<walkdir::Error> for std::io::Error.

Once you have a std::io::Error, you could then extract its ErrorKind and do case analysis there.

It would be nice to have an example for this. We might also consider adding an io_error method that returns a &std::io::Error.

readdir() performance on large directories

See this article: http://be-n.com/spw/you-can-list-a-million-files-in-a-directory-but-not-with-ls.html

ls and practically every other method of listing a directory (including python os.listdir, find .) rely on libc readdir(). However readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory (i.e. 500M of directory entries) it is going to take an insanely long time to read all the directory entries, especially on a slow disk. For directories containing a large number of files, you'll need to dig deeper than tools that rely on readdir(). You will need to use the getdents() syscall directly, rather than helper methods from libc.

I'm just mentioning this since you have a comparison to find and using a larger buffer size than 32kByte (~400 files per syscall) might improve walking performance further.

Correct errors in WalkDir type docs

The docs on WalkDir have a few inaccuracies about the behaviour that should be corrected:

  • If contents_first is true, then directories aren't emitted before their contents
  • The order isn't unspecified if a sort_by function is given. Or does 'unspecified' mean something different here?

Add option to yield empty directories

Currently, with the structure:

test
├── a
├── b
├── d_a/
│   ├── c
│   └── d
└── d_b/

where d_a and d_b are directories, WalkDir will not yield d_b. It would be useful (to me, at least) to be able to optionally iterate over empty directories.

I may have a chance to look at implementing it in the future, if this is a desired feature.

How find path relative to walkdir-root

Hello,

I'm looking for an easy way to find the relative path between dir given to WalkDir::new and found files. The docstring below from DirEntry says that the path() is formed by joining the WalkDir::new-path with the relative path I'm interested in, below called "file name of this entry". But the file_name() logically only returns the filename, not the dirs between...

fn path(&self) -> &Path[−]

The full path that this entry represents.

The full path is created by joining the parents of this entry up to the root initially given to WalkDir::new with the file name of this entry.

Rayon support for walkdir iterators

I'd love to use Rayon parallel iterators with walkdir. Walkdir's iterators don't implement the necessary traits to do so. @nikomatsakis provided a different pattern that provided some thread-pool-based parallelism, but it would be nice to directly use par_iter / into_par_iter with walkdir. walkdir itself could actually run the walk in parallel in that case, processing multiple directories in parallel.

Release 2.0 on crates.io

The libz blitz evaluation in #47 has finished (thank you to everyone that contributed!).

So I thought I'd open this issue to track any other things you'd like to do before releasing 2.0.

  • Final pass over documentation
  • Sort out #22?

Any ideas @BurntSushi?

Link references to std in docs

Relevant API Guideline

Turn references to items in the standard library in docs into references to that item in https://doc.rust-lang.org/stable/std/, so for example:

Reference to [`std::io::Error`][IoError] in markdown docs.

[IoError]: https://doc.rust-lang.org/stable/std/io/struct.Error.html

Unexpected `filter_entry` behavior

If I want to get all the files in a WalkDir root path, here is the code:

    for entry in WalkDir::new("/tmp/").into_iter().filter_entry(|e| e.file_type().is_file()) {
        let entry = entry.unwrap();
        println!("{}", entry.path().display());
    }

which I expect that all the files in "/tmp/" dir should be printed out,
it just skips the root path(because it is not a file) and prints nothing.

Is this a code mistake or I must have missed some tips?

Thanks for help.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.