Code Monkey home page Code Monkey logo

split-folders's Introduction

split-folders Build Status PyPI PyPI - Python Version PyPI - Downloads

Split folders with files (e.g. images) into train, validation and test (dataset) folders.

The input folder should have the following format:

input/
    class1/
        img1.jpg
        img2.jpg
        ...
    class2/
        imgWhatever.jpg
        ...
    ...

In order to give you this:

output/
    train/
        class1/
            img1.jpg
            ...
        class2/
            imga.jpg
            ...
    val/
        class1/
            img2.jpg
            ...
        class2/
            imgb.jpg
            ...
    test/
        class1/
            img3.jpg
            ...
        class2/
            imgc.jpg
            ...

This should get you started to do some serious deep learning on your data. Read here why it's a good idea to split your data intro three different sets.

  • Split files into a training set and a validation set (and optionally a test set).
  • Works on any file types.
  • The files get shuffled.
  • A seed makes splits reproducible.
  • Allows randomized oversampling for imbalanced datasets.
  • Optionally group files by prefix.
  • (Should) work on all operating systems.

Install

This package is Python only and there are no external dependencies.

pip install split-folders

Optionally, you may install tqdm to get a progress bar when moving files.

pip install split-folders[full]

Usage

You can use split-folders as Python module or as a Command Line Interface (CLI).

If your datasets is balanced (each class has the same number of samples), choose ratio otherwise fixed. NB: oversampling is turned off by default. Oversampling is only applied to the train folder since having duplicates in val or test would be considered cheating.

Module

import splitfolders

# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
splitfolders.ratio("input_folder", output="output",
    seed=1337, ratio=(.8, .1, .1), group_prefix=None, move=False) # default values

# Split val/test with a fixed number of items, e.g. `(100, 100)`, for each set.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
# Set 3 values, e.g. `(300, 100, 100)`, to limit the number of training values.
splitfolders.fixed("input_folder", output="output",
    seed=1337, fixed=(100, 100), oversample=False, group_prefix=None, move=False) # default values

Occasionally, you may have things that comprise more than a single file (e.g. picture (.png) + annotation (.txt)). splitfolders lets you split files into equally-sized groups based on their prefix. Set group_prefix to the length of the group (e.g. 2). But now all files should be part of groups.

Set move=True if you want to move the files instead of copying.

CLI

Usage:
    splitfolders [--output] [--ratio] [--fixed] [--seed] [--oversample] [--group_prefix] [--move] folder_with_images
Options:
    --output        path to the output folder. defaults to `output`. Get created if non-existent.
    --ratio         the ratio to split. e.g. for train/val/test `.8 .1 .1 --` or for train/val `.8 .2 --`.
    --fixed         set the absolute number of items per validation/test set. The remaining items constitute
                    the training set. e.g. for train/val/test `100 100` or for train/val `100`.
                    Set 3 values, e.g. `300 100 100`, to limit the number of training values.
    --seed          set seed value for shuffling the items. defaults to 1337.
    --oversample    enable oversampling of imbalanced datasets, works only with --fixed.
    --group_prefix  split files into equally-sized groups based on their prefix
    --move          move the files instead of copying
Example:
    splitfolders --ratio .8 .1 .1 -- folder_with_images

Because of some Python quirks you have to prepend -- after using --ratio.

Instead of the command splitfolders you can also use split_folders or split-folders.

Development

Install and use poetry.

Contributing

If you have a question, found a bug or want to propose a new feature, have a look at the issues page.

Pull requests are especially welcomed when they fix bugs or improve the code quality.

License

MIT

split-folders's People

Contributors

andife avatar dependabot[bot] avatar ghltshubh avatar jfilter avatar mariusmez avatar nicholastzx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

split-folders's Issues

Split files with same prefix together

I have some experiments where I crop the image in small tiles. All of them shared the same prefix. Is it possible to keep the tiles from the same source image either in train or valid?

Thanks!

split_folders not working

I have a directory containing two classes - 0 and 1 , consisting of subfolders with wave files. The structure is -
main folder --class 1
---- speaker1 ----wavefile1
---wavefile2
---wavefile3
-----speaker2 ........
--class 2 ---- speaker1
-----speaker2
........
`
Splitfile is not working when using the syntax given in the example

Error while specifying only ratio and image folder with CLI

splitfolders --ratio .8 .2 img_folder throws the following error

usage: splitfolders [-h] [--output OUTPUT] [--seed SEED]
                    [--ratio RATIO [RATIO ...]] [--fixed FIXED [FIXED ...]]
                    [--oversample] [--group_prefix GROUP_PREFIX]
                    input
splitfolders: error: argument --ratio: invalid float value: 'img_folder'

Error

I don't know why this doesn't work, help me please.
image

There is still an error in split_class_dir_fixed

Hi! I've seen that you found mistake in split_class_dir_fixed function and changed comparison from ">" to ">=", but i think you haven't published update so when I did pip install split-folders and used it, i got this error with ">" sign in function. I hope you will publish it in future (:

AttributeError: 'PosixPath' object has no attribute 'rstrip'

import splitfolders
input_folder=pathlib.Path("/content/drive/MyDrive/StanfordCarsDataset/train")
print(input_folder)
output=pathlib.Path("/content/drive/MyDrive/StanfordCarsDataset/Train_val_split")
print(output)
# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
splitfolders.ratio(input_folder, output=output,
    seed=1337, ratio=(.8, .2, ), group_prefix=None, move=True) # default values

I am using the above code for splitting images
and getting below error

Copying files: 0 files [00:00, ? files/s]

AttributeError Traceback (most recent call last)
in ()
8 # To only split into training and validation set, set a tuple to ratio, i.e, (.8, .2).
9 splitfolders.ratio(input_folder, output=output,
---> 10 seed=1337, ratio=(.8, .2, ), group_prefix=None, move=True) # default values

4 frames
/usr/lib/python3.7/shutil.py in _basename(path)
524 # Thus we always get the last component of the path, even for directories.
525 sep = os.path.sep + (os.path.altsep or '')
--> 526 return os.path.basename(path.rstrip(sep))
527
528 def move(src, dst, copy_function=copy2):

AttributeError: 'PosixPath' object has no attribute 'rstrip'

What if I want to take exactly 200 files from each folder for training and rest for validation/test

Hey @jfilter ,

First of all, let me admire your efforts for making our lives so easy by developing this very handy tool. However, there is problem for imbalanced number of files in folders. In my case I want to keep exactly 200 files for training and rest for the test/validation regardless of how many files left for test/validation. If I do so following your code guidelines;

To only split into training and validation set, use a single number to fixed, i.e., 10.

I got the following assertion error. Could you please help me to resolve this issue.

TypeError Traceback (most recent call last)
in ()
7
8 #no_files
----> 9 sp.fixed(data_dir, output=output_dir, seed=13, fixed=200, oversample=False, group_prefix=None) # default values

/usr/local/lib/python3.6/dist-packages/splitfolders/split.py in fixed(input, output, seed, fixed, oversample, group_prefix)
96 fixed = fixed
97
---> 98 assert len(fixed) in (1, 2)
99
100 if tqdm_is_installed:

TypeError: object of type 'int' has no len()

Feature request: ratio() classes structure option argument

The ratio() function assumes there is already a folder class structure (input_dir/class1, input_dir/class2, etc.). It would be helpful if there was an argument for the ratio function that allows to make this an optional assumption (e.g. splitfolders.ratio(classes=None)).

splitfolders.ratio fail without error message

I have correctly implemented splitfolders.ratio(input= input_folder,
output=output_folder, seed=1337, ratio=(.9, 0.1))

There is no error message. However, the files are not split and remain in the input folder. the output folder does not change either. happens to python 3.9

Splitting all numbers of files of a folder into train val test using fixed will throw an error

If my folder has 4567 files and I want to split all of them into train, val and test folders with desired numbers:
splitfolders.fixed(input, output=output, seed=1337, fixed(3000, 1000, 567))

it will show an error message:

if not len(files) > sum(fixed):
        raise ValueError(
            f'The number of samples in class "{class_dir.stem}" are too few. There are only {len(files)} samples available but your fixed parameter {fixed} requires at least {sum(fixed)} files. You may want to split your classes by ratio.'
        )``

Suggestion to fix the bug by using >= instead of >, so that if all the fixed numbers sum up equals to the length of files, it can proceed without error:

if not len(files) >= sum(fixed):
        raise ValueError(
            f'The number of samples in class "{class_dir.stem}" are too few. There are only {len(files)} samples available but your fixed parameter {fixed} requires at least {sum(fixed)} files. You may want to split your classes by ratio.'
        )``

Split folders to work without a "class" hierarchy

Since splitting data into (test, train, validation) sets is relevant to all data types, not just ones that are related different classes, having the option to use split-folder on a general folder, i.e. one that contains actual data and does not comply with the subdir ('class1', 'class2',...) hierarchy, would make this package relevant to a much larger crowd.

Custom folder names

Awesome library, it would be awesome if it's possible to add custom folder names. Great work!

Assertion Error

I get this error when i use the fixed attribute.
Error:

AssertionError Traceback (most recent call last)
in
----> 1 split_folders.fixed('C:/Users/TEXVNQA/Downloads/TL_DATA/', output="C:/Users/TEXVNQA/Downloads/GTS_Torch Format/", seed=1337, fixed=(135,135), oversample=False) # default values

c:\torch\lib\site-packages\split_folders\split.py in fixed(input, output, seed, fixed, oversample)
69 lens = []
70 for class_dir in dirs:
---> 71 lens.append(split_class_dir_fixed(class_dir, output, fixed, seed))
72
73 if not oversample:

c:\torch\lib\site-packages\split_folders\split.py in split_class_dir_fixed(class_dir, output, fixed, seed)
105 files = setup_files(class_dir, seed)
106
--> 107 assert len(files) > sum(fixed)
108
109 split_train = len(files) - sum(fixed)

AssertionError:

๐Ÿ“œ XML copy support

Hello ๐Ÿ™‹โ€โ™€๏ธ,

  • ๐Ÿค” I suggest you to add .xml file copy support according to some flag like --copy_xml to add supporting to datasets in VOC structure
  • ๐Ÿ‘€ You can modify copy_files method in some way like the following:
def copy_files(files_type, class_dir, output, prog_bar, copy_xml=False):
    """Copies the files from the input folder to the output folder
    """
    # get the last part within the file
    class_name = path.split(class_dir)[1]
    for (files, folder_type) in files_type:
        full_path = path.join(output, folder_type, class_name)

        pathlib.Path(full_path).mkdir(parents=True, exist_ok=True)
        for f in files:
            if not prog_bar is None:
                prog_bar.update()
            extension = path.splitext(path.split(f)[-1])[-1].lower()
            if extension in [".jpg", ".png", ".bmp", "jpeg", "gif"]:
                shutil.copy2(f, full_path)
                if copy_xml:
                    xml_f = path.splitext(f)[0] + ".xml"
                    shutil.copy2(xml_f, full_path)

AttributeError: 'PosixPath' object has no attribute 'rfind'

I did a split_folders with --ratio .8 .2, and got the following error:

Traceback (most recent call last):
File "/root/.virtualenvs/data-manipulation/bin/split_folders", line 27, in
split_folders.ratio(args.input, args.output, args.seed, args.ratio)
File "/root/.virtualenvs/data-manipulation/lib/python3.5/site-packages/split_folders/split.py", line 58, in ratio
split_class_dir_ratio(class_dir, output, ratio, seed)
File "/root/.virtualenvs/data-manipulation/lib/python3.5/site-packages/split_folders/split.py", line 126, in split_class_dir_ratio
copy_files(li, class_dir, output)
File "/root/.virtualenvs/data-manipulation/lib/python3.5/site-packages/split_folders/split.py", line 148, in copy_files
class_name = path.split(class_dir)[1]
File "/root/.virtualenvs/data-manipulation/lib/python3.5/posixpath.py", line 103, in split
i = p.rfind(sep) + 1
AttributeError: 'PosixPath' object has no attribute 'rfind'

the same problem happens with python 2.7.12 and python 3.5.2

Feature request

I would want splitfolders to have a package that can split csv files having labels of fake and real in them and be splitted into train_fake.csv and test_fake.csv files. and Train_real.csv and Test_real.csv files respectively

Feature request: Input folder path

Hi, thanks for developing this library. I just wander if it would make sense and be possible to allow giving as input_folder the complete path to the folder that contains the files. For example I'm working with images and annotations and i have the following structure

data\
- images\
- im_1.jpg
- im_2.jpg
- ...
- annotations\
- im_1.xml
- im_2.xml
- ...

The problem is that ratio function asks for the input_folder to be the path to a directory, which in my case would be data. But this would also split the annotations, which would be great if the split for the annotations would mirror the split for the images, but apparently it does not. It seems that the split for images is independent from the split of annotations, for example i can find im_1.jpg in the train folder and the im_1.xml in the validation folder.

Thanks, and keep on the excellent work that you are doing.

split_folders.ratio() to have "oversample" parameter.

I wanted to split the images into train, val and test.
split_folders.ratio('data/images', output="data/images_new", seed=1337, ratio=(.8, .1, .1))
1) I realized that there are many copies for the classes with small samples. Would like to find out whether using the ratio method will automatically balance the data set?

  1. If yes, would it also be possible to add a parameter "oversample" for ratio method?

Need to split time series data into train/val/test, but cannot shuffle

As I am working with time series data, the information cannot be shuffled.
Data volume is to big and therefore, I will use keras.train_on_batch since loading full dataset into RAM memory is not feasible.
Spliting the data into train, test and validation folders would be helpfull. I am trying to use the split-folder library, but I could not identify how to avoid the shuffling process. Is it possible de-activate the shuffle?

Specify the exact number of items for training/validation/test sets

I understand that by version 0.4.3 it is possible to specify the exact number of items for the validation and test sets by using the flag --fixed, however, as the documentation states:

The remaining items constitute the training set. e.g. for train/val/test 100 100 or for train/val 100.

Meaning that you can currently specify the number of items for the validation and test sets but not for the training set, so in a scenario where a given range of images (i.e. a subset from a larger dataset) it would be useful to be able to specify a fixed number of items for each of the sets.

More than an issue, a feature request. Thanks.

AttributeError: 'PosixPath' object has no attribute 'rfind'

I get the following error, at this line : split_folders.ratio('non_dup', output="data_", seed=42, ratio=(.8, .2))

Traceback (most recent call last): File "image_classification_2.py", line 53, in <module> split_folders.ratio('non_dup', output="data_", seed=42, ratio=(.8, .2)) File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/split_folders/split.py", line 58, in ratio split_class_dir_ratio(class_dir, output, ratio, seed) File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/split_folders/split.py", line 126, in split_class_dir_ratio copy_files(li, class_dir, output) File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/split_folders/split.py", line 148, in copy_files class_name = path.split(class_dir)[1] File "/home/ubuntu/anaconda3/lib/python3.5/posixpath.py", line 103, in split i = p.rfind(sep) + 1 AttributeError: 'PosixPath' object has no attribute 'rfind'

group_by_prefix function finding multiple matches

I have a dataset with images in this format, image_1.png, image _2.png,.... image_130.png, ...., image_1301.png, image_1302.png, .... and my label files follow the same convention.

When I use group_prefix = 2, I get an error message to say it has found multiple matches for image_130.png. I am not a python expert but looking at your group_by_prefix function in splitfolders/split.py on line 190 it is checking to see if the file name startswith instead of checking for an exact match on the file name before the file extension. So for image_130.png, the function is going to find a match for image_130.png, image_1300,png, image_1301.png etc.

Error importing the module

import split_folders

Traceback (most recent call last):
File "", line 1, in
File "/home/linux/Ishan_work/tf_venv/lib/python3.5/site-packages/split_folders/init.py", line 1, in
from .split import *
File "/home/linux/Ishan_work/tf_venv/lib/python3.5/site-packages/split_folders/split.py", line 109
raise ValueError(f'The number of samples in class "{class_dir.stem}" are too few. There are only {len(files)} samples available but your fixed parameter {fixed} requires at least {sum(fixed)} files. You may want to split your classes by ratio.')
^
SyntaxError: invalid syntax

Any helps appreciated. Thanks in advance !
Uploading Screenshot from 2019-08-07 17-08-54.pngโ€ฆ

Split by Ratio

Hi dude,
Verify the ratio by .7 .2 .1, the sum ratio result: 0.9999999999999999
See the print.
image

Show progress bar

Hi, nice and usefull little project.

May I suggest adding tool like TQDM https://tqdm.github.io to show advancement of the copy in case of big datasets which take time to process.

It's easy to add I think and could do a PR if you want to, but as I don't know the philosophy of your project (are external dependancy allowed ?)
I'll wait for your return.

Does not work with relative paths

import splitfolders
splitfolders.ratio("./input_folder", output="./output", seed=1337, ratio=(.8, .1, .1), group_prefix=None)

Error:

FileNotFoundError: [WinError 3] The system cannot find the path specified: 'input_folder'

My folder

  • root
    • input_folder
      • Covid-19
      • No_findings
    • output
    • split_folder_script.ipynb

Ratio not found

import splitfolders

splitfolders.ratio("/Users/mavaylon/Research/Research_Gambier/Data_P/BP", output="/Users/mavaylon/Research/Research_Gambier/Data_P/output", seed=1337, ratio=(.7, .3), group_prefix=None) # default values

This just returns ratio is not an attribute.

warn user if folders are not in the right format

(cv2020) hduser@slave1:~/computer_vision_2020/staff_detecion_using_icard/data$ split_folders images --ratio .8 .2
Copying files: 0 files [00:00, ? files/s]
when i done it say copying 0 files.

Can this library works with txt files?

Hi , thanks for the library. I have an issue when splitting txt files

image

this the result after splitting the file , here should be one txt file but I'm getting this desktop.ini file.

Thanks .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.