jfilter / split-folders Goto Github PK

🗂 Split folders with files (i.e. images) into training, validation and test (dataset) folders

License: MIT License

Python 100.00%

validation training test dataset splitting machine-learning deep-learning oversampling python python-package

split-folders's Issues

split_folders.ratio() to have "oversample" parameter.

I wanted to split the images into train, val and test.
split_folders.ratio('data/images', output="data/images_new", seed=1337, ratio=(.8, .1, .1))
~~1) I realized that there are many copies for the classes with small samples. Would like to find out whether using the ratio method will automatically balance the data set?~~

If yes, would it also be possible to add a parameter "oversample" for ratio method?

Suggestion: Make default seed random and allow for Cross Validation folders

Hello!

According to https://pypi.org/project/split-folders/ , the default seed is 1337, so I'm assuming this make always the train-val-test split always the same?

And furthermore, a good idea would be to include an option to divide the train folder into X-folds for cross validation!

Split files with same prefix together

I have some experiments where I crop the image in small tiles. All of them shared the same prefix. Is it possible to keep the tiles from the same source image either in train or valid?

Thanks!

Split by Ratio

Hi dude,
Verify the ratio by .7 .2 .1, the sum ratio result: 0.9999999999999999
See the print.

Assertion Error

I get this error when i use the fixed attribute.
Error:

AssertionError Traceback (most recent call last)
in
----> 1 split_folders.fixed('C:/Users/TEXVNQA/Downloads/TL_DATA/', output="C:/Users/TEXVNQA/Downloads/GTS_Torch Format/", seed=1337, fixed=(135,135), oversample=False) # default values

c:\torch\lib\site-packages\split_folders\split.py in fixed(input, output, seed, fixed, oversample)
69 lens = []
70 for class_dir in dirs:
---> 71 lens.append(split_class_dir_fixed(class_dir, output, fixed, seed))
72
73 if not oversample:

c:\torch\lib\site-packages\split_folders\split.py in split_class_dir_fixed(class_dir, output, fixed, seed)
105 files = setup_files(class_dir, seed)
106
--> 107 assert len(files) > sum(fixed)
108
109 split_train = len(files) - sum(fixed)

AssertionError:

There is still an error in split_class_dir_fixed

Hi! I've seen that you found mistake in split_class_dir_fixed function and changed comparison from ">" to ">=", but i think you haven't published update so when I did pip install split-folders and used it, i got this error with ">" sign in function. I hope you will publish it in future (:

What if I want to take exactly 200 files from each folder for training and rest for validation/test

Hey @jfilter ,

First of all, let me admire your efforts for making our lives so easy by developing this very handy tool. However, there is problem for imbalanced number of files in folders. In my case I want to keep exactly 200 files for training and rest for the test/validation regardless of how many files left for test/validation. If I do so following your code guidelines;

To only split into training and validation set, use a single number to `fixed`, i.e., `10`.

I got the following assertion error. Could you please help me to resolve this issue.

TypeError Traceback (most recent call last)
in ()
7
8 #no_files
----> 9 sp.fixed(data_dir, output=output_dir, seed=13, fixed=200, oversample=False, group_prefix=None) # default values

/usr/local/lib/python3.6/dist-packages/splitfolders/split.py in fixed(input, output, seed, fixed, oversample, group_prefix)
96 fixed = fixed
97
---> 98 assert len(fixed) in (1, 2)
99
100 if tqdm_is_installed:

TypeError: object of type 'int' has no len()

Can this library works with txt files?

Hi , thanks for the library. I have an issue when splitting txt files

this the result after splitting the file , here should be one txt file but I'm getting this desktop.ini file.

Thanks .

I am unable to create train/test/val folders

splitfolders.ratio fail without error message

I have correctly implemented splitfolders.ratio(input= input_folder,
output=output_folder, seed=1337, ratio=(.9, 0.1))

There is no error message. However, the files are not split and remain in the input folder. the output folder does not change either. happens to python 3.9

split_folders not working

I have a directory containing two classes - 0 and 1 , consisting of subfolders with wave files. The structure is -
main folder --class 1
---- speaker1 ----wavefile1
---wavefile2
---wavefile3
-----speaker2 ........
--class 2 ---- speaker1
-----speaker2
........
`
Splitfile is not working when using the syntax given in the example

Feature request: ratio() classes structure option argument

The ratio() function assumes there is already a folder class structure (input_dir/class1, input_dir/class2, etc.). It would be helpful if there was an argument for the ratio function that allows to make this an optional assumption (e.g. splitfolders.ratio(classes=None)).

AttributeError: 'PosixPath' object has no attribute 'rfind'

I get the following error, at this line : split_folders.ratio('non_dup', output="data_", seed=42, ratio=(.8, .2))

Traceback (most recent call last): File "image_classification_2.py", line 53, in <module> split_folders.ratio('non_dup', output="data_", seed=42, ratio=(.8, .2)) File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/split_folders/split.py", line 58, in ratio split_class_dir_ratio(class_dir, output, ratio, seed) File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/split_folders/split.py", line 126, in split_class_dir_ratio copy_files(li, class_dir, output) File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/split_folders/split.py", line 148, in copy_files class_name = path.split(class_dir)[1] File "/home/ubuntu/anaconda3/lib/python3.5/posixpath.py", line 103, in split i = p.rfind(sep) + 1 AttributeError: 'PosixPath' object has no attribute 'rfind'

Feature request

I would want splitfolders to have a package that can split csv files having labels of fake and real in them and be splitted into train_fake.csv and test_fake.csv files. and Train_real.csv and Test_real.csv files respectively

Feature request: Input folder path

Hi, thanks for developing this library. I just wander if it would make sense and be possible to allow giving as input_folder the complete path to the folder that contains the files. For example I'm working with images and annotations and i have the following structure

data\
- images\
- im_1.jpg
- im_2.jpg
- ...
- annotations\
- im_1.xml
- im_2.xml
- ...

The problem is that ratio function asks for the input_folder to be the path to a directory, which in my case would be data. But this would also split the annotations, which would be great if the split for the annotations would mirror the split for the images, but apparently it does not. It seems that the split for images is independent from the split of annotations, for example i can find im_1.jpg in the train folder and the im_1.xml in the validation folder.

Thanks, and keep on the excellent work that you are doing.

AttributeError: 'PosixPath' object has no attribute 'rstrip'

import splitfolders
input_folder=pathlib.Path("/content/drive/MyDrive/StanfordCarsDataset/train")
print(input_folder)
output=pathlib.Path("/content/drive/MyDrive/StanfordCarsDataset/Train_val_split")
print(output)
# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
splitfolders.ratio(input_folder, output=output,
    seed=1337, ratio=(.8, .2, ), group_prefix=None, move=True) # default values

I am using the above code for splitting images
and getting below error

Copying files: 0 files [00:00, ? files/s]

AttributeError Traceback (most recent call last)
in ()
8 # To only split into training and validation set, set a tuple to ratio, i.e, (.8, .2).
9 splitfolders.ratio(input_folder, output=output,
---> 10 seed=1337, ratio=(.8, .2, ), group_prefix=None, move=True) # default values

4 frames
/usr/lib/python3.7/shutil.py in _basename(path)
524 # Thus we always get the last component of the path, even for directories.
525 sep = os.path.sep + (os.path.altsep or '')
--> 526 return os.path.basename(path.rstrip(sep))
527
528 def move(src, dst, copy_function=copy2):

AttributeError: 'PosixPath' object has no attribute 'rstrip'

AttributeError: 'PosixPath' object has no attribute 'rfind'

I did a split_folders with --ratio .8 .2, and got the following error:

Traceback (most recent call last):
File "/root/.virtualenvs/data-manipulation/bin/split_folders", line 27, in
split_folders.ratio(args.input, args.output, args.seed, args.ratio)
File "/root/.virtualenvs/data-manipulation/lib/python3.5/site-packages/split_folders/split.py", line 58, in ratio
split_class_dir_ratio(class_dir, output, ratio, seed)
File "/root/.virtualenvs/data-manipulation/lib/python3.5/site-packages/split_folders/split.py", line 126, in split_class_dir_ratio
copy_files(li, class_dir, output)
File "/root/.virtualenvs/data-manipulation/lib/python3.5/site-packages/split_folders/split.py", line 148, in copy_files
class_name = path.split(class_dir)[1]
File "/root/.virtualenvs/data-manipulation/lib/python3.5/posixpath.py", line 103, in split
i = p.rfind(sep) + 1
AttributeError: 'PosixPath' object has no attribute 'rfind'

the same problem happens with python 2.7.12 and python 3.5.2

Cannot be used for more than 2 classes

Feature Request: Stratify Train/Test by Class

I would like to have equal ratio of classes in the training and test set. Can we add this feature?

📜 XML copy support

Hello 🙋‍♀️,

🤔 I suggest you to add .xml file copy support according to some flag like --copy_xml to add supporting to datasets in VOC structure
👀 You can modify copy_files method in some way like the following:

def copy_files(files_type, class_dir, output, prog_bar, copy_xml=False):
    """Copies the files from the input folder to the output folder
    """
    # get the last part within the file
    class_name = path.split(class_dir)[1]
    for (files, folder_type) in files_type:
        full_path = path.join(output, folder_type, class_name)

        pathlib.Path(full_path).mkdir(parents=True, exist_ok=True)
        for f in files:
            if not prog_bar is None:
                prog_bar.update()
            extension = path.splitext(path.split(f)[-1])[-1].lower()
            if extension in [".jpg", ".png", ".bmp", "jpeg", "gif"]:
                shutil.copy2(f, full_path)
                if copy_xml:
                    xml_f = path.splitext(f)[0] + ".xml"
                    shutil.copy2(xml_f, full_path)

Split folders to work without a "class" hierarchy

Since splitting data into (test, train, validation) sets is relevant to all data types, not just ones that are related different classes, having the option to use split-folder on a general folder, i.e. one that contains actual data and does not comply with the subdir ('class1', 'class2',...) hierarchy, would make this package relevant to a much larger crowd.

group_by_prefix function finding multiple matches

I have a dataset with images in this format, image_1.png, image _2.png,.... image_130.png, ...., image_1301.png, image_1302.png, .... and my label files follow the same convention.

When I use group_prefix = 2, I get an error message to say it has found multiple matches for image_130.png. I am not a python expert but looking at your group_by_prefix function in splitfolders/split.py on line 190 it is checking to see if the file name startswith instead of checking for an exact match on the file name before the file extension. So for image_130.png, the function is going to find a match for image_130.png, image_1300,png, image_1301.png etc.

warn user if folders are not in the right format

(cv2020) hduser@slave1:~/computer_vision_2020/staff_detecion_using_icard/data$ split_folders images --ratio .8 .2
Copying files: 0 files [00:00, ? files/s]
when i done it say copying 0 files.

Need to split time series data into train/val/test, but cannot shuffle

As I am working with time series data, the information cannot be shuffled.
Data volume is to big and therefore, I will use keras.train_on_batch since loading full dataset into RAM memory is not feasible.
Spliting the data into train, test and validation folders would be helpfull. I am trying to use the split-folder library, but I could not identify how to avoid the shuffling process. Is it possible de-activate the shuffle?

Feature Request: output filepaths in lists without moving or copying files

Hi,

first off, I really like this function. It could however be nice with a feature of just splitting and outputting the file paths into train, val, test without actually moving or copying any files.

Feature Request: Train/Test split by file

Some datasets provide a train_test_split doc, which may have image number to train/test mapping. Can we have a feature to do this?

It would be better to have option of moving images rather than copying

Error while specifying only ratio and image folder with CLI

splitfolders --ratio .8 .2 img_folder throws the following error

usage: splitfolders [-h] [--output OUTPUT] [--seed SEED]
                    [--ratio RATIO [RATIO ...]] [--fixed FIXED [FIXED ...]]
                    [--oversample] [--group_prefix GROUP_PREFIX]
                    input
splitfolders: error: argument --ratio: invalid float value: 'img_folder'

Error importing the module

import split_folders

Traceback (most recent call last):
File "", line 1, in
File "/home/linux/Ishan_work/tf_venv/lib/python3.5/site-packages/split_folders/init.py", line 1, in
from .split import *
File "/home/linux/Ishan_work/tf_venv/lib/python3.5/site-packages/split_folders/split.py", line 109
raise ValueError(f'The number of samples in class "{class_dir.stem}" are too few. There are only {len(files)} samples available but your fixed parameter {fixed} requires at least {sum(fixed)} files. You may want to split your classes by ratio.')
^
SyntaxError: invalid syntax

Any helps appreciated. Thanks in advance !

To speed up, use "ln -s" instead of "cp" ?

It's too slow to copy many images

Feature request: Split with Cross Validation

It seems the project does not support cross validation split, or does it? it would be nice to implement it

How to access file once created

I can see the splitting folders works, but I do not know how to call the train test validate files to use after it is created.
Can anyone help? or add code to the example in https://pypi.org/project/split-folders/

splitfolders is not copying files but execute without error

This is my code snippet

`import splitfolders
import os

os.makedirs('output')
images = 'img'
splitfolders.ratio(images, output= 'output', seed=42, ratio = (.7,.2,.1))`

Feature Request : Specify file format(s)

There are cases when we have a pickle file, annotated files or other file formats in the same directory and we don't wish to split including them.

Custom folder names

Awesome library, it would be awesome if it's possible to add custom folder names. Great work!

Specify the exact number of items for training/validation/test sets

I understand that by version 0.4.3 it is possible to specify the exact number of items for the validation and test sets by using the flag --fixed, however, as the documentation states:

The remaining items constitute the training set. e.g. for train/val/test 100 100 or for train/val 100.

Meaning that you can currently specify the number of items for the validation and test sets but not for the training set, so in a scenario where a given range of images (i.e. a subset from a larger dataset) it would be useful to be able to specify a fixed number of items for each of the sets.

More than an issue, a feature request. Thanks.

AttributeError: module 'split_folders' has no attribute 'ratio'

Hi, i run "split_folders.ratio('img_data/', output="output", seed=1500, ratio=(.8, .1, .1))" and get AttributeError: module 'split_folders' has no attribute 'ratio'

My dataset have 15 sub-folders containing 100 images for each

Splitting all numbers of files of a folder into train val test using fixed will throw an error

If my folder has 4567 files and I want to split all of them into train, val and test folders with desired numbers:
splitfolders.fixed(input, output=output, seed=1337, fixed(3000, 1000, 567))

it will show an error message:

if not len(files) > sum(fixed):
        raise ValueError(
            f'The number of samples in class "{class_dir.stem}" are too few. There are only {len(files)} samples available but your fixed parameter {fixed} requires at least {sum(fixed)} files. You may want to split your classes by ratio.'
        )``

Suggestion to fix the bug by using >= instead of >, so that if all the fixed numbers sum up equals to the length of files, it can proceed without error:

if not len(files) >= sum(fixed):
        raise ValueError(
            f'The number of samples in class "{class_dir.stem}" are too few. There are only {len(files)} samples available but your fixed parameter {fixed} requires at least {sum(fixed)} files. You may want to split your classes by ratio.'
        )``

Show progress bar

Hi, nice and usefull little project.

May I suggest adding tool like TQDM https://tqdm.github.io to show advancement of the copy in case of big datasets which take time to process.

It's easy to add I think and could do a PR if you want to, but as I don't know the philosophy of your project (are external dependancy allowed ?)
I'll wait for your return.

Feature Request : Automatically set an optimal default for fixed when oversampling

The current default value for fixed is 100, but sometimes the samples can be even less than 100, causing it to raise an error.

Would it be better to set it to the smallest class count?

Does not work with relative paths

import splitfolders
splitfolders.ratio("./input_folder", output="./output", seed=1337, ratio=(.8, .1, .1), group_prefix=None)

Error:

FileNotFoundError: [WinError 3] The system cannot find the path specified: 'input_folder'

My folder

root
- input_folder
  - Covid-19
  - No_findings
- output
- split_folder_script.ipynb

Error

I don't know why this doesn't work, help me please.

Ratio not found

import splitfolders

splitfolders.ratio("/Users/mavaylon/Research/Research_Gambier/Data_P/BP", output="/Users/mavaylon/Research/Research_Gambier/Data_P/output", seed=1337, ratio=(.7, .3), group_prefix=None) # default values

This just returns ratio is not an attribute.

jfilter / split-folders Goto Github PK

split-folders's Issues

To only split into training and validation set, use a single number to fixed, i.e., 10.

Copying files: 0 files [00:00, ? files/s]

Recommend Projects

Recommend Topics

Recommend Org

To only split into training and validation set, use a single number to `fixed`, i.e., `10`.