jfilter / split-folders Goto Github PK
View Code? Open in Web Editor NEW๐ Split folders with files (i.e. images) into training, validation and test (dataset) folders
License: MIT License
๐ Split folders with files (i.e. images) into training, validation and test (dataset) folders
License: MIT License
I wanted to split the images into train, val and test.
split_folders.ratio('data/images', output="data/images_new", seed=1337, ratio=(.8, .1, .1))
1) I realized that there are many copies for the classes with small samples. Would like to find out whether using the ratio method will automatically balance the data set?
Hello!
According to https://pypi.org/project/split-folders/ , the default seed is 1337, so I'm assuming this make always the train-val-test split always the same?
And furthermore, a good idea would be to include an option to divide the train folder into X-folds for cross validation!
I have some experiments where I crop the image in small tiles. All of them shared the same prefix. Is it possible to keep the tiles from the same source image either in train or valid?
Thanks!
I get this error when i use the fixed attribute.
Error:
AssertionError Traceback (most recent call last)
in
----> 1 split_folders.fixed('C:/Users/TEXVNQA/Downloads/TL_DATA/', output="C:/Users/TEXVNQA/Downloads/GTS_Torch Format/", seed=1337, fixed=(135,135), oversample=False) # default values
c:\torch\lib\site-packages\split_folders\split.py in fixed(input, output, seed, fixed, oversample)
69 lens = []
70 for class_dir in dirs:
---> 71 lens.append(split_class_dir_fixed(class_dir, output, fixed, seed))
72
73 if not oversample:
c:\torch\lib\site-packages\split_folders\split.py in split_class_dir_fixed(class_dir, output, fixed, seed)
105 files = setup_files(class_dir, seed)
106
--> 107 assert len(files) > sum(fixed)
108
109 split_train = len(files) - sum(fixed)
AssertionError:
Hi! I've seen that you found mistake in split_class_dir_fixed
function and changed comparison from ">" to ">=", but i think you haven't published update so when I did pip install split-folders
and used it, i got this error with ">" sign in function. I hope you will publish it in future (:
Hey @jfilter ,
First of all, let me admire your efforts for making our lives so easy by developing this very handy tool. However, there is problem for imbalanced number of files in folders. In my case I want to keep exactly 200 files for training and rest for the test/validation regardless of how many files left for test/validation. If I do so following your code guidelines;
fixed
, i.e., 10
.I got the following assertion error. Could you please help me to resolve this issue.
TypeError Traceback (most recent call last)
in ()
7
8 #no_files
----> 9 sp.fixed(data_dir, output=output_dir, seed=13, fixed=200, oversample=False, group_prefix=None) # default values
/usr/local/lib/python3.6/dist-packages/splitfolders/split.py in fixed(input, output, seed, fixed, oversample, group_prefix)
96 fixed = fixed
97
---> 98 assert len(fixed) in (1, 2)
99
100 if tqdm_is_installed:
TypeError: object of type 'int' has no len()
I have correctly implemented splitfolders.ratio(input= input_folder,
output=output_folder, seed=1337, ratio=(.9, 0.1))
There is no error message. However, the files are not split and remain in the input folder. the output folder does not change either. happens to python 3.9
I have a directory containing two classes - 0 and 1 , consisting of subfolders with wave files. The structure is -
main folder
--class 1
---- speaker1
----wavefile1
---wavefile2
---wavefile3
-----speaker2 ........
--class 2
---- speaker1
-----speaker2
........
`
Splitfile is not working when using the syntax given in the example
The ratio() function assumes there is already a folder class structure (input_dir/class1, input_dir/class2
, etc.). It would be helpful if there was an argument for the ratio function that allows to make this an optional assumption (e.g. splitfolders.ratio(classes=None
)).
I get the following error, at this line : split_folders.ratio('non_dup', output="data_", seed=42, ratio=(.8, .2))
Traceback (most recent call last): File "image_classification_2.py", line 53, in <module> split_folders.ratio('non_dup', output="data_", seed=42, ratio=(.8, .2)) File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/split_folders/split.py", line 58, in ratio split_class_dir_ratio(class_dir, output, ratio, seed) File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/split_folders/split.py", line 126, in split_class_dir_ratio copy_files(li, class_dir, output) File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/split_folders/split.py", line 148, in copy_files class_name = path.split(class_dir)[1] File "/home/ubuntu/anaconda3/lib/python3.5/posixpath.py", line 103, in split i = p.rfind(sep) + 1 AttributeError: 'PosixPath' object has no attribute 'rfind'
I would want splitfolders to have a package that can split csv files having labels of fake and real in them and be splitted into train_fake.csv and test_fake.csv files. and Train_real.csv and Test_real.csv files respectively
Hi, thanks for developing this library. I just wander if it would make sense and be possible to allow giving as input_folder
the complete path to the folder that contains the files. For example I'm working with images and annotations and i have the following structure
data\
- images\
- im_1.jpg
- im_2.jpg
- ...
- annotations\
- im_1.xml
- im_2.xml
- ...
The problem is that ratio
function asks for the input_folder
to be the path to a directory, which in my case would be data
. But this would also split the annotations
, which would be great if the split for the annotations would mirror the split for the images, but apparently it does not. It seems that the split for images is independent from the split of annotations, for example i can find im_1.jpg
in the train
folder and the im_1.xml
in the validation
folder.
Thanks, and keep on the excellent work that you are doing.
import splitfolders
input_folder=pathlib.Path("/content/drive/MyDrive/StanfordCarsDataset/train")
print(input_folder)
output=pathlib.Path("/content/drive/MyDrive/StanfordCarsDataset/Train_val_split")
print(output)
# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
splitfolders.ratio(input_folder, output=output,
seed=1337, ratio=(.8, .2, ), group_prefix=None, move=True) # default values
I am using the above code for splitting images
and getting below error
AttributeError Traceback (most recent call last)
in ()
8 # To only split into training and validation set, set a tuple to ratio
, i.e, (.8, .2)
.
9 splitfolders.ratio(input_folder, output=output,
---> 10 seed=1337, ratio=(.8, .2, ), group_prefix=None, move=True) # default values
4 frames
/usr/lib/python3.7/shutil.py in _basename(path)
524 # Thus we always get the last component of the path, even for directories.
525 sep = os.path.sep + (os.path.altsep or '')
--> 526 return os.path.basename(path.rstrip(sep))
527
528 def move(src, dst, copy_function=copy2):
AttributeError: 'PosixPath' object has no attribute 'rstrip'
I did a split_folders with --ratio .8 .2, and got the following error:
Traceback (most recent call last):
File "/root/.virtualenvs/data-manipulation/bin/split_folders", line 27, in
split_folders.ratio(args.input, args.output, args.seed, args.ratio)
File "/root/.virtualenvs/data-manipulation/lib/python3.5/site-packages/split_folders/split.py", line 58, in ratio
split_class_dir_ratio(class_dir, output, ratio, seed)
File "/root/.virtualenvs/data-manipulation/lib/python3.5/site-packages/split_folders/split.py", line 126, in split_class_dir_ratio
copy_files(li, class_dir, output)
File "/root/.virtualenvs/data-manipulation/lib/python3.5/site-packages/split_folders/split.py", line 148, in copy_files
class_name = path.split(class_dir)[1]
File "/root/.virtualenvs/data-manipulation/lib/python3.5/posixpath.py", line 103, in split
i = p.rfind(sep) + 1
AttributeError: 'PosixPath' object has no attribute 'rfind'
the same problem happens with python 2.7.12 and python 3.5.2
I would like to have equal ratio of classes in the training and test set. Can we add this feature?
Hello ๐โโ๏ธ,
.xml
file copy support according to some flag like --copy_xml
to add supporting to datasets in VOC structurecopy_files
method in some way like the following:def copy_files(files_type, class_dir, output, prog_bar, copy_xml=False):
"""Copies the files from the input folder to the output folder
"""
# get the last part within the file
class_name = path.split(class_dir)[1]
for (files, folder_type) in files_type:
full_path = path.join(output, folder_type, class_name)
pathlib.Path(full_path).mkdir(parents=True, exist_ok=True)
for f in files:
if not prog_bar is None:
prog_bar.update()
extension = path.splitext(path.split(f)[-1])[-1].lower()
if extension in [".jpg", ".png", ".bmp", "jpeg", "gif"]:
shutil.copy2(f, full_path)
if copy_xml:
xml_f = path.splitext(f)[0] + ".xml"
shutil.copy2(xml_f, full_path)
Since splitting data into (test, train, validation) sets is relevant to all data types, not just ones that are related different classes, having the option to use split-folder on a general folder, i.e. one that contains actual data and does not comply with the subdir ('class1', 'class2',...) hierarchy, would make this package relevant to a much larger crowd.
I have a dataset with images in this format, image_1.png, image _2.png,.... image_130.png, ...., image_1301.png, image_1302.png, .... and my label files follow the same convention.
When I use group_prefix = 2, I get an error message to say it has found multiple matches for image_130.png. I am not a python expert but looking at your group_by_prefix function in splitfolders/split.py on line 190 it is checking to see if the file name startswith instead of checking for an exact match on the file name before the file extension. So for image_130.png, the function is going to find a match for image_130.png, image_1300,png, image_1301.png etc.
(cv2020) hduser@slave1:~/computer_vision_2020/staff_detecion_using_icard/data$ split_folders images --ratio .8 .2
Copying files: 0 files [00:00, ? files/s]
when i done it say copying 0 files.
As I am working with time series data, the information cannot be shuffled.
Data volume is to big and therefore, I will use keras.train_on_batch since loading full dataset into RAM memory is not feasible.
Spliting the data into train, test and validation folders would be helpfull. I am trying to use the split-folder library, but I could not identify how to avoid the shuffling process. Is it possible de-activate the shuffle?
Hi,
first off, I really like this function. It could however be nice with a feature of just splitting and outputting the file paths into train, val, test without actually moving or copying any files.
Some datasets provide a train_test_split doc, which may have image number to train/test mapping. Can we have a feature to do this?
splitfolders --ratio .8 .2 img_folder
throws the following error
usage: splitfolders [-h] [--output OUTPUT] [--seed SEED]
[--ratio RATIO [RATIO ...]] [--fixed FIXED [FIXED ...]]
[--oversample] [--group_prefix GROUP_PREFIX]
input
splitfolders: error: argument --ratio: invalid float value: 'img_folder'
import split_folders
Traceback (most recent call last):
File "", line 1, in
File "/home/linux/Ishan_work/tf_venv/lib/python3.5/site-packages/split_folders/init.py", line 1, in
from .split import *
File "/home/linux/Ishan_work/tf_venv/lib/python3.5/site-packages/split_folders/split.py", line 109
raise ValueError(f'The number of samples in class "{class_dir.stem}" are too few. There are only {len(files)} samples available but your fixed parameter {fixed} requires at least {sum(fixed)} files. You may want to split your classes by ratio.')
^
SyntaxError: invalid syntax
It's too slow to copy many images
It seems the project does not support cross validation split, or does it? it would be nice to implement it
I can see the splitting folders works, but I do not know how to call the train test validate files to use after it is created.
Can anyone help? or add code to the example in https://pypi.org/project/split-folders/
This is my code snippet
`import splitfolders
import os
os.makedirs('output')
images = 'img'
splitfolders.ratio(images, output= 'output', seed=42, ratio = (.7,.2,.1))`
There are cases when we have a pickle file, annotated files or other file formats in the same directory and we don't wish to split including them.
Awesome library, it would be awesome if it's possible to add custom folder names. Great work!
I understand that by version 0.4.3
it is possible to specify the exact number of items for the validation and test sets by using the flag --fixed, however, as the documentation states:
The remaining items constitute the training set. e.g. for train/val/test 100 100
or for train/val 100
.
Meaning that you can currently specify the number of items for the validation and test sets but not for the training set, so in a scenario where a given range of images (i.e. a subset from a larger dataset) it would be useful to be able to specify a fixed number of items for each of the sets.
More than an issue, a feature request. Thanks.
Hi, i run "split_folders.ratio('img_data/', output="output", seed=1500, ratio=(.8, .1, .1))" and get AttributeError: module 'split_folders' has no attribute 'ratio'
My dataset have 15 sub-folders containing 100 images for each
If my folder has 4567 files and I want to split all of them into train, val and test folders with desired numbers:
splitfolders.fixed(input, output=output, seed=1337, fixed(3000, 1000, 567))
it will show an error message:
if not len(files) > sum(fixed):
raise ValueError(
f'The number of samples in class "{class_dir.stem}" are too few. There are only {len(files)} samples available but your fixed parameter {fixed} requires at least {sum(fixed)} files. You may want to split your classes by ratio.'
)``
Suggestion to fix the bug by using >= instead of >, so that if all the fixed numbers sum up equals to the length of files, it can proceed without error:
if not len(files) >= sum(fixed):
raise ValueError(
f'The number of samples in class "{class_dir.stem}" are too few. There are only {len(files)} samples available but your fixed parameter {fixed} requires at least {sum(fixed)} files. You may want to split your classes by ratio.'
)``
Hi, nice and usefull little project.
May I suggest adding tool like TQDM https://tqdm.github.io to show advancement of the copy in case of big datasets which take time to process.
It's easy to add I think and could do a PR if you want to, but as I don't know the philosophy of your project (are external dependancy allowed ?)
I'll wait for your return.
The current default value for fixed is 100, but sometimes the samples can be even less than 100, causing it to raise an error.
Would it be better to set it to the smallest class count?
import splitfolders
splitfolders.ratio("./input_folder", output="./output", seed=1337, ratio=(.8, .1, .1), group_prefix=None)
Error:
FileNotFoundError: [WinError 3] The system cannot find the path specified: 'input_folder'
My folder
import splitfolders
splitfolders.ratio("/Users/mavaylon/Research/Research_Gambier/Data_P/BP", output="/Users/mavaylon/Research/Research_Gambier/Data_P/output", seed=1337, ratio=(.7, .3), group_prefix=None) # default values
This just returns ratio is not an attribute.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.