Comments (20)
Donโt want you to think I dropped off this one. My personal goal is to have this in good shape before the months end.
Sent with GitHawk
from pyjanitor.
Thanks, Sounds great. Just going to add, the idea generally came to me, when I use R I find myself using the readbulk library. And yes, I do like your idea for read_csvs, maybe later work on read_xlsxs.
Sent with GitHawk
from pyjanitor.
I like where the proposed implementations are going, however I would like to point out what mentioned by @szuckerman again. Although janitor favors functions concatenation, it does not seem to me intuitive that the function is implemented as a df.read_csvs(...)
.
In the current implementation the df argument is not used and overwritten on return.
I propose to move (as suggested) the function to a janitor.io
module.
from pyjanitor.
This is my current implementation proposal:
def read_csvs(
filespath: str,
seperate_df: bool = False,
**kwargs
):
"""
:param filespath: The string pattern matching the CSVs files. Accepts regular expressions, with or without csv extension
:param seperate_df: If False (default) returns a single Dataframe with the concatenation of the csv files-
If True, returns a dictionary of seperate dataframes for each CSV file.
:param kwargs: Keyword arguments to pass into the original pandas `read_csv`.
"""
# Sanitize input
assert filespath is not None
assert len(filespath) != 0
# Check if the original filespath contains .csv
if not filespath.endswith(".csv"):
filespath += ".csv"
# Read the csv files
dfs = {
os.path.basename(f) : pd.read_csv(f, **kwargs)
for f
in glob(filespath)
}
# Check if dataframes have been read
if len(dfs) == 0:
raise ValueError("No CSV files to read with the given filespath")
# Concatenate the dataframes if requested (default)
if seperate_df:
return dfs
else:
try:
return pd.concat(
list(dfs.values()),
ignore_index=True,
sort=False)
except:
raise ValueError("Input CSV files cannot be concatenated")
It takes a single argument for the file path, that accepts regular expressions (via the glob package).
It is not a pandas Dataframe function, so it cannot be concatenated (it would not make sense logically)
By default it concatenates the csf files in a single dataframe.
from pyjanitor.
Looks good!
A few comments:
1
if not filespath.endswith(".csv"):
filespath += ".csv"
I'm not sure we need to append "csv" to every file. There are many instances where multiple files may not have a csv filename, but will be comma delimited. It's more common for tab-separated files, though. In that case, I would propose adding a sep
argument, similar to how you can do pd.read_csv('file', sep="\t")
to read a tab-delimited file.
2
dfs = {
os.path.basename(f) : pd.read_csv(f, **kwargs)
for f
in glob(filespath)
}
I like that you want to keep the filename to reference the DataFrame
, but what if someone doesn't know all the filenames that are in there? It will be a bit difficult to traverse a dictionary without knowing what all the keys are. Obviously one can iterate over dfs.keys()
, but that gets a bit tedious. Maybe return it as a namedtuple
that has filename
and data
arguments so people can access the DataFrame
s in a list
but also have access to a filename
descriptor.
from pyjanitor.
@jcvall this sounds like a great idea! Before you go too deep into this, do have you seen any nice implementation of the functionality? For example, from a cursory search, I saw this implementation which looks nice!
As for the function signature, what do you think about the following design?
import pandas as pd
from typing import Union
from pathlib import Path
def read_csvs(df: pd.DataFrame, directory: Union[str, Path], pattern: str, filetype: str**kwargs):
"""
:param df: A pandas dataframe.
:param directory: The directory that contains the CSVs.
:param pattern: The pattern of csvs to match.
:param kwargs: Keyword arguments to pass into `read_csv`.
"""
I chose read_csvs
because it is a plural version of read_csv
, and hence easily carries the meaning of the original function, and I think read_csv
is the most commonly used file I/O function in pandas (at least for me, it is). For a starter version of the function, it'd probably also be enough scope; I think we can wait for the case where others need read_xlss
or read_hdfs
to have a contribution from them.
What are your thoughts? Naturally, happy to leave the implementation details to you. Don't forget that a test for such a function would require multiple dummy csv files (which could be really dummy, 3 rows per file kind of data).
from pyjanitor.
Ok! Looking forward to your contribution. ๐ Thanks for being active with the project!
from pyjanitor.
Sorry I have been away for a while. I am happy to say I have been hired as a data analyst just last week and will be coding in python and r full time. I am catching up on the conversations and want to let you know I will go with what you think is best for releases. I been working on the read_csvs() function and will submit it soon. Should I do anything special for the branch when I do?
Sent with GitHawk
from pyjanitor.
I am happy to say I have been hired as a data analyst just last week and will be coding in python and r full time.
Congratulations! This is a wonderful opportunity to continue in the data world ๐.
Should I do anything special for the branch when I do?
Yes, be sure to update your fork!
There's a few ways to do this - the easiest is to delete your fork (on GitHub and locally) and then fork from my master again.
We recently made a few changes, I hope you've been keeping up with them. @zbarry updated the contribution guide. The key changes are:
- We now PR into
dev
by default. (Don't worry, GitHub will take care of this for you.) master
is reserved for releases.- Each function should be tested at least with one test - in its own
test_<function_name>.py
. - We started using test "fixtures" (i.e. pre-built dataframes, basically, that can be used). (For this function, I think you don't have to worry about it.)
Anyways, we'll knock out what needs to be done when we get to that stage.
For now, congrats again on the new job! And looking forward to seeing your contribution!
from pyjanitor.
Something to think about.... is the end goal for this method to read all the csvs into one DataFrame
or make separate DataFrame
s from all the csvs?
I thought it was the former, but I've noticed that I perform the latter constantly and it might be useful. For example, a directory with [customers.csv
, sales.csv
] would be useful to just turn into
customers_df
and sales_df
automatically (probably with a list
that gets returned to let users know how many DataFrame
s actually got created).
Note: For the former implementation, unfortunately there's going to need to be a ton of error checking on the files to ensure that one of the csvs isn't malformed and destroys the entire DataFrame
.
@ericmjl I think this function actually brings up an interesting issue for pyjanitor in the sense that it's not really a method on a DataFrame
, it's more top-level than that. Since pyjanitor is based on methods on instantiated DataFrame
s this function would either need to be implemented like this:
import janitor
pd.DataFrame().read_folderfiles()
or like this
import janitor
janitor.read_folderfiles()
from pyjanitor.
@ericmjl I think this function actually brings up an interesting issue for pyjanitor in the sense that it's not really a method on a
DataFrame
, it's more top-level than that. Since pyjanitor is based on methods on instantiatedDataFrame
s this function would either need to be implemented like this:
Yes, that's definitely true, @szuckerman.
@jcvall, I'd probably add a new io.py
(I/O = input/output
) module, and then add the read_csvs
function there. That way, one can do what Sam's 2nd example is like, and cleanly return a pandas DataFrame.
from pyjanitor.
Sounds good. Hope to get you something this weekend to look over.
Sent with GitHawk
from pyjanitor.
Wanted your thoughts. Certainly not finished but just wants some feedback..
import pandas as pd
import glob
import os
def read_csvs(df: pd.DataFrame, directory: Union[str, Path], pattern: str = pattern, sep:str = sep,
skiprows: int = skiprows, compression: str = compression, encoding: str = encoding, low_memory: bool = low_memory,
seperate_df: bool = seperate_df,
filetype: str**kwargs):
"""
:param df: A pandas dataframe.
:param directory: The directory that contains the CSVs.
:param pattern: The pattern of csvs to match.
:param sep: Delimited seperator, defalut is ","
:param skiprows: Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
:param compression: For on-the-fly decompression of on-disk data.
If โinferโ and filepath_or_buffer is path-like, then detect compression
from the following extensions: โ.gzโ, โ.bz2โ, โ.zipโ, or โ.xzโ (otherwise no decompression).
If using โzipโ, the ZIP file must contain only one data file to be read in. Set to None for no decompression.
:param encoding: Encoding to use for UTF when reading/writing (ex. โutf-8โ). Default is 'latin1'
:param low_memory: Internally process the file in chunks, resulting in lower memory use while parsing,
but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter.
:param seperate_df: Returns a dictionary of seperate dataframes for each csv file read
:param kwargs: Keyword arguments to pass into `read_csv`.
"""
if seperate_df:
dfs = {os.path.basename(f): pd.read_csv(f, engine = 'python', sep = sep, compression = compression, low_memory = low_memory , encoding=encoding, skiprows = skiprows) for f in glob.glob(os.path.join(path, "*.csv"))}
return dfs
print ("Use dfs.get(key) to get dataframes.")
print ("List of keys:")
for key in dfs :
print (key)
else:
df = pd.concat([pd.read_csv(f, sep = sep, compression = compression, low_memory = low_memory , encoding=encoding, skiprows = skiprows).assign(filename = os.path.basename(f)) for f in glob.glob(os.path.join(Path, pattern))], ignore_index=True, sort=False)
return df
from pyjanitor.
One of the thoughts I have on the pattern feature is it will usually take on ".csv", but it may have to change if the file is compressed. In that case the user would use ".gz" for example. I wanted to keep some of the features of the original pd.read_csv as I often come into errors with encoding, skiprows or low_memory that need to be addressed. I am still new to this but will be learning in earnest as I go.
As a side note I really wanted to add tqdm_notebook() status bar, from the tqdm package, and have a progress bar included as a feature. I loved this as it gave me an est. time and progress bar as each file was loaded getting closer to 100%. It looks like you need a java notebook extension which is not a big deal, but I was afraid it would turn people off. If you want me to add it I can, just to see what it looks like, and if you dont like it I can take it away?
Thanks.
from pyjanitor.
Nice work, @jcvall! Looking carefully at the code you wrote, I think there's some places that can be shortened.
First thing I noticed was that you had kwargs specified in there. I think that's a great start! We could condense the read_csvs
kwargs into there, and simply just let them pass through to the read_csv
call.
Second thing I did was remove the printing (I'm guessing you may have been using them as a debugging statement).
Third thing I did was manually apply some Black-style formatting.
def read_csvs(
df: pd.DataFrame,
directory: Union[str, Path],
pattern: str = pattern,
seperate_df: bool = seperate_df,
**kwargs
):
"""
:param df: A pandas dataframe.
:param directory: The directory that contains the CSVs.
:param pattern: The pattern of CSVs to match.
:param seperate_df: Returns a dictionary of seperate dataframes for each CSV file read
:param kwargs: Keyword arguments to pass into `read_csv`.
"""
if seperate_df:
dfs = {
os.path.basename(f): pd.read_csv(f, **kwargs)
for f
in glob.glob(os.path.join(path, "*.csv"))
}
return dfs
else:
df = pd.concat(
[
pd.read_csv(f, **kwargs).assign(filename = os.path.basename(f))
for f
in glob.glob(os.path.join(path, pattern))
]
ignore_index=True,
sort=False)
return df
On using tqdm, I think you can take a look at chemistry.py, in which contains an example usage of tqdm. Because the primary use of pyjanitor has been in the notebook (for me at least), I made it an optional kwarg by setting a default value.
from pyjanitor.
I would like to work on this task!
from pyjanitor.
Sure!
Sent with GitHawk
from pyjanitor.
For inspiration: dask already has this functionality, in that you can use wildcards to read all the files with similar filenames.
The difference with dask, though is that since it processes on a distributed system, all the files "live separately" when running a function on the dask dataframe. In our case they would need to be combined into one DataFrame
. (Or, maybe have an option to return one DataFrame
or a list of DataFrame
s?)
from pyjanitor.
@dave-frazzetto, would you be kind enough to help me regain context here: has a PR been made, and if not, would you like to put one in for this issue?
from pyjanitor.
Ah, I just realized, the io
module is available! Closing!
from pyjanitor.
Related Issues (20)
- Explore rust integration for performance in some scenarios
- avoid explicit copy ? HOT 3
- `expand` for named objects
- DeprecationWarning HOT 1
- `.clean_names()` should replace `@` with `_` HOT 5
- string selection on multiindex top level HOT 1
- add row count for conditional_join
- RuntimeWarning: subpackages can technically be lazily loaded HOT 16
- explode_levels
- Not able to import janitor.clean_name function - ImportError: cannot import name 'ABCPandasArray' from 'pandas.core.dtypes.generic' HOT 2
- Typos in repository
- expand function
- [INFRA] Switch over to pyproject.toml
- Support efficient json extraction within a pandas column HOT 1
- [ENH] implement full numba version of a single conditional_join
- deprecation warning for pivot_longer HOT 1
- Return only matching indices for `conditional_join`
- [ENH] cython a subset of _range_join_indices and equi join HOT 4
- extend `col` powers for index selection HOT 1
- dtype conversion on index
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyjanitor.