Code Monkey home page Code Monkey logo

tableio.jl's Introduction

TableIO

Build Status codecov

A small "glue" package for reading and writing tabular data. It aims to provide a uniform api for reading and writing tabular data from and to multiple sources. This package is "intelligent" in this sense that it automatically selects the right reading / writing methods depending on the file extension.

Supported Formats

The underlying packages are not direct dependencies of TableIO and are therefore not installed automatically with it. This is for reduction of installation size and package load time.

For installation of the Python requirements for Pandas HDF5 use the following Julia commands:

ENV["PYTHON"] = "" # to use a separate Conda environment for Julia
using Pkg
Pkg.add(["PyCall", "Conda", "Pandas"])
Pkg.build("PyCall")
using Conda
Conda.add("pandas")
Conda.add("pytables")

Installation

using Pkg
Pkg.add("TableIO")
using TableIO

Before using a specific format, the corresponding package needs to be installed and imported:

] add JDF
import JDF # or using JDF

If the file format specific library is not imported and / or installed, an error message is raised.

Reading Data

The function

read_table

reads a data source (file or database) and returns a Table.jl interface, e.g. for creating a DataFrame.

using TableIO, DataFrames

CSV Format:

using CSV
df = DataFrame(read_table("my_data.csv"); copycols=false) # Keyword arguments can be passed to the CSV reader (CSV.jl)
df = DataFrame(read_table("my_data.zip"); copycols=false) # zipped CSV format (assuming there is only 1 file in the archive)

JSON Format:

using Dates, JSONTables
df = read_table("my_data.json") |> DataFrame # note that |> DataFrame(; copycols=false) gives wrong column types!
df.my_date_col = Dates.(df.my_date_col) # Dates are imported as strings by default, need to be manually converted

df = read_table("my_data.zip", "my_data.json") |> DataFrame

Binary Formats:

using JDF
df = DataFrame(read_table("my_data.jdf"); copycols=false) # JDF (compressed binary format)

using Parquet
df = DataFrame(read_table("my_data.parquet"); copycols=false) # Parquet

using Arrow
df = DataFrame(read_table("my_data.arrow"); copycols=false) # Apache Arrow

import Pandas # using gives a naming conflict
df = DataFrame(read_table("my_data.hdf", "key"); copycols=false) # HDF5 (via Pandas)

Excel:

using XLSX
df = DataFrame(read_table("my_data.xlsx"); copycols=false) # imports 1st sheet
df = DataFrame(read_table("my_data.xlsx", "MyAwesomeSheet"); copycols=false) # imports named sheet

SQLite:

using SQLite
df = DataFrame(read_table("my_data.db", "my_table"); copycols=false) # SQLite from file, table name must be given

Alternatively, SQLite database objects could be used:

using SQLite
sqlite_db = SQLite.DB("my_data.db")
df = DataFrame(read_table(sqlite_db, "my_table"); copycols=false) # SQLite from database connection, table name must be given

PostgreSQL:

using LibPQ, CSV # CSV is required here because `write_table!` for PostgreSQL depends on CSV
postgres_conn = LibPQ.Connection("dbname=postgres user=postgres")
df = DataFrame(read_table(postgres_conn, "my_table"); copycols=false) # reading from Postgres connection

StatFiles.jl integration:

using StatFiles
df = DataFrame(read_table("my_data.dta"); copycols=false) # Stata
df = DataFrame(read_table("my_data.sav"); copycols=false) # SPSS
df = DataFrame(read_table("my_data.sas7bdat"); copycols=false) # SAS

For data formats supporting multiple tables inside a file, the function list_tables returns an alphabetically sorted list of table names.

table_names = list_tables(filename)

Writing Data

The function

write_table!

writes a Table.jl compatible data source into a file or databse.

using TableIO, DataFrames

CSV Format:

using CSV
write_table!("my_data.csv", df)

write_table!("my_data.zip", df) # zipped CSV. If no "inner" file name is given, the table is always stored in csv format with the same file name as the zip archive

JSON Format:

using JSONTables
write_table!("my_data.json", df) # dictionary of arrays
write_table!("my_data.json", df, orientation=:objecttable) # dictionary of arrays
write_table!("my_data.json", df, orientation=:arraytable) # array of dictionaries

write_table!("my_data.zip", "my_data.json", df) # need to explicitly give a file name inside zip archive, otherwise csv format is used

Binary Formats:

using JDF
write_table!("my_data.jdf", df) # JDF (compressed binary format)

using Parquet
write_table!("my_data.parquet", df) # Parquet - note that Date element type is not supported yet

using Arrow
write_table!("my_data.arrow", df) # Apache Arrow

import Pandas # using gives a naming conflict
write_table!("my_data.hdf", "key", df) # HDF5 (via Pandas)

Excel:

write_table!("my_data.xlsx", "test_sheet_42", df) # creates sheet with defined name

SQLite:

using SQLite
write_table!("my_data.db", "my_table", df) # SQLite from file, table must not exist

sqlite_db = SQLite.DB("my_data.db")
write_table!(sqlite_db, "my_table", df) # SQLite from database connection

PostgreSQL:

using LibPQ, CSV
postgres_conn = LibPQ.Connection("dbname=postgres user=postgres")
write_table!(postgres_conn, "my_table", df) # table must exist and be compatible with the input data

StatFiles.jl integration: write_table! is not supported.

Additionally, it is possible to export tabular data into Julia code (.jl files):

write_table!("my_data.jl", "my_table", df)

To read this data, the corresponding Julia source code file can be included:

include("my_data.jl")
@assert DataFrame(my_table) == df

Conversions

It is possible to pass the output of read_table directly as input to write_table! for converting tabular data between different formats:

name1 = joinpath(testpath, "test.zip") # zipped CSV
name2 = joinpath(testpath, "testx.jdf") # binary
name3 = joinpath(testpath, "testx.xlsx") # Excel
name4 = joinpath(testpath, "testx.db") # SQLite

write_table!(name2, read_table(name1))
write_table!(name3, read_table(name2))
write_table!(name4, "my_table", read_table(name3))

df_recovered = DataFrame(read_table(name4, "my_table"); copycols=false)

PlutoUI Integration

In a Pluto.jl notebook, TableIO can be used directly on a PlutoUI.jl FilePicker output.

Example (run in a Pluto.jl notebook):

using PlutoUI, TableIO, DataFrames
@bind f PlutoUI.FilePicker() # pick any supported file type
df = DataFrame(read_table(f); copycols=false)

This functionality works for all supported file formats if the corresponding import packages are installed. It is not required to import them, this is done automatically.

Testing

The PostgreSQL component requires a running PostgreSQL database for unit tests. This database can be started using the following command:

docker run --rm --detach --name test-libpqjl -e POSTGRES_HOST_AUTH_METHOD=trust -p 5432:5432 postgres

Disclaimer

If you encounter warnings like

┌ Warning: Package TableIO does not have CSV in its dependencies:
│ - If you have TableIO checked out for development and have
│   added CSV as a dependency but haven't updated your primary
│   environment's manifest file, try `Pkg.resolve()`.
│ - Otherwise you may need to report an issue with TableIO
└ Loading CSV into TableIO from project dependency, future warnings for TableIO are suppressed.

please ignore them.

TableIO purposely has not included the libraries for the specific file formats as its own dependencies.

tableio.jl's People

Contributors

github-actions[bot] avatar lungben avatar pallharaldsson avatar xiaodaigh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

tableio.jl's Issues

List items in a file

Several formats, like xlsx, SQLite, HDF, allow storage of multiple tables in one file.
Add a function list_tables(filename) to list all table names inside a file.

Shall I implement read_table for ODBC query?

Hi there,

I have a project that reads tables from disk (various file formats) and databases.
These data sources are specified in a TOML as part of a pipeline, so TableIO is very helpful in minimising the boilerplate.

I'm interested in implementing read_table for SQL queries over ODBC. Something like this:

read_table("MyDSN.odbc", sql_query::String)

Does this format suit TableIO?
Shall I go ahead with the implementation?

Cheers,
Jock

Referencing Table Inside a Schema

I'm working with a database that has multiple schemas. I'd like to get a DataFrame from the table "Reference"."Date".

When I run list_tables(conn), I get "Date" in the list of tables, but when I run, read_table(conn, "Date"), it errors because there is no such table. And when I try "Reference.Date" or ""Reference"."Date"", I get an error because table_name can only have alphanumeric characters (I think it is erroring on the period).

I'm relatively new to Julia and TableIO, so I'm sure I'm just missing something. I'd appreciate any help getting a solution for this.

Add HDF support

Ideally, it should be possible to read HDF files created e.g. by Pandas to_hdf().
This is probably only possible without relying to PyCall for C- data types, not for pickled Python objects.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

align interface for file formats containing multiple tables

Some file formats, e.g. xlsx, sqlite, hdf, allow storage of multiple tables inside one file.
Currently, the interfaces for these file formats are inconsistent.

Proposed common api:

Define 2 methods for read_table: one method with table name and one without.
For the latter, return the content of the alphabetically first table, but log a warning if the file contains more than one table.

For write_table!, the table name should always be given, no "default" table name is used.
For zipped files, giving the filename inside the zip archive should stay optional, if it is not given, the data is stored in csv format with filename equal to the archive name.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.