tobgu / qframe Goto Github PK

Immutable data frame for Go

License: MIT License

Makefile 0.19% Go 99.53% Python 0.27%

golang go data-science data-frame immutable dataframe

qframe's Introduction

QFrame is an immutable data frame that support filtering, aggregation and data manipulation. Any operation on a QFrame results in a new QFrame, the original QFrame remains unchanged. This can be done fairly efficiently since much of the underlying data will be shared between the two frames.

The design of QFrame has mainly been driven by the requirements from qocache but it is in many aspects a general purpose data frame. Any suggestions for added/improved functionality to support a wider scope is always of interest as long as they don't conflict with the requirements from qocache! See Contribute.

Installation

go get github.com/tobgu/qframe

Usage

Below are some examples of common use cases. The list is not exhaustive in any way. For a complete description of all operations including more examples see the docs.

IO

QFrames can currently be read from and written to CSV, record oriented JSON, and any SQL database supported by the go database/sql driver.

CSV Data

Read CSV data:

input := `COL1,COL2
a,1.5
b,2.25
c,3.0`

f := qframe.ReadCSV(strings.NewReader(input))
fmt.Println(f)

Output:

COL1(s) COL2(f)
------- -------
      a     1.5
      b    2.25
      c       3

Dims = 2 x 3

SQL Data

QFrame supports reading and writing data from the standard library database/sql drivers. It has been tested with SQLite, Postgres, and MariaDB.

SQLite Example

Load data to and from an in-memory SQLite database. Note that this example requires you to have go-sqlite3 installed prior to running.

package main

import (
	"database/sql"
	"fmt"

	_ "github.com/mattn/go-sqlite3"
	"github.com/tobgu/qframe"
	qsql "github.com/tobgu/qframe/config/sql"
)

func main() {
	// Create a new in-memory SQLite database.
	db, _ := sql.Open("sqlite3", ":memory:")
	// Add a new table.
	db.Exec(`
	CREATE TABLE test (
		COL1 INT,
		COL2 REAL,
		COL3 TEXT,
		COL4 BOOL
	);`)
	// Create a new QFrame to populate our table with.
	qf := qframe.New(map[string]interface{}{
		"COL1": []int{1, 2, 3},
		"COL2": []float64{1.1, 2.2, 3.3},
		"COL3": []string{"one", "two", "three"},
		"COL4": []bool{true, true, true},
	})
	fmt.Println(qf)
	// Start a new SQL Transaction.
	tx, _ := db.Begin()
	// Write the QFrame to the database.
	qf.ToSQL(tx,
		// Write only to the test table
		qsql.Table("test"),
		// Explicitly set SQLite compatibility.
		qsql.SQLite(),
	)
	// Create a new QFrame from SQL.
	newQf := qframe.ReadSQL(tx,
		// A query must return at least one column. In this 
		// case it will return all of the columns we created above.
		qsql.Query("SELECT * FROM test"),
		// SQLite stores boolean values as integers, so we
		// can coerce them back to bools with the CoercePair option.
		qsql.Coerce(qsql.CoercePair{Column: "COL4", Type: qsql.Int64ToBool}),
		qsql.SQLite(),
	)
	fmt.Println(newQf)
	fmt.Println(newQf.Equals(qf))
}

Output:

COL1(i) COL2(f) COL3(s) COL4(b)
------- ------- ------- -------
      1     1.1     one    true
      2     2.2     two    true
      3     3.3   three    true

Dims = 4 x 3
true

Filtering

Filtering can be done either by applying individual filters to the QFrame or by combining filters using AND and OR.

Filter with OR-clause:

f := qframe.New(map[string]interface{}{"COL1": []int{1, 2, 3}, "COL2": []string{"a", "b", "c"}})
newF := f.Filter(qframe.Or(
    qframe.Filter{Column: "COL1", Comparator: ">", Arg: 2},
    qframe.Filter{Column: "COL2", Comparator: "=", Arg: "a"}))
fmt.Println(newF)

Output:

COL1(i) COL2(s)
------- -------
      1       a
      3       c

Dims = 2 x 2

Grouping and aggregation

Grouping and aggregation is done in two distinct steps. The function used in the aggregation step takes a slice of elements and returns an element. For floats this function signature matches many of the statistical functions in Gonum, these can hence be applied directly.

intSum := func(xx []int) int {
    result := 0
    for _, x := range xx {
        result += x
    }
    return result
}

f := qframe.New(map[string]interface{}{"COL1": []int{1, 2, 2, 3, 3}, "COL2": []string{"a", "b", "c", "a", "b"}})
f = f.GroupBy(groupby.Columns("COL2")).Aggregate(qframe.Aggregation{Fn: intSum, Column: "COL1"})
fmt.Println(f.Sort(qframe.Order{Column: "COL2"}))

Output:

COL2(s) COL1(i)
------- -------
      a       4
      b       5
      c       2

Dims = 2 x 3

Data manipulation

There are two different functions by which data can be manipulated, Apply and Eval. Eval is slightly more high level and takes a more data driven approach but basically boils down to a bunch of Apply in the end.

Example using Apply to string concatenate two columns:

f := qframe.New(map[string]interface{}{"COL1": []int{1, 2, 3}, "COL2": []string{"a", "b", "c"}})
f = f.Apply(
    qframe.Instruction{Fn: function.StrI, DstCol: "COL1", SrcCol1: "COL1"},
    qframe.Instruction{Fn: function.ConcatS, DstCol: "COL3", SrcCol1: "COL1", SrcCol2: "COL2"})
fmt.Println(f.Select("COL3"))

Output:

COL3(s)
-------
     1a
     2b
     3c

Dims = 1 x 3

The same example using Eval instead:

f := qframe.New(map[string]interface{}{"COL1": []int{1, 2, 3}, "COL2": []string{"a", "b", "c"}})
f = f.Eval("COL3", qframe.Expr("+", qframe.Expr("str", types.ColumnName("COL1")), types.ColumnName("COL2")))
fmt.Println(f.Select("COL3"))

More usage examples

Examples of the most common operations are available in the docs.

Error handling

All operations that may result in errors will set the Err variable on the returned QFrame to indicate that an error occurred. The presence of an error on the QFrame will prevent any future operations from being executed on the frame (eg. it follows a monad-like pattern). This allows for smooth chaining of multiple operations without having to explicitly check errors between each operation.

Configuration parameters

API functions that require configuration parameters make use of functional options to allow more options to be easily added in the future in a backwards compatible way.

Design goals

Performance
- Speed should be on par with, or better than, Python Pandas for corresponding operations.
- No or very little memory overhead per data element.
- Performance impact of operations should be straight forward to reason about.
API
- Should be reasonably small and low ceremony.
- Should allow custom, user provided, functions to be used for data processing
- Should provide built in functions for most common operations

High level design

A QFrame is a collection of columns which can be of type int, float, string, bool or enum. For more information about the data types see the types docs.

In addition to the columns there is also an index which controls which rows in the columns that are part of the QFrame and the sort order of these columns. Many operations on QFrames only affect the index, the underlying data remains the same.

Many functions and methods in qframe take the empty interface as parameter, for functions to be applied or string references to internal functions for example. These always correspond to a union/sum type with a fixed set of valid types that are checked in runtime through type switches (there's hardly any reflection applied in QFrame for performance reasons). Which types are valid depends on the function called and the column type that is affected. Modelling this statically is hard/impossible in Go, hence the dynamic approach. If you plan to use QFrame with datasets with fixed layout and types it should be a small task to write tiny wrappers for the types you are using to regain static type safety.

Limitations

The API can still not be considered stable.
The maximum number of rows in a QFrame is 4294967296 (2^32).
The CSV parser only handles ASCII characters as separators.
Individual strings cannot be longer than 268 Mb (2^28 byte).
A string column cannot contain more than a total of 34 Gb (2^35 byte).
At the moment you cannot rely on any of the errors returned to fulfill anything else than the Error interface. In the future this will hopefully be improved to provide more help in identifying the root cause of errors.

Performance/benchmarks

There are a number of benchmarks in qbench comparing QFrame to Pandas and Gota where applicable.

Other data frames

The work on QFrame has been inspired by Python Pandas and Gota.

Contribute

Want to contribute? Great! Open an issue on Github and let the discussions begin! Below are some instructions for working with the QFrame repo.

Ideas for further work

Below are some ideas of areas where contributions would be welcome.

Support for more input and output formats.
Support for additional column formats.
Support for using the Arrow format for columns.
General CPU and memory optimizations.
Improve documentation.
More analytical functionality.
Dataset joins.
Improved interoperability with other libraries in the Go data science eco system.
Improve string representation of QFrames.

Install dependencies

make dev-deps

Tests

Please contribute tests together with any code. The tests should be written against the public API to avoid lockdown of the implementation and internal structure which would make it more difficult to change in the future.

Run tests: make test

This will also trigger code to be regenerated.

Code generation

The codebase contains some generated code to reduce the amount of duplication required for similar functionality across different column types. Generated code is recognized by file names ending with _gen.go. These files must never be edited directly.

To trigger code generation: make generate

qframe's People

Contributors

Stargazers

Watchers

qframe's Issues

Extract column as a []float64?

I'm hoping to extract a column as a []float64, or a set of columns as a *mat.Dense. Is that possible?

error in READCSV

Got an error when reading from a CSV file. Same file can be read by read_csv function in pandas without problems. A quick visual check also confirms there's no column without a name in the file.

New (CheckName: column name must not be empty)

Passing a struct or a map into the df creation

hi guys

The below passes into the df but I get the error 'createColumn: unknown column data type "string" for column "Age"'

Would you be able to help me please?


        to_df := map[string]interface{}{
			"CreatedAt": CreatedDate,
			"Age": strconv.FormatFloat(rounded, 'f', 6, 64),
			"FavoriteCount": strconv.Itoa(FavoriteCount),
			"RetweetCount": strconv.Itoa(RetweetCount),
			"hour": hour,
			"day": day,
		}
		f := qframe.New(to_df)
		fmt.Println(f)

Is there any plan to support time columns/types?

I can see that there is this internal package to deal with column types.
I can not find a time.Time implementation for the icolumn.View stuff, is there any plan to support this?

Would it be just a matter of adding the relevant code to the package internal or are there other places to change/complement?

I am basically trying to apply functions to a time column to change time-zones, then being able to filter the rows according to these time-zoned columns.

Support for multiple aggregations

Is there a way to accomplish this?

f = f.GroupBy(groupby.Columns("day")).Aggregate(
		qframe.Aggregation{Fn: sum, Column: "amount"},
		qframe.Aggregation{Fn: "count", Column: "amount"},
		qframe.Aggregation{Fn: max, Column: "amount"})

Error message: Aggregate: cannot aggregate on column that is part of group by or is already an aggregate: amount

Increasing Column width

In my dataframe, I hardly read the data in the columns, can I increase the column width so I can improve the readability?

Iterate Over the given qframe

Any example which depicts how to iterate over the rows present in the qframe?

vet: Go idioms

as noted in https://www.reddit.com/r/golang/comments/8lmv8l/qframe_a_performant_immutable_data_frame_for_go/

there are a few exposed functions which could follow more Go-idiomatic naming.

How to convert many CSV files into a qframe?

I have many data source files like data_1.csv,data_2.csv,data_3.csv.
In "gota", I can use method "RBind"
if fileN == 0 { dfAll = df.Copy() } else { dfAll = dfAll.RBind(df) }
but qframe.ReadCSV is static function.
How to make a qframe read from 1+ csv ？

Improve Eval

It would be nice to implement some stronger typing around the Eval system in QFrame. There are currently many empty interfaces passed around which can make it challenging for end users to understand how to use the library.

I've included an example of some sample code below to illustrate this.

package main

import (
	"crypto/md5"
	"fmt"

	"github.com/tobgu/qframe"
	"github.com/tobgu/qframe/config/eval"
	"github.com/tobgu/qframe/types"
)

// Create a new column containing an MD5 hash
// of the data in each preceding column.
func main() {
	qf := qframe.New(map[string]interface{}{
		"COL1": []string{"2001", "2002", "2003"},
		"COL2": []int{1, 2, 3},
		"COL3": []float64{3, 4, 5},
	})
	ctx := eval.NewDefaultCtx()
	// Add a new "hash" function to the default EvalContext.
	ctx.SetFunc("hash", func(data *string) *string {
		hash := fmt.Sprintf("%X", md5.Sum([]byte(*data)))
		return &hash
	})
	// Things get a bit awkward here..
	// Range over each column and create
	// an expression that coercises the
	// underlying datatypes to a string.
	var toStrings []interface{}
	for _, name := range qf.ColumnNames() {
		// We've lost all typing here and it is unclear what "str"
		// does. To find out we need to lookup the logic in config/eval/context
		// and follow those definitions into the function package. Maybe we
		// could pass the functions in directly?
		toStrings = append(toStrings, qframe.Expr("str", types.ColumnName(name)))
	}
	// Concatentate each string together and then
	// pass the result to our custom "hash" function above.
	// Again we've lost strong typing and clarity here,
	// since we just defined "hash", that one is obvious.
	// The unary operators are convenient but a type
	// that I can jump to in my editor would be nicer.
	expr := qframe.Expr("hash", qframe.Expr("+", toStrings...))
	// md5sum = hash(concat(string(T)...))
	qf = qf.Eval("md5sum", expr, eval.EvalContext(ctx))
	fmt.Println(qf)
}

/*
COL1(s) COL2(i) COL3(f) md5sum(s)
------- ------- ------- ---------
   2001       1       3 0E5DDD...
   2002       2       4 E185FD...
   2003       3       5 A3CE02...

Dims = 4 x 3
*/

Zero urgency behind this issue, the code above works just fine!

Thanks for all your work on this awesome library.

how to achieve multi index ?

hi there,

Hope someone can help me, how can I achieve multi index similar to https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html

Any help is much appraciated.

Regards,
Julio

Possibly add a qframe.ReadParquet method

I just wrote a blog post on how to convert a CSV to Parquet with Go and thought about how wonderful it would be to play with Parquet files in a Go DataFrame ;)

I doesn't look like the readCSV code is that complex.

Do you think parquet-go can be used to add this feature relatively easily?

Apply function issues

Hey there

Do you know what is wrong with the below?

I get the error: apply1 (int.Apply1: cannot apply type (func([]int) string)(0x574120) to column)

package main

import (
    "fmt"
    "github.com/tobgu/qframe"
    "io/ioutil"
    "strings"
    "strconv")

func main() {
    //define filepath
    filepath := "emp.csv"

    //open the file
    openfile, _ := ioutil.ReadFile(filepath)

    //cast byte slice as string
    contents := string(openfile)

    //get rid of whitespace
    contents = strings.TrimSpace(contents)

    f := qframe.ReadCSV(strings.NewReader(string(contents)))
    fmt.Println(f)

    converter := func(x int) string {
       y := x*2
	     return strconv.Itoa(y) //convert int to string
     }
 
    qf2 := f.Apply(qframe.Instruction{Fn: converter, DstCol: "strings", SrcCol1: "AGE"})
    fmt.Println(qf2)
}

How do you add a new column / overwrite an existing column

Hi,

how do you overwrite a column of a qframe object?

I have a data-frame qdf.

Based on the examples I used a view to iterate over the column. Within the for loop I apply a conversion (in this case a Epoch to human-readable timestamp conversion).

I'd like to collect all the results into the slice presult and add it to the columns of a new qdf

view := qdf.MustStringView("EdgeStartTimestamp")
presult := make([]string, 1)
for i:=0; i < view.Len(); i++ {
	item := view.ItemAt(i)
	if citem, err  := strconv.Atoi(*item); err == nil {
		presult = append(presult,
			time.Unix(int64(citem) / 1000000000, 0).String())
	}
}

I can apply functions to one column, but I cannot add a new column:

qdf = qdf.Apply(qframe.Instruction{Fn: function.IntF, DstCol: "EdgeStartTimestamp", SrcCol1: "EdgeStartTimestamp" })

In this particular case I convert the column EdgeStartTimestamp. My goal would be to have a column timestamp with the contents of "EdgeStartTimestamp" in the replaced qdf object.

Is there a way to do this at the moment?

Best,
Marius

Unable to iterate through the GroupBy Grouper object

I need to write to different csv based on GroupBy , so I want to iterate through the GroupBy indices and write to csv. The Grouper object is not allowing to iterate through it.

Please give an example code if the functionality is present

INSERT INTO table (id, name, age) VALUES(1, "A", 19) ON DUPLICATE KEY UPDATE

Is it possible to save a qframe, where we update primary keys if they exist?

Not able to read CSV with multiple empty (duplicate column names)

Not able to read csv with multiple empty column names.
Expected behaviour : rename columns with numerica suffix such as empty1, empty2

example:

name, lastname , , , address
jay, patel, 12,123,12,3 na na

Errror

ReadCsv: Duplicate columns detected: [                       ]

Quotes not handled correctly when <CR><LF> is used as line delimiter

If the last column in a row is quoted the last quote is always included when the \r\n is used as line delimiter (as opposed to only \n which works fine)

COL1\r\n
"a"\r\n

Results in COL1 = {"a\""}

install issues

Hey there

I am trying to do go get on my server and I get the below.

Any ideas?

root@sotrics:~/gcode# go get github.com/tobgu/qframe
# github.com/tobgu/qframe/internal/hash
../go/src/github.com/tobgu/qframe/internal/hash/memhash.go:9:6: missing function body
# github.com/tobgu/qframe/internal/ryu
../go/src/github.com/tobgu/qframe/internal/ryu/ryu32.go:270:12: undefined: bits.Mul64
../go/src/github.com/tobgu/qframe/internal/ryu/ryu64.go:396:16: undefined: bits.Mul64
../go/src/github.com/tobgu/qframe/internal/ryu/ryu64.go:397:13: undefined: bits.Mul64

Referencing columns from QF

Hey there

Thanks so much for your last response. I have one more question (sorry!!!)

I have a CSV which I ingest using the below

The problem I am having is, the names in the file are
name
age
city

But when I print f, they show like name (s), age (i)

But in the groupby I can't actually reference them like that or like they are in the file. How can I tackle that

Thanks

  f := qframe.ReadCSV(csvfile)
  fmt.Println(f)
  g := f.GroupBy(groupby.Columns("name (s)"))
  f = g.Aggregate(qframe.Aggregation{Fn: "count", Column: "age (i)"}).Sort(qframe.Order{Column: "name (s)"})

Can't extend Column type

Hi, the dataframe library is wonerful. But I encounter some problems. The Column has some type int, string, float. But It is not enough. I find the Column interface, but it locate in internal package. I can't implement it to extend more type.

What should I do if I want to save more type?

Detailed examples/Documentation

Can you please provide detailed documentation on how to Get the aggregate sum?
Suppose if we have a qframe of employees with salaries,age,ids.Total sum of salaries where salary is greater than 30?

Joining dataframes

Hi, thank you for writing this library. Are there any plans to add Joins? If I were to add them at least for myself, since I am not that experienced Go developer and I doubt it will bi in par to you standards, where/how would be the smartest way to add it?

Btw. Not totally related, but I am making an interpreter in Go and I will probably use your qframe for it's dataframe implementation. It could be a nice solution for interactive data exploration/cleanup. I will show you once language is more developed.

Serialize to and deserialize from Apache Arrow format

I am using arrow and it uses flat buffers internally which are very fast.

I would be interested in extending qframe to work with flat buffers.

There is also a special schemaless flat buffers called "flexible" which does not enforce a schema. I expect this is what you want to use for qframe.

"row" operations

Suppose we have a frame

  A   B   C
  1   2   3
 11  22  33
111 222 333

and we need to create new column S and put there some result across a row (for example S=A+B+C) so resulting frame should be as follows:

  A   B   C   S
  1   2   3   6
 11  22  33  66
111 222 333 666

How to achieve this?

Load CSV with now header

There are cases when CSV files has not a header and column names are stored in another file (e.g. LOAD DATA sentences). In pandas that is supported a follows:

fileDf = pd.read_csv(fileName, header=None, names=header)

I would be great if qframe could support something like that.

Group by not working

Hey there,

First of all - awesome library

I have the below code directly from your documents, but I keep getting this error

./main.go:14:19: undefined: groupby
compiler exit status 2

Any thoughts?

package main

import "fmt"
import "log"
import "os"
import "github.com/tobgu/qframe"

func main() {
  qf := qframe.New(map[string]interface{}{
      "COL1": []string{"a", "b", "a", "b", "b", "c"},
      "COL2": []float64{0.1, 0.1, 0.2, 0.4, 0.5, 0.6},
  })

  g := qf.GroupBy(groupby.Columns("COL1"))
  qf = g.Aggregate(qframe.Aggregation{Fn: "count", Column: "COL2"}).Sort(qframe.Order{Column: "COL1"})

  fmt.Println(qf)

  }

Add equivalent of `pandas`.`read_html`

To get more feature parity with Pandas, integrate with https://github.com/nfx/go-htmltable.

Groupby error

Running the following example in gophernotes,

	"fmt"
	"math"
	"strings"

	"github.com/tobgu/qframe"
	"github.com/tobgu/qframe/config/groupby"
	"github.com/tobgu/qframe/config/newqf"
	"github.com/tobgu/qframe/function"
	"github.com/tobgu/qframe/types"
)
intSum := func(xx []int) int {
    result := 0
    for _, x := range xx {
        result += x
    }
    return result
}

f := qframe.New(map[string]interface{}{"COL1": []int{1, 2, 2, 3, 3}, "COL2": []string{"a", "b", "c", "a", "b"}})
f = f.GroupBy(groupby.Columns("COL2")).Aggregate(qframe.Aggregation{Fn: intSum, Column: "COL1"})
fmt.Println(f.Sort(qframe.Order{Column: "COL2"}))

Got the following error message

repl.go:10:5: invalid qualified type, expecting packagename.identifier, found: f.GroupBy(groupby.Columns("COL2")).Aggregate <*ast.SelectorExpr>

and if I take out the aggregation but and simply do a groupby, it gave me

cannot use <github.com/tobgu/qframe/config/groupby.ConfigFunc> as <github.com/tobgu/qframe/config/groupby.ConfigFunc> in argument to f.GroupBy

Not sure what's going on here

time.Time column type

Hello! I have a use case where I'd like to sort/filter/group based on time. While in practice, I could likely use an int/string column to represent my time and convert it (if necessary) within an operation to time.Time - I'm not inclined to think this will perform as well as a solution which stores the time internally.
Curious if you would:

Think it makes sense to add a new column type vs piggybacking on int or string
Be open to a pull request for a time.Time column,
Have info on all the places to look to add a new column type/have a high level description of what would be required?

Thanks in advance!

Grouping by multiple columns and creating a computed column

Would like to do the following logic (expressed as sql) using this library

select column1, column2, count(*) as count
from table
group by column1, column2

What i tried

dataFrame.
  GroupBy(groupby.Columns("column1", "column2")).
  Aggregate(qframe.Aggregation{
	  Fn: count,
	  Column: "count",
  })

Result

Aggregate: unknown column: "count"

tobgu / qframe Goto Github PK

qframe's Introduction

Installation

Usage

IO

CSV Data

SQL Data

SQLite Example

Filtering

Grouping and aggregation

Data manipulation

More usage examples

Error handling

Configuration parameters

Design goals

High level design

Limitations

Performance/benchmarks

Other data frames

Contribute

Ideas for further work

Install dependencies

Tests

Code generation

qframe's People

Contributors

Stargazers

Watchers

Forkers

qframe's Issues

What i tried

Result

Recommend Projects

Recommend Topics

Recommend Org