polymathorg / dataframe Goto Github PK

DataFrame in Pharo - tabular data structures for data analysis

License: MIT License

Smalltalk 98.97% StringTemplate 1.03%

pharo smalltalk gsoc data-science data-analysis data-frame tabular-data data-visualization statistics pharo-smalltalk

dataframe's Introduction

Pharo DataFrame

DataFrame is a tabular data structure for data analysis in Pharo. It organizes and represents data in a tabular format, resembling a spreadsheet or database table. It is designed to handle structured data and offer various functionalities for data manipulation and analysis. DataFrames are used as visualization tools for Machine Learning and Data Science related tasks.

Installation

To install the latest stable version of DataFrame (pre-v3), go to the Playground (Ctrl+OW) in your Pharo image and execute the following Metacello script (select it and press Do-it button or Ctrl+D):

EpMonitor disableDuring: [ 
    Metacello new
      baseline: 'DataFrame';
      repository: 'github://PolyMathOrg/DataFrame:pre-v3/src';
      load ].

Use this script if you want the latest version of DataFrame:

EpMonitor disableDuring: [ 
    Metacello new
      baseline: 'DataFrame';
      repository: 'github://PolyMathOrg/DataFrame/src';
      load ].

Note: EpMonitor serves to deactive Epicea, a Pharo code recovering mechanism, during the installation of DataFrame.

How to depend on it?

If you want to add a dependency on DataFrame to your project, include the following lines into your baseline method:

spec
  baseline: 'DataFrame'
  with: [ spec repository: 'github://PolyMathOrg/DataFrame/src' ].

If you are new to baselines and Metacello, check out the Baselines tutorial on Pharo Wiki.

What are data frames?

Data frames are the one of the essential parts of the data science toolkit. They are the specialized data structures for tabular data sets that provide us with a simple and powerful API for summarizing, cleaning, and manipulating a wealth of data sources that are currently cumbersome to use.

A data frame is like a database inside a variable. It is an object which can be created, modified, copied, serialized, debugged, inspected, and garbage collected. It allows you to communicate with your data quickly and effortlessly, using just a few lines of code. DataFrame project is similar to pandas library in Python or built-in data.frame class in R.

Very simple example

In this section I show a very simple example of creating and manipulating a little data frame. For more advanced examples, please check the DataFrame Booklet.

Creating a data frame

weather := DataFrame withRows: #(
  (2.4 true rain)
  (0.5 true rain)
  (-1.2 true snow)
  (-2.3 false -)
  (3.2 true rain)).

	1	2	3
1	2.4	true	rain
2	0.5	true	rain
3	-1.2	true	snow
4	-2.3	false	-
5	3.2	true	rain

Removing the third row of the data frame

weather removeRowAt: 3.

	1	2	3
1	2.4	true	rain
2	0.5	true	rain
4	-2.3	false	-
5	3.2	true	rain

Adding a row to the data frame

weather addRow: #(-1.2 true snow) named: 6.

	1	2	3
1	2.4	true	rain
2	0.5	true	rain
4	-2.3	false	-
5	3.2	true	rain
6	-1.2	true	snow

Replacing the data in the first row and third column with 'snow'

weather at:1 at:3 put:#snow.

	1	2	3
1	2.4	true	snow
2	0.5	true	rain
4	-2.3	false	-
5	3.2	true	rain
6	-1.2	true	snow

Transpose of the data frame

weather transposed.

	1	2	4	5	6
1	2.4	0.5	-2.3	3.2	-1.2
2	true	true	false	true	true
3	snow	rain	-	rain	snow

Documentation and Literature

Data Analysis Made Simple with Pharo DataFrame - a booklet that serves as the main source of documentation for the DataFrame project. It describes the complete API of DataFrame and DataSeries data structures, and provides examples for each method.

Zaytsev Oleksandr, Nick Papoulias and Serge Stinckwich. Towards Exploratory Data Analysis for Pharo In Proceedings of the 12th edition of the International Workshop on Smalltalk Technologies, pp. 1-6. 2017.

dataframe's People

Contributors

Stargazers

Watchers

dataframe's Issues

Some tests are failing because they are dependent on built-in datasets

Some tests were dependent on the built-in datasets. Now that the datasets were moved to DataFrame-Tools, several tests in DataFrame-Core-Tests package fail.

Make the column widths of FastTable draggable in the inspector

It would be great to have a context menu item "size to fit". This could be potentially expensive to calculate, so it would be user-initiated.

Typo: Arythmetic --> Arithmetic

Present in max, stddev etc.

Message copied several times, can be factored.

NeoCSV not loaded by the configuration out of the box

dataseries #sorted: gives weird results and misses data

Try sorting a DataSeries of associations.

Documentation errors

Hi
There are a couple of errors with the main page documentation:
a) The add column example uses the keyword "row:"
b) The add: item example does not work as the method does not exist
BTW, I think this is a great contribution!
Regards,
Graham

The inspector always shows row numbers and not the supplied name

When column names are changed, the FastTable in the inspector view is still showing row numbers.

Example:

df := DataFrame fromRows: #(
   ('Barcelona' 1.609 true)
   ('Dubai' 2.789 true)
   ('London' 8.788 false)).

df rowNames: #(A B C).

Feature Request: addRow:

It would be great to have an addRow:
capability which would simply append a new row to the dataFrame.
I tried trivially adding one, but it seems the data structure is fixed size and one may have to copy
the whole frame to achieve this, something one wants to avoid for performance reasons.
Is there an easy way to add in the implementation?
Thanks
Graham

Request rowsCollect: and columnsCollect: methods

DataFrame should allow us to collectRows: by applying some block to each row and collecting the results into another DataFrame. The number of columns of the resulting DataFrame should depend on the block.

For example:

df collectRows: [ :row | row * 2 ].
df collectRows: [ :row | row range ].

The same should be done for columns.

Remove DataFrame>>select:where: and others

The idea of method DataFrame>>select:where: was to resemble SQL and introduce a small DSL for filtering DataFrame. Names of block arguments should correspond to column names and every argument x is replaced with (self column: #x) - column of DataFrame with the same name.

"Find all the rows that satisfy the condition that value at column #species 
equals #setosa and value at column #sepalWidth equals 3.
Then return only columns #petalWidth #petalLength "

irisDataFrame
    select: #(petalWidth petalLength)
    where: [ :species :sepalWidth | species = #setosa and: sepalWidth = 3 ].

This idea was very interesting, but now it doesn't work (all tests are failing) and I think that instead of fixing it we should just remove it. Because the same can be done with standard Smalltalk select: message:

(irisDataFrame
    select: [ :row | (row atKey: #species) = #setosa and: (row atKey: #sepalWidth) = 3 ])
    where: #(petalWidth petalLength).

It's longer, but now it looks much cleaner to me. Because select: part in select:where: was often confused with traditional select: which is doing the same as where:. So all those names are inconsistent.

In the future we may go back to the idea of DSL for querying. The implementation of DataFrame>>select:where: can be found by checking out previous commits.

Create empty DataFrame with column names

We can create an empty DataFrame with no rows and no columns.

DataFrame new. "a DataFrame (0@0)"

But there is no way to create a DataFrame with several named columns and no rows (specified features but no observations) in order to fill it later.

   |  Name  Age  Nationality  
---+--------------------------

If we try creating an empty DataFrame and setting the column names, the number of columns will remain 0, and the names will be saved (which also makes no sense - I will make a separate issue for that).

df := DataFrame new.
df numberOfColumns. "0"
df columnNames. "#()"
df columnNames: #(Name Age Gender).
df numberOfColumns. "0"
df columnNames. "#(Name Age Gender)"

We also have the API for creating an empty DataFrame with specified number of rows and columns.

DataFrame new: 2@3.

   |    1    2    3  
---+-----------------
1  |  nil  nil  nil  
2  |  nil  nil  nil

But it throws SubscriptOutOfBounds if we want to have 0 rows

DataFrame new: 0@3. "SubscriptOutOfBounds: 1"

How to deal with missing data for inducing types ?

If you have a column with missing data (with nil values) and this is often the case, the DataTypeInductor is not able to deduce the type of the column. How do we deal with that ?

Move packages into src/ folder

Add DataFrame to Catalog Browser

Provide sorting

A feature I missed during my first experiments with DataFrame is sorting. I'd expect the most general sorting operation to be sort: aBlock, like for SequenceableCollection, but the practically more frequent operation would be sorting by column.

FastTable moves all columns to the left

DataFrame is initialized correctly, but FastTable moves everything to one side and displays this mess

DataFrameInternal uses deprecated class Matrix

The Matrix class has been deprecated in Pharo 7. Better use Array2D ?

Missing dependency on Tabular package in DataFrame-Tools

DataFrame-Tools uses XLSXImporter class from the Tabular package. However, the dependency is not specified in a baseline and package is not loaded.

This dependency should be added to the BaselineOfDataFrame.

As a temporary solution you can load Tabular from World Menu > Tools > Catalog Browser.

Use Tonel as file format

Remove all visualizations. DataFrame should not visualize itself

Sort dataseries not keeping original index of the data, so cannot retrieve the row from the df

e.g.

distances := (1 to: rows) collect: [ :row |  row -> (alpha - (lut at: row at: 4)) abs].

df addColumn: distances named: #Distance.

distancesDataSerie := df column: #Distance.

distancesDataSerie sorted: [ :a :b | a value < b value ].

But then there is no way to find back where each ds entry came from in the df.

It would be even better to be able to sort the df by a given column, or a set of such columns, or an arbitrary composite criteria (read: a block)

DataFrame can not be initialized with missing values

It would be useful to initialize a DataFrame with the following "incomplete" array and have it automatically insert nil or Float nan in place of the missing values. It should also be possible to handle these values later.

DataFrame fromRows: #(
    (1 2 3)
    (4 5)).

Profile select methods and make them faster

select: aBlock with: testedColumnNames
	
	| blockStr |
	blockStr := aBlock asString.
	
	"Remove parameters: '[ :x | x > 3 ]' to '[ x > 3 ]'"
	blockStr := (blockStr
		copyFrom: (blockStr findString: '|')
		to: blockStr size).
		
	testedColumnNames do: [ :eachColName |
		blockStr := blockStr
			copyReplaceAll: eachColName
			with: ('(row atKey: #', eachColName, ')') ].
		
	blockStr := '[ :row ', blockStr.
	^ self select: (Compiler evaluate: blockStr)

Change the script to load DataFrame

Proposed Metacello groups:

default (core DataFrame)
#Visualizations (loads Roassal)
#Examples (loads GTDocumenter)
#Files (loads NeoCSV)

DataFrame Core

Metacello new
  baseline: 'DataFrame';
  repository: 'github://PolyMathOrg/DataFrame';
  load.

Core + Files

Metacello new
  baseline: 'DataFrame';
  repository: 'github://PolyMathOrg/DataFrame';
  load: #Files.

Core + Files + Examples

Metacello new
  baseline: 'DataFrame';
  repository: 'github://PolyMathOrg/DataFrame';
  load: #(Files Examples).

Add ability to tabulate: like in Matrix

ut := Matrix rows: rows columns: columns tabulate: [ :r :c |
    | v |
    c = 2 ifTrue: [ v := r \\ 10. v = 0 ifTrue: [ v := 10 ]].
    c = 1 ifTrue: [ v := r // 10.  (r \\ 10) = 0 ifTrue: [ v := v - 1] . v:= v + 1].
    c = 3 ifTrue: [ v := 0 ].
    c = 4 ifTrue: [ v := 0 ].

    v ].

so, DataFrame rows: 100 columns: 3 tabulate: [:rowIndex :colIndex | ... ]
Code could be better, but you get the point

Last column name not displayed in inspector

df columnNames: #(Numerator Denumerator Fraction AsFloat).

But the "AsFloat" entry is not showing.

fromCSV: should detect data types

If we read a CSV file where some columns are strings and others are integers, floats, or dates, DataFrame reads all of them as strings. All these values should be parsed and converted to the most appropriate data types, which should be detected automatically.

Inefficient implementation of DataSeries>>#unique

unique
	| unique |
	unique := LinkedList new.

	self collect: [ :each |
	(unique includes: each)
		ifFalse: [ unique add: each ] ].
	
	unique := unique asDataSeries.
	unique name: self name.
	^ unique

collect: is actually building a collection of all items just to throw it away.

Can't we just do something like asBag? We'd have the counts as well.

Confusing method names atColumn:put: and columnAt:put:

We the following two methods for adding an array as a column to DataFrame:

DataFrame >> atColumn: aColumnName put: anArray
DataFrame >> columnAt: aNumber put: anArray

Those signatures are very confusing

Update baseline

Roassal2 has migrated to GitHub but DataFrame is still loading it from Smalltalkhub which sometimes causes conflicts. And in general, current baseline has to be improved.

Missing dependency error for GTExample

Loading DataFrame with this script raises a missing dependency error for GTExample.

Metacello new
    baseline: 'DataFrame' ;
    repository: 'github://PolyMathOrg/DataFrame';
    load .

Write examples for DataFrame

Missing dependency for NeoCSV

After DataFrame was loaded, NeoCSV has to be installed manually from Catalog Browser. This dependency must be added to the BaselineOfDataFrame.

It doesn't raise an error when DataFrame is loaded. But if you call DataFrame >> fromCSV:, you'll get an error saying that "on: was sent to nil"

select:with: is failing on Pharo 7

DataFrame>>select:with: uses Compiler class which is removed or renamed in Pharo 7

select: aBlock with: testedColumnNames
	
	| blockStr |
	blockStr := aBlock asString.
	
	"Remove parameters: '[ :x | x > 3 ]' to '[ x > 3 ]'"
	blockStr := (blockStr
		copyFrom: (blockStr findString: '|')
		to: blockStr size).
		
	testedColumnNames do: [ :eachColName |
		blockStr := blockStr
			copyReplaceAll: eachColName
			with: ('(row atKey: #', eachColName, ')') ].
		
	blockStr := '[ :row ', blockStr.
	^ self select: (Compiler evaluate: blockStr)

empty CSV file loading gives an issue instead of an empty frame

Ability to add DataSeries as columns to a DataFrame

Cannot add a new DataSerie to a DataFrame as a named column.

Ability to add an empty column to be later filled with data

Like with:

addEmptyColumnNamed: columnName
	self addColumn: (Array new: self size) named: columnName.

Row names and column names should be stored in OrderedCollection not Array

Because we need to add and remove rows and columns and we can only copy arrays.

Ability to transpose a dataframe

A dataframe made out of columns and rows should be transposable into rows and columns

dfTransposed := df transposed.

Rewrite DataFrame>>select: (it's insane)

select: aBlock 
	"Evaluate aBlock with each of the receiver's elements as the argument. 
	Collect into a new collection like the receiver, only those elements for 
	which aBlock evaluates to true. Answer the new collection."

	| selectedRowNumbers df |
	
	selectedRowNumbers := LinkedList new.
	
	1 to: self numberOfRows do: [ :i | 
		(aBlock value: (self rowAt: i)) 
			ifTrue: [ selectedRowNumbers add: i ] ].
	
	df := self class new:
		(selectedRowNumbers size @ self numberOfColumns).
		
	df rowNames: (selectedRowNumbers collect: [ :i |
		self rowNames at: i ]) asArray.
	
	df columnNames: self columnNames.
		
	selectedRowNumbers doWithIndex: [ :rowNumber :i |
		df rowAt: i put: (self rowAt: rowNumber) ].
	
	^ df

DataFrame class(Object) doesNotUnderstand: #Rows

Thank you for this wonderful contribution. I'm hitting a very basic error. After installing per the directions, I then try the example to create with #rows. Unfortunately, I get the error "DataFrame class(Object) doesNotUnderstand: #Rows".

DataFrame >> rows:

shows up in the SystemBrowser (although in the protocol "not implemented", despite showing what looks like completed code.)

Below is what I input via the Playground and the error I received. I've tried in both the 32 and 64 bit versions of Pharo 6.1 (on Mac OS 10.12), freshly downloaded. No signs of any other problems with my images or system. Please let me know if there is other information I can provide or things I can try to help with debugging.

Thank you again.

Metacello new
  baseline: 'DataFrame';
  repository: 'github://PolyMathOrg/DataFrame';
  load.

df := DataFrame rows: #(
   ('Barcelona' 1.609 true)
   ('Dubai' 2.789 true)
   ('London' 8.788 false)).

Author: GlennHoetker

UndefinedObject(Object)>>doesNotUnderstand: #rows:
UndefinedObject>>DoIt
OpalCompiler>>evaluate
RubSmalltalkEditor>>evaluate:andDo:
RubSmalltalkEditor>>highlightEvaluateAndDo:
[ textMorph textArea editor highlightEvaluateAndDo: ann action.
textMorph shoutStyler style: textMorph text
] in [ textMorph textArea
	handleEdit: [ textMorph textArea editor highlightEvaluateAndDo: ann action.
		textMorph shoutStyler style: textMorph text
		]
] in GLMMorphicPharoScriptRenderer(GLMMorphicPharoCodeRenderer)>>actOnHighlightAndEvaluate:
RubEditingArea(RubAbstractTextArea)>>handleEdit:
[ textMorph textArea
	handleEdit: [ textMorph textArea editor highlightEvaluateAndDo: ann action.
		textMorph shoutStyler style: textMorph text
		]
] in GLMMorphicPharoScriptRenderer(GLMMorphicPharoCodeRenderer)>>actOnHighlightAndEvaluate:
WorldState>>runStepMethodsIn:
WorldMorph>>runStepMethods
WorldState>>doOneCycleNowFor:
WorldState>>doOneCycleFor:
WorldMorph>>doOneCycle
WorldMorph class>>doOneCycle
[ [ WorldMorph doOneCycle.
Processor yield.
false
] whileFalse: [  ]
] in MorphicUIManager>>spawnNewProcess
[ self value.
Processor terminateActive
] in BlockClosure>>newProcess

The last column label isn't displayed

Setting column names does not check the number of columns

If we have a DataFrame with 2 columns

df := DataFrame fromRows: #(
	(Chile Santiago)
	(France Paris)).
df columnNames: #(Country Capital).
df numberOfColumns. "2"

   |  Country  Capital   
---+---------------------
1  |  Chile    Santiago  
2  |  France   Paris

We can set the column names with an array of size 3 and get no error.

df columnNames: #(Country Capital Population).

Besides, the number of columns will remain 2.

df numberOfColumns. "2"

But the column names will now contain 3 elements, which makes no sense.

df columnNames. "#(#Country #Capital #Population)"

The same happens if we assign less column names than there are columns.

df columnNames: #(Country).
df numberOfColumns. "2"
df columnNames. "#(#Country)"

And in both cases, if we try to print a string table, we will get an error.

df asStringTable. "SubscriptOutOfBounds"