Code Monkey home page Code Monkey logo

dataframe's Introduction

Pharo DataFrame

Build status Coverage Status License

DataFrame is a tabular data structure for data analysis in Pharo. It organizes and represents data in a tabular format, resembling a spreadsheet or database table. It is designed to handle structured data and offer various functionalities for data manipulation and analysis. DataFrames are used as visualization tools for Machine Learning and Data Science related tasks.

Installation

To install the latest stable version of DataFrame (pre-v3), go to the Playground (Ctrl+OW) in your Pharo image and execute the following Metacello script (select it and press Do-it button or Ctrl+D):

EpMonitor disableDuring: [ 
    Metacello new
      baseline: 'DataFrame';
      repository: 'github://PolyMathOrg/DataFrame:pre-v3/src';
      load ].

Use this script if you want the latest version of DataFrame:

EpMonitor disableDuring: [ 
    Metacello new
      baseline: 'DataFrame';
      repository: 'github://PolyMathOrg/DataFrame/src';
      load ].

Note: EpMonitor serves to deactive Epicea, a Pharo code recovering mechanism, during the installation of DataFrame.

How to depend on it?

If you want to add a dependency on DataFrame to your project, include the following lines into your baseline method:

spec
  baseline: 'DataFrame'
  with: [ spec repository: 'github://PolyMathOrg/DataFrame/src' ].

If you are new to baselines and Metacello, check out the Baselines tutorial on Pharo Wiki.

What are data frames?

Data frames are the one of the essential parts of the data science toolkit. They are the specialized data structures for tabular data sets that provide us with a simple and powerful API for summarizing, cleaning, and manipulating a wealth of data sources that are currently cumbersome to use.

A data frame is like a database inside a variable. It is an object which can be created, modified, copied, serialized, debugged, inspected, and garbage collected. It allows you to communicate with your data quickly and effortlessly, using just a few lines of code. DataFrame project is similar to pandas library in Python or built-in data.frame class in R.

Very simple example

In this section I show a very simple example of creating and manipulating a little data frame. For more advanced examples, please check the DataFrame Booklet.

Creating a data frame

weather := DataFrame withRows: #(
  (2.4 true rain)
  (0.5 true rain)
  (-1.2 true snow)
  (-2.3 false -)
  (3.2 true rain)).
1 2 3
1 2.4 true rain
2 0.5 true rain
3 -1.2 true snow
4 -2.3 false -
5 3.2 true rain

Removing the third row of the data frame

weather removeRowAt: 3.
1 2 3
1 2.4 true rain
2 0.5 true rain
4 -2.3 false -
5 3.2 true rain

Adding a row to the data frame

weather addRow: #(-1.2 true snow) named: 6.
1 2 3
1 2.4 true rain
2 0.5 true rain
4 -2.3 false -
5 3.2 true rain
6 -1.2 true snow

Replacing the data in the first row and third column with 'snow'

weather at:1 at:3 put:#snow.
1 2 3
1 2.4 true snow
2 0.5 true rain
4 -2.3 false -
5 3.2 true rain
6 -1.2 true snow

Transpose of the data frame

weather transposed.
1 2 4 5 6
1 2.4 0.5 -2.3 3.2 -1.2
2 true true false true true
3 snow rain - rain snow

Documentation and Literature

  1. Data Analysis Made Simple with Pharo DataFrame - a booklet that serves as the main source of documentation for the DataFrame project. It describes the complete API of DataFrame and DataSeries data structures, and provides examples for each method.

DataFrame Booklet

  1. Zaytsev Oleksandr, Nick Papoulias and Serge Stinckwich. Towards Exploratory Data Analysis for Pharo In Proceedings of the 12th edition of the International Workshop on Smalltalk Technologies, pp. 1-6. 2017.

dataframe's People

Contributors

atharvakhare avatar balajig2000 avatar ctskennerton avatar evd995 avatar hernanmd avatar jecisc avatar jordanmontt avatar josh0306 avatar joshua-dias-barreto avatar mabdi avatar olekscode avatar sergestinckwich avatar svenvc avatar tinchodias avatar tomooda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataframe's Issues

Documentation errors

Hi
There are a couple of errors with the main page documentation:
a) The add column example uses the keyword "row:"
b) The add: item example does not work as the method does not exist
BTW, I think this is a great contribution!
Regards,
Graham

Feature Request: addRow:

It would be great to have an addRow:
capability which would simply append a new row to the dataFrame.
I tried trivially adding one, but it seems the data structure is fixed size and one may have to copy
the whole frame to achieve this, something one wants to avoid for performance reasons.
Is there an easy way to add in the implementation?
Thanks
Graham

Request rowsCollect: and columnsCollect: methods

DataFrame should allow us to collectRows: by applying some block to each row and collecting the results into another DataFrame. The number of columns of the resulting DataFrame should depend on the block.

For example:

df collectRows: [ :row | row * 2 ].
df collectRows: [ :row | row range ].

The same should be done for columns.

Remove DataFrame>>select:where: and others

The idea of method DataFrame>>select:where: was to resemble SQL and introduce a small DSL for filtering DataFrame. Names of block arguments should correspond to column names and every argument x is replaced with (self column: #x) - column of DataFrame with the same name.

"Find all the rows that satisfy the condition that value at column #species 
equals #setosa and value at column #sepalWidth equals 3.
Then return only columns #petalWidth #petalLength "

irisDataFrame
    select: #(petalWidth petalLength)
    where: [ :species :sepalWidth | species = #setosa and: sepalWidth = 3 ].

This idea was very interesting, but now it doesn't work (all tests are failing) and I think that instead of fixing it we should just remove it. Because the same can be done with standard Smalltalk select: message:

(irisDataFrame
    select: [ :row | (row atKey: #species) = #setosa and: (row atKey: #sepalWidth) = 3 ])
    where: #(petalWidth petalLength).

It's longer, but now it looks much cleaner to me. Because select: part in select:where: was often confused with traditional select: which is doing the same as where:. So all those names are inconsistent.

In the future we may go back to the idea of DSL for querying. The implementation of DataFrame>>select:where: can be found by checking out previous commits.

Create empty DataFrame with column names

We can create an empty DataFrame with no rows and no columns.

DataFrame new. "a DataFrame (0@0)"

But there is no way to create a DataFrame with several named columns and no rows (specified features but no observations) in order to fill it later.

   |  Name  Age  Nationality  
---+--------------------------

If we try creating an empty DataFrame and setting the column names, the number of columns will remain 0, and the names will be saved (which also makes no sense - I will make a separate issue for that).

df := DataFrame new.
df numberOfColumns. "0"
df columnNames. "#()"
df columnNames: #(Name Age Gender).
df numberOfColumns. "0"
df columnNames. "#(Name Age Gender)"

We also have the API for creating an empty DataFrame with specified number of rows and columns.

DataFrame new: 2@3.
   |    1    2    3  
---+-----------------
1  |  nil  nil  nil  
2  |  nil  nil  nil  

But it throws SubscriptOutOfBounds if we want to have 0 rows

DataFrame new: 0@3. "SubscriptOutOfBounds: 1"

Provide sorting

A feature I missed during my first experiments with DataFrame is sorting. I'd expect the most general sorting operation to be sort: aBlock, like for SequenceableCollection, but the practically more frequent operation would be sorting by column.

Missing dependency on Tabular package in DataFrame-Tools

DataFrame-Tools uses XLSXImporter class from the Tabular package. However, the dependency is not specified in a baseline and package is not loaded.

This dependency should be added to the BaselineOfDataFrame.

As a temporary solution you can load Tabular from World Menu > Tools > Catalog Browser.

Sort dataseries not keeping original index of the data, so cannot retrieve the row from the df

e.g.

distances := (1 to: rows) collect: [ :row |  row -> (alpha - (lut at: row at: 4)) abs].

df addColumn: distances named: #Distance.

distancesDataSerie := df column: #Distance.

distancesDataSerie sorted: [ :a :b | a value < b value ].

But then there is no way to find back where each ds entry came from in the df.

It would be even better to be able to sort the df by a given column, or a set of such columns, or an arbitrary composite criteria (read: a block)

DataFrame can not be initialized with missing values

It would be useful to initialize a DataFrame with the following "incomplete" array and have it automatically insert nil or Float nan in place of the missing values. It should also be possible to handle these values later.

DataFrame fromRows: #(
    (1 2 3)
    (4 5)).

Profile select methods and make them faster

select: aBlock with: testedColumnNames
	
	| blockStr |
	blockStr := aBlock asString.
	
	"Remove parameters: '[ :x | x > 3 ]' to '[ x > 3 ]'"
	blockStr := (blockStr
		copyFrom: (blockStr findString: '|')
		to: blockStr size).
		
	testedColumnNames do: [ :eachColName |
		blockStr := blockStr
			copyReplaceAll: eachColName
			with: ('(row atKey: #', eachColName, ')') ].
		
	blockStr := '[ :row ', blockStr.
	^ self select: (Compiler evaluate: blockStr)

Change the script to load DataFrame

Proposed Metacello groups:

  • default (core DataFrame)
  • #Visualizations (loads Roassal)
  • #Examples (loads GTDocumenter)
  • #Files (loads NeoCSV)

DataFrame Core

Metacello new
  baseline: 'DataFrame';
  repository: 'github://PolyMathOrg/DataFrame';
  load.

Core + Files

Metacello new
  baseline: 'DataFrame';
  repository: 'github://PolyMathOrg/DataFrame';
  load: #Files.

Core + Files + Examples

Metacello new
  baseline: 'DataFrame';
  repository: 'github://PolyMathOrg/DataFrame';
  load: #(Files Examples).

Add ability to tabulate: like in Matrix

ut := Matrix rows: rows columns: columns tabulate: [ :r :c |
    | v |
    c = 2 ifTrue: [ v := r \\ 10. v = 0 ifTrue: [ v := 10 ]].
    c = 1 ifTrue: [ v := r // 10.  (r \\ 10) = 0 ifTrue: [ v := v - 1] . v:= v + 1].
    c = 3 ifTrue: [ v := 0 ].
    c = 4 ifTrue: [ v := 0 ].

    v ].

so, DataFrame rows: 100 columns: 3 tabulate: [:rowIndex :colIndex | ... ]
Code could be better, but you get the point

fromCSV: should detect data types

If we read a CSV file where some columns are strings and others are integers, floats, or dates, DataFrame reads all of them as strings. All these values should be parsed and converted to the most appropriate data types, which should be detected automatically.

Inefficient implementation of DataSeries>>#unique

unique
	| unique |
	unique := LinkedList new.

	self collect: [ :each |
	(unique includes: each)
		ifFalse: [ unique add: each ] ].
	
	unique := unique asDataSeries.
	unique name: self name.
	^ unique

collect: is actually building a collection of all items just to throw it away.

Can't we just do something like asBag? We'd have the counts as well.

Update baseline

Roassal2 has migrated to GitHub but DataFrame is still loading it from Smalltalkhub which sometimes causes conflicts. And in general, current baseline has to be improved.

Missing dependency error for GTExample

Loading DataFrame with this script raises a missing dependency error for GTExample.

Metacello new
    baseline: 'DataFrame' ;
    repository: 'github://PolyMathOrg/DataFrame';
    load .

Missing dependency for NeoCSV

After DataFrame was loaded, NeoCSV has to be installed manually from Catalog Browser. This dependency must be added to the BaselineOfDataFrame.

It doesn't raise an error when DataFrame is loaded. But if you call DataFrame >> fromCSV:, you'll get an error saying that "on: was sent to nil"

select:with: is failing on Pharo 7

DataFrame>>select:with: uses Compiler class which is removed or renamed in Pharo 7

select: aBlock with: testedColumnNames
	
	| blockStr |
	blockStr := aBlock asString.
	
	"Remove parameters: '[ :x | x > 3 ]' to '[ x > 3 ]'"
	blockStr := (blockStr
		copyFrom: (blockStr findString: '|')
		to: blockStr size).
		
	testedColumnNames do: [ :eachColName |
		blockStr := blockStr
			copyReplaceAll: eachColName
			with: ('(row atKey: #', eachColName, ')') ].
		
	blockStr := '[ :row ', blockStr.
	^ self select: (Compiler evaluate: blockStr)

Rewrite DataFrame>>select: (it's insane)

select: aBlock 
	"Evaluate aBlock with each of the receiver's elements as the argument. 
	Collect into a new collection like the receiver, only those elements for 
	which aBlock evaluates to true. Answer the new collection."

	| selectedRowNumbers df |
	
	selectedRowNumbers := LinkedList new.
	
	1 to: self numberOfRows do: [ :i | 
		(aBlock value: (self rowAt: i)) 
			ifTrue: [ selectedRowNumbers add: i ] ].
	
	df := self class new:
		(selectedRowNumbers size @ self numberOfColumns).
		
	df rowNames: (selectedRowNumbers collect: [ :i |
		self rowNames at: i ]) asArray.
	
	df columnNames: self columnNames.
		
	selectedRowNumbers doWithIndex: [ :rowNumber :i |
		df rowAt: i put: (self rowAt: rowNumber) ].
	
	^ df

DataFrame class(Object) doesNotUnderstand: #Rows

Thank you for this wonderful contribution. I'm hitting a very basic error. After installing per the directions, I then try the example to create with #rows. Unfortunately, I get the error "DataFrame class(Object) doesNotUnderstand: #Rows".

DataFrame >> rows:

shows up in the SystemBrowser (although in the protocol "not implemented", despite showing what looks like completed code.)

Below is what I input via the Playground and the error I received. I've tried in both the 32 and 64 bit versions of Pharo 6.1 (on Mac OS 10.12), freshly downloaded. No signs of any other problems with my images or system. Please let me know if there is other information I can provide or things I can try to help with debugging.

Thank you again.


Metacello new
  baseline: 'DataFrame';
  repository: 'github://PolyMathOrg/DataFrame';
  load.

df := DataFrame rows: #(
   ('Barcelona' 1.609 true)
   ('Dubai' 2.789 true)
   ('London' 8.788 false)).
Author: GlennHoetker

UndefinedObject(Object)>>doesNotUnderstand: #rows:
UndefinedObject>>DoIt
OpalCompiler>>evaluate
RubSmalltalkEditor>>evaluate:andDo:
RubSmalltalkEditor>>highlightEvaluateAndDo:
[ textMorph textArea editor highlightEvaluateAndDo: ann action.
textMorph shoutStyler style: textMorph text
] in [ textMorph textArea
	handleEdit: [ textMorph textArea editor highlightEvaluateAndDo: ann action.
		textMorph shoutStyler style: textMorph text
		]
] in GLMMorphicPharoScriptRenderer(GLMMorphicPharoCodeRenderer)>>actOnHighlightAndEvaluate:
RubEditingArea(RubAbstractTextArea)>>handleEdit:
[ textMorph textArea
	handleEdit: [ textMorph textArea editor highlightEvaluateAndDo: ann action.
		textMorph shoutStyler style: textMorph text
		]
] in GLMMorphicPharoScriptRenderer(GLMMorphicPharoCodeRenderer)>>actOnHighlightAndEvaluate:
WorldState>>runStepMethodsIn:
WorldMorph>>runStepMethods
WorldState>>doOneCycleNowFor:
WorldState>>doOneCycleFor:
WorldMorph>>doOneCycle
WorldMorph class>>doOneCycle
[ [ WorldMorph doOneCycle.
Processor yield.
false
] whileFalse: [  ]
] in MorphicUIManager>>spawnNewProcess
[ self value.
Processor terminateActive
] in BlockClosure>>newProcess

Setting column names does not check the number of columns

If we have a DataFrame with 2 columns

df := DataFrame fromRows: #(
	(Chile Santiago)
	(France Paris)).
df columnNames: #(Country Capital).
df numberOfColumns. "2"
   |  Country  Capital   
---+---------------------
1  |  Chile    Santiago  
2  |  France   Paris     

We can set the column names with an array of size 3 and get no error.

df columnNames: #(Country Capital Population).

Besides, the number of columns will remain 2.

df numberOfColumns. "2"

But the column names will now contain 3 elements, which makes no sense.

df columnNames. "#(#Country #Capital #Population)"

The same happens if we assign less column names than there are columns.

df columnNames: #(Country).
df numberOfColumns. "2"
df columnNames. "#(#Country)"

And in both cases, if we try to print a string table, we will get an error.

df asStringTable. "SubscriptOutOfBounds"

Review API for consistency

The methods of DataFrame were created at different time over 2 years. Because of that API is sometimes inconsistent. We need to review all methods, remove the redundant ones and rename those with confusing names. Protocols also have to be reviewed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.