lux-org / lux Goto Github PK
View Code? Open in Web Editor NEWAutomatically visualize your pandas dataframe via a single print! ๐ ๐ก
License: Apache License 2.0
Automatically visualize your pandas dataframe via a single print! ๐ ๐ก
License: Apache License 2.0
Currently, the to_Altair functionality relies on chart.code
, which is manually created via string concatenation. This approach is hard to maintain across changes. We can rewrite this by leveraging the inspect.getsource
function to obtain the exported code.
For example:
def change_color_add_title(chart):
chart = chart.configure_mark(color="green") # change mark color to green
chart.title = "Custom Title" # add title to chart
return chart
import inspect
print (inspect.getsource(change_color_add_title))
For high cardinality bar chart, we have a text indicator that shows "X more..." bars, however the text attribute color:
text = alt.Chart(visData).mark_text(
x=155,
y=142,
align="right",
color = "#ff8e04",
fontSize = 11,
text=f"+ 230 more ..."
)
is overridden by the color encoding:
chart = chart.encode(color=alt.Color('occupation',type='nominal'))
which is generated from AltairChart:encode_color
, the encode color is called last since that is only added for visualizations that have a color attribute.
This example can be found in the census dataset.
df = pd.read_csv("../lux-datasets/data/census.csv")
df.set_intent(["education"])
df
df = pd.read_csv("../../lux/data/state_timeseries.csv")
df["Date"] = pd.to_datetime(df["Date"])
df.set_index(["Date"])
This is happening because the executor expects a flat table and pre_aggregate is inferred as False for this table.
There is a bug related to quantitative attributes that have low cardinality. For example, we create a column of values where all the values are 4.0
. There is a warning message resulting from a divide-by-zero error and a histogram that looks, since the range of the computed min/max field is zero. This issue will be resolved if we avoid computing histograms for quantitative values that have low cardinality.
import pandas as pd
import lux
df = pd.read_csv("../../lux/data/car.csv")
df["Year"] = pd.to_datetime(df["Year"], format='%Y') # change pandas dtype for the column "Year" to datetype
df["Units"] = 4.0
df
In Lux, we detect attributes that look like an ID and avoid visualizing them.
There are several issues related to the current type detection mechanisms:
check_if_id_like
needs to be improved so that we are not relying on attribute_contain_id
check too much, i.e. even if the attribute name does not contain ID but looks like an ID, we should still label it as an ID. The cardinality check almost_all_vals_unique
is a good example since most ID fields are largely unique. Another check we could implement is checking that the ID is spaced by a regular interval (e.g., 200,201,202,...), this is somewhat of a weak signal, since it not a necessary property of ID.BUG: We only trigger ID detection currently if the data type of the attribute is detected as an integer (source). We should fix this bug so that string attributes that are ID like (e.g., a CustomerID in the Churn dataset like "7590-VHVEG") are also detected as IDs.
Some test data can be found here, feel free to find your own on Kaggle or elsewhere. For a pull request, please include tests to try out the bugfix on several datasets to verify that ID fields are being detected and that non-ID fields are not detected.
The current sampling strategy is crude and largely based on random sampling. We should investigate the performance degradation of Lux across various large datasets to select better sampling strategies, as well as exposing tunable parameters in the API to allow users to adjust for different sampling parameters and strategies. Add ReadTheDoc page explaining default sampling strategy and how to tune them in Lux.
Incorporate pattern similarity into registered action, or show an example of how user-defined action can easily support custom VisList and scoring functions. Adapt existing code in lux/action/similarity.py.
Currently, autocomplete for LuxDataFrame is supported for methods for both pandas and LuxDataFrame. However, the autocomplete for class properties (i.e. class attributes) are not showing up. The @property
decorator approach doesn't work since it leads to maxrecursionerror, possibly because of subclassing.
There is a placeholder value df
to generate the exported code.
This breaks if we name the dataframe as anything other than df
, for example:
data = pd.DataFrame({"label" : ["a","b","c","d","e"],
"value2" : [25,30,35,40,45],
"value" : [25,30,35,40,45]})
data
vis = data.get_exported()[0]
print (vis.to_Altair())
This issue should be resolved if we resolve issue #37.
When dataframes are filtered to a resulting table that is somewhat small (e.g. less than 15 datapoints or so), then highlight the difference in the previous dataframe (e.g. df._prev
) and the items in the resulting table.
TODOS:
Currently the stacked bar chart is used when 2 categorical variables and 1 quantitative variable are specified. This works when the aggregate function is sum, but the chart is not very interpretable when another aggregate type is used.
While the heatmap chart is the more space efficient alternative, it is a bit harder to interpret compared to the grouped bar chart. We will implement the grouped bar chart, but will need to modify the grouped bar chart to share axes, similar to matplotlib's shared axes grouped bar chart.
Again for compactness reasons, we may only chose to apply this when the color attribute has less than 3 distinct values.
We should display a better warning message when the user specifies an intent that indicates more than one item, but puts it into Vis instead of VisList.
For example, if I do:
Vis(["AverageCost","SATAverage", "Geography=?"],df)
The error that I get is not very interpretable. We should warn users that the intent should be used with a VisList as shown in the screenshot below. If possible, we might even map the intent down to a VisList so that we display something to the user.
Currently, __setitem__
computes the metadata when called
def __setitem__(self, key, value):
super(LuxDataFrame, self).__setitem__(key, value)
self.compute_stats()
self.compute_dataset_metadata()
However, when _repr_html_
is called, the metadata is computed many times more than is necessary.
Removing this metadata computation causes Lux tests to fail.
We should have a settings panel to hide away ID-like fields (fields with cardinality that is close to the number of datapoints or are very evenly spaced). Incorporating ID-like fields in Correlation
is problematic because everything involving the ID field would not be very informative.
This is an example from the instacart-market-basket-analysis Kaggle notebook.
df.set_context([lux.Spec(attribute = "Horsepower"),lux.Spec(attribute = "Horsepower")])
df
Right now, we penalize views that have duplicate attributes with an interestingness score of -1, which is why we don't have Enhance and Filter here. This would actually be one of the few places where Pivot
might be helpful to help users to "get unstuck".
When a dataframe is pre-aggregated, our type detection based on cardinality often fail to detect the type correctly. For example, when the dataset size is small (often the case when data is pre-aggregated), nominal fields would get recognized as a quantitative
type.
df = pd.read_csv("lux/data/car.csv")
df["Year"] = pd.to_datetime(df["Year"], format='%Y') # change pandas dtype for the column "Year" to datetype
a = df.groupby("Cylinders").mean()
a.data_type
As a related issue, we should also support the detection of types for named index, for example, in this case, Cylinders
is an index, so its data type is not being computed.
Currently, we only support automatic detection of datetime if the variable name contains ["month", "year","day","date","time"]. We should automatically detect if a string data type is datetime-like and convert it to temporal type.
Currently, all bars in the bar chart has the same color, this makes comparison across items challenging. In particular, users need to read the bar labels to realize that the bars are sorted to match other charts for comparison. We should assign a consistent set of colors for low cardinality categorical attributes to facilitate ease of comparison. We can use a standard color pallette for the particular visualization library (Vega, matplotlib).
The update bars would look something like this, where all values of Cylinder=4 is green, all values of Cylinders of 8 is yellow, etc.
Currently, text attributes are recognized as categorical attributes and displayed as bar charts. We should add a new datatype for text attributes, perhaps distinguishing between short: 1~2 words (more like labels, categories), long: 2+ (more like reviews, sentences, etc). In addition, we can add an action that visualizes text fields, such as word clouds, N-grams, etc.
This article documents some possible visualization types for text data.
Supporting vis renderer other than Altair, such as matplotlib or plotly
Running Pandas test suite to ensure that metadata and recommendations maintained for a variety of Pandas functions.
Vis objects are currently exported as code via to_Altair
or to_VegaLite
, but the Lux syntax is not exposed. As suggested by @adityagp, we should extend Lux with a feature that allows users to copy the code that generates the intent via a UI button so that users can paste, edit their intent.
Currently, our bar chart visualization tries to balance showing the overall distribution and min/max over individual values, the issue is that the charts are summarized or squashed.
โ
easily distinguish min & max items
โ charts summarized/squashed
Other alternatives include 1) overplotting, 2) truncating to top-k, and 3) scrollable design.
Each of them come with their own pros/cons.
One possible design is having a truncated top-k view that can be "expanded" into a scrollable view.
Any Sphinx autogenerated links to Lux modules gets hyperlinked to the bottom of the page.
For example:
At this link, if you press lux.action
, you get directed to the bottom of the page, specifically here.
For some reason, all links to modules go directly to the "Module contents" section of a page, which happens to be at the bottom.
This issue was brought up at this StackOverflow post, but the fix is for the sphinx library command rather than something we can add to conf.py
or one of the rst files. However, that may be a good place to start when looking for a fix.
This issue was opened in light of trying to embed lux widgets in the documentation. We've tried a variety of solutions listed below, but none of them were able to either import or embed the widgets.
Our most recent version can be found on this branch, and was based on Altair's documentation. We were able to show code block, but not a chart (might be worth investigating more later on).
We have tried various approaches on this front, along with @westernguy2 and @jrdzha .
.. ipywidgets-display::
)
As a note for the future, we might need to look into ways that we can make a static rendering of the widget without the need of a Jupyter backend. This requires us to package all the current dependencies into the export. This will also help with embedding to HTML or sharing of Lux widgets.
There is a bug when using Lux with this example from Datashader.
import pandas as pd
import numpy as np
from collections import OrderedDict as odict
num=10000
np.random.seed(1)
dists = {cat: pd.DataFrame(odict([('x',np.random.normal(x,s,num)),
('y',np.random.normal(y,s,num)),
('val',val),
('cat',cat)]))
for x, y, s, val, cat in
[( 2, 2, 0.03, 10, "d1"),
( 2, -2, 0.10, 20, "d2"),
( -2, -2, 0.50, 30, "d3"),
( -2, 2, 1.00, 40, "d4"),
( 0, 0, 3.00, 50, "d5")] }
df = pd.concat(dists,ignore_index=True)
df["cat"]=df["cat"].astype("category") # If commented, the df.intent=vis line doesn't break
df #Select and export the scatterplot vis (big circular blob)
vis = df.exported[0]
df.intent = vis
df # This errors on the key 'cat' which is a categorical
We should look into supporting Categorical data types in Pandas. More importantly, we should look into whether bugs show up when we perform astype
operations for other data types.
Lux currently only supports 1st level index and falls back to default Pandas display if there is an index level higher than 2. MultiIndex can be created via crosstab
and explicitly via pd.MultiIndex
.
Example 1:
# Example from http://www.datasciencemadesimple.com/cross-tab-cross-table-python-pandas/
d = {
'Name':['Alisa','Bobby','Cathrine','Alisa','Bobby','Cathrine',
'Alisa','Bobby','Cathrine','Alisa','Bobby','Cathrine'],
'Exam':['Semester 1','Semester 1','Semester 1','Semester 1','Semester 1','Semester 1',
'Semester 2','Semester 2','Semester 2','Semester 2','Semester 2','Semester 2'],
'Subject':['Mathematics','Mathematics','Mathematics','Science','Science','Science',
'Mathematics','Mathematics','Mathematics','Science','Science','Science'],
'Result':['Pass','Pass','Fail','Pass','Fail','Pass','Pass','Fail','Fail','Pass','Pass','Fail']}
df = pd.DataFrame(d,columns=['Name','Exam','Subject','Result'])
pd.crosstab([df.Exam,df.Subject],df.Result)
Example 2:
# Example borrowed from https://jakevdp.github.io/PythonDataScienceHandbook/03.05-hierarchical-indexing.html
index = [('California', 2000), ('California', 2010),
('New York', 2000), ('New York', 2010),
('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]
some_other_vals = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]
pop = pd.DataFrame(populations, columns=["population"], index=index)
index = pd.MultiIndex.from_tuples(index)
pop = pop.reindex(index)
pop
This ends up creating many possible visualizations and is better visualized as a faceted chart. Altair also does not currently support MultiIndex, so we would need to explicitly reformat the data to render the vis.
The console leaves me a "ModuleNotFoundError: No module named 'luxwidget'" when i try to run "jupyter nbextension install --sys-prefix --symlink --overwrite --py luxwidget" in cmd.
Currently when multiple filters are applied it is unclear if an OR or AND is applied. Not enough explanations and information is displayed on the visualization to indicate this.
For example, even though clauses are combined via conjunction, in this example, it seems like only one of the filters is being applied.
df.intent = ["Region=New England", "Region=Southeast", "Region=Far West"]
df
More work needs to be done to extend the language for supporting OR.
When properties are specified in the Vis and fed inside the VisList, the refresh_source goes through the parser, validator and compiler. At the compiler stage, the Vis properties gets overridden by the automatically determined values. We should tolerate specified title
, mark
and other Vis properties to override the automatically determined values.
This is an example test that evaluates this, the last two line fails because the title
and mark
takes on the new values.
This example should be added to the test suite once we resolve this issue.
df = pd.read_csv("lux/data/olympic.csv")
df["Year"] = pd.to_datetime(df["Year"], format='%Y') # change pandas dtype for the column "Year" to datetype
from lux.vis.VisList import VisList
from lux.vis.Vis import Vis
vcLst = []
for attribute in ['Sport','Year','Height','HostRegion','SportType']:
vis = Vis([lux.Clause("Weight"), lux.Clause(attribute)],title="overriding dummy title",mark="line")
vcLst.append(vis)
vc = VisList(vcLst,df)
for v in vc:
assert v.title=="overriding dummy title" # AssertionError
assert v.mark=="line" # AssertionError
Scatterplots with large number of data points take a disproportionately long time to render on the frontend, due to the large number of points we have to draw. We should explore other rendering options that are faster, even if that means that the visualizations are non-interactive and static (e.g., just an image). For portability, we should try other rendering options in Altair first. Our current rendering mechanism is Canvas, and it's not entirely clear whether SVG would be faster or slower (see [1],[2]). If not, we could migrate to matplotlib as an alternative rendering backend and create static images that is then sent to the frontend.
Occasionally, the test suite can fail when it fetches the data from the same URL too frequently, we should replace the URL link with local versions of the cars.csv if possible. In addition, any remote fetches of the data should catch the HTTP 429 error and add a sleeping timer to ensure that the test doesn't fail when the query rate is too high.
We could also look into dataframe reuse opportunities across tests so that the dataframe doesn't need to be reloaded every single time.
Currently, the Lux widget is rendered through the IPython.display module, which is not directly tied to the output. This makes the widget show up before the Out[...] in Jupyter.
There is a way to display the Jupyter widget through the Output display. However, since the LuxDataframe is overriding Pandas. We have to suppress the Pandas output in the __repr__
function.
This issue and #56 are related and should be addressed together.
We should create a LuxSeries object to take on the sliced version of the LuxDataframe, following guidelines for subclassing DataFrames. We need to pass the _metadata
from LuxDataFrame to LuxSeries so that it is preserved across operations (and therefore doesn't need to be recomputed), related to #65. Currently, this code is commented since LuxSeries is causing issues compared to the original Pd.Series.
class LuxDataFrame(pd.DataFrame):
....
@property
def _constructor(self):
return LuxDataFrame
@property
def _constructor_sliced(self):
def f(*args, **kwargs):
# adapted from https://github.com/pandas-dev/pandas/issues/13208#issuecomment-326556232
return LuxSeries(*args, **kwargs).__finalize__(self, method='inherit')
return f
class LuxSeries(pd.Series):
# _metadata = ['name','_intent','data_type_lookup','data_type',
# 'data_model_lookup','data_model','unique_values','cardinality',
# 'min_max','plot_config', '_current_vis','_widget', '_recommendation']
def __init__(self,*args, **kw):
super(LuxSeries, self).__init__(*args, **kw)
@property
def _constructor(self):
return LuxSeries
@property
def _constructor_expanddim(self):
from lux.core.frame import LuxDataFrame
# def f(*args, **kwargs):
# # adapted from https://github.com/pandas-dev/pandas/issues/13208#issuecomment-326556232
# return LuxDataFrame(*args, **kwargs).__finalize__(self, method='inherit')
# return f
return LuxDataFrame
In particular the original name
property of the Lux Series is lost when we implement LuxSeries, see test_pandas.py:test_df_to_series
for more details.
Example:
df = pd.read_csv("lux/data/car.csv")
df._repr_html_()
series = df["Weight"]
series.name # returns None (BUG!)
series.cardinality # preserved
We should also add a repr to print out the basic histogram for Series objects.
We have basic checks inside correlation.py
and univariate.py
to make sure that we don't generate recommendations when there is less than 5 points. We should generate a quick overview visualization of the dataframe when users do head and tail.
For the Filter action, when the filter operation involves inequality, we should support better alternatives. For example when a user as for MilesPerGal<15, we could also show MilesPerGal>15 or bin it to different quartiles.
I used follow commend to instal lux widget.
pip install git+https://github.com/lux-org/lux-widget
jupyter nbextension install --py luxWidget
jupyter nbextension enable --py luxWidget
It's ok but the widget is not appear in Jupyter notebook. My environment is conda 4.8.5. and Python is 3.7.8.
__repr__
is currently empty to facilitate better display(..) of the luxWidget. We should support string __repr__
of LuxDataFrame for string operations on the LuxDataFrame.
One toy example of where this shows up is when dataframes are put into a list, the string __repr__
is triggered:
Ideally, we should have a string version that looks more like this:
When no recommendations are generated (e.g., when dataframe is small but not preaggregated, possibly other cases), we should display a warning that explains why the Lux view is not showing up.
Add an advanced ReadTheDoc page explaining default recommendation logic, including when recommendations are not displayed.
Originally posted by @akanz1 in #110 (comment)
Hey Team Lux,
i just installed Lux and the nbextension as described in the Qiuck install section to explore its capabilities and ran into the following error message after hitting the "Toggle Pandas/Lux" button.
import lux
runs without any complaint and running lux.version_info
returns (0, 2, 0). Also, when installing i got Validating: OK
for both nbextension commands.
I'm running this in the latest version of chrome, python 3.7.8 and pandas 1.1.3.
Occasionally, there might be mistakes in the data type that is automatically detected in Lux, we should support the ability to let users override the inferred data type and persist this inferred setting throughout the session (even if the metadata has to be recomputed again).
Colored bars and lines currently have a fixed score of 0.2, which is perhaps a bit too high. There might be an issue with the interestingness calculation of the no colored bars since their values are all close to zero. Ideally, the no colored bars should be prioritized since they are simpler and more interpretable.
Some scatterplots have large number of datapoints and suffer from the problem of occlusion (hard to distinguish patterns since datapoints are too dense). We should change the default chart setting to make opacity lower based on number of datapoints.
We currently rely on Pandas's finalize to propagate the _metadata (including computed metadata, recommendations, and other stored properties). However, there are situations where dataframes don't always stay the same type throughout various operations. For example, in this issue, when we do group by. We end up with a GroupBy object, and the _metadata
is lost. Similar issues will occur when going from Pandas Dataframe --> Series.
We should find a better strategy for metadata maintenance without having to explicitly pass in finalize, or ensure that when metadata properties are retrieved that they trigger a metadata recomputation if unfresh (so that the system is slightly slower but doesn't break when the metadata fields are accessed).
df = pd.read_csv("lux/data/car.csv")
groupby_result = df.groupby("Origin").agg("sum")
intermediate = groupby_result.reset_index()
intermediate.cardinality # this returns None
df = intermediate.__finalize__(df)
df.cardinality # cardinality is now populated again
The parent call for __setitem__
seems to trigger a SettingWithCopyWarning.
# First cell
df = pd.read_csv("https://github.com/lux-org/lux-datasets/blob/master/data/olympic.csv?raw=True")
df["Year"] = pd.to_datetime(df["Year"],format="%Y")
from lux.vis.Vis import Vis
vis = Vis(["Weight","Height"],df)
vis
# Next cell
df = vis.data
df["xBin"] = pd.cut(df["Weight"], bins=30)
df["yBin"] = pd.cut(df["Height"], bins=30)
df
Warning:
lux/core/frame.py:56: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
super(LuxDataFrame, self).__setitem__(key, value)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.