lux-org / lux Goto Github PK

Automatically visualize your pandas dataframe via a single print! 📊 💡

License: Apache License 2.0

Makefile 0.03% Python 99.91% JavaScript 0.06%

visualization-tools jupyter python data-science exploratory-data-analysis visualization pandas

lux's Introduction

A Python API for Intelligent Visual Discovery

Lux is a Python library that facilitate fast and easy data exploration by automating the visualization and data analysis process. By simply printing out a dataframe in a Jupyter notebook, Lux recommends a set of visualizations highlighting interesting trends and patterns in the dataset. Visualizations are displayed via an interactive widget that enables users to quickly browse through large collections of visualizations and make sense of their data.

Here is a 1-min video introducing Lux, and slides from a more extended talk.

Check out our notebook gallery with examples of how Lux can be used with different datasets and analyses.
Or try out Lux on your own in a live Jupyter Notebook!

Getting Started

To start using Lux, simply add an extra import statement along with your Pandas import.

import lux
import pandas as pd

Lux can be used without modifying any existing Pandas code. Here, we use Pandas's read_csv command to load in a dataset of colleges and their properties.

df = pd.read_csv("https://raw.githubusercontent.com/lux-org/lux-datasets/master/data/college.csv")
df

When the dataframe is printed out, Lux automatically recommends a set of visualizations highlighting interesting trends and patterns in the dataset.

Voila! Here's a set of visualizations that you can now use to explore your dataset further!

Next-step recommendations based on user intent:

In addition to dataframe visualizations at every step in the exploration, you can specify to Lux the attributes and values you're interested in. Based on this intent, Lux guides users towards potential next-steps in their exploration.

For example, we might be interested in the attributes AverageCost and SATAverage.

df.intent = ["AverageCost","SATAverage"]
df

The left-hand side of the widget shows the current visualization, i.e., the current visualization generated based on what the user is interested in. On the right, Lux generates three sets of recommendations, organized as separate tabs on the widget:

Enhance adds an additional attribute to the current selection, essentially highlighting how additional variables affect the relationship of AverageCost and SATAverage. We see that if we breakdown the relationship by FundingModel, there is a clear separation between public colleges (shown in red) and private colleges (in blue), with public colleges being cheaper to attend and with SAT average of lower than 1400.
Filter adds a filter to the current selection, while keeping attributes (on the X and Y axes) fixed. These visualizations show how the relationship of AverageCost and SATAverage changes for different subsets of data. For instance, we see that colleges that offer Bachelor's degree as its highest degree show a roughly linear trend between the two variables.
Generalize removes an attribute to display a more general trend, showing the distributions of AverageCost and SATAverage on its own. From the AverageCost histogram, we see that many colleges with average cost of around $20000 per year, corresponding to the bulge we see in the scatterplot view.

See this page for more information on additional ways for specifying the intent.

Easy programmatic access and export of visualizations:

Now that we have found some interesting visualizations through Lux, we might be interested in digging into these visualizations a bit more or sharing it with others. We can save the visualizations generated in Lux as a static, shareable HTML or programmatically access these visualizations further in Jupyter. Selected Vis objects can be translated into Altair, Matplotlib, or Vega-Lite code, so that they can be further edited.

Learn more about how to save and export visualizations here.

Quick, on-demand visualizations with the help of automatic encoding:

We've seen how Viss are automatically generated as part of the recommendations. Users can also create their own Vis via the same syntax as specifying the intent. Lux is built on the philosophy that users should always be able to visualize anything they want, without having to think about how the visualization should look like. Lux automatically determines the mark and channel mappings based on a set of best practices. The visualizations are rendered via Altair into Vega-Lite specifications.

from lux.vis.Vis import Vis
Vis(["Region=New England","MedianEarnings"],df)

Powerful language for working with collections of visualizations:

Lux provides a powerful abstraction for working with collections of visualizations based on a partially specified queries. Users can provide a list or a wildcard to iterate over combinations of filter or attribute values and quickly browse through large numbers of visualizations. The partial specification is inspired by existing work on visualization query languages, including ZQL and CompassQL.

For example, we are interested in how the AverageCost distribution differs across different Regions.

from lux.vis.VisList import VisList
VisList(["Region=?","AverageCost"],df)

To find out more about other features in Lux, see the complete documentation on ReadTheDocs.

Installation & Setup

Note: Lux's official package name is lux-api (not lux). After installing the package, remember to run the setup instructions for your notebook IDE, e.g., jupyter notebook and jupyter lab.

To get started, please follow both the installation and setup instructions in your command line. lux-api can be installed through PyPI or conda-forge.

pip install lux-api

If you use conda, you can install lux-api via:

conda install -c conda-forge lux-api

Both the PyPI and conda installation include includes the Lux Jupyter widget frontend, lux-widget.

Setup in Jupyter Notebook, VSCode, JupyterHub

To use Lux with any Jupyter notebook-based frontends (e.g., Jupyter notebook, JupyterHub, or VSCode), activate the notebook extension:

jupyter nbextension install --py luxwidget
jupyter nbextension enable --py luxwidget

If the installation happens correctly, you should see two - Validating: OK after executing the two lines above. Note that you may have to restart the Jupyter Notebook server to ensure that the widget is displaying correctly.

Setup in Jupyter Lab

Lux is compatible with both Jupyter Lab version 2 and 3. To use Lux in Jupyter Lab, activate the lab extension:

jupyter labextension install @jupyter-widgets/jupyterlab-manager
jupyter labextension install luxwidget

Note that JupyterLab and VSCode is supported only for lux-widget version >=0.1.2, if you have an earlier version, please upgrade to the latest version of lux-widget. Lux has only been tested with the Chrome browser.

If you encounter issues with the installation, please refer to this page to troubleshoot the installation. Follow these instructions to set up Lux for development purposes.

Support and Resources

Lux is undergoing active development. If you are using Lux, we would love to hear from you!

Drop us a note here to share your experiences — any feedback, suggestions, and contributions are welcome!

Links to additional resources:

Follow us on Twitter or sign up to our mailing list to stay tuned for upcoming releases and updates.
Visit ReadTheDoc for more detailed documentation.
Try out these hands-on exercises or tutorials on Binder. Or clone and run lux-binder locally.
Join our community Slack to discuss and ask questions.
Check out our paper for more details on how Lux works under the hoods.
Report any bugs, issues, or requests through Github Issues.

lux's People

Contributors

Stargazers

Watchers

Forkers

thyneb19 jaywoo123 caitlynachen westernguy2 19thyneb smritim dorisjlee prastogi11 jinimukh justinrwong justinyiwang cjachekang kmr20 leo23 allensmile jlian401 demomo-20191204 rahmansunny071 jerrysong1324 akanz1 kgzaker ccubc vishalbelsare lenamax2355 jrdzha qutubkhan piyushg9794 jbdatascience mrzombie69232 rishirelan hieuqtran georgi-petkov maxcodextc sahar90o ratanboddu newdevops2020 andreluispinto markusbuchholz rajesh16702 scbasaraner forkedrepositories doubianimehdi coulbe ajayakumar1983 trawely gunjanrt04 chandrachud23 anhmike vineethraj510 hamed225 ahlas milan-chicago aqua-regia m7013 touji2j abraich tawabshakeel alfredogoni adityaem ui-frontend alcefilho aliborji piresluciana maybeee18 aegp17 joseph-x-li jay-chakalasiya xrosliang shoman2 qqq-tech timestap dj-khandelwal wztt-xh micahtyong cmudig donghang11 priyansdesai seanlin2000 anupambaranwal mantejpanesar shrinivas-io gilliardmorandim yumeone patrickocr linuodoc specialuse whmz wenxiang-li jorson-chen weiplanet sophiahhuang mu-l paes-dawrlog echinoids csolitaire jgpic uzbekdev1 moh-yakoub aryaamadhur projectafey

lux's Issues

Add Pivot action to support identity case

df.set_context([lux.Spec(attribute = "Horsepower"),lux.Spec(attribute = "Horsepower")])
df

Right now, we penalize views that have duplicate attributes with an interestingness score of -1, which is why we don't have Enhance and Filter here. This would actually be one of the few places where Pivot might be helpful to help users to "get unstuck".

Supporting MultiIndex (level >2)

Lux currently only supports 1st level index and falls back to default Pandas display if there is an index level higher than 2. MultiIndex can be created via crosstab and explicitly via pd.MultiIndex.

Example 1:

# Example from http://www.datasciencemadesimple.com/cross-tab-cross-table-python-pandas/
d = {
    'Name':['Alisa','Bobby','Cathrine','Alisa','Bobby','Cathrine',
            'Alisa','Bobby','Cathrine','Alisa','Bobby','Cathrine'],
    'Exam':['Semester 1','Semester 1','Semester 1','Semester 1','Semester 1','Semester 1',
            'Semester 2','Semester 2','Semester 2','Semester 2','Semester 2','Semester 2'],
     
    'Subject':['Mathematics','Mathematics','Mathematics','Science','Science','Science',
               'Mathematics','Mathematics','Mathematics','Science','Science','Science'],
   'Result':['Pass','Pass','Fail','Pass','Fail','Pass','Pass','Fail','Fail','Pass','Pass','Fail']}
 
df = pd.DataFrame(d,columns=['Name','Exam','Subject','Result'])

pd.crosstab([df.Exam,df.Subject],df.Result)

Example 2:

# Example borrowed from https://jakevdp.github.io/PythonDataScienceHandbook/03.05-hierarchical-indexing.html
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
some_other_vals = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop = pd.DataFrame(populations, columns=["population"], index=index)
index = pd.MultiIndex.from_tuples(index)
pop = pop.reindex(index)
pop

This ends up creating many possible visualizations and is better visualized as a faceted chart. Altair also does not currently support MultiIndex, so we would need to explicitly reformat the data to render the vis.

High-cardinality bar chart text color overridden by color channel encoding

For high cardinality bar chart, we have a text indicator that shows "X more..." bars, however the text attribute color:

text = alt.Chart(visData).mark_text(
	x=155, 
	y=142,
	align="right",
	color = "#ff8e04",
	fontSize = 11,
	text=f"+ 230 more ..."
)

is overridden by the color encoding:

chart = chart.encode(color=alt.Color('occupation',type='nominal'))

which is generated from AltairChart:encode_color, the encode color is called last since that is only added for visualizations that have a color attribute.

This example can be found in the census dataset.

df = pd.read_csv("../lux-datasets/data/census.csv")
df.set_intent(["education"])
df

Show widget in Out[...] to conform to Pandas semantics

Currently, the Lux widget is rendered through the IPython.display module, which is not directly tied to the output. This makes the widget show up before the Out[...] in Jupyter.

There is a way to display the Jupyter widget through the Output display. However, since the LuxDataframe is overriding Pandas. We have to suppress the Pandas output in the __repr__ function.

This issue and #56 are related and should be addressed together.

API support for type override

Occasionally, there might be mistakes in the data type that is automatically detected in Lux, we should support the ability to let users override the inferred data type and persist this inferred setting throughout the session (even if the metadata has to be recomputed again).

Avoid showing histograms for low cardinality quantitative attributes

There is a bug related to quantitative attributes that have low cardinality. For example, we create a column of values where all the values are 4.0. There is a warning message resulting from a divide-by-zero error and a histogram that looks, since the range of the computed min/max field is zero. This issue will be resolved if we avoid computing histograms for quantitative values that have low cardinality.

import pandas as pd
import lux
df = pd.read_csv("../../lux/data/car.csv")
df["Year"] = pd.to_datetime(df["Year"], format='%Y') # change pandas dtype for the column "Year" to datetype
df["Units"] = 4.0
df

Supporting export/copyable intent

Vis objects are currently exported as code via to_Altair or to_VegaLite, but the Lux syntax is not exposed. As suggested by @adityagp, we should extend Lux with a feature that allows users to copy the code that generates the intent via a UI button so that users can paste, edit their intent.

Better data type detection for pre_aggregated, indexed dataframes

When a dataframe is pre-aggregated, our type detection based on cardinality often fail to detect the type correctly. For example, when the dataset size is small (often the case when data is pre-aggregated), nominal fields would get recognized as a quantitative type.

df = pd.read_csv("lux/data/car.csv")
df["Year"] = pd.to_datetime(df["Year"], format='%Y') # change pandas dtype for the column "Year" to datetype
a = df.groupby("Cylinders").mean()

a.data_type

As a related issue, we should also support the detection of types for named index, for example, in this case, Cylinders is an index, so its data type is not being computed.

Embedding Sphinx Extension

This issue was opened in light of trying to embed lux widgets in the documentation. We've tried a variety of solutions listed below, but none of them were able to either import or embed the widgets.
Our most recent version can be found on this branch, and was based on Altair's documentation. We were able to show code block, but not a chart (might be worth investigating more later on).

We have tried various approaches on this front, along with @westernguy2 and @jrdzha .

nbsphinx approach:

Code in our nbsphinx branch
(extension to display entire notebook)
Pandas was able to display correctly
Save widgets with notebook
Nbsphinx_widgets_path
-also limited to embedding full ipython notebooks

jupyter-sphinx approach:

Code in our jupyter-sphinx branch
Tried using old version (to use 'jupyter_sphinx.embed_widgets', .. ipywidgets-display::)
- Seems to be able to build (no errors)
- Shows input, but no output
limited documentation, so hard to reproduce

Other ipywidgets that also manually building sphinx ext

bokeh plot: compiler was specific to bokeh plot, so we decided not to go for this one
nbinteract: Generate Interactive Web Pages - From Jupyter Notebooks: unable to build custom widgets

Manually exporting the lux widgets in order to display it on docs (html)

unable to compile/import lux-widget

As a note for the future, we might need to look into ways that we can make a static rendering of the widget without the need of a Jupyter backend. This requires us to package all the current dependencies into the export. This will also help with embedding to HTML or sharing of Lux widgets.

LuxDataFrame's setitem super call throws SettingWithCopyWarning

The parent call for __setitem__ seems to trigger a SettingWithCopyWarning.

# First cell
df = pd.read_csv("https://github.com/lux-org/lux-datasets/blob/master/data/olympic.csv?raw=True")
df["Year"] = pd.to_datetime(df["Year"],format="%Y")

from lux.vis.Vis import Vis
vis = Vis(["Weight","Height"],df)
vis

# Next cell
df = vis.data
df["xBin"] = pd.cut(df["Weight"], bins=30)
df["yBin"] = pd.cut(df["Height"], bins=30)
df

Warning:

lux/core/frame.py:56: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super(LuxDataFrame, self).__setitem__(key, value)

Supporting static graphics rendering

Scatterplots with large number of data points take a disproportionately long time to render on the frontend, due to the large number of points we have to draw. We should explore other rendering options that are faster, even if that means that the visualizations are non-interactive and static (e.g., just an image). For portability, we should try other rendering options in Altair first. Our current rendering mechanism is Canvas, and it's not entirely clear whether SVG would be faster or slower (see [1],[2]). If not, we could migrate to matplotlib as an alternative rendering backend and create static images that is then sent to the frontend.

Replace Stacked Barchart with Group Bar chart with shared axes

Currently the stacked bar chart is used when 2 categorical variables and 1 quantitative variable are specified. This works when the aggregate function is sum, but the chart is not very interpretable when another aggregate type is used.

Current Chart used

Alternatives

While the heatmap chart is the more space efficient alternative, it is a bit harder to interpret compared to the grouped bar chart. We will implement the grouped bar chart, but will need to modify the grouped bar chart to share axes, similar to matplotlib's shared axes grouped bar chart.
Again for compactness reasons, we may only chose to apply this when the color attribute has less than 3 distinct values.

Lux errors when entire column is NaN

import pandas as pd
import lux
df = pd.read_csv("../../lux/data/college.csv")
import numpy as np
df["test"]=np.nan

Assign unique color to different values in unsorted bar charts

Currently, all bars in the bar chart has the same color, this makes comparison across items challenging. In particular, users need to read the bar labels to realize that the bars are sorted to match other charts for comparison. We should assign a consistent set of colors for low cardinality categorical attributes to facilitate ease of comparison. We can use a standard color pallette for the particular visualization library (Vega, matplotlib).

The update bars would look something like this, where all values of Cylinder=4 is green, all values of Cylinders of 8 is yellow, etc.

Remove excessive computation of metadata in setitem

Currently, __setitem__ computes the metadata when called

def __setitem__(self, key, value):
	super(LuxDataFrame, self).__setitem__(key, value)
	self.compute_stats()
	self.compute_dataset_metadata()

However, when _repr_html_ is called, the metadata is computed many times more than is necessary.

Removing this metadata computation causes Lux tests to fail.

Autocomplete for LuxDataFrame functions

Currently, autocomplete for LuxDataFrame is supported for methods for both pandas and LuxDataFrame. However, the autocomplete for class properties (i.e. class attributes) are not showing up. The @property decorator approach doesn't work since it leads to maxrecursionerror, possibly because of subclassing.

VisList properties overidden upon `refresh_source` initialization

When properties are specified in the Vis and fed inside the VisList, the refresh_source goes through the parser, validator and compiler. At the compiler stage, the Vis properties gets overridden by the automatically determined values. We should tolerate specified title, mark and other Vis properties to override the automatically determined values.

This is an example test that evaluates this, the last two line fails because the title and mark takes on the new values.
This example should be added to the test suite once we resolve this issue.

df = pd.read_csv("lux/data/olympic.csv")
df["Year"] = pd.to_datetime(df["Year"], format='%Y') # change pandas dtype for the column "Year" to datetype
from lux.vis.VisList import VisList
from lux.vis.Vis import Vis
vcLst = []
for attribute in ['Sport','Year','Height','HostRegion','SportType']: 
    vis = Vis([lux.Clause("Weight"), lux.Clause(attribute)],title="overriding dummy title",mark="line")
    vcLst.append(vis)
vc = VisList(vcLst,df)
for v in vc: 
    assert v.title=="overriding dummy title" # AssertionError
    assert v.mark=="line" # AssertionError

Incorporate pattern similarity

Incorporate pattern similarity into registered action, or show an example of how user-defined action can easily support custom VisList and scoring functions. Adapt existing code in lux/action/similarity.py.

Display warning when no recommendations are generated

When no recommendations are generated (e.g., when dataframe is small but not preaggregated, possibly other cases), we should display a warning that explains why the Lux view is not showing up.

Add an advanced ReadTheDoc page explaining default recommendation logic, including when recommendations are not displayed.

Originally posted by @akanz1 in #110 (comment)

Better visualization for high cardinality bar charts

Currently, our bar chart visualization tries to balance showing the overall distribution and min/max over individual values, the issue is that the charts are summarized or squashed.

✅ easily distinguish min & max items
❌ charts summarized/squashed

Other alternatives include 1) overplotting, 2) truncating to top-k, and 3) scrollable design.
Each of them come with their own pros/cons.

One possible design is having a truncated top-k view that can be "expanded" into a scrollable view.

Unsupported Category dtype as_type

There is a bug when using Lux with this example from Datashader.

import pandas as pd
import numpy as np
from collections import OrderedDict as odict

num=10000
np.random.seed(1)

dists = {cat: pd.DataFrame(odict([('x',np.random.normal(x,s,num)), 
                                  ('y',np.random.normal(y,s,num)), 
                                  ('val',val), 
                                  ('cat',cat)]))      
         for x,  y,  s,  val, cat in 
         [(  2,  2, 0.03, 10, "d1"), 
          (  2, -2, 0.10, 20, "d2"), 
          ( -2, -2, 0.50, 30, "d3"), 
          ( -2,  2, 1.00, 40, "d4"), 
          (  0,  0, 3.00, 50, "d5")] }

df = pd.concat(dists,ignore_index=True)
df["cat"]=df["cat"].astype("category") #  If commented, the df.intent=vis line doesn't break

df  #Select and export the scatterplot vis (big circular blob)

vis = df.exported[0]

df.intent = vis 
df # This errors on the key 'cat' which is a categorical

We should look into supporting Categorical data types in Pandas. More importantly, we should look into whether bugs show up when we perform astype operations for other data types.

Lux Errors when `set_index`

df = pd.read_csv("../../lux/data/state_timeseries.csv")
df["Date"] = pd.to_datetime(df["Date"])
df.set_index(["Date"])

This is happening because the executor expects a flat table and pre_aggregate is inferred as False for this table.

Date Visualization Bug in Flights Dataset

X axis for temporal attribute looks off for the flights dataset.

Interestingness Ranking of colored v.s. no colored bars

Colored bars and lines currently have a fixed score of 0.2, which is perhaps a bit too high. There might be an issue with the interestingness calculation of the no colored bars since their values are all close to zero. Ideally, the no colored bars should be prioritized since they are simpler and more interpretable.

Modify code for exported Altair vis code to use `inspect.getsource`

Currently, the to_Altair functionality relies on chart.code, which is manually created via string concatenation. This approach is hard to maintain across changes. We can rewrite this by leveraging the inspect.getsource function to obtain the exported code.
For example:

def change_color_add_title(chart):
    chart = chart.configure_mark(color="green") # change mark color to green
    chart.title = "Custom Title" # add title to chart
    return chart

import inspect
print (inspect.getsource(change_color_add_title))

Better repr for LuxDataFrame

__repr__ is currently empty to facilitate better display(..) of the luxWidget. We should support string __repr__ of LuxDataFrame for string operations on the LuxDataFrame.
One toy example of where this shows up is when dataframes are put into a list, the string __repr__ is triggered:

Ideally, we should have a string version that looks more like this:

Add mypy to Lux for static checking

https://github.com/python/mypy

Automatic binned ranges for inequality filters

For the Filter action, when the filter operation involves inequality, we should support better alternatives. For example when a user as for MilesPerGal<15, we could also show MilesPerGal>15 or bin it to different quartiles.

Better warning message for Vis and VisList

We should display a better warning message when the user specifies an intent that indicates more than one item, but puts it into Vis instead of VisList.

For example, if I do:

Vis(["AverageCost","SATAverage", "Geography=?"],df)

The error that I get is not very interpretable. We should warn users that the intent should be used with a VisList as shown in the screenshot below. If possible, we might even map the intent down to a VisList so that we display something to the user.

Automatic Type Detection for Datetime

Currently, we only support automatic detection of datetime if the variable name contains ["month", "year","day","date","time"]. We should automatically detect if a string data type is datetime-like and convert it to temporal type.

Metadata propagation not reliable if object type changes

We currently rely on Pandas's finalize to propagate the _metadata (including computed metadata, recommendations, and other stored properties). However, there are situations where dataframes don't always stay the same type throughout various operations. For example, in this issue, when we do group by. We end up with a GroupBy object, and the _metadata is lost. Similar issues will occur when going from Pandas Dataframe --> Series.
We should find a better strategy for metadata maintenance without having to explicitly pass in finalize, or ensure that when metadata properties are retrieved that they trigger a metadata recomputation if unfresh (so that the system is slightly slower but doesn't break when the metadata fields are accessed).

df = pd.read_csv("lux/data/car.csv")
groupby_result = df.groupby("Origin").agg("sum")
intermediate = groupby_result.reset_index()
intermediate.cardinality # this returns None
df = intermediate.__finalize__(df)
df.cardinality # cardinality is now populated again

Extend to different visualization renderer

Supporting vis renderer other than Altair, such as matplotlib or plotly

Dataframe differences when filter to small table

When dataframes are filtered to a resulting table that is somewhat small (e.g. less than 15 datapoints or so), then highlight the difference in the previous dataframe (e.g. df._prev) and the items in the resulting table.

TODOS:

Record dataframe history upon a filter operation (iloc, loc, filter, and other filtering operation in Pandas). Identify which functions to override, note that not all of these functions are inside frame.py (e.g., _LocIndexer inside indexing.py)
Highlight syntax for instances in a dataframe (#84)

Multiple Filter Support

Currently when multiple filters are applied it is unclear if an OR or AND is applied. Not enough explanations and information is displayed on the visualization to indicate this.
For example, even though clauses are combined via conjunction, in this example, it seems like only one of the filters is being applied.

df.intent = ["Region=New England", "Region=Southeast", "Region=Far West"]
df

More work needs to be done to extend the language for supporting OR.

Lux recommendations not displayed for df.head and tail

We have basic checks inside correlation.py and univariate.py to make sure that we don't generate recommendations when there is less than 5 points. We should generate a quick overview visualization of the dataframe when users do head and tail.

Running Pandas test suite to ensure Pandas function coverage

Running Pandas test suite to ensure that metadata and recommendations maintained for a variety of Pandas functions.

Exported code static placeholder `df` variable breaks if dataframe not named df

There is a placeholder value df to generate the exported code.

This breaks if we name the dataframe as anything other than df, for example:

data = pd.DataFrame({"label" : ["a","b","c","d","e"], 
                     "value2" : [25,30,35,40,45],
                     "value" : [25,30,35,40,45]})
data

vis = data.get_exported()[0]
print (vis.to_Altair())

This issue should be resolved if we resolve issue #37.

Improve type detection mechanism for IDs

In Lux, we detect attributes that look like an ID and avoid visualizing them.

There are several issues related to the current type detection mechanisms:

The function check_if_id_like needs to be improved so that we are not relying on attribute_contain_id check too much, i.e. even if the attribute name does not contain ID but looks like an ID, we should still label it as an ID. The cardinality check almost_all_vals_unique is a good example since most ID fields are largely unique. Another check we could implement is checking that the ID is spaced by a regular interval (e.g., 200,201,202,...), this is somewhat of a weak signal, since it not a necessary property of ID.

BUG: We only trigger ID detection currently if the data type of the attribute is detected as an integer (source). We should fix this bug so that string attributes that are ID like (e.g., a CustomerID in the Churn dataset like "7590-VHVEG") are also detected as IDs.

Some test data can be found here, feel free to find your own on Kaggle or elsewhere. For a pull request, please include tests to try out the bugfix on several datasets to verify that ID fields are being detected and that non-ID fields are not detected.

Error displaying widget: model not found

Hey Team Lux,

i just installed Lux and the nbextension as described in the Qiuck install section to explore its capabilities and ran into the following error message after hitting the "Toggle Pandas/Lux" button.

import lux runs without any complaint and running lux.version_info returns (0, 2, 0). Also, when installing i got Validating: OK for both nbextension commands.

I'm running this in the latest version of chrome, python 3.7.8 and pandas 1.1.3.

No module named 'luxwidget' when installing and activating the api

The console leaves me a "ModuleNotFoundError: No module named 'luxwidget'" when i try to run "jupyter nbextension install --sys-prefix --symlink --overwrite --py luxwidget" in cmd.

Better default opacity setting to prevent scatterplot datapoint occlusion

Some scatterplots have large number of datapoints and suffer from the problem of occlusion (hard to distinguish patterns since datapoints are too dense). We should change the default chart setting to make opacity lower based on number of datapoints.

Better support for Pandas.Series

We should create a LuxSeries object to take on the sliced version of the LuxDataframe, following guidelines for subclassing DataFrames. We need to pass the _metadata from LuxDataFrame to LuxSeries so that it is preserved across operations (and therefore doesn't need to be recomputed), related to #65. Currently, this code is commented since LuxSeries is causing issues compared to the original Pd.Series.

class LuxDataFrame(pd.DataFrame):
        ....
	@property
	def _constructor(self):
		return LuxDataFrame

	@property
	def _constructor_sliced(self):
		def f(*args, **kwargs):
			# adapted from https://github.com/pandas-dev/pandas/issues/13208#issuecomment-326556232
			return LuxSeries(*args, **kwargs).__finalize__(self, method='inherit')
		return f

class LuxSeries(pd.Series):
	# _metadata =  ['name','_intent','data_type_lookup','data_type',
	# 			 'data_model_lookup','data_model','unique_values','cardinality',
	# 			'min_max','plot_config', '_current_vis','_widget', '_recommendation']
	def __init__(self,*args, **kw):
		super(LuxSeries, self).__init__(*args, **kw)
	@property
	def _constructor(self):
		return LuxSeries

	@property
	def _constructor_expanddim(self):
		from lux.core.frame import LuxDataFrame
		# def f(*args, **kwargs):
		# 	# adapted from https://github.com/pandas-dev/pandas/issues/13208#issuecomment-326556232
		# 	return LuxDataFrame(*args, **kwargs).__finalize__(self, method='inherit')
		# return f
		return LuxDataFrame

In particular the original name property of the Lux Series is lost when we implement LuxSeries, see test_pandas.py:test_df_to_series for more details.
Example:

df = pd.read_csv("lux/data/car.csv")
df._repr_html_()
series = df["Weight"]
series.name # returns None (BUG!)
series.cardinality # preserved

We should also add a repr to print out the basic histogram for Series objects.

New Action: Text visualizations

Currently, text attributes are recognized as categorical attributes and displayed as bar charts. We should add a new datatype for text attributes, perhaps distinguishing between short: 1~2 words (more like labels, categories), long: 2+ (more like reviews, sentences, etc). In addition, we can add an action that visualizes text fields, such as word clouds, N-grams, etc.
This article documents some possible visualization types for text data.

Widget can not show in jupyter notebook

I used follow commend to instal lux widget.
pip install git+https://github.com/lux-org/lux-widget
jupyter nbextension install --py luxWidget
jupyter nbextension enable --py luxWidget

It's ok but the widget is not appear in Jupyter notebook. My environment is conda 4.8.5. and Python is 3.7.8.

Dealing with ID-like fields

We should have a settings panel to hide away ID-like fields (fields with cardinality that is close to the number of datapoints or are very evenly spaced). Incorporating ID-like fields in Correlation is problematic because everything involving the ID field would not be very informative.
This is an example from the instacart-market-basket-analysis Kaggle notebook.

HTTP Error 429: too many requests causes tests to fail

Occasionally, the test suite can fail when it fetches the data from the same URL too frequently, we should replace the URL link with local versions of the cars.csv if possible. In addition, any remote fetches of the data should catch the HTTP 429 error and add a sleeping timer to ensure that the test doesn't fail when the query rate is too high.
We could also look into dataframe reuse opportunities across tests so that the dataframe doesn't need to be reloaded every single time.

Highlight syntax for intent specification

Add capability in intent to highlight a list of specified data values/tuples.

Sphinx Autodoc doesn't create correct hyperlinks for modules

Any Sphinx autogenerated links to Lux modules gets hyperlinked to the bottom of the page.

For example:

At this link, if you press lux.action, you get directed to the bottom of the page, specifically here.

For some reason, all links to modules go directly to the "Module contents" section of a page, which happens to be at the bottom.

This issue was brought up at this StackOverflow post, but the fix is for the sphinx library command rather than something we can add to conf.py or one of the rst files. However, that may be a good place to start when looking for a fix.

Improve general sampling strategy in PandasExecutor

The current sampling strategy is crude and largely based on random sampling. We should investigate the performance degradation of Lux across various large datasets to select better sampling strategies, as well as exposing tunable parameters in the API to allow users to adjust for different sampling parameters and strategies. Add ReadTheDoc page explaining default sampling strategy and how to tune them in Lux.

Incorporate code style formatters in dev workflow

Update development workflow and CONTRIBUTING.md to incorporate flake8 and black for refactoring and improving code style.