rbavishi / databutler Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 0.0 3.13 MB

The DataButler Research Project

License: BSD 2-Clause "Simplified" License

Python 96.68% JavaScript 0.22% CSS 0.01% Jupyter Notebook 0.10% TypeScript 0.82% Shell 0.44% HTML 1.73%

databutler's People

Contributors

Stargazers

Watchers

databutler's Issues

Change documentation style to Google

reStructured text can be limiting, especially when it comes to documenting class attributes.

Create a cookiecutter for making custom widgets

It is imperative to be able to quickly prototype jupyter widgets to support the functionalities of the project, especially when AutoPandas and Gauss are integrated.

Upgrade attrs to latest version

PAT was originally written with attrs==19.1.0. The API has transformed a lot since then, and mostly for good.

The backward compatibility is great, and PAT still functions correctly with the latest version 21.4.0, but the usage of the API does not reflect best practices in many places. Upgrade API usage in PAT to ensure consistency with the rest of the code.

Configure Logging

It may be worthwhile to use the loguru library, but the benefits need to be fully assessed. It does have better coloring and formatting support, plus rotation file-logging.

Add a task description parameter for code2nl and nl2code API

The get_nl and get_code should take in an optional argument that will be added to the top of the prompt.

Integrate Gauss

Integrate the interaction-based synthesis engine for Pandas - Gauss (https://github.com/rbavishi/gauss-oopsla-2021) as an API to invoke in a notebook environment.

An ongoing issue with the original Gauss is high memory usage. Use on-demand loading of the knowledge base (oracle) to mitigate this.

Add framework for generating and describing code-changes

The framework must allow for easy addition of different classes or strategies of code changes such as removing keyword args, removing function calls, removing assignments etc.
It should be easy to write new code change strategies.

Finalize structure of the corpus and a corpus member

Need to finalize how to represent the corpus of code variants as a graph. This should be able to store changes along with their descriptions, and the vanilla description of a code snippet. Note that we want to represent the corpus implicitly, as something that can be generated, rather than explicitly.

Add campaign functionality for corpus creation

Once the corpus organization is finalized, the next step is to write robust functionality for undertaking campaigns for generating the corpus.

The campaign must support the following:

Create a campaign directory.
Allow the user to create and fill in few-shot examples, along with errors and warnings about mistakes / missing examples.
Use bidirectional consistency for generating high-fidelity descriptions for changes and code snippets.

Speed up CI

There is the possibility of using caching to improve build times, as laid out by the article here - https://medium.com/ai2-blog/python-caching-in-github-actions-e9452698e98d

Integrate AutoPandas

Integrate AutoPandas i.e. the I/O example-based synthesis engine for Pandas as a library (https://github.com/rbavishi/autopandas)

The main challenge will be in handling the neural models.

Add Code Processor for Variable-name and Dataframe-Column Optimization

We need a code-processor that can optimize away unnecessary variable names and dataframe columns. For example, convert

def f(df, col1):
   import seaborn as sns
   df1 = df.dropna()
   sns.distplot(df1[col1])

def f(df, col1):
   import seaborn as sns
   df1 = df.dropna()
   sns.distplot(df1[col1])

and

def f(df, col1):
   import seaborn as sns
   df["NewCol"] = df[col1].dropna()
   sns.distplot(df1["NewCol"])

def f(df, col1):
   import seaborn as sns
   df[col1] = df[col1].dropna()
   sns.distplot(df1[col1])

This optimization allows for easy independence of code transformations, which is necessary for representing our corpus of code variants efficiently (linear vs exponential space).

Add abstractions for using multiple library versions for executing code in a different process

Running mined code often requires specific versions of various libraries to function properly. Additionally, the regular dev code relies on libraries whose latest versions are often incompatible with the ones required for running the mined code. Thus, there is a need to be able to run the mined code without disturbing the primary install.

So far, the best idea is to install the other versions to a directory using something like pip install --target=<path-to-dir>. Then to execute the mined code, we spawn a new process and modify sys.path to have this directory at the front so the correct version is picked up.

There are other methods like importlib, but they are hard to get write for big libraries like pandas which themselves depend on other libraries such as numpy.

Add framework for code processors

Setup an easy-to-extend framework for adding code processing transforms, such as unnecessary variable removal, keyword-arg normalization, etc.

Integrate VizSmith as a Mode

Provide the existing VizSmith interface as one mode in the overall Datana design, using Codex as the NL metadata provider rather than the existing comment-mining mechanism.

rbavishi / databutler Goto Github PK

databutler's People

Contributors

Stargazers

Watchers

databutler's Issues

Recommend Projects

Recommend Topics

Recommend Org