rbavishi / databutler Goto Github PK
View Code? Open in Web Editor NEWThe DataButler Research Project
License: BSD 2-Clause "Simplified" License
The DataButler Research Project
License: BSD 2-Clause "Simplified" License
reStructured text can be limiting, especially when it comes to documenting class attributes.
It is imperative to be able to quickly prototype jupyter widgets to support the functionalities of the project, especially when AutoPandas and Gauss are integrated.
PAT was originally written with attrs==19.1.0
. The API has transformed a lot since then, and mostly for good.
The backward compatibility is great, and PAT still functions correctly with the latest version 21.4.0
, but the usage of the API does not reflect best practices in many places. Upgrade API usage in PAT to ensure consistency with the rest of the code.
It may be worthwhile to use the loguru
library, but the benefits need to be fully assessed. It does have better coloring and formatting support, plus rotation file-logging.
The get_nl
and get_code
should take in an optional argument that will be added to the top of the prompt.
Integrate the interaction-based synthesis engine for Pandas - Gauss (https://github.com/rbavishi/gauss-oopsla-2021) as an API to invoke in a notebook environment.
An ongoing issue with the original Gauss is high memory usage. Use on-demand loading of the knowledge base (oracle) to mitigate this.
Need to finalize how to represent the corpus of code variants as a graph. This should be able to store changes along with their descriptions, and the vanilla description of a code snippet. Note that we want to represent the corpus implicitly, as something that can be generated, rather than explicitly.
Once the corpus organization is finalized, the next step is to write robust functionality for undertaking campaigns for generating the corpus.
The campaign must support the following:
There is the possibility of using caching to improve build times, as laid out by the article here - https://medium.com/ai2-blog/python-caching-in-github-actions-e9452698e98d
Integrate AutoPandas i.e. the I/O example-based synthesis engine for Pandas as a library (https://github.com/rbavishi/autopandas)
The main challenge will be in handling the neural models.
We need a code-processor that can optimize away unnecessary variable names and dataframe columns. For example, convert
def f(df, col1):
import seaborn as sns
df1 = df.dropna()
sns.distplot(df1[col1])
to
def f(df, col1):
import seaborn as sns
df1 = df.dropna()
sns.distplot(df1[col1])
and
def f(df, col1):
import seaborn as sns
df["NewCol"] = df[col1].dropna()
sns.distplot(df1["NewCol"])
to
def f(df, col1):
import seaborn as sns
df[col1] = df[col1].dropna()
sns.distplot(df1[col1])
This optimization allows for easy independence of code transformations, which is necessary for representing our corpus of code variants efficiently (linear vs exponential space).
Running mined code often requires specific versions of various libraries to function properly. Additionally, the regular dev code relies on libraries whose latest versions are often incompatible with the ones required for running the mined code. Thus, there is a need to be able to run the mined code without disturbing the primary install.
So far, the best idea is to install the other versions to a directory using something like pip install --target=<path-to-dir>
. Then to execute the mined code, we spawn a new process and modify sys.path
to have this directory at the front so the correct version is picked up.
There are other methods like importlib
, but they are hard to get write for big libraries like pandas
which themselves depend on other libraries such as numpy
.
Setup an easy-to-extend framework for adding code processing transforms, such as unnecessary variable removal, keyword-arg normalization, etc.
Provide the existing VizSmith interface as one mode in the overall Datana design, using Codex as the NL metadata provider rather than the existing comment-mining mechanism.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.