datajoint / datajoint-python Goto Github PK
View Code? Open in Web Editor NEWRelational data pipelines for the science lab
Home Page: https://datajoint.com/docs
License: GNU Lesser General Public License v2.1
Relational data pipelines for the science lab
Home Page: https://datajoint.com/docs
License: GNU Lesser General Public License v2.1
Table definitions in given in Relvar class doc strings. Port the latest table declaration parser. If a relvar object cannot find its table in the database, it shall declare it upon first query to the table.
... and make sure it agrees with the licenses of the libraries we use.
At the moment we are at 47%, which is clearly unacceptable.
Just as in Matlab implementation, having colon (:) in enum values causes declaration errors, but only so when default value is given for the field.
Current implementation will fail if a newly defined table has a dependency on a table that already exists in a schema bound to a module, but does not have a corresponding class definition in the module. This is a rather convoluted, but certainly possible situation that must be addressed.
For a concrete example, if you are defining a new table called Experiment
in module mouse
that refers to another existing table called microscopes
in the database mouse_setups
that is already bound to a module called setups
, as here:
definition = """
mouse.Experiment (manual)
-> setups.Microscopes
then this will fail to create an appropriate table relation representing setups.Microscopes
.
automatically populated tables should never be inserted into from outside a populate call. Calling populate
should enable insert
. Exiting populate should prohibit insert
.
restricted delete without cascading
dj.Relational.__str__
should display a representation of the contents of the relation
port the latest implementation of the automatic table population from the MATLAB version.
We should have this. Other ideas on the topic are:
Some handling of data requires us to perform operations like project
and join
on the fetched data structure with a data structure passed in by the user. It appears like our current go-to data structure is the numpy record array
, with no such methods available (at least by default). On Matlab side, join
is also no available by default but provided by datajoint
as dj.struct.join
. We could do this for record array
but pandas
package already provides DataFrame
object with all such methods implemented. I believe that pandas
is a pretty standard data analysis package in Python numerical computing, so may be it wouldn't be a bad idea for us to use pandas
and it's data structures (DataFrame
and perhaps Series
) directly in our implementation.
It would be nice if we could use that for fetching.
Once all the new features are added, I think datajoint should be submitted to PyPi, so it can be installed by just typing pip install datajoint
.
More information on how to submit a package to the PyPi repostory can be found here.
This will also require us to think about how we are going to version the python interface of datajoint. I guess it would be the easiest to just use the same major and minor version number as in matlab to show SQL data structure compatibility.
Implementation of features for AutoPopulate
depends on the implementation of the Base
. In particular, a class that derives from AutoPopulate
(which is an abstract class now), will function only if:
Base
derivative to function correctly - basically we expect the subclass to inherit from Base
and AutoPopulate
simultaneously. This I think is conceptually messy.Since we really don't expect the subclass to not be a Base
derivative for the AutoPopulate
subclass to function, I think it'd make more sense to simply make AutoPopulate
itself an abstract subclass of the Base
class. This way, when one wants to implement a table with populate
functionality, the class has to only inherit from AutoPopulate
.
Let reference to other tables in the same module (and thus schema) should allow for the module name to be skipped in the foreign reference definition. So
->own_module.TableB
can be specified simply as
->TableB
As I worked through the massive changes in the logic introduced via the last merge, I noticed that the new declaration logic doesn't work well with the previous logic used in resolving module references (e.g. foreign key references) in the table definition string.
When a table definition refers to another table in another schema as in the following case
definition = """
A.Subject (manual) # subjects defined in schema A
-> B.Setups
"""
Then this reference to module B
has to be resolved appropriately. Unless B
is a module that's known by that very name to the Python interpreter (that is, you actually have B.py
on the top level, and not inside any package), then B
is not descriptive enough to be resolved down to a module that holds the definition of the target table B.Setups
. To get around this problem, I have previously introduced three-step resolution strategy that utilizes the information about the module that holds this definition (in this case, the module A
). The steps are as follows:
A
, look to see if there is a module imported by the name B
. This means that if inside the module A
I have a statement like import v1_project.schemata.B as B
(and therefore there is a local variable B
that is a module object) then the module name B
in the table definition string will resolve to the module v1_project.schemata.B
, and looks for the table Setups
in there.A
actually resides in a package (so that A
is actually package.A
), then look for a module named B
in the same immediate parent package. Thus it'll check for package.B
and if found, uses that.B
is actually a globally accessible module name, and attempt to import the module. From an organization and distribution point of view, I think it is simply unreasonable to assume that all modules with table definitions to be a top level module.The point here is that, the resolution of any reference to a module that is not the module containing the table definition (i.e. -> B.Setups
), requires the information provided by the module holding the definition (and the package it belongs to).
An alternative is of course to make the module reference in the definition explicit, such that in the above case, rather than -> B.Setups
, it would read v1_project.schemata.B.Setups
instead. The problem with this is that, not only now it is wordy and tedious to write, but it binds the location of the module to a strict package structure. The beauty of the previous implementation was that you can do relative reference to other schema (module), with an ability to make the target explicit via import, if preferred.
The move of making table declaration into a separate function really breaks the above logic, because it assumes that the table definition can exist independently from the module in which they are found. Unlike in Matlab where you can pull up any class name up to the top level by adding it to the path, Python modules inside a package really requires the package names to reach the module, and as far as I know there is no easy way to make a module in a package accessible at the top namespace throughout the Python interpreter, and even if there is, I'd think it'll be rather awkward.
Given these, I suggest that we place the declare
and related functionality back into the Base
class. I could of course make the declare
function take in the name of the module that holds the definition, but if the most common use of declare
is by Base
derivatives, and thus pretty much always need information about the module in which the derived class is defined, I really don't see a whole lot of benefit in separating declare
into an independent function.
Complete the implementation of the dependency loading mechanism in the connection object.
Names of methods, functions and variables are currently largely in camel cases, and thus not adhering to the Python's style guide PEP 8 which advocates use of underscores in function and variable names. I think we should refactor these names to make the code more Pythonic.
Now documentation generation system is in place, we should go and massively add documentations for all public interfaces of the DataJoint.
In state the code is currently in, we only support Python 3 (e.g. because we use "nonlocal" in our class design).
Do you think it is worth to support python 2? In case of the nonlocal, for example, there is no simple solution (see this for more details).
An nice start into bilingual python can be found here.
Current implementation doesn't support creating table from the doc-string in a module defined inside a package because the module full name is of the form package.subpackage.module
whereas first line of declaration is expected to be modue.classname
.
Implement data definition functionality (i.e. insert, drop table, etc) into the Base class.
Because good software needs good tests
Should we allow the table definition to contain reference to other databases directly via the database name rather than the module name? For example, should the following definition be allowed in Base
derivatives
definition = """
schema1.Subjects (manual) # list of subjects
-> `database2`.Experimenter
where database2
is the actual name of the database under which a table named experimenter
actually exists. Such referencing style is currently allowed if you directly instantiate a Table
object, passing in a definition string to the constructor.
I was thinking that since we expect all dj.Base
derivatives to reference each other via module.class
naming conventions, it would make sense to actually prohibit direct reference to a database from within definitions for dj.Base
derivatives.
Currently, blobs can only store n-dimensional numerical arrays. The data serialization is done using the mym protocol to keep the data compatible with the matlab side of things. Mym supports all MATLAB data structures: structures, structure arrays, cells, objects, etc. In practice, we only store numerical arrays in blobs. Everything else is usually normalized into their own attributes. So I am okay with deviation from the mym protocol for serializing other objects than numerical arrays. This is low priority for now since this feature is not in strong demand.
In 90% of cases, the pop_rel
is the unrestricted join of the primary dependencies of the table. Should we set that as the default value and let users override that?
If Connection.bind
does not find the specified database on the server, it should create it automatically. This will be the user's way of creating schemas.
Also, how should we drop schemas? Maybe dropping all tables should trigger the dropping of the database.
port latest relational algebra operators from the MATLAB version
Make datajoint.Base.drop
cascade.
see dj.Table/drop on the matlab side for reference.
I noticed that in connection.py
conn_container
defintion, the part
input('Enter datajoint server address >> ')
was commented out. I'm guessing this was a done when configuring TravisCI, but is this necessary anymore?
Generate documentation forDataJoint semi-automatically from in-code documentations. Generated documentations should be hosted on datajoint-python project site.
The logic for creating and manipulating graphs that represent the ERD is pretty much there and can be found in the erd
branch. I will keep this branch as the feature implementation branch. There are a few issues that I'm running into as I try implementing plotting support:
pygraphviz
library that was once used in our project for plotting ERD does not support Python 3 natively yet. Explicitly using their beta branch with pip install pygraphviz==1.3rc2
allows us to use it in Python 3, but I'm not sure if we would like to introduce a dependency on beta-version of the package, especially as this is a very specific dependency.pydot
is an alternative package that may be used, but again there is no official Python 3 support. There appears to be a port pydot3k
The above two were the preferred method to work with graphviz
graph plotting engine, that yields very nicely arranged graphs. Alternatively, I could try working with graphviz
directly, but to my knowledge, the library can only render files like .pdf
files at the end and it takes some effort to display this into matplotlib
.
Implement organized testing scheme. Thinking about using nose.
This was previously implemented using the networkx module. Make this work with the new implementation, probably as a method of Connection.
re-implement insert
see dj.Relvar/del
on the MATLAB side.
I think the name would fit better. I would then use __iter__
for iterating through all tuples of a relation and return them as tuples. This would have the advantage to be able to use for loops like
for animal_id, date_of_birth, genetic_line in mice_relation:
...
What do you think?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.