pbiecek / archivist Goto Github PK
View Code? Open in Web Editor NEWA set of tools for datasets and plots archiving
Home Page: http://pbiecek.github.io/archivist/
A set of tools for datasets and plots archiving
Home Page: http://pbiecek.github.io/archivist/
I found an R package called saves
that provides functionality of saving columns from a data.frames in a separate .rdatas files and enables TurboFast
loading those columns if needed. Author guarantees this is faster than regular load
or SQL
statements.
Should we consider applying those functionalities in an archivist package?
add new attribute to object <- restore attr from object and save it to backpack.db listOfTags + info to documentation and mb an example
So far only 1 class (default class(artifact)[1]
) is archived from an artifact. Would it be a good solution to store every possible class?
Every github-mode function now takes that argument as an address to github
.GithubURL <- "https://raw.githubusercontent.com/"
which is created in zzz.r
file.
Right now there is an exact search with the 'tag' entry.
Would it be possible to do regexp search or partial search?
It would be useful to use
searchInLocalRepo( tag = "name:", dir = myRepo)
or
searchInLocalRepo( tag = "name:*", dir = myRepo)
to find all artifacts in database.
When installing directly from github
under MacOS
I've got following warning
Warning: /private/var/folders/.../archivist/man/createEmptyRepo.Rd:15: unknown macro '\pck'
and error
Error: contains a blank line
[likely to be a problem with devtools]
Up to discussion,
but since call can be long and can contain characters like '%
right now when the lm is call thought do.call the which saveToRepo fails with error.
Possible solutions: escape call (like in RODBCext)
or remove call from the list of tags to be stored
I've invented new function that will be very usefull while working with artifacts created with chaining codes.
Why this Vignette looks like this on github instead of having beautifull colours and layout?
I've tried to misplaced repeatable code lines like:
sqlite <- dbDriver( "SQLite" )
conn <- dbConnect( sqlite, paste0( dir, "/backpack.db" ) )
...
dbDisconnect( conn )
dbUnloadDriver( sqlite )
with something like:
conn <- dbConnect( get( "sqlite", envir = .ArchivistEnv ), paste0( dir, "/backpack.db" ) )
...
dbDisconnect( conn )
where "sqlite"
is a connection made in zzz.r
on load
library( "RSQLite" )
assign( x = "sqlite", value = dbDriver( "SQLite" ), envir = .ArchivistEnv )
But I recieved an error:
Error in sqliteNewConnection(drv, ...) :
RS-DBI driver: (invalid SQLiteManager)
9 sqliteNewConnection(drv, ...)
8 is(object, Cl)
7 is(object, Cl)
6 .valueClassTest(standardGeneric("dbConnect"), "DBIConnection",
structure("dbConnect", package = "DBI"))
5 dbConnect(get("sqlite", envir = .ArchivistEnv), paste0(dir, "backpack.db")) at createEmptyRepo.R#112
4 FUN(c("labelx:gp", "labely:y", "data:c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3) c(-1.68058147697706, -0.25858551283198, 0.0695991957014381, -0.305351769954381, -0.0865442037369759, -0.49046203366745, 1.39493781861695, 0.60949965253791, -1.2555366629927, -0.0232679119291622, 0.347457532299437, 1.19498196922739, -1.52858438778783, -1.92154071088541, -0.617682575458965, 0.873087620089834, -0.63566999796332, -0.548280098008737, 0.668236740598881, -1.08011100356851, 0.468561888689557, 0.725540030207352, -1.18398787537976, -0.18325333179287, -0.271565440671143, -0.0421176210575606, \n0.369259996211449, -0.303006267521281, -0.353903273939233, -0.623758789064422)",
"class:ggplot", "name:myplot123", "date:2014-08-18 16:44:14")[[1L]],
...)
3 lapply(X = X, FUN = FUN, ...)
2 sapply(extractedTags, addTag, md5hash = md5hash, dir = repoDir) at saveToRepo.R#210
1 saveToRepo(myplot123, repoDir = exampleRepoDir)
I've changed lines only in addTag
function situated in createRepo.R
file so an error is still visible on saveToRepo()
call.
Pushing with that commit
I'm not sure is it good idea to rise an error with default force=FALSE setting of saveToRepo()
It is good to let user know that the object is already in the repository,
but an error might break some larger computations which is not pleasant experience.
Suggestions: rise warning instead an error or make default force=TRUE
Also, instead of cat's I think that it is more convenient to rise a warning when force=FALSE
But of course it's up to discussion
Suggestions:
prcomp
coxph
princomp
lars
boosting
bagging
rpart
hclust
Change word object
to artifact
in documentation.
Delete "data" tag from extracData and add new database.db
I'm not 100% positive that it's issue with
parent.frame(2)
in
if (rememberName) {
save(file = paste0(repoDir, "gallery/", md5hash, ".rda"),
ascii = TRUE, list = objectName, envir = parent.frame(2))
}
but it looks like the most probable source of problem.
The problem:
when saveToRepo() is called within a function and rememberName=TRUE then the function saveToRepo() reports an error
'Error in save(file = paste0(repoDir, "gallery/", md5hash, ".rda"), ascii = TRUE, :
object ‘output’ not found'
it's likely that the object is in parent.frame(1), but maybe it should be referred in some other way.
Suggestion: let's try parent.frame(1) or use ls() to find the right environment.
Archivist package is a set of tools for datasets and plots archivisation in R.
Every object can be archived with his unique Tags
, that depends on object's class. Tags
are attributes of an object, i.e., a class
, a name
, names of object's parts, etc.. So far supported objects list is presented below. Objects are divided thematically. The list of object Tags
vary across object's classes.
Tags
are stored in the Repository. If data is extracted from object a special Tag
named relationWith
is created, and specifies with which object this data is related to.
In my opinion shiny search
is an utterly useful function and should be advertised in it's own use case. What do you think about that?
In the lattice
package developers invented a way to order a documentation. They simply add letters in an alfabetic order to the beginning of a .Rd
file. This order might suggest the way of passing thru the manual to optimize understanding. Manual is here.
Try to make a website like this in a free time http://kzps.github.io/info/ .
After
saveToRepo(iris, dir = myRepo)
saveToRepo(iris, dir = myRepo)
saveToRepo(iris, dir = myRepo)
saveToRepo(iris, dir = myRepo)
I've got
(iris_md5hash <- searchInLocalRepo( tag = "name:iris", dir = myRepo))
[1] "ff575c261c949d073b2895b05d1097c3" "ff575c261c949d073b2895b05d1097c3"
[3] "ff575c261c949d073b2895b05d1097c3" "ff575c261c949d073b2895b05d1097c3"
[5] "ff575c261c949d073b2895b05d1097c3"
Maybe it is worth yo to warn user that given artifact already exists
The output from searchInLocalRepo should be either after 'unique()' function or with additional attributes, like create date
Shouldn't cache()
be a basic function in archivist?
We also can create github version of this function if it'll be efficient :)
Maybe we should think about creating a function that will be a multiSearch
and would take many conditions about artifact as a parameter and would search for md5hashes
of every condition and in the end would intersect those md5hashes
to get those that fulfill all of them.
Example:
multiSearchInRepo <- function( conditions, repoDir, fixed ){
md_1 <- searchInRepo( conditions[1], repoDir, fixed[1])
md_2 <- searchInRepo( conditions[12], repoDir, fixed[2])
...
md_15 <- searchInRepo( conditions[15], repoDir, fixed[15])
return( intersection( md_1, md_2, .... , md_15) )
}
After
saveToRepo(iris, dir = myRepo)
rmFromRepo( md5hash = "ff575c261c949d073b2895b05d1097c3", dir = myRepo)
I still have a miniature in the gallery folder
Add examples to Tags.r
Various statistics can be calculated for objects that are in the repository. Like: calendar plot with timestamps of objects from repo, barplot for classes of these objects, etc.
summaryRepo() can present such statistics.
I apologize for any inconvenience during today reverting the same commit. I thought there was a bug but it did not.
The call to createEmptyRepo(dir) fails if dir does not exist.
Following message is displayed:
Error in sqliteNewConnection(drv, ...) :
RS-DBI driver: (could not connect to dbname:
unable to open database file
)
Either the message should be changed or desired dir should be created first.
Are such functions necessary?
It would't take long to prepare them.
To make the function call as short and easy as possible it would be good to have arguments without defaults before those with defaults.
It is not true for summaryLocalRepo(method = "md5hashes", repoDir) and possibly some other
Future issue for vignettes. By now every commit about vignette can be tagged with this issue.
Do you know what might be the problem that package does not want to install properly?
> install_github("pbiecek/archivist")
Installing github repo archivist/master from pbiecek
Downloading master.zip from https://github.com/pbiecek/archivist/archive/master.zip
Installing package from C:\Users\Marcin\AppData\Local\Temp\Rtmp86dDLa/master.zip
Installing archivist
"C:/PROGRA~1/R/R-31~1.1/bin/x64/R" --vanilla CMD INSTALL \
"C:\Users\Marcin\AppData\Local\Temp\Rtmp86dDLa\devtools15fc6b22ef1\archivist-master" \
--library="C:/Users/Marcin/Documents/R/win-library/3.1" \
--install-tests
* installing *source* package 'archivist' ...
** R
** demo
** preparing package for lazy loading
Warning: replacing previous import by 'shiny::validate' when loading 'archivist'
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
Warning: replacing previous import by 'shiny::validate' when loading 'archivist'
* DONE (archivist)
> help(packages="archivist")
Error in loadNamespace(name) : there is no package called ‘NA’
> library(archivist)
Ładowanie wymaganego pakietu: RSQLite
Ładowanie wymaganego pakietu: DBI
Ładowanie wymaganego pakietu: shiny
Ładowanie wymaganego pakietu: lubridate
Ładowanie wymaganego pakietu: jsonlite
Dołączanie pakietu: ‘jsonlite’
The following object is masked from ‘package:shiny’:
validate
The following object is masked from ‘package:utils’:
View
Welcome to the archivist package (ver 1.2).
Warning message:
zastępowanie poprzedniego importu przez ‘shiny::validate’ podczas ładowanie przestrzeni nazw ‘archivist’
> help(packages="archivist")
Error in loadNamespace(name) : there is no package called ‘NA’
After adding argument force
in few functions, some of examples might not work properly. Also, when changed order or parameters in summarRepo
the examples might be changed to easier.
In the saveToRepo() function there is a
md5hash <- digest( object )
but then digest( object ) is used four or more times, which is not efficient.
Having the calculated md5hash there is no need to repeated calculations (that should speed up the saving)
Right now there are two competing arguments with search criterias: tag and regex.
User has to set 'method="regexp"' to make the second argument working.
It's not consistent with base R neither necessary.
Please consider two approaches:
It would be good to set not only user, repository and branch but also a directory for archivist gallery on github.
Then one can have a single github project with many repositories in different directories.
The only two required arguments in saveToRepo() are: object and dir.
Maybe they should be listed as first two arguments.
Then one can use the function without specification of argument names, e.g.
saveToRepo(iris, demoDir)
Instead
saveToRepo(object, ..., archiveData = TRUE, archiveTags = TRUE,
archiveMiniature = TRUE, dir, rememberName = TRUE)
saveToRepo(object, dir, ..., archiveData = TRUE, archiveTags = TRUE,
archiveMiniature = TRUE, rememberName = TRUE)
Also, consider other names for the 'dir' argument. Maybe: repo, repoDir, gallery or something more 'archivist specific'
Since we have createEmptyRepo() it would be nice to have deleteRepo() as well.
In future, please consider a function copyToRepo which will copy a list of artifacts from one repo to another.
Those two functions should fill up functionality for version 1/0
By now extractData
function gets variables from Global Environment.
While saveToRepo
may be used inside any other function maybe is better to change environment to something like parent.frame(1)
or parent.frame(2)
?
Problem:
Proposed solution:
Example:
Takes all objects of lm class from repository, extracts R2 for them and stores as R2: tags
md5hashes <- searchInLocalRepo(repoDir, “class:lm”)
addTag(md5hashes, repoDir, function(x) paste0(“R2:”,summary(x)$r.square))
download.file is used in loadFromGithubRepo() and searchInGithubRepo()
but it does not work with https protocols.
Consider use of RCurl or devtools (see devtools:::github_get_conn)
change it's name to different one, maybe ifPaste
I believe those are the hot lines:
# in case there existed an object in GlobalEnv this function will not delete him
NotDelete <- as.logical(sapply( name , exists, envir = .GlobalEnv))
I will remove vignettes in the CRAN build,
examples and rda files are large,
let them stay on github.
For empty repo summaryLocalRepo() returns
List of 5
$ artifactsNumber: int 0
$ dataSetsNumber : int 0
$ classesNumber : Named list()
$ savesPerDay : Named list()
$ classesTypes : chr(0)
and then print.repository fails when attempting to
names(classes) <- "Number"
(clasess is empty as well)
Error in names(classes) <- "Number" :
'names' attribute [1] must be the same length as the vector [0]
Goal:
have as little of dependencies as possible
Proposition: limit the Depend section to
digest, RCurl, httr
If user requests a miniature then more packges needs to be loaded,
like ggplot2 for ggplot objects,
but then additional packages my be loaded on demand in the function that create miniature.
That will cut down number of required packages (otherwise we need to load all packages that are supposted by data extracting funciton - survival, MASS, ggplot2 and others)
Hello, first I'd like to thank you for creating such a useful package to manage my various R workspaces in an innovative way! I noticed that when I load archivist I receive a message The following object is masked from 'package:utils': View
immediately after the jsonlite
package is loaded. I use RStudio as my IDE for R, and in normal situations when I click on the name of a data frame in my environment panel, it generates a new tab in my editor with the contents of my data frame (since it calls the View
function from the utils package) and the name of the tab is the name of the data frame object. Since the jsonlite
package contains its own version of the View
function, when I do the same thing after loading archivist the data do appear in a separate tab but the tab name is always x
instead of the name of the data frame, which means I cannot have more than one data frame displayed separately. I realize this stems from a different package, however I searched your function documentation for any calls like @import jsonlite
or @importFrom jsonlite
and did not find any lines with these types of calls. Is there a way to remove jsonlite
from the package dependencies? Or did I miss where you are using functions from that package?
Till now, when artifacts could only return call of itself not the real data, we tried to get
those data from paren.frame(1)
or previously from .GlobalEnv
. But If an user specifies data as data=as.numeric(Train[,-3])
every overloaded version of extractData
would try to get( "as.numeric(Train[,-3])", envir = parent.frame(1))
. Since there might be Train
object in R, there probably will never be an object like this as.numeric(Train[,-3])
.
My solution is to add warning telling that data could not be archived. That`s what I've done in this commit.
To synchronize with travis I need administration permission on this repository, so I'll fork this repo and sync with travis on my account and then I'll send pull request to this repo.
TO DO: synchronize with travis
https://travis-ci.org/getting_started
http://docs.travis-ci.com/user/build-configuration/#.travis.yml-file%3A-what-it-is-and-how-it-is-used
> exampleRepoDir2 <- tempdir()
> createEmptyRepo( exampleRepoDir2 )
> md5hashes2 <- searchInGithubRepo( pattern= "relation", user = "pbiecek", repo = "archivist", fixed = FALSE )
> copyGithubRepo( repoTo = exampleRepoDir2, md5hashes2,
+ user= "pbiecek", repo = "archivist")
Hide Traceback
Rerun with Debug
Error in file(con, "wb") : invalid 'description' argument
7 file(con, "wb")
6 writeBin(fileFromGithub, paste0(to, file)) at copyToRepo.R#150
5 FUN(X[[1L]], ...)
4 lapply(X = X, FUN = FUN, ...)
3 sapply(filesToDownload, cloneGithubFile, repo = repo, user = user,
branch = branch, to = repoTo) at copyToRepo.R#137
2 copyRepo(repoTo = repoTo, repoFrom = Temp, md5hashes = md5hashes,
local = FALSE, user = user, repo = repo, branch = branch) at copyToRepo.R#79
1 copyGithubRepo(repoTo = exampleRepoDir2, md5hashes2, user = "pbiecek",
repo = "archivist")
No idea where is the bug.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.