pbiecek / archivist Goto Github PK

View Code? Open in Web Editor NEW

75.0 75.0 9.0 51.54 MB

A set of tools for datasets and plots archiving

Home Page: http://pbiecek.github.io/archivist/

R 2.86% HTML 97.14%

archivist's People

Contributors

Stargazers

Watchers

Forkers

eliotmcintire gitter-badger adamryczkowski nemochina2008 zoudj predictiveecology stjordanis greengrassblueocean

archivist's Issues

saves package

I found an R package called saves that provides functionality of saving columns from a data.frames in a separate .rdatas files and enables TurboFast loading those columns if needed. Author guarantees this is faster than regular load or SQL statements.

Should we consider applying those functionalities in an archivist package?

User can add their own tags to backpack.db

add new attribute to object <- restore attr from object and save it to backpack.db listOfTags + info to documentation and mb an example

Artifacts' clasess

So far only 1 class (default class(artifact)[1]) is archived from an artifact. Would it be a good solution to store every possible class?

Github Version Functions' GithubURL Argument

Every github-mode function now takes that argument as an address to github

.GithubURL <- "https://raw.githubusercontent.com/"

which is created in zzz.r file.

Could searchInLocalRepo() accept regexps?

Right now there is an exact search with the 'tag' entry.
Would it be possible to do regexp search or partial search?
It would be useful to use
searchInLocalRepo( tag = "name:", dir = myRepo)
or
searchInLocalRepo( tag = "name:*", dir = myRepo)

to find all artifacts in database.

unknown macro '\pck'

When installing directly from github
under MacOS
I've got following warning

Warning: /private/var/folders/.../archivist/man/createEmptyRepo.Rd:15: unknown macro '\pck'

and error

Error: contains a blank line
[likely to be a problem with devtools]

remove or escale call from tag list

Up to discussion,
but since call can be long and can contain characters like '%
right now when the lm is call thought do.call the which saveToRepo fails with error.

Possible solutions: escape call (like in RODBCext)
or remove call from the list of tags to be stored

returnTag function

I've invented new function that will be very usefull while working with artifacts created with chaining codes.

Can't find the way to present raw version of DOCCO-style Vignette

Why this Vignette looks like this on github instead of having beautifull colours and layout?

Replacing repeatable code lines

I've tried to misplaced repeatable code lines like:

sqlite <- dbDriver( "SQLite" )
conn <- dbConnect( sqlite, paste0( dir, "/backpack.db" ) )
...
dbDisconnect( conn )
dbUnloadDriver( sqlite )

with something like:

conn <- dbConnect( get( "sqlite", envir = .ArchivistEnv ), paste0( dir, "/backpack.db" ) )
...
dbDisconnect( conn )

where "sqlite" is a connection made in zzz.r on load

library( "RSQLite" )
assign( x = "sqlite", value = dbDriver( "SQLite" ), envir = .ArchivistEnv )

But I recieved an error:

Error in sqliteNewConnection(drv, ...) : 
  RS-DBI driver: (invalid SQLiteManager) 
9 sqliteNewConnection(drv, ...) 
8 is(object, Cl) 
7 is(object, Cl) 
6 .valueClassTest(standardGeneric("dbConnect"), "DBIConnection", 
    structure("dbConnect", package = "DBI")) 
5 dbConnect(get("sqlite", envir = .ArchivistEnv), paste0(dir, "backpack.db")) at createEmptyRepo.R#112
4 FUN(c("labelx:gp", "labely:y", "data:c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3) c(-1.68058147697706, -0.25858551283198, 0.0695991957014381, -0.305351769954381, -0.0865442037369759, -0.49046203366745, 1.39493781861695, 0.60949965253791, -1.2555366629927, -0.0232679119291622, 0.347457532299437, 1.19498196922739, -1.52858438778783, -1.92154071088541, -0.617682575458965, 0.873087620089834, -0.63566999796332, -0.548280098008737, 0.668236740598881, -1.08011100356851, 0.468561888689557, 0.725540030207352, -1.18398787537976, -0.18325333179287, -0.271565440671143, -0.0421176210575606, \n0.369259996211449, -0.303006267521281, -0.353903273939233, -0.623758789064422)", 
"class:ggplot", "name:myplot123", "date:2014-08-18 16:44:14")[[1L]], 
    ...) 
3 lapply(X = X, FUN = FUN, ...) 
2 sapply(extractedTags, addTag, md5hash = md5hash, dir = repoDir) at saveToRepo.R#210
1 saveToRepo(myplot123, repoDir = exampleRepoDir)

I've changed lines only in addTag function situated in createRepo.R file so an error is still visible on saveToRepo() call.

Pushing with that commit

saveToRepo() with force = FALSE may break some pipelines

I'm not sure is it good idea to rise an error with default force=FALSE setting of saveToRepo()
It is good to let user know that the object is already in the repository,
but an error might break some larger computations which is not pleasant experience.

Suggestions: rise warning instead an error or make default force=TRUE
Also, instead of cat's I think that it is more convenient to rise a warning when force=FALSE

But of course it's up to discussion

New objects in next version

Suggestions:

prcomp
coxph
princomp
lars
boosting
bagging
rpart
hclust

Vocabulary

Change word object to artifact in documentation.

an issue with rememberName in saveToRepo when it's called outside GlobalEnvir

I'm not 100% positive that it's issue with
parent.frame(2)
in
if (rememberName) {
save(file = paste0(repoDir, "gallery/", md5hash, ".rda"),
ascii = TRUE, list = objectName, envir = parent.frame(2))
}

but it looks like the most probable source of problem.

The problem:
when saveToRepo() is called within a function and rememberName=TRUE then the function saveToRepo() reports an error
'Error in save(file = paste0(repoDir, "gallery/", md5hash, ".rda"), ascii = TRUE, :
object ‘output’ not found'

it's likely that the object is in parent.frame(1), but maybe it should be referred in some other way.

Suggestion: let's try parent.frame(1) or use ls() to find the right environment.

Supported objects and tags

archivist package - Tags

Archivist package is a set of tools for datasets and plots archivisation in R.
Every object can be archived with his unique Tags, that depends on object's class. Tags are attributes of an object, i.e., a class, a name, names of object's parts, etc.. So far supported objects list is presented below. Objects are divided thematically. The list of object Tags vary across object's classes.

Tags are stored in the Repository. If data is extracted from object a special Tag named relationWith is created, and specifies with which object this data is related to.

Regression Models

lm

- coefname - class - call - name - data - date

glmnet

- date - name - class - call - beta - lambda

survfit

- date - name - class - call - strata - type

Plots

ggplot

- name - class - date - data - labelx - labely

trellis

- date - name - class - call

Results of Agglomeration Methods

twins

which is a result of agnes, diana or mona functions

- date - name - class - ac - merge - data

partition

which is a result of `pam`, `clara` or `fanny` functions

- date - name - class - call - data - objective

lda

- date - name - class - call

qda

- date - name - class - call - terms

Statistical tests

htest

- alternative - method - date - name - class

When non of above is specified, tags are corresponded by default

default

- name - class - date

data.frame

- name - class - date - varname

shinySearch use Case

In my opinion shiny search is an utterly useful function and should be advertised in it's own use case. What do you think about that?

R topics documented in order

In the lattice package developers invented a way to order a documentation. They simply add letters in an alfabetic order to the beginning of a .Rd file. This order might suggest the way of passing thru the manual to optimize understanding. Manual is here.

github webpage for repository

Try to make a website like this in a free time http://kzps.github.io/info/ .

saveToRepo add new entry in artifact even if object with given md5 has already exists

After
saveToRepo(iris, dir = myRepo)
saveToRepo(iris, dir = myRepo)
saveToRepo(iris, dir = myRepo)
saveToRepo(iris, dir = myRepo)

I've got
(iris_md5hash <- searchInLocalRepo( tag = "name:iris", dir = myRepo))
[1] "ff575c261c949d073b2895b05d1097c3" "ff575c261c949d073b2895b05d1097c3"
[3] "ff575c261c949d073b2895b05d1097c3" "ff575c261c949d073b2895b05d1097c3"
[5] "ff575c261c949d073b2895b05d1097c3"

Maybe it is worth yo to warn user that given artifact already exists

The output from searchInLocalRepo should be either after 'unique()' function or with additional attributes, like create date

cacheLocal and cacheGithub

Shouldn't cache() be a basic function in archivist?
We also can create github version of this function if it'll be efficient :)

multiSearchInRepo

Maybe we should think about creating a function that will be a multiSearch and would take many conditions about artifact as a parameter and would search for md5hashes of every condition and in the end would intersect those md5hashes to get those that fulfill all of them.

Example:

multiSearchInRepo <- function( conditions, repoDir, fixed ){

md_1 <- searchInRepo( conditions[1], repoDir, fixed[1])
md_2 <- searchInRepo( conditions[12], repoDir, fixed[2])
...
md_15 <- searchInRepo( conditions[15], repoDir, fixed[15])

return( intersection( md_1, md_2, .... , md_15) )
}

rmFromRepo() does not remove miniature

After
saveToRepo(iris, dir = myRepo)
rmFromRepo( md5hash = "ff575c261c949d073b2895b05d1097c3", dir = myRepo)

I still have a miniature in the gallery folder

example to Tags

Add examples to Tags.r

TODO: repo statistics

Various statistics can be calculated for objects that are in the repository. Like: calendar plot with timestamps of objects from repo, barplot for classes of these objects, etc.
summaryRepo() can present such statistics.

Reverting changes

I apologize for any inconvenience during today reverting the same commit. I thought there was a bug but it did not.

createEmptyRepo fails if dir does not exist

The call to createEmptyRepo(dir) fails if dir does not exist.
Following message is displayed:

Error in sqliteNewConnection(drv, ...) :
RS-DBI driver: (could not connect to dbname:
unable to open database file
)

Either the message should be changed or desired dir should be created first.

zip/tar and unzip/untar repo

Are such functions necessary?
It would't take long to prepare them.

Arguments with defaults should go after those without defaults

To make the function call as short and easy as possible it would be good to have arguments without defaults before those with defaults.

It is not true for summaryLocalRepo(method = "md5hashes", repoDir) and possibly some other

Vignette

Future issue for vignettes. By now every commit about vignette can be tagged with this issue.

Install from github problem

Do you know what might be the problem that package does not want to install properly?

> install_github("pbiecek/archivist")
Installing github repo archivist/master from pbiecek
Downloading master.zip from https://github.com/pbiecek/archivist/archive/master.zip
Installing package from C:\Users\Marcin\AppData\Local\Temp\Rtmp86dDLa/master.zip
Installing archivist
"C:/PROGRA~1/R/R-31~1.1/bin/x64/R" --vanilla CMD INSTALL  \
  "C:\Users\Marcin\AppData\Local\Temp\Rtmp86dDLa\devtools15fc6b22ef1\archivist-master"  \
  --library="C:/Users/Marcin/Documents/R/win-library/3.1"  \
  --install-tests 

* installing *source* package 'archivist' ...
** R
** demo
** preparing package for lazy loading
Warning: replacing previous import by 'shiny::validate' when loading 'archivist'
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
Warning: replacing previous import by 'shiny::validate' when loading 'archivist'
* DONE (archivist)
> help(packages="archivist")
Error in loadNamespace(name) : there is no package called ‘NA’
> library(archivist)
Ładowanie wymaganego pakietu: RSQLite
Ładowanie wymaganego pakietu: DBI
Ładowanie wymaganego pakietu: shiny
Ładowanie wymaganego pakietu: lubridate
Ładowanie wymaganego pakietu: jsonlite

Dołączanie pakietu: ‘jsonlite’

The following object is masked from ‘package:shiny’:

    validate

The following object is masked from ‘package:utils’:

    View


 Welcome to the archivist package (ver 1.2).
Warning message:
zastępowanie poprzedniego importu przez ‘shiny::validate’ podczas ładowanie przestrzeni nazw ‘archivist’ 
> help(packages="archivist")
Error in loadNamespace(name) : there is no package called ‘NA’

Correct examples

After adding argument force in few functions, some of examples might not work properly. Also, when changed order or parameters in summarRepo the examples might be changed to easier.

Avoid unnecessary calls to digest function*(

In the saveToRepo() function there is a
md5hash <- digest( object )

but then digest( object ) is used four or more times, which is not efficient.
Having the calculated md5hash there is no need to repeated calculations (that should speed up the saving)

Regexps in searchInLocalRepo()

Right now there are two competing arguments with search criterias: tag and regex.
User has to set 'method="regexp"' to make the second argument working.
It's not consistent with base R neither necessary.

Please consider two approaches:

recognise if tag is set, inf not assume that regexp has required match criteria (like agnes() recognises if there is a data frame or distance matrix)
approach used in grep(). if fixed=TRUE (let's use fixed instead of method to be more consistent with base R) then first argument is the exact match, if fixed=FALSE then use LIKE % scheme. Instead of 'tag' maybe 'pattern'?

non root directory for github repository

It would be good to set not only user, repository and branch but also a directory for archivist gallery on github.

Then one can have a single github project with many repositories in different directories.

Order of arguments in saveToRepo

The only two required arguments in saveToRepo() are: object and dir.
Maybe they should be listed as first two arguments.
Then one can use the function without specification of argument names, e.g.
saveToRepo(iris, demoDir)

Instead
saveToRepo(object, ..., archiveData = TRUE, archiveTags = TRUE,
archiveMiniature = TRUE, dir, rememberName = TRUE)

saveToRepo(object, dir, ..., archiveData = TRUE, archiveTags = TRUE,
archiveMiniature = TRUE, rememberName = TRUE)

Also, consider other names for the 'dir' argument. Maybe: repo, repoDir, gallery or something more 'archivist specific'

deleteRepo()

Since we have createEmptyRepo() it would be nice to have deleteRepo() as well.

In future, please consider a function copyToRepo which will copy a list of artifacts from one repo to another.

Those two functions should fill up functionality for version 1/0

extractData gets variables from Global Environment

By now extractData function gets variables from Global Environment.
While saveToRepo may be used inside any other function maybe is better to change environment to something like parent.frame(1) or parent.frame(2)?

New fucntion: addTag - for update of list of tags

Problem:

after an artifact is added there is no way to add/update it's tags
sometimes one discover, that it would be nice to have additional properties exposed as tags

Proposed solution:

addTag(md5hashes, repoDir, FUN, tags) function, that will take list of md5hashes
if user specifies tags, then these tags will be added to all md5hashes
if user specifies FUN, then FUN will be executed on each md5hash object, FUN returns a character vector (=new tags), new tags are added to database

Example:
Takes all objects of lm class from repository, extracts R2 for them and stores as R2: tags

md5hashes <- searchInLocalRepo(repoDir, “class:lm”)
addTag(md5hashes, repoDir, function(x) paste0(“R2:”,summary(x)$r.square))

download.file {utils} does not work with https

download.file is used in loadFromGithubRepo() and searchInGithubRepo()
but it does not work with https protocols.

Consider use of RCurl or devtools (see devtools:::github_get_conn)

paste argument

change it's name to different one, maybe ifPaste

load function loads extra objects while working in returns=TRUE mode

I believe those are the hot lines:

 # in case there existed an object in GlobalEnv this function will not delete him
NotDelete <- as.logical(sapply( name , exists, envir = .GlobalEnv))

Are we ready for 1.1?

I will remove vignettes in the CRAN build,
examples and rda files are large,
let them stay on github.

print.repository fails with empty repo

For empty repo summaryLocalRepo() returns

List of 5
$ artifactsNumber: int 0
$ dataSetsNumber : int 0
$ classesNumber : Named list()
$ savesPerDay : Named list()
$ classesTypes : chr(0)

attr(*, "class")= chr "repository"

and then print.repository fails when attempting to
names(classes) <- "Number"

(clasess is empty as well)

Error in names(classes) <- "Number" :
'names' attribute [1] must be the same length as the vector [0]

Dependencies

Goal:
have as little of dependencies as possible
Proposition: limit the Depend section to
digest, RCurl, httr

If user requests a miniature then more packges needs to be loaded,
like ggplot2 for ggplot objects,
but then additional packages my be loaded on demand in the function that create miniature.

That will cut down number of required packages (otherwise we need to load all packages that are supposted by data extracting funciton - survival, MASS, ggplot2 and others)

Is jsonlite a necessary dependency?

Hello, first I'd like to thank you for creating such a useful package to manage my various R workspaces in an innovative way! I noticed that when I load archivist I receive a message The following object is masked from 'package:utils': View immediately after the jsonlite package is loaded. I use RStudio as my IDE for R, and in normal situations when I click on the name of a data frame in my environment panel, it generates a new tab in my editor with the contents of my data frame (since it calls the View function from the utils package) and the name of the tab is the name of the data frame object. Since the jsonlite package contains its own version of the View function, when I do the same thing after loading archivist the data do appear in a separate tab but the tab name is always x instead of the name of the data frame, which means I cannot have more than one data frame displayed separately. I realize this stems from a different package, however I searched your function documentation for any calls like @import jsonlite or @importFrom jsonlite and did not find any lines with these types of calls. Is there a way to remove jsonlite from the package dependencies? Or did I miss where you are using functions from that package?

extractData bug/problem

Till now, when artifacts could only return call of itself not the real data, we tried to get those data from paren.frame(1) or previously from .GlobalEnv. But If an user specifies data as data=as.numeric(Train[,-3]) every overloaded version of extractData would try to get( "as.numeric(Train[,-3])", envir = parent.frame(1)). Since there might be Train object in R, there probably will never be an object like this as.numeric(Train[,-3]).

My solution is to add warning telling that data could not be archived. That`s what I've done in this commit.

Travis support CI

To synchronize with travis I need administration permission on this repository, so I'll fork this repo and sync with travis on my account and then I'll send pull request to this repo.

TO DO: synchronize with travis
https://travis-ci.org/getting_started
http://docs.travis-ci.com/user/build-configuration/#.travis.yml-file%3A-what-it-is-and-how-it-is-used

copyGithubRepo bug

> exampleRepoDir2 <- tempdir()
> createEmptyRepo( exampleRepoDir2 )
> md5hashes2 <- searchInGithubRepo( pattern= "relation", user = "pbiecek", repo = "archivist", fixed = FALSE )
> copyGithubRepo( repoTo = exampleRepoDir2,  md5hashes2,
+ user= "pbiecek", repo = "archivist")
 Hide Traceback

 Rerun with Debug
 Error in file(con, "wb") : invalid 'description' argument 
7 file(con, "wb") 
6 writeBin(fileFromGithub, paste0(to, file)) at copyToRepo.R#150
5 FUN(X[[1L]], ...) 
4 lapply(X = X, FUN = FUN, ...) 
3 sapply(filesToDownload, cloneGithubFile, repo = repo, user = user, 
    branch = branch, to = repoTo) at copyToRepo.R#137
2 copyRepo(repoTo = repoTo, repoFrom = Temp, md5hashes = md5hashes, 
    local = FALSE, user = user, repo = repo, branch = branch) at copyToRepo.R#79
1 copyGithubRepo(repoTo = exampleRepoDir2, md5hashes2, user = "pbiecek", 
    repo = "archivist")

No idea where is the bug.

Releases

It is possible to tag a date on which new version ( i.e. 1.1) was released.
See this and that.