Code Monkey home page Code Monkey logo

archivist's People

Contributors

adamryczkowski avatar eliotmcintire avatar greengrassblueocean avatar marcinkosinski avatar pbiecek avatar quares avatar wchodor avatar zzawadz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

archivist's Issues

saves package

I found an R package called saves that provides functionality of saving columns from a data.frames in a separate .rdatas files and enables TurboFast loading those columns if needed. Author guarantees this is faster than regular load or SQL statements.

Should we consider applying those functionalities in an archivist package?

Artifacts' clasess

So far only 1 class (default class(artifact)[1]) is archived from an artifact. Would it be a good solution to store every possible class?

Could searchInLocalRepo() accept regexps?

Right now there is an exact search with the 'tag' entry.
Would it be possible to do regexp search or partial search?
It would be useful to use
searchInLocalRepo( tag = "name:", dir = myRepo)
or
searchInLocalRepo( tag = "name:*", dir = myRepo)

to find all artifacts in database.

unknown macro '\pck'

When installing directly from github
under MacOS
I've got following warning

Warning: /private/var/folders/.../archivist/man/createEmptyRepo.Rd:15: unknown macro '\pck'

and error

Error: contains a blank line
[likely to be a problem with devtools]

remove or escale call from tag list

Up to discussion,
but since call can be long and can contain characters like '%
right now when the lm is call thought do.call the which saveToRepo fails with error.

Possible solutions: escape call (like in RODBCext)
or remove call from the list of tags to be stored

returnTag function

I've invented new function that will be very usefull while working with artifacts created with chaining codes.

Replacing repeatable code lines

I've tried to misplaced repeatable code lines like:

sqlite <- dbDriver( "SQLite" )
conn <- dbConnect( sqlite, paste0( dir, "/backpack.db" ) )
...
dbDisconnect( conn )
dbUnloadDriver( sqlite ) 

with something like:

conn <- dbConnect( get( "sqlite", envir = .ArchivistEnv ), paste0( dir, "/backpack.db" ) )
...
dbDisconnect( conn )

where "sqlite" is a connection made in zzz.r on load

library( "RSQLite" )
assign( x = "sqlite", value = dbDriver( "SQLite" ), envir = .ArchivistEnv )

But I recieved an error:

Error in sqliteNewConnection(drv, ...) : 
  RS-DBI driver: (invalid SQLiteManager) 
9 sqliteNewConnection(drv, ...) 
8 is(object, Cl) 
7 is(object, Cl) 
6 .valueClassTest(standardGeneric("dbConnect"), "DBIConnection", 
    structure("dbConnect", package = "DBI")) 
5 dbConnect(get("sqlite", envir = .ArchivistEnv), paste0(dir, "backpack.db")) at createEmptyRepo.R#112
4 FUN(c("labelx:gp", "labely:y", "data:c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3) c(-1.68058147697706, -0.25858551283198, 0.0695991957014381, -0.305351769954381, -0.0865442037369759, -0.49046203366745, 1.39493781861695, 0.60949965253791, -1.2555366629927, -0.0232679119291622, 0.347457532299437, 1.19498196922739, -1.52858438778783, -1.92154071088541, -0.617682575458965, 0.873087620089834, -0.63566999796332, -0.548280098008737, 0.668236740598881, -1.08011100356851, 0.468561888689557, 0.725540030207352, -1.18398787537976, -0.18325333179287, -0.271565440671143, -0.0421176210575606, \n0.369259996211449, -0.303006267521281, -0.353903273939233, -0.623758789064422)", 
"class:ggplot", "name:myplot123", "date:2014-08-18 16:44:14")[[1L]], 
    ...) 
3 lapply(X = X, FUN = FUN, ...) 
2 sapply(extractedTags, addTag, md5hash = md5hash, dir = repoDir) at saveToRepo.R#210
1 saveToRepo(myplot123, repoDir = exampleRepoDir) 

I've changed lines only in addTag function situated in createRepo.R file so an error is still visible on saveToRepo() call.

Pushing with that commit

saveToRepo() with force = FALSE may break some pipelines

I'm not sure is it good idea to rise an error with default force=FALSE setting of saveToRepo()
It is good to let user know that the object is already in the repository,
but an error might break some larger computations which is not pleasant experience.

Suggestions: rise warning instead an error or make default force=TRUE
Also, instead of cat's I think that it is more convenient to rise a warning when force=FALSE

But of course it's up to discussion

Vocabulary

Change word object to artifact in documentation.

Tags

Delete "data" tag from extracData and add new database.db

an issue with rememberName in saveToRepo when it's called outside GlobalEnvir

I'm not 100% positive that it's issue with
parent.frame(2)
in
if (rememberName) {
save(file = paste0(repoDir, "gallery/", md5hash, ".rda"),
ascii = TRUE, list = objectName, envir = parent.frame(2))
}

but it looks like the most probable source of problem.

The problem:
when saveToRepo() is called within a function and rememberName=TRUE then the function saveToRepo() reports an error
'Error in save(file = paste0(repoDir, "gallery/", md5hash, ".rda"), ascii = TRUE, :
object ‘output’ not found'

it's likely that the object is in parent.frame(1), but maybe it should be referred in some other way.

Suggestion: let's try parent.frame(1) or use ls() to find the right environment.

Supported objects and tags

archivist package - Tags

Archivist package is a set of tools for datasets and plots archivisation in R.
Every object can be archived with his unique Tags, that depends on object's class. Tags are attributes of an object, i.e., a class, a name, names of object's parts, etc.. So far supported objects list is presented below. Objects are divided thematically. The list of object Tags vary across object's classes.

Tags are stored in the Repository. If data is extracted from object a special Tag named relationWith is created, and specifies with which object this data is related to.

Regression Models

lm

- coefname - class - call - name - data - date

glmnet

- date - name - class - call - beta - lambda

survfit

- date - name - class - call - strata - type

Plots

ggplot

- name - class - date - data - labelx - labely

trellis

- date - name - class - call

Results of Agglomeration Methods

twins

which is a result of agnes, diana or mona functions
- date - name - class - ac - merge - data

partition

which is a result of `pam`, `clara` or `fanny` functions
- date - name - class - call - data - objective

lda

- date - name - class - call

qda

- date - name - class - call - terms

Statistical tests

htest

- alternative - method - date - name - class

When non of above is specified, tags are corresponded by default

default

- name - class - date

data.frame

- name - class - date - varname

shinySearch use Case

In my opinion shiny search is an utterly useful function and should be advertised in it's own use case. What do you think about that?

R topics documented in order

In the lattice package developers invented a way to order a documentation. They simply add letters in an alfabetic order to the beginning of a .Rd file. This order might suggest the way of passing thru the manual to optimize understanding. Manual is here.

saveToRepo add new entry in artifact even if object with given md5 has already exists

After
saveToRepo(iris, dir = myRepo)
saveToRepo(iris, dir = myRepo)
saveToRepo(iris, dir = myRepo)
saveToRepo(iris, dir = myRepo)

I've got
(iris_md5hash <- searchInLocalRepo( tag = "name:iris", dir = myRepo))
[1] "ff575c261c949d073b2895b05d1097c3" "ff575c261c949d073b2895b05d1097c3"
[3] "ff575c261c949d073b2895b05d1097c3" "ff575c261c949d073b2895b05d1097c3"
[5] "ff575c261c949d073b2895b05d1097c3"

Maybe it is worth yo to warn user that given artifact already exists

The output from searchInLocalRepo should be either after 'unique()' function or with additional attributes, like create date

cacheLocal and cacheGithub

Shouldn't cache() be a basic function in archivist?
We also can create github version of this function if it'll be efficient :)

multiSearchInRepo

Maybe we should think about creating a function that will be a multiSearch and would take many conditions about artifact as a parameter and would search for md5hashes of every condition and in the end would intersect those md5hashes to get those that fulfill all of them.

Example:

multiSearchInRepo <- function( conditions, repoDir, fixed ){

md_1 <- searchInRepo( conditions[1], repoDir, fixed[1])
md_2 <- searchInRepo( conditions[12], repoDir, fixed[2])
...
md_15 <- searchInRepo( conditions[15], repoDir, fixed[15])

return( intersection( md_1, md_2, .... , md_15) )
}

rmFromRepo() does not remove miniature

After
saveToRepo(iris, dir = myRepo)
rmFromRepo( md5hash = "ff575c261c949d073b2895b05d1097c3", dir = myRepo)

I still have a miniature in the gallery folder

TODO: repo statistics

Various statistics can be calculated for objects that are in the repository. Like: calendar plot with timestamps of objects from repo, barplot for classes of these objects, etc.
summaryRepo() can present such statistics.

Reverting changes

I apologize for any inconvenience during today reverting the same commit. I thought there was a bug but it did not.

createEmptyRepo fails if dir does not exist

The call to createEmptyRepo(dir) fails if dir does not exist.
Following message is displayed:

Error in sqliteNewConnection(drv, ...) :
RS-DBI driver: (could not connect to dbname:
unable to open database file
)

Either the message should be changed or desired dir should be created first.

Vignette

Future issue for vignettes. By now every commit about vignette can be tagged with this issue.

Install from github problem

Do you know what might be the problem that package does not want to install properly?

> install_github("pbiecek/archivist")
Installing github repo archivist/master from pbiecek
Downloading master.zip from https://github.com/pbiecek/archivist/archive/master.zip
Installing package from C:\Users\Marcin\AppData\Local\Temp\Rtmp86dDLa/master.zip
Installing archivist
"C:/PROGRA~1/R/R-31~1.1/bin/x64/R" --vanilla CMD INSTALL  \
  "C:\Users\Marcin\AppData\Local\Temp\Rtmp86dDLa\devtools15fc6b22ef1\archivist-master"  \
  --library="C:/Users/Marcin/Documents/R/win-library/3.1"  \
  --install-tests 

* installing *source* package 'archivist' ...
** R
** demo
** preparing package for lazy loading
Warning: replacing previous import by 'shiny::validate' when loading 'archivist'
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
Warning: replacing previous import by 'shiny::validate' when loading 'archivist'
* DONE (archivist)
> help(packages="archivist")
Error in loadNamespace(name) : there is no package called ‘NA’
> library(archivist)
Ładowanie wymaganego pakietu: RSQLite
Ładowanie wymaganego pakietu: DBI
Ładowanie wymaganego pakietu: shiny
Ładowanie wymaganego pakietu: lubridate
Ładowanie wymaganego pakietu: jsonlite

Dołączanie pakietu: ‘jsonlite’

The following object is masked from ‘package:shiny’:

    validate

The following object is masked from ‘package:utils’:

    View


 Welcome to the archivist package (ver 1.2).
Warning message:
zastępowanie poprzedniego importu przez ‘shiny::validate’ podczas ładowanie przestrzeni nazw ‘archivist’ 
> help(packages="archivist")
Error in loadNamespace(name) : there is no package called ‘NA’

Correct examples

After adding argument force in few functions, some of examples might not work properly. Also, when changed order or parameters in summarRepo the examples might be changed to easier.

Avoid unnecessary calls to digest function*(

In the saveToRepo() function there is a
md5hash <- digest( object )

but then digest( object ) is used four or more times, which is not efficient.
Having the calculated md5hash there is no need to repeated calculations (that should speed up the saving)

Regexps in searchInLocalRepo()

Right now there are two competing arguments with search criterias: tag and regex.
User has to set 'method="regexp"' to make the second argument working.
It's not consistent with base R neither necessary.

Please consider two approaches:

  • recognise if tag is set, inf not assume that regexp has required match criteria (like agnes() recognises if there is a data frame or distance matrix)
  • approach used in grep(). if fixed=TRUE (let's use fixed instead of method to be more consistent with base R) then first argument is the exact match, if fixed=FALSE then use LIKE % scheme. Instead of 'tag' maybe 'pattern'?

non root directory for github repository

It would be good to set not only user, repository and branch but also a directory for archivist gallery on github.

Then one can have a single github project with many repositories in different directories.

Order of arguments in saveToRepo

The only two required arguments in saveToRepo() are: object and dir.
Maybe they should be listed as first two arguments.
Then one can use the function without specification of argument names, e.g.
saveToRepo(iris, demoDir)

Instead
saveToRepo(object, ..., archiveData = TRUE, archiveTags = TRUE,
archiveMiniature = TRUE, dir, rememberName = TRUE)

saveToRepo(object, dir, ..., archiveData = TRUE, archiveTags = TRUE,
archiveMiniature = TRUE, rememberName = TRUE)

Also, consider other names for the 'dir' argument. Maybe: repo, repoDir, gallery or something more 'archivist specific'

deleteRepo()

Since we have createEmptyRepo() it would be nice to have deleteRepo() as well.

In future, please consider a function copyToRepo which will copy a list of artifacts from one repo to another.

Those two functions should fill up functionality for version 1/0

extractData gets variables from Global Environment

By now extractData function gets variables from Global Environment.
While saveToRepo may be used inside any other function maybe is better to change environment to something like parent.frame(1) or parent.frame(2)?

New fucntion: addTag - for update of list of tags

Problem:

  • after an artifact is added there is no way to add/update it's tags
  • sometimes one discover, that it would be nice to have additional properties exposed as tags

Proposed solution:

  • addTag(md5hashes, repoDir, FUN, tags) function, that will take list of md5hashes
    if user specifies tags, then these tags will be added to all md5hashes
    if user specifies FUN, then FUN will be executed on each md5hash object, FUN returns a character vector (=new tags), new tags are added to database

Example:
Takes all objects of lm class from repository, extracts R2 for them and stores as R2: tags

md5hashes <- searchInLocalRepo(repoDir, “class:lm”)
addTag(md5hashes, repoDir, function(x) paste0(“R2:”,summary(x)$r.square))

Are we ready for 1.1?

I will remove vignettes in the CRAN build,
examples and rda files are large,
let them stay on github.

print.repository fails with empty repo

For empty repo summaryLocalRepo() returns

List of 5
$ artifactsNumber: int 0
$ dataSetsNumber : int 0
$ classesNumber : Named list()
$ savesPerDay : Named list()
$ classesTypes : chr(0)

  • attr(*, "class")= chr "repository"

and then print.repository fails when attempting to
names(classes) <- "Number"

(clasess is empty as well)

Error in names(classes) <- "Number" :
'names' attribute [1] must be the same length as the vector [0]

Dependencies

Goal:
have as little of dependencies as possible
Proposition: limit the Depend section to
digest, RCurl, httr

If user requests a miniature then more packges needs to be loaded,
like ggplot2 for ggplot objects,
but then additional packages my be loaded on demand in the function that create miniature.

That will cut down number of required packages (otherwise we need to load all packages that are supposted by data extracting funciton - survival, MASS, ggplot2 and others)

Is jsonlite a necessary dependency?

Hello, first I'd like to thank you for creating such a useful package to manage my various R workspaces in an innovative way! I noticed that when I load archivist I receive a message The following object is masked from 'package:utils': View immediately after the jsonlite package is loaded. I use RStudio as my IDE for R, and in normal situations when I click on the name of a data frame in my environment panel, it generates a new tab in my editor with the contents of my data frame (since it calls the View function from the utils package) and the name of the tab is the name of the data frame object. Since the jsonlite package contains its own version of the View function, when I do the same thing after loading archivist the data do appear in a separate tab but the tab name is always x instead of the name of the data frame, which means I cannot have more than one data frame displayed separately. I realize this stems from a different package, however I searched your function documentation for any calls like @import jsonlite or @importFrom jsonlite and did not find any lines with these types of calls. Is there a way to remove jsonlite from the package dependencies? Or did I miss where you are using functions from that package?

extractData bug/problem

Till now, when artifacts could only return call of itself not the real data, we tried to get those data from paren.frame(1) or previously from .GlobalEnv. But If an user specifies data as data=as.numeric(Train[,-3]) every overloaded version of extractData would try to get( "as.numeric(Train[,-3])", envir = parent.frame(1)). Since there might be Train object in R, there probably will never be an object like this as.numeric(Train[,-3]).

My solution is to add warning telling that data could not be archived. That`s what I've done in this commit.

copyGithubRepo bug

> exampleRepoDir2 <- tempdir()
> createEmptyRepo( exampleRepoDir2 )
> md5hashes2 <- searchInGithubRepo( pattern= "relation", user = "pbiecek", repo = "archivist", fixed = FALSE )
> copyGithubRepo( repoTo = exampleRepoDir2,  md5hashes2,
+ user= "pbiecek", repo = "archivist")
 Hide Traceback

 Rerun with Debug
 Error in file(con, "wb") : invalid 'description' argument 
7 file(con, "wb") 
6 writeBin(fileFromGithub, paste0(to, file)) at copyToRepo.R#150
5 FUN(X[[1L]], ...) 
4 lapply(X = X, FUN = FUN, ...) 
3 sapply(filesToDownload, cloneGithubFile, repo = repo, user = user, 
    branch = branch, to = repoTo) at copyToRepo.R#137
2 copyRepo(repoTo = repoTo, repoFrom = Temp, md5hashes = md5hashes, 
    local = FALSE, user = user, repo = repo, branch = branch) at copyToRepo.R#79
1 copyGithubRepo(repoTo = exampleRepoDir2, md5hashes2, user = "pbiecek", 
    repo = "archivist") 

No idea where is the bug.

Releases

It is possible to tag a date on which new version ( i.e. 1.1) was released.
See this and that.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.