nanxstats / r-base-shortcuts Goto Github PK

⚡ Base R shortcuts: A collection of lesser-known but powerful idioms and coding patterns for writing concise and fast R code

Home Page: https://nanx.me/blog/post/r-base-shortcuts/

design-patterns ergonomics idiomatic idioms r-base r-language rstats idiomatic-r

r-base-shortcuts's Introduction

r-base-shortcuts

A collection of lesser-known but powerful base R idioms and shortcuts for writing concise and fast base R code, useful for beginner level to intermediate level R developers.

Please help me improve and extend this list. See contributing guide and code of conduct.

Why?

From 2012 to 2022, I answered thousands of R questions in the online community Capital of Statistics. These recipes are observed and digested from the recurring patterns I learned from the frequently asked questions with less common answers.

Object creation
Object transformation
Conditions
Vectorization
Functions
- Specify formal argument lists with alist()
- Use internal functions without :::
Side-effects
- Return invisibly with invisible() for side-effect functions
- Use on.exit() for cleanup
Numerical computations
- Create step functions with stepfun()
Further reading

Object creation

Create sequences with `seq_len()` and `seq_along()`

seq_len() and seq_along() are safer than 1:length(x) or 1:nrow(x) because they avoid the unexpected result when x is of length 0:

# Safe version of 1:length(x)
seq_len(length(x))
# Safe version of 1:length(x)
seq_along(x)

Repeat character strings with `strrep()`

When you need to repeat a string a certain number of times, instead of using the tedious pattern of paste(rep("foo", 10), collapse = ""), you can use the strrep() function:

strrep("foo", 10)

strrep() is vectorized, meaning that you can pass vectors as arguments and it will return a vector of the same length as the first argument:

fruits <- c("apple", "banana", "orange")
strrep(c("*"), nchar(fruits))
strrep(c("-", "=", "**"), nchar(fruits))

Create an empty list of a given length

Use the vector() function to create an empty list of a specific length:

x <- vector("list", length)

Create and assigning S3 classes in one step

Avoid creating an object and assigning its class separately. Instead, use the structure() function to do both at once:

x <- structure(list(), class = "my_class")

Instead of:

x <- list()
class(x) <- "my_class"

This makes the code more concise when returning an object of a specific class.

Assign names to vector elements or data frame columns at creation

The setNames() function allows you to assign names to vector elements or data frame columns during creation:

x <- setNames(1:3, c("one", "two", "three"))
x <- setNames(data.frame(...), c("names", "of", "columns"))

Use `I()` to include objects as is in data frames

The I() function allows you to include objects as is when creating data frames:

df <- data.frame(x = I(list(1:10, letters)))
df$x
#> [[1]]
#>  [1]  1  2  3  4  5  6  7  8  9 10
#>
#> [[2]]
#>  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m"
#> [14] "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"

This creates a data frame with one column x that is a list of vectors.

Generate factors using `gl()`

Create a vector with specific levels with gl() by specifying the levels and the number of repetitions:

gl(n = 2, k = 5, labels = c("Low", "High"))
#> [1] Low  Low  Low  Low  Low  High High High High High
#> Levels: Low High

The gl() function is particularly useful when setting up experiments or simulations that involve categorical variables.

Object transformation

Insert elements into a vector with `append()`

When you need to insert elements into a vector at a specific position, use append(). It has an argument after that specifies the position after which the new elements should be inserted, defaulting to length of the vector being appended to.

For example, To insert the numbers 4, 5, 6 between 1, 2, 3 and 7, 8, 9:

x <- c(1, 2, 3, 7, 8, 9)
append(x, 4:6, after = 3)
#> [1] 1 2 3 4 5 6 7 8 9

Without append(), the solution would be more verbose and less readable:

c(x[1:3], 4:6, x[4:length(x)])
#> [1] 1 2 3 4 5 6 7 8 9

When after is set to 0, the new values are "appended" to the beginning of the input vector:

append(x, 4:6, after = 0)
#> [1] 4 5 6 1 2 3 7 8 9

Use `[` and `[[` as functions in apply calls

When you need to extract the same element from each item in a list or list-like object, you can leverage [ and [[ as functions (they actually are!) within lapply() and sapply() calls.

Consider a list of named vectors:

lst <- list(
  item1 = c(a = 1, b = 2, c = 3),
  item2 = c(a = 4, b = 5, c = 6),
  item3 = c(a = 7, b = 8, c = 9)
)

# Extract named element "a" using `[[`
element_a <- sapply(lst, `[[`, "a")

lst <- list(
  item1 = c(1, 2, 3),
  item2 = c(4, 5, 6),
  item3 = c(7, 8, 9)
)

# Extract first element using `[`
first_element <- sapply(lst, `[`, 1)

Sum all components in a list

Use the Reduce() function with the infix function + to sum up all components in a list:

x <- Reduce("+", list)

Bind multiple data frames in a list

The do.call() function with the rbind argument allows you to bind multiple data frames in a list into one data frame:

df_combined <- do.call("rbind", list_of_dfs)

Alternatively, more performant solutions for such operations are offered in data.table::rbindlist() and dplyr::bind_rows(). See this article for details.

Use `modifyList()` to update a list

The modifyList() function allows you to easily update values in a list without a verbose syntax:

old_list <- list(a = 1, b = 2, c = 3)
new_vals <- list(a = 10, c = 30)
new_list <- modifyList(defaults, new_vals)

This can be very useful for maintaining and updating a set of configuration parameters.

Run-length encoding

Run-length encoding is a simple form of data compression in which sequences of the same element are replaced by a single instance of the element followed by the number of times it appears in the sequence.

Suppose you have a vector with many repeating elements:

x <- c(1, 1, 1, 2, 2, 3, 3, 3, 3, 2, 2, 2, 1, 1)

You can use rle() to compress this vector and decompress the result back into the original vector with inverse.rle():

x <- c(1, 1, 1, 2, 2, 3, 3, 3, 3, 2, 2, 2, 1, 1)

(y <- rle(x))
#> Run Length Encoding
#>   lengths: int [1:5] 3 2 4 3 2
#>   values : num [1:5] 1 2 3 2 1

inverse.rle(y)
#> [1] 1 1 1 2 2 3 3 3 3 2 2 2 1 1

Conditions

Use `inherits()` for class checking

Instead of using the class() function in conjunction with ==, !=, or %in% operators to check if an object belongs to a certain class, use the inherits() function.

if (inherits(x, "class"))

This will return TRUE if "class" is one of the classes from which x inherits. This replaces the following more verbose forms:

if (class(x) == "class")

if (class(x) %in% c("class1", "class2"))

It is also more reliable because it checks for class inheritance, not just the first class name (R supports multiple classes for S3 and S4 objects).

Replace multiple `ifelse()` with `cut()`

For a series of range-based conditions, use cut() instead of chaining multiple if-else conditions or ifelse() calls:

categories <- cut(
  x,
  breaks = c(-Inf, 0, 10, Inf),
  labels = c("negative", "small", "large")
)

This assigns each element in x to the category that corresponds to the range it falls in.

Simplify recoding categorical values with `factor()`

When dealing with categorical variables, you might need to replace or recode certain levels. This can be achieved using chained ifelse() statements, but a more efficient and readable approach is to use the factor() function:

x <- c("M", "F", "F", NA)

factor(
  x,
  levels = c("F", "M", NA),
  labels = c("Female", "Male", "Missing"),
  exclude = NULL # Include missing values in the levels
)

Save the number of `if` conditions with upcasting

Sometimes, the number of conditions checked in multiple if statements can be reduced by cleverly using the fact that in R, TRUE is upcasted to 1 and FALSE to 0 in numeric contexts. This can be useful for selecting an index based on a set of conditions:

i <- (width >= 960) + (width >= 1140) + 1
p <- p + facet_wrap(vars(class), ncol = c(1, 2, 4)[i])

This does the same thing as the following code, but in a much more concise way:

if (width >= 1140) p <- p + facet_wrap(vars(class), ncol = 4)
if (width >= 960 & width < 1140) p <- p + facet_wrap(vars(class), ncol = 2)
if (width < 960) p <- p + facet_wrap(vars(class), ncol = 1)

This works because the condition checks in the parentheses result in a TRUE or FALSE, and when they are added together, they are upcasted to 1 or 0.

Use `findInterval()` for many breakpoints

If you want to assign a variable to many different groups or intervals, instead of using a series of if statements, you can use the findInterval() function. Using the same example above:

breakpoints <- c(960, 1140)
ncols <- c(1, 2, 4)
i <- findInterval(width, breakpoints) + 1
p <- p + facet_wrap(vars(class), ncol = ncols[i])

The findInterval() function finds which interval each number in a given vector falls into and returns a vector of interval indices. It's a faster alternative when there are many breakpoints.

Vectorization

Use `match()` for fast lookups

The match() function can be faster than which() for looking up values in a vector:

index <- match(value, my_vector)

This code sets index to the index of value in my_vector.

Use `mapply()` for element-wise operations on multiple lists

mapply() applies a function over a set of lists in an element-wise fashion:

mapply(sum, list1, list2, list3)

Simplify element-wise min and max operations with `pmin()` and `pmax()`

When comparing two or more vectors on an element-wise basis and get the minimum or maximum of each set of elements, use pmin() and pmax().

vec1 <- c(1, 5, 3, 9, 5)
vec2 <- c(4, 2, 8, 1, 7)

# Instead of using sapply() or a loop:
sapply(1:length(vec1), function(i) min(vec1[i], vec2[i]))
sapply(1:length(vec1), function(i) max(vec1[i], vec2[i]))

# Use pmin() and pmax() for a more concise and efficient solution:
pmin(vec1, vec2)
pmax(vec1, vec2)

pmin() and pmax() perform these operations much more efficiently than alternatives such as applying min() and max() in a loop or using sapply(). This can lead to a noticeable performance improvement when working with large vectors.

Apply a function to all combinations of parameters

Sometimes we need to run a function on every combination of a set of parameter values, for example, in grid search. We can use the combination of expand.grid(), mapply(), and do.call() + rbind() to accomplish this.

Suppose we have a simple function that takes two parameters, a and b:

f <- function(a, b) {
  result <- a * b
  data.frame(a = a, b = b, result = result)
}

Create a grid of a and b parameter values to evaluate:

params <- expand.grid(a = 1:3, b = 4:6)

We use mapply() to apply f to each row of our parameter grid. We will use SIMPLIFY = FALSE to keep the results as a list of data frames:

lst <- mapply(f, a = params$a, b = params$b, SIMPLIFY = FALSE)

Finally, we bind all the result data frames together into one final data frame:

do.call(rbind, lst)

Generate all possible combinations of given characters

To generate all possible combinations of a given set of characters, expand.grid() and do.call() with paste0() can help. The following snippet produces all possible three-digit character strings consisting of both letters (lowercase) and numbers:

x <- c(letters, 0:9)
do.call(paste0, expand.grid(x, x, x))

Here, expand.grid() generates a data frame where each row is a unique combination of three elements from x. Then, do.call(paste0, ...) concatenates each combination together into a string.

Vectorize a function with `Vectorize()`

If a function is not natively vectorized (it has arguments that only take one value at a time), you can use Vectorize() to create a new function that accepts vector inputs:

f <- function(x) x^2
lower <- c(1, 2, 3)
upper <- c(4, 5, 6)

integrate_vec <- Vectorize(integrate, vectorize.args = c("lower", "upper"))

result <- integrate_vec(f, lower, upper)
unlist(result["value", ])

The Vectorize() function works internally by leveraging the mapply() function, which applies a function over two or more vectors or lists.

Pairwise computations using `outer()`

The outer() function is useful for applying a function to every pair of elements from two vectors. This can be particularly useful for U-statistics and other situations requiring pairwise computations.

Consider two vectors of numeric values for which we wish to compute a custom function for each pair:

x <- rnorm(5)
y <- rnorm(5)

outer(x, y, FUN = function(x, y) x + x^2 - y)

Functions

Specify formal argument lists with `alist()`

The alist() function can create lists where some elements are intentionally left blank (or are "missing"), which can be helpful when we want to specify formal arguments of a function, especially in conjunction with formals().

Consider this scenario. Suppose we are writing a function that wraps another function, and we want our wrapper function to have the same formal arguments as the original function, even if it does not use all of them. Here is how we can use alist() to achieve that:

original_function <- function(a, b, c = 3, d = "something") a + b

wrapper_function <- function(...) {
  # Use the formals of the original function
  arguments <- match.call(expand.dots = FALSE)$...

  # Update the formals using `alist()`
  formals(wrapper_function) <- alist(a = , b = , c = 3, d = "something")

  # Call the original function
  do.call(original_function, arguments)
}

Now, wrapper_function() has the same formal arguments as original_function(), and any arguments passed to wrapper_function() are forwarded to original_function(). This way, even if wrapper_function() does not use all the arguments, it can still accept them, and code that uses wrapper_function() can be more consistent with code that uses original_function().

The alist() function is used here to create a list of formals where some elements are missing, which represents the fact that some arguments are required and have no default values. This would not be possible with list(), which cannot create lists with missing elements.

Use internal functions without `:::`

To use internal functions from packages without using :::, you can use

f <- utils::getFromNamespace("f", ns = "package")
f(...)

Side-effects

Return invisibly with `invisible()` for side-effect functions

R functions always return a value. However, some functions are primarily designed for their side effects. To suppress the automatic printing of the returned value, use invisible().

f <- function(x) {
  print(x^2)
  invisible(x)
}

The value of x can be used later when the result is assigned to a variable or piped into the next function.

Use `on.exit()` for cleanup

on.exit() is a useful function for cleaning up side effects, such as deleting temporary files or closing opened connections, even if a function exits early due to an error:

f <- function() {
  temp_file <- tempfile()
  on.exit(unlink(temp_file))

  # Do stuff with temp_file
}

f <- function(file) {
  con <- file(file, "r")
  on.exit(close(con))
  readLines(con)
}

This function creates a temporary file and then ensures it gets deleted when the function exits, regardless of why it exits. Note that the arguments add and after in on.exit() are important for controlling the overwriting and ordering behavior of the expressions.

Numerical computations

Create step functions with `stepfun()`

The stepfun() function is an effective tool for creating step functions, which can be particularly handy in survival analysis. For instance, say we have two survival curves generated from Kaplan-Meier estimators, and we want to determine the difference in survival probabilities at a given time.

Create the survival curves using survfit():

library("survival")

fit_km <- survfit(Surv(stop, event == "pcm") ~ 1, data = mgus1, subset = (start == 0))
fit_cr <- survfit(Surv(stop, event == "death") ~ 1, data = mgus1, subset = (start == 0))

Convert these survival curves into step functions:

step_km <- stepfun(fit_km$time, c(1, fit_km$surv))
step_cr <- stepfun(fit_cr$time, c(1, fit_cr$surv))

With these step functions, it becomes straightforward to compute the difference in survival probabilities at specific times:

t <- 1:3 * 1000
step_km(t) - step_cr(t)

r-base-shortcuts's People

Contributors

Stargazers

Watchers

Forkers

owain-s elong0527 skedare mukhtardotun siddhesh2097 rwrcrmp piotr-kaczmarski anu-bioinfo dryezl crerecombinase artlesshao aris-budiman zhujiedong haozhou1988

r-base-shortcuts's Issues

New R Code Snippet Suggestions

There are so many programmatic capabilities and secrets in R that it remains to be seen whether any R developer who has no more than a fundamental knowledge of the tool's 31 baseline packages, commonly referred to as the tidyverse, would ever discover.

R currently maintains nearly 20,000 active packages on its network, called CRAN.

Consequently, I arguably possess one of the most extensive collections of R code snippets in the world. These code snippets are principally targeted to address challenges supporting Data Science proper, Software Engineering, Machine Learning, and Generative Art.

To that end, I have hundreds of R code snippets that I have either captured or developed over the last 8 years.

This repository is so large that I had to develop a separate database and code library to manage it.

While I had most of the code snippets that you posted in your R Short Cuts, there were a few items that I didn't have.

So, as a quid pro quo to the effort that you so kindly shared, I thought I would share a few of the code snippets and ideas that I have collected over the years with you and your audience.

This code will be provided under what I refer to as the "R-Insight" series under the "Issues" section of your GitHub repository.

It is my sincere hope that you decide to share them either in their original form, or redacted as you interpret them.

In terms of credibility as a viable source in this space, I wrote a book on R which was published in July 2021 titled, "Conquering R Basics." This work can be found on Amazon Books.

If you are interested in a particular topic or interest about which I earlier referenced, reach out as I most likely have an article or two that describes it. -BR

R-Insight: Run-Length Encoding (RLE)

The rle function found in base R is obsolete and should not be used. A much better option is the subSeq function found in the doBy package which captures a series of data points related to an RLE, including the following:

First position in the sequence
Last position in the sequence
Sequence length
The midpoint position of a repeating sequence
The value being examined

RLE results are captured in a data frame. A Dot Plot has been added to facilitate the visualization of an RLE. In this example, binary values are examined:
library(broman)
library(doBy)
set.seed(7854)
y = sample(x = 0:1, size = 200, replace = TRUE)

Returns a comprehensive RLE analysis in a data frame
yrle = subSeq(y)
dotplot(group = yrle$value, y = yrle$slength, main = "RLE Binary Dot Plot", xlab = "Value", ylab = "Run Length", jiggle = "fixed", bg = "red")

To get a table summary of the RLE analysis, apply the following code:
table(yrle$value, yrle$slength)
-----1----2---3--4--6
0---34--12 --7--4--0
1---31--12 --8--5--1

Two facts are quickly discernible from the RLE analysis:

In the yrle data frame record 102, position 174-179, the RLE analysis shows the longest run-length pattern of (6) 1-based values. The corresponding Dot Plot supports this finding. If one was looking for an outlier pattern this is it.
Considering all run-length patterns in vector y, there are no consecutive patterns of (5) values for either 0 or 1.

Final Comments on R-Insight Series

With all due respect, I find you to be excessively myopic in your interpretive application of R functions. Your position is that you are only interested in base R functions for the creation and development of production code? That is my interpretation based on your email responses to my R-Insight series.

And you [REDACTED]? I hope your [REDACTED].

You probably don't know this but there is a separate R package that actually provides improved functionality for many of the Base R functions that you are interested in promoting. The functions provided in Base R are nearly a quarter century old but that is what you are promoting on your r-base-shortcuts page? Wow.

I believe your current mindset is grossly marinated in a state of ignorance and static thinking. R has radically changed over the last decade in both its function and its form.

New ways of thinking are being instituted in R on nearly a weekly basis. I am monitoring these changes in near real-time.

I will no longer be contributed ideas to this page as the thinking supporting it is grossly misinformed. The good news is that I will never be competing for a job or a project against you or anyone who thinks like you in such a limited fashion about technology. Anyone who thinks more broadly about these technologies has an extraordinary edge on those who think like you. Your thinking on this matter needs to be seriously adjusted.

And keep promoting the rle function as the function to use for run-length encoding because that function nets you nothing in terms of RLE analysis.

Stay myopic and statically defined by using 25 year-old functions in R.

Note: @nanxstats edited this comment to remove content that violated this project's code of conduct.

R-Insight: Unique Identifiers

Unique Identifiers are important in data analysis. They are used to uniquely identify a record within a dataset. It may be necessary to create unique identifiers beyond the sequential order of traditional datasets in R where the numeric sequence begins with 1.

There are other ways to create unique, more complex identifiers but this solution provides an economy of code to achieve the task.

This code snippet showcases how to generate unique identifiers using two different methods:

This example uses R's ability to generate temporary file names as a means to extract and generate Unique Identifiers:

Example 1:
library(easyr)
x = right(replicate(basename(tempfile()), n = 5), 8)

NOTE1: The n argument within the replicate function determines the number of Unique Identifiers to generate.

NOTE2: Using this method (Example 1), up to 8 characters can be used to create unique identifiers. The example generates 5, 8-digit results.

Example 2:
library(wakefield)
x = string(n = 5, length = 8)

The n argument determines the number of Unique Identifiers to generate and the length argument controls the number of characters comprising each Unique Identifier.

NOTE: Example 2 is superior to Example 1 in terms of flexibility because the identifier length can be customized. Example 1 provides Unique Identifiers that cannot exceed 8 characters in length.

R-Insight: Return a Listing of all Functions in a Target R Package

The R code in this snippet allows the user to directly write console output in RStudio to a pre-designated text file. The purpose of this code snippet is to provide an alternative for capturing R data results when such data cannot be stored within a data object. List object data produced by various R functions, for example, are very difficult to convert to data frames, which is necessary if the data needs to be externally accessed and used. The following R code resolves this issue:

This example returns a listing of all R package functions by name and by structure, converting the output to a text file:

library(dplyr)
fp = paste0(path.expand("~"), "/dplyr_fcn_list.txt")
sink(fp, append = FALSE, split = FALSE)
lsf.str(pos = "package:dplyr")
sink()
closeAllConnections()

NOTE1: Before executing the code, the target package must be locally installed AND loaded.

NOTE2: The lst.str function is an abbreviation for List Functions as a String.

R-Insight: expand.grid Obsolete

The expand.grid function found in Base R is largely obsolete. A better alternative to use is the vec_expand_grid function found in the vctrs package. It substantially expands on the expand.grid function by executing improved type-set rules:

Increased process performance
Produces sorted output by default
Never converts strings to factors
Does not add additional attributes
Drops NULL inputs
Can expand any vector type, including data frames and records

A more advanced example of a cross-balanced dataset shows three dimensions of data that are organized and connected from within a combinatorial structure of job positions, code provisions, and position categories. Simulated job titles were generated from the charlatan package:

library(charlatan)
library(vctrs)
set.seed(32491)
jb = ch_job(n = 10)
cd = paste0(sample(100:300, size = 3, replace = TRUE), ".", sample(1:8, size = 3, replace = TRUE))
ct = c("TRNG", "ONBRD", "HR")
ds = vec_expand_grid(job = jb, code = cd, cat = ct)

NOTE: When using the vec_expand_grid function, all arguments must be preceded by an argument name whether it is the default x and y parameter or as field names. If argument names are not defined, the function will crash.

R-Insight: Save All Data Objects in a Workspace to a Single File

When working in the R environment, data is created, collected, and used in the form of various list objects, vectors, models, and datasets. By applying the save.image function found in Base R, the entire active workspace can be saved to a single data file. The R file extension used must be .RData. In RStudio, data objects are extracted and subsequently converted into an .RData file from the Environment tab.

save.image(file = "my_work_space.RData")

To load the entire workspace back into an RStudio active session, apply the following code:
load("my_work_space.RData")

NOTE: If no formal file path is provided in the file argument, the file will be saved in the user's Documents folder.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

nanxstats / r-base-shortcuts Goto Github PK

r-base-shortcuts's Introduction

r-base-shortcuts

Contents

Object creation

Create sequences with seq_len() and seq_along()

Repeat character strings with strrep()

Create an empty list of a given length

Create and assigning S3 classes in one step

Assign names to vector elements or data frame columns at creation

Use I() to include objects as is in data frames

Generate factors using gl()

Object transformation

Insert elements into a vector with append()

Use [ and [[ as functions in apply calls