centerforassessment / randomnames Goto Github PK

Function to generate random gender and ethnicity correct first and/or last names. Names are chosen proportionally based upon their probability of appearing in a large scale data base of real names.

Home Page: https://centerforassessment.github.io/randomNames

License: Other

R 100.00%

r random-names random-name-generators cran

randomnames's Introduction

randomNames

Overview

The randomNames package contains a single function randomNames which creates random gender/ethnicity correct first and/or last names where names are proportionally sampled based upon their frequency in a large scale database.

Installation

From CRAN

To install the latest stable release of randomNames from CRAN

> install.packages("randomNames")

From Github

To install the development release of randomNames from GitHub:

> devtools::install_github("CenterForAssessment/randomNames")

Usage

> randomNames(5) ## 5 last, first names
[1] "Mossberg, Cassie"  "Mendiaz, Victoria" "Miner, Cassidy"    "Austin, Brook"     "Babcock, Lloyd"

> randomNames(5, gender=1) ## 5 female last, first names
[1] "Bruckner, Birva"   "Caringer, Madelyn" "Mendoza, Rebecca"  "el-Haque, Jaleela" "Williams, Miranda"

> randomNames(5, gender=0) ## 5 male last, first names
[1] "al-Salam, Rida"    "Debus, Kai"        "al-Aly, Jaabir"    "Garces, Markus"    "Robertson, Trevor"

> randomNames(5, gender=0, ethnicity=3) ## 5 African American, male last, first names
[1] "Bashir, Shaquille" "Ursery, Keilan"    "Marlow, Marvin"    "Bell, Daishavon"   "Hammond, Kyle"

> randomNames(5, gender=1, ethnicity=6, which.names="first") ## 5 Middle Eastern, female first names
[1] "Jawhara"  "Raaniya"  "Ghaada"   "Ghazaala" "Raabia"

Resources

Contributors

The randomNames Package is crafted with ❤️ by:

Damian Betebenner

I love feedback and am happy to answer questions.

randomnames's People

Contributors

Stargazers

Watchers

Forkers

dbetebenner adamvi ywang115 estebahr joshwlambert

randomnames's Issues

Plans to expand database of names?

Is there any plan to increase the number of names? I often exceed the number of names (when sampling without replacement) and it would be great to know if there are plans to add more, or are you open to accepting contributions (for example for ethnicities currently not included)?

Please add argument `initial.letter =`

Hello i really like this package, it automated a lot of my work. Now it would be perfect if i could get to choose random names that start with a specific letter in a argument inside the function for example:

randomNames(n=5, gender=1, ethnicity = 4, which.names="first", initial.letter="M")
# [1] "Maria" "Magdalena" "Margarita" "Margot" "Milagros"

Thank you and looking forward for an answer. Greetings from Peru

Handle NAs in randomNames()'s gender argument

When calling randomNames(gender = g) on a gender vector that contains NAs, the generated names do no longer represent gender correctly:

# Define gender vector
> (g <- rep(0:1, each = 3))
[1] 0 0 0 1 1 1

# Gender is correctly represented
> randomNames::randomNames(gender = g, which.names = "first")
[1] "Samuel"   "Carlos"   "Theodore" "Emlynn"   "Briana"   "Deborah" 

# Include NA in gender vector
> g[3] <- NA
> randomNames::randomNames(gender = g, which.names = "first")
[1] "Maleeha" "Sean"    "Sad"     "Sang"    "Labeeba" "Carter" 

# First gender is 0 (male)
> g[1]
[1] 0

# "Maleeha" is not among any mal first names list
> fn_male <- grep("^first.*g0$", names(randomNames::randomNamesData))
> sapply(fn_male, function(i)  "Maleeha" %in% names(randomNames::randomNamesData[[
+     names(randomNames::randomNamesData)[i]
+     ]])
+ )
[1] FALSE FALSE FALSE FALSE FALSE FALSE

Partial argument matching in `rep()`

First and foremost, just to say that I really like the {randomNames} R package. It has been really useful for a package we're developing in the Epiverse-TRACE intiative: {simulist}.

Within that package we use a testing setup option that automatically checks whether we are using partial argument matching (i.e. only matching some of the argument name within a function which is then matched to the full argument name by R). This check also picks up if dependencies used within the package are using partial argument matching, and it has detected some in {randomNames}.

Specifically, the rep() function, where currently, length is being used and partially matched to length.out. It would be great if this could be updated to avoid partial matching. I've added some links at the bottom of this issue to why using partial matching can make code brittle.

I'm happy to make a PR from a fork of the package to make the recommended changes.

Links:

New package version & CRAN release?

Thank you for your collaboration getting #82 merged and in responding to issues I've raised. Would it be possible make a new version release and submit this new version to CRAN?

This would assist getting a package I am working on onto CRAN (see epiverse-trace/simulist#1).

I've completed the reverse dependency check and there were no errors and everything passed. I can paste the output logs of this revdep_check() on this issue if you would like.

I was unable to run the R CMD check due to the inst/doc directory.

Please let me know if you would be open to the idea of a new release and a submission of this new release to CRAN, I am happy to assist in any way possible.

Leading whitespaces in some names

When running the command
randomNames(1, which.names = "first", gender = 1)
some names are returned which have leading whitespaces. E.g. " Huda" and " Muzna".

Even though this can be fixed easily with trimws it is unexpected and maybe problematic to some people if these names are also returned without whitespace sometimes.

Anonymize name across multiple records

Hi! I am successfully using this great package in data frames with one record per person. I would like to use it when I have multiple records.
Here's a quick example:

ID	NAME	YEAR	GRADE	ANONYMIZED NAME
45	Sue	2023	5	Beth
45	Sue	2024	6	Kayla

Is there a method to assign the same anonymized name to a person with the same unique identifier such as ID?

firstnames weighted by birth year?

Hi - this is a neat package for a specific purpose. One possible nice feature - could you set parameters birth_year_start = 1960 (default) and birth_year_end = 2000 (default), The user could then change these parameters to get firstnames appropriate for people born between 1880-1890, or 2000-2010. Ideally this would use the weighted frequency of each firstname by gender for the included range of birth years.

data source?

Hi, this is very useful. Could you let me know what was the source dataset(s) used to create the name distributions?

regards
Lucas

offer option to sample without replacement

... to avoid duplicates, sometimes people might want to have unique names.

Error message populating out even though the package runs fine

Installing the package and importing into the library statements for randomNames package is fine, but during the import statement library(randomNames) there will always be an error message populating:
randomNames 1.4-0.0 (3-7-2019). For help: >help("randomNames") or visit https://centerforassessment.github.io/randomNames

Uninformative error message when exhausting names

It seems that when the number of names is exhausted when using randomNames() (with sample.with.replacement = FALSE) it gives an uninformative error message about sampling. It would be great if the {randomNames} package could provide the user with an custom informative error message when the requested number of names is too large. This error message can also suggest turning sample.with.replacement to TRUE to help.

Here is a reprex to show an example

library(randomNames)
set.seed(1)
gender <- rep(c("M", "F"), 2525)
names <- randomNames::randomNames(
    which.names = "both",
    name.sep = " ",
    name.order = "first.last",
    gender = gender,
    sample.with.replacement = FALSE
)
str(names)
#>  chr [1:5050] "Sebastian Clayton" "Melisa White" "Eli Jackson" "Malisse Ha" ...

gender <- rep(c("M", "F"), 3000)
names <- randomNames::randomNames(
    which.names = "both",
    name.sep = " ",
    name.order = "first.last",
    gender = gender,
    sample.with.replacement = FALSE
)
#> Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'

^{Created on 2024-01-18 with reprex v2.0.2}

Random Error with large samples without replacement

I want to generate many (9000) unique names to replace human unfriendly uuid-numbers. I wanted to extend the random name generator to enable infinite unique names by simply adding an integer when the maximum number of random names is reached.

In writing the function, I realised that I cant find a hard upper limit of number of names I can generate without replacement: I get an error at different ns.

For example below, the function returns an error on the first run, but is successful on the second, third and fourth try.

Can you elaborate on this?

library(randomNames)
#> Warning: Paket 'randomNames' wurde unter R Version 3.6.3 erstellt

set.seed(1)
one <- randomNames(5000, sample.with.replacement = FALSE)
#> Error in sample.int(length(x), size, replace, prob): kann keine Stichprobe größer als die Grundgesamtheit nehmen
#>  wenn 'replace = FALSE'
two <- randomNames(5000, sample.with.replacement = FALSE)
thr <- randomNames(5000, sample.with.replacement = FALSE)
fou <- randomNames(5000, sample.with.replacement = FALSE)

^{Created on 2020-03-18 by the reprex package (v0.3.0)}

randomNames(0) returns more than 'n' random first and/or last names.

The documentation states that the argument ofrandomNames(n) indicates "how many names to produce". In fact, the method always returns at least one name, which is contrary to the stated documentation. This affects correctness of the following examples

names <- append(names, randomNames::randomNames(num - length(names)))

test_that("test number of generated names", {
  expect_equal(10, length(randomNames(10)))
  expect_equal(1,  length(randomNames(1)))
  expect_equal(0,  length(randomNames(0)))
})

As of randomNames 0.1-0.0, the documentation says

       n: OPTIONAL. Integer indicating how many name to produce. Best
          to use when no gender or ethnicity data is provided and one
          simple wants ‘n’ random first and/or last names.

set.seed() only works for the first row of a dataframe.

Neat package.

One minor thing, running this will result in a different person for the second row of the dataframe. The seed is only respected for the first run.

set.seed(1842)
randomNames(2, which.names = "both", return.complete.data = T)

The workaround:
set.seed(1842)
df1 <- bind_rows(randomNames(1, gender = T, ethnicity = T, which.names = "both", return.complete.data = T),
randomNames(1, gender = T, ethnicity = T, which.names = "both", return.complete.data = T),
randomNames(1, gender = T, ethnicity = T, which.names = "both", return.complete.data = T))

Any thoughts on allowing us to set the seed so we can always reproduce the same set of names?

`sample.with.replacement = FALSE` across ethnicities/ genders

I have a possibly annoying feature request: Would it be possible to make sample.with.replacement = FALSE work across ethnicities/ genders?

I wanted a list of randomly generated, unique names, but had to use a work around with unique() to get it to work.

library(randomNames)                                          
set.seed(7)                                                   
 
# expected unique names, but some are duplicated                                                             
random_names <- randomNames(100, which.names = 'first',       
sample.with.replacement = FALSE)                              
any(duplicated(random_names))                                 
#> [1] TRUE
                                                              
# by contrast, it works for a single ethnicity/ gender        
unique_random_names <- randomNames(100, which.names = 'first',
sample.with.replacement = FALSE, ethnicity = 1, gender = 1)   
any(duplicated(unique_random_names))                          
#> [1] FALSE

Are the first and second names independent or associated

Sorry I couldn't figure this out from the documentation. Are the first and surnames conditionally independent (after accounting for ethnicity) or are they associated as per the original data?

centerforassessment / randomnames Goto Github PK

randomnames's Introduction

randomNames

Overview

Installation

From CRAN

From Github

Usage

Resources

Contributors

randomnames's People

Contributors

Stargazers

Watchers

Forkers

randomnames's Issues

Recommend Projects

Recommend Topics

Recommend Org