datalowe / synr Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 0.0 8.13 MB

An R package for handling synesthesia consistency test data. Explore, validate and summarize data.

Home Page: https://datalowe.github.io/synr/

License: Other

R 100.00%

synesthesia data-cleaning

synr's Introduction

synr

This is an R package for working with data resulting from grapheme-color synesthesia-related consistency tests. synr provides tools for exploring test data, including visualizing a single participant's data, and applying summarizing functions such as calculating color variation/consistency scores or classifying participant data as valid or invalid.

Installation

synr is available on CRAN, meaning you can simply:

install.packages('synr')

Note that this will also install packages that synr depends on unless you already have them (dbscan, data.table and ggplot2).

Usage

Once data are in an appropriately formatted data frame/tibble ('long format' - see vignettes for more information), everything starts with rolling up participant data into a 'ParticipantGroup' object with create_participantgroup_widedata:

library(synr)

pgroup <- create_participantgroup(
    formatted_df, # data frame/tibble to use, with data in 'long format'
    n_trials_per_grapheme=3, # number of trials that grapheme was used for
    participant_col_name="participant_id", # name of column which holds participant ID's
    symbol_col_name='symbol', # name of column which holds grapheme symbol strings
    color_col_name='color', # rname of column which holds response color HEX codes
    color_space_spec = "Luv" # color space to use for all calculations with participant group
)

Using the resulting object (pgroup), you can call various methods. A few examples follow.

Example group-level method: get_mean_consistency_scores

pgroup$get_mean_consistency_scores(symbol_filter=LETTERS) would return a vector of CIELUV-based consistency scores, using only data from trials involving capital letters.

Example group-level method: check_valid_get_twcv_scores

pgroup$check_valid_get_twcv_scores(symbol_filter=0:9) would return a data frame which describes classifications of all participant data, where each data set is classified as 'invalid' or 'valid', based largely on DBSCAN clustering. This may be used to identify participants who varied their responses too little, e. g. by responding with an orange color on every trial.

Example participant-level method: get_plot

pgroup$participants[[1]]$get_plot(symbol_filter=LETTERS) would produce a bar plot of per-grapheme consistency scores for a single participant, using only data from trials involving capital letters. You can see an example below.

Detailed usage information

More details on required data format and how to use the above functions and more can be found in the package's vignettes, some of which are also included in the package itself (run help(synr) to find them). Additional information is available in the following article:

Wilsson, L., van Leeuwen, T.M. & Neufeld, J. synr: An R package for handling synesthesia consistency test data. Behav Res 55, 4086–4098 (2023). https://doi.org/10.3758/s13428-022-02007-y

Feedback

If you have any suggestions on improvements you are very welcome to directly raise issues or commit code improvements to the github repository at https://github.com/datalowe/synr.

synr's People

Contributors

Watchers

synr's Issues

Rename `synr_example_full` and convert it to long format

The 'small' example data frames synr_exampledf_long_small and synr_exampledf_wide_small provide example data in both long and wide formats. The 'full' example data frame should be in long format to emphasize that this is the 'default' data format for synr (esp. since 'wide' format might come to be deprecated and then not supported in later versions), and the resulting data frame should be named synr_exampledf_full.

Add arguments for specifying background/foreground colors in plotting-related functions

Currently, the Participant get_plot method always uses a white background and black foreground (graph bars, axis texts, et c) color. This means that graphemes in very light colors are hard to discern. Users should be allowed to choose which colors to use, so that they can tweak presentation as they wish.

Functions/methods relying on get_plot, such as Participant$save_plot, also need to be updated after enhancement has been applied.

Add references about excluding invalid participant data

In 'Validating participant color response data' vignette, add references to articles which discuss/mention excluding invalid participant response data in 'Introduction' section.

Remove 'proportion color'-related functions/methods

Methods like 'Participant$get_prop_color' should be removed from synr as they are bound to cause confusion, especially with the 'color labels' that they've provided and because they only use RGB color space. DBSCAN-based validation is a better tool for what they were designed for.

Apart from removing the code, tests and examples in eg the 'main tutorial' vignette need to also be removed.

Documentation: Add vignette/update existing vignettes to include guidance on participant data validation

ParticipantGroup$check_valid_get_twcv_scores should be discussed in one or multiple vignettes. A suggestion is to add a separate vignette about it, and applying it in the vignette using data from Dingemanse&colleagues.

Only count noise cluster if it includes at least `dbscan_min_pts` points

The check_valid_get_twcv function's safe_num_clusters parameter is to specify the necessary number of clusters which should be enough to classify a participant's data so long as they are non-tight-knit. Currently, the DBSCAN 'noise' cluster always counts towards this tally, regardless of how many points are encompassed by the noise cluster.

This might mean that a participant has 20 green points (cluster G), 20 blue points (cluster B), 20 red points (cluster R), and just one black point (noise cluster). This would be considered by check_valid_get_twcv to reach a safe_num_clusters = 4 criterion, regardless of what the dbscan_min_pts parameter is set to.

It would make more intuitive sense, and probably be more useful, if the noise cluster counts toward the 'safe_num_clusters' tally only if it consists of at least min_pts points, just like other potential clusters don't count if they have less than min_pts points.

The documentation for Participant and ParticipantGroup methods using check_valid_get_twcv would probably need to also be updated after the improvement is applied.

Suggestion: Add identified number of clusters to output data of `get_valid_twcv`

Currently, get_valid_twcv returns a list with components valid, reason_invalid and twcv. In order to better mirror the input parameters, it would make sense to also include a component num_identified_clusters, which says how many clusters were counted toward the safe_num_clusters-related tally. Related Participant and ParticipantGroup methods would also need to be updated to pass this information on to the user.

Participant summary plot: Add indicators of whether responses are missing or not

The plots produced by Participant$get_plot() use a white background by default. If a participant has responded with white colors, these are not immediately visible, unless one changes the background color. To prevent this issue, there should be some kind of indicators drawn next to/around color responses, e.g. an enclosing box or a dot next to a grapheme/word.

Fail gracefully if internet resource not available

The CRAN team reported the following issue with synr: "Packages which use Internet resources should fail gracefully with an informative message if the resource is not available (and not give a check warning nor error).".

AFAIU the issue stems from the "Dingemanse sample data vignette", where it calls read.csv to read in data from GitHub.

Participant plot 'implodes' if all of participant's graphemes are perfectly consistent

Steps to reproduce:

p <- Participant$new()
g <- Grapheme$new(symbol='A')
g$set_colors(c("#000000", "#000000", "#000000"), "Luv")
p$add_grapheme(g)
for (lett in LETTERS[2:length(LETTERS)]) {
  g <- Grapheme$new(symbol=lett)
  g$set_colors(c("#FF0000", "#FF0000", "#FF0000"), "Luv")
  p$add_grapheme(g)
}
p$get_plot()

Move `na.rm` to ends of parameter lists

Many methods in synr include a na.rm argument, but this currently goes at the beginning of parameter lists. eg ParticipantGroup includes this method:

    get_mean_consistency_scores = function(
      na.rm = FALSE,
      symbol_filter = NULL,
      method="euclidean"
    )

To stay in line with R conventions, na.rm should always come at the end of paramater lists.

Add option to validation functions to only include data from complete graphemes

When calculating consistency scores, oftentimes only data from graphemes with a full set of valid color responses are considered. Thus, get_valid_twcv and related functions/methods should offer the option of only considering data from such graphemes when classifying the data set as valid or not. That is, only color data points from 'complete' graphemes would be included when applying DBSCAN clustering and related calculations.

How should very light color responses, varying fairly much in hue, be handled?

Steps to illustrate issue:

githuburl <- 'https://raw.githubusercontent.com/mdingemanse/colouredvowels/master/BRM_colouredvowels_voweldata.csv'
dingemanse_voweldata <- read.csv(githuburl, sep=' ')

cvow_long <- dingemanse_voweldata %>% 
  pivot_longer(
    cols=c('color1', 'color2', 'color3',
      'timing1', 'timing2', 'timing3'),
    names_to=c(".value", "trial"),
    names_pattern="(\\w*)(\\d)",
    values_to=c('color', 'timing')
  )

pg <- create_participantgroup(
  raw_df=cvow_long, # CHANGE THIS
  n_trials_per_grapheme=3,
  id_col_name="anonid",
  symbol_col_name="item",
  color_col_name="color",
  time_col_name="timing",
  color_space_spec="Luv"
)

validity_df <- pg$check_valid_get_twcv_scores(
  min_complete_graphemes = 7,
  dbscan_eps = 30,
  dbscan_min_pts = 4,
  max_var_tight_cluster = 100,
  max_prop_single_tight_cluster = 0.6,
  safe_num_clusters = 4,
  safe_twcv = 250,
  symbol_filter = NULL
)
validity_df$id <- pg$get_ids()

print(validity_df[validity_df$id=='d47c0e32-e3e2-4acf-84d0-08bf7375308b', ])

The above shows that the particular participant is classified as having invalid data, reason 'few_clusters_low_twcv'.

The corresponding plot for the participant looks like this.

pg$participants$`d47c0e32-e3e2-4acf-84d0-08bf7375308b`$get_plot(grapheme_size = 4)

The participant did use very light colors throughout the whole experiment. On the other hand, the colors varied a fair bit in hue. Then again, if naively applying a consistency score calculation with a threshold of 135.3 (as suggested by Rothen, Rothen, Seth, Witzel & Ward, 2013), this participant would be considered a synesthete, even though their responses do not appear very consistent to a human observer.

Participant$save_plot() produces error for all-NA response participants

Participant$save_plot() produces the error message "Error: Discrete value supplied to continuous scale" in some cases, seemingly when a participant only has graphemes with all-NA response color matrices.

(this needs to be unittested/isolated first however, it might be that the error was produced due to how Participant instances were created by create_participantgroup(), as the error occurred with a participant created this way)