gerasy1987 / hiddenmeta Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 390.63 MB

R package for diagnosis and meta-analysis of studies that measure size and prevalence of hard-to-reach populations

Home Page: http://gsyunyaev.com/hiddenmeta/

License: MIT License

R 74.42% TeX 4.37% C++ 20.82% Shell 0.04% Stan 0.35%

hidden-populations measurement r sampling-methods simulation social-sciences

hiddenmeta's Introduction

hey 👋 I’m Gosha

my tools:

currently working on:

hiddenmeta's People

Contributors

Stargazers

Watchers

Forkers

yihuiviv

hiddenmeta's Issues

test estimation functions on test data

check if estimation functions produce sensible results

move stan model compilation out of simulations

Make table of required parameters

The table should be:

Columns for studies
Rows for variables
Plus a column with a brief definition of the variable

X's to indicate that we expect that variable from that study

Implement estimation of proportion from RDS

Some of the studies are actually going to sample non-hidden population through RDS (e.g. sex workers in age 18-21 in Recife (Brazil)) and then estimate the share of the hidden population within the RDS sample (e.g. sex workers who were underaged when they became sex workers). @macartan and I previously believed this won't be the case.

It seems that the current implementation in the package can already handle this (but it is complicated):

Define two "hidden" populations (e.g. 1. sex workers and 2. sex workers who were underaged before)
Get RDS sample on population 1
Use RDS package Sequential Sampling estimator (?RDS::RDS.SS.estimates) using status of membership in population 2

What we need:

Check that this is correct
Implement RDS estimators for categorical variables in study level estimators
Add example of this to vignette

estimands

add prevalence: share of population that is in hidden population

fix weights for cluster sampling in sample_pps()

HT estimator using current weights is off from target. Need to be fixed

Simulations for meta-analysis

The idea is to

Implement a simple function that creates meta-analysis level dataset with given biases and run a diagnosis on that
Implement the most straightforward study designs (high visibility and low homphily) and diagnose those systematically to see which features of population/design drive the bias

Inverse wave weighting for sampling in sample_rds()

I realized that we already have weights implemented when drawing from the pool of eligible nodes in RDS. The weighting is currently done according to the inverse of the wave (i.e. those who were sampled from network closer to initial seed are more likely to be drawn):

new <- dplyr::slice_sample(eligible, n = 1, replace = FALSE, weight_by = 1/wave)

Do we want to change this?

prepare set of example datasets to use in the examples and in vignettes

As per @macartan's suggestion we might need the following to enhance documentation of the package:

Simulated population dataset (with all samples)
Simulated meta-analyses population
De-identified or simulated study parameters spreadsheet

get_required_data: add data type / example to

add two columns here:

   variable           label                                                  Type       Example       
 1 name               Respondent ID                                          Integer    8452, 1431, 1221
 2 hidden             Hidden group member                                    Logical    TRUE, FALSE, TRUE
 3 hidden_visible_out Knows how many members of hidden group                 Integer    6, 9, 12 
 4 hidden_visible_in  Known to be member of hidden group to how many people  Integer    0, NA, 5
 5 rds                RDS: Sampled                                           Logical    FALSE, FALSE, TRUE

purrr or other required?

hpop_estimands(df)
Error: Problem with mutate() input degree.
x could not find function "map_int"
i Input degree is map_int(links, ~length(.x)).
Run rlang::last_error() to see where the error occurred.

Stan

My RStan stopped working while debugging, so seeking help:

currently we have draft of get_meta_estimators that can be ran on data generated by draw_data(meta_population + meta_inquiry + meta_sample) as shown in the end of vignette
It seemed that it is close to working but we need to figure out how to make Stan program generated by get_meta_stan inside get_meta_estimators to compile

Tests need to be updated

Right now, this is not causing too many troubles, but the test coverage is pretty bad, and there was a lot of new functionality implemented since I last updated them

add flexibility to RDS sampling

Since several studies have strategy where they follow RDS networks till they die out but then enroll more seeds if the sample size is too small, we need to allow RDS sample to allow this in RDS sampling. This is essential for sparse networks especially

add missing study level estimators

Missing estimators are:

GNSUM
Mark-recapture
Vincent Link-Tracing

Add parameters to get_study_population()

Need to add:

Parameters related to TLS (probability of show-up in particular location at particular time, set of locations)
Other group memberships that we might be interested in prevalence estimation and/or will define the probability of being selected as seed for RDS

link-tracing sampling

we need to collect all nominations from the final rds wave (imagine an unlimited number of coupons for the final wave)

check the weights in sample_tls()

My sense is that the weights constructed by sample_tls() function are incorrect and need to be checked, since the estimates coming from TLS samples are off even in fairly straightforward cases

allow for specification of custom names for population data

Need to add visibility to sample_pps()

Currently the function draws according to the proportions of all groups in the population. This might be unrealistic, since say hidden population members might be less likely to be sampled through PPS thus introducing bias.

Allow more than two group memberships

Need to be able to generate more than two (possibly pairwise correlated) vectors of group memberships in get_study_popoluation(). Currently functions only support 2 (one know and one hidden).

link_tracing parallelism

move parallel sampling from R to C++ via pragma. Check efficiency of pass by reference calls in helpers.

more efficient simulations and data storage

Here are two sources of inspiration for switching to data.table for handling of data frames:

link_tracing optimization

current median run time for a 1000 sample chain is ~5 sec

potential optimization in lambda generating code block:

sampled units don't change during chain iterations!!!

since we re-index sampled units from 0 to n_i - 1:

we don't need set difference to get non sampled units (it's just end iterator of sampled + 1 : n)
we only need to get strata for sampled units once (not in every iteration)
we can simply push_back strata of non sampled units in stratum vector (we don't need random access insertion with int_vec_insert)
we don't need to reconstruct the link matrix from scratch in each iteration >> we fill links between known units once and add links between unkown units in each iteration
we don't need to generate unknown pairs with set difference (simply combn_cpp already generated vector of not sampled units)

general optimization:

we largely use dynamic memory allocation via push_back (this induces bounds checks) >> when vector sizes are known we can allocate memory directly and access via x[i] and decrease overhead

check why meta estimation produces errors occasionally

Main suspect here is the Stan fit code. In general errors do not arise often, but when running meta-example vignette they account for roughly 5% of simulations. My sense is that this is because stan fit fails and then there is no object to output.

Rewrite study population simulation function to allow more network structure flexibility

Instead of relying on igraph::sample_pref() for simulation of network structure we can allow network simulation handler as an argument to study population that will use either custom or pre-existing network simulation functions. This will allow simple incorporation of more complex models like ERGM on top of block-models.

References for details on both types of models include this course on social networks by Mark Hoffman and Kolaczyk, Eric D. 2009. Statistical Analysis of Network Data. New York, NY: Springer New York.

create package website with pkgdown

Proposing to move the "book" currently located here to pkgdown website for the package.

Create pkdown website and deploy on github pages
Move contents of the "book" into theory article on the website

Add weights to sample_tls()

For estimation using HT we need weights to be produced by sample_tls() function.

passing arguments from get_study_est_sspse() to sspse::posteriorsize()

add option to specify sspse::posteriorsize() arguments in get_study_est_sspse().

networkreporting dependency issue

networkreporting is a dependency but has been removed from CRAN. We can either include the archived binary builds (sort of annoying and error prone) or link the github version (not sure how that gels with CRAN policies if we intend to put hiddenmeta on CRAN)

get_study_est_chords() does not properly initiate with RDS+ sample (many seeds)

Either need to exclude the method, or figure out a way to initiate
Also this method gives off estimates for actual study designs

potential memory leak in link tracing gibbs sampler

cpp implementation of lt_permute crashes R Session after ~5000 iterations
likely due to memory issues (switching from Numeric to Integer matrices pushed valid iterations from ~1000 to ~5000)
switching loop end condition from i < var to i < var - 1 seems to fix issue (likely due to smaller size of return object since assignment to null pointer seems unlikely
repeated calls to Rupp::sample may create overhead, try switching to armadillo sample

need handler for reading study parameters from google spreadsheets

Need this to create a list similar to the one in vignettes from google spreadsheet

Meta-analysis proposal

Draft structure

Get all estimates produced for each of the studies (depending on the design) in a single data frame that will include estimator and standard error for each estimator (or potentially for each sampling-estimator pair)
Then we can feed this data frame into Stan model akin the one below
a. Use the data frame to produce observed estimator vectors for each estimator/sampling-estimator pair, observed*
b. Use the data frame produced by study designs to produce arrays of observed estimates and corresponding standard errors (with dummy value where missing), est* and est*_sd
c. Assume that deviation from true parameter of interest (error) is driven by use of estimator
We then declare the relevant estimands and run diagnosis


stan_model_meta <- " 
  data {
    int<lower=0> N;   // number of studies 
    int<lower=0> K;   // max number of estimator (estimator-sampling pairs) 
    // number of studies with estimator (estimator-sampling pair)
    int<lower=0,upper=N> N1;
    int<lower=0,upper=N> N2;
    int<lower=0,upper=N> N3;
    // ids of studies with specific estimator (estimator-sampling pair)
    int<lower=0,upper=N> observed1[N1];   
    int<lower=0,upper=N> observed2[N2];
    int<lower=0,upper=N> observed3[N3];
    // parameter estimates
    real<lower=0,upper=1> est1[N1];   
    real<lower=0,upper=1> est2[N2];
    real<lower=0,upper=1> est3[N3];
    // estimated standard errors of parameter estimates
    real<lower=0> est1_sd[N1]; 
    real<lower=0> est2_sd[N2];
    real<lower=0> est3_sd[N3];
  }
  parameters {
    // (additive) error factor for each estimator/estimator-sampling pair
    real<lower=-1,upper=1> error[K]; 
    // prevalence estimate for each study
    vector<lower=0,upper=1>[N] alpha;
    // need to add Sigma to allow for interdependence of errors across estimators
    // or studies
  }

  model
    target += normal_lpmf(est1 | error[1] + alpha[observed1], est1_sd);
    target += normal_lpmf(est2 | error[2] + alpha[observed2], est2_sd);
    target += normal_lpmf(est3 | error[3] + alpha[observed3], est3_sd);
  }
  "

get_meta_estimands <- function(data) {
  
  data.frame(estimand_label = c(paste0("prevalence_", 1:N)),
             estimand = c(data[,1]),
             stringsAsFactors = FALSE)
}

get_meta_estimators = function(data) {
  
  stan_data <- list(N = nrow(data),
                    K = (ncol(data)-1)/2)
  
  for (k in 1:stan_data$K) {
    stan_data[[ paste0("observed",k) ]] <- 
      which(!is.na(data[,(2 * k)]))
    stan_data[[ paste0("N",k) ]] <- 
      length(stan_data[[paste0("observed",k)]])
    stan_data[[ paste0("est",k) ]] <- 
      data[stan_data[[paste0("observed",k)]],(2 * k)]
    stan_data[[ paste0("est",k,"_sd") ]] <- 
      data[stan_data[[paste0("observed",k)]],(1 + 2 * k)]
  }
  
  fit <- 
    rstan::stan(fit = stan_model_meta, 
                data = stan_data, 
                iter = 4000) %>% 
    extract
  
  data.frame(estimator_label = c(paste0("prev_", 1:N)),
             estimate = c(apply(fit$alpha, 2, mean)),
             sd =   c(apply(fit$alpha, 2, sd)),
             estimand_label = c(paste0("hidden_prev", 1:N)),
             big_Rhat = big_Rhat
             )
  
  }

Meta declaration


pop_args <- 
  list(study_1 = study_1$pop,
       study_2 = study_2$pop,
       study_3 = study_3$pop,
       study_4 = study_4$pop)

sample_args <- 
  list(study_1 = study_1$sample,
       study_2 = study_2$sample,
       study_3 = study_3$sample,
       study_4 = study_4$sample)

study_estimators <- 
  list(study_1 = study_1$estimators,
       study_2 = study_2$estimators,
       study_3 = study_3$estimators,
       study_4 = study_4$estimators)

study_estimands <- 
  list(study_1 = get_study_estimands,
       study_2 = get_study_estimands,
       study_3 = get_study_estimands,
       study_4 = get_study_estimands)


study_populations <- 
  declare_population(handler = get_stduy_populations, handler_args = meta_pop_args)

study_samples <- 
  declare_sampling(handler = get_study_samples, handler_args = meta_sample_args) 

study_estimands <- 
  declare_estimand(handler = get_study_estimands, handler_args = study_estimands) 

study_estimators <- 
  declare_estimator(handler = get_study_estimators, handler_args = study_estimators) 

meta_switch <- 
  declare_step(prep_study_estimators, handler_args = get_study_estimators)

meta_estimands <- 
  declare_estimand(handler = get_meta_estimands)

meta_estimators <- 
  declare_estimator(handler = get_meta_estimators)

meta_design <- 
  study_populations +
  study_samples +
  study_estimands +
  study_estimators +
  meta_switch +
  meta_estimands +
  meta_estimators

Implementation To-Do

Need to write helper functions that will allow to perform single study declarations on multiple studies with fixed labels
- population
- sampling strategies
- estimators
- estimands
Need a transformation function for meta_switch step
Need to double-check and enhance Stan code for meta-analyses and

allow more than one hidden group in read_study_params()

This is not required for our meta-analyses since all groups seem to be interested in one hidden group, but can be useful for other studies

link_tracing lambda parameter

Return lambda estimates and add input option for stratum of interest (which stratum is the hidden population) >> return estimated proportion of stratum of interest.
May require method specific sampling method from stochastic block model (currently we use rds sampling method).