Code Monkey home page Code Monkey logo

hiddenmeta's Introduction

hey ๐Ÿ‘‹ Iโ€™m Gosha

my tools:

currently working on:

Readme Card Readme Card

hiddenmeta's People

Contributors

gerasy1987 avatar macartan avatar till-tietz avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Forkers

yihuiviv

hiddenmeta's Issues

Make table of required parameters

The table should be:

  • Columns for studies
  • Rows for variables
  • Plus a column with a brief definition of the variable

X's to indicate that we expect that variable from that study

Implement estimation of proportion from RDS

Some of the studies are actually going to sample non-hidden population through RDS (e.g. sex workers in age 18-21 in Recife (Brazil)) and then estimate the share of the hidden population within the RDS sample (e.g. sex workers who were underaged when they became sex workers). @macartan and I previously believed this won't be the case.

It seems that the current implementation in the package can already handle this (but it is complicated):

  • Define two "hidden" populations (e.g. 1. sex workers and 2. sex workers who were underaged before)
  • Get RDS sample on population 1
  • Use RDS package Sequential Sampling estimator (?RDS::RDS.SS.estimates) using status of membership in population 2

What we need:

  • Check that this is correct
  • Implement RDS estimators for categorical variables in study level estimators
  • Add example of this to vignette

estimands

add prevalence: share of population that is in hidden population

Simulations for meta-analysis

The idea is to

  1. Implement a simple function that creates meta-analysis level dataset with given biases and run a diagnosis on that
  2. Implement the most straightforward study designs (high visibility and low homphily) and diagnose those systematically to see which features of population/design drive the bias

Inverse wave weighting for sampling in sample_rds()

I realized that we already have weights implemented when drawing from the pool of eligible nodes in RDS. The weighting is currently done according to the inverse of the wave (i.e. those who were sampled from network closer to initial seed are more likely to be drawn):

new <- dplyr::slice_sample(eligible, n = 1, replace = FALSE, weight_by = 1/wave)

Do we want to change this?

get_required_data: add data type / example to

add two columns here:

   variable           label                                                  Type       Example       
 1 name               Respondent ID                                          Integer    8452, 1431, 1221
 2 hidden             Hidden group member                                    Logical    TRUE, FALSE, TRUE
 3 hidden_visible_out Knows how many members of hidden group                 Integer    6, 9, 12 
 4 hidden_visible_in  Known to be member of hidden group to how many people  Integer    0, NA, 5
 5 rds                RDS: Sampled                                           Logical    FALSE, FALSE, TRUE 

purrr or other required?

hpop_estimands(df)
Error: Problem with mutate() input degree.
x could not find function "map_int"
i Input degree is map_int(links, ~length(.x)).
Run rlang::last_error() to see where the error occurred.

Stan

My RStan stopped working while debugging, so seeking help:

  • currently we have draft of get_meta_estimators that can be ran on data generated by draw_data(meta_population + meta_inquiry + meta_sample) as shown in the end of vignette
  • It seemed that it is close to working but we need to figure out how to make Stan program generated by get_meta_stan inside get_meta_estimators to compile

Tests need to be updated

Right now, this is not causing too many troubles, but the test coverage is pretty bad, and there was a lot of new functionality implemented since I last updated them

add flexibility to RDS sampling

Since several studies have strategy where they follow RDS networks till they die out but then enroll more seeds if the sample size is too small, we need to allow RDS sample to allow this in RDS sampling. This is essential for sparse networks especially

Add parameters to get_study_population()

Need to add:

  • Parameters related to TLS (probability of show-up in particular location at particular time, set of locations)
  • Other group memberships that we might be interested in prevalence estimation and/or will define the probability of being selected as seed for RDS

link-tracing sampling

  • we need to collect all nominations from the final rds wave (imagine an unlimited number of coupons for the final wave)

check the weights in sample_tls()

My sense is that the weights constructed by sample_tls() function are incorrect and need to be checked, since the estimates coming from TLS samples are off even in fairly straightforward cases

Need to add visibility to sample_pps()

Currently the function draws according to the proportions of all groups in the population. This might be unrealistic, since say hidden population members might be less likely to be sampled through PPS thus introducing bias.

Allow more than two group memberships

Need to be able to generate more than two (possibly pairwise correlated) vectors of group memberships in get_study_popoluation(). Currently functions only support 2 (one know and one hidden).

link_tracing parallelism

move parallel sampling from R to C++ via pragma. Check efficiency of pass by reference calls in helpers.

link_tracing optimization

current median run time for a 1000 sample chain is ~5 sec

potential optimization in lambda generating code block:

  • sampled units don't change during chain iterations!!!

since we re-index sampled units from 0 to n_i - 1:

  • we don't need set difference to get non sampled units (it's just end iterator of sampled + 1 : n)
  • we only need to get strata for sampled units once (not in every iteration)
  • we can simply push_back strata of non sampled units in stratum vector (we don't need random access insertion with int_vec_insert)
  • we don't need to reconstruct the link matrix from scratch in each iteration >> we fill links between known units once and add links between unkown units in each iteration
  • we don't need to generate unknown pairs with set difference (simply combn_cpp already generated vector of not sampled units)

general optimization:

  • we largely use dynamic memory allocation via push_back (this induces bounds checks) >> when vector sizes are known we can allocate memory directly and access via x[i] and decrease overhead

check why meta estimation produces errors occasionally

Main suspect here is the Stan fit code. In general errors do not arise often, but when running meta-example vignette they account for roughly 5% of simulations. My sense is that this is because stan fit fails and then there is no object to output.

Rewrite study population simulation function to allow more network structure flexibility

Instead of relying on igraph::sample_pref() for simulation of network structure we can allow network simulation handler as an argument to study population that will use either custom or pre-existing network simulation functions. This will allow simple incorporation of more complex models like ERGM on top of block-models.

References for details on both types of models include this course on social networks by Mark Hoffman and Kolaczyk, Eric D. 2009. Statistical Analysis of Network Data. New York, NY: Springer New York.

create package website with pkgdown

Proposing to move the "book" currently located here to pkgdown website for the package.

  • Create pkdown website and deploy on github pages
  • Move contents of the "book" into theory article on the website

networkreporting dependency issue

networkreporting is a dependency but has been removed from CRAN. We can either include the archived binary builds (sort of annoying and error prone) or link the github version (not sure how that gels with CRAN policies if we intend to put hiddenmeta on CRAN)

potential memory leak in link tracing gibbs sampler

  • cpp implementation of lt_permute crashes R Session after ~5000 iterations
  • likely due to memory issues (switching from Numeric to Integer matrices pushed valid iterations from ~1000 to ~5000)
  • switching loop end condition from i < var to i < var - 1 seems to fix issue (likely due to smaller size of return object since assignment to null pointer seems unlikely
  • repeated calls to Rupp::sample may create overhead, try switching to armadillo sample

Meta-analysis proposal

Draft structure

  1. Get all estimates produced for each of the studies (depending on the design) in a single data frame that will include estimator and standard error for each estimator (or potentially for each sampling-estimator pair)
  2. Then we can feed this data frame into Stan model akin the one below
    a. Use the data frame to produce observed estimator vectors for each estimator/sampling-estimator pair, observed*
    b. Use the data frame produced by study designs to produce arrays of observed estimates and corresponding standard errors (with dummy value where missing), est* and est*_sd
    c. Assume that deviation from true parameter of interest (error) is driven by use of estimator
  3. We then declare the relevant estimands and run diagnosis

stan_model_meta <- " 
  data {
    int<lower=0> N;   // number of studies 
    int<lower=0> K;   // max number of estimator (estimator-sampling pairs) 
    // number of studies with estimator (estimator-sampling pair)
    int<lower=0,upper=N> N1;
    int<lower=0,upper=N> N2;
    int<lower=0,upper=N> N3;
    // ids of studies with specific estimator (estimator-sampling pair)
    int<lower=0,upper=N> observed1[N1];   
    int<lower=0,upper=N> observed2[N2];
    int<lower=0,upper=N> observed3[N3];
    // parameter estimates
    real<lower=0,upper=1> est1[N1];   
    real<lower=0,upper=1> est2[N2];
    real<lower=0,upper=1> est3[N3];
    // estimated standard errors of parameter estimates
    real<lower=0> est1_sd[N1]; 
    real<lower=0> est2_sd[N2];
    real<lower=0> est3_sd[N3];
  }
  parameters {
    // (additive) error factor for each estimator/estimator-sampling pair
    real<lower=-1,upper=1> error[K]; 
    // prevalence estimate for each study
    vector<lower=0,upper=1>[N] alpha;
    // need to add Sigma to allow for interdependence of errors across estimators
    // or studies
  }

  model
    target += normal_lpmf(est1 | error[1] + alpha[observed1], est1_sd);
    target += normal_lpmf(est2 | error[2] + alpha[observed2], est2_sd);
    target += normal_lpmf(est3 | error[3] + alpha[observed3], est3_sd);
  }
  "

get_meta_estimands <- function(data) {
  
  data.frame(estimand_label = c(paste0("prevalence_", 1:N)),
             estimand = c(data[,1]),
             stringsAsFactors = FALSE)
}

get_meta_estimators = function(data) {
  
  stan_data <- list(N = nrow(data),
                    K = (ncol(data)-1)/2)
  
  for (k in 1:stan_data$K) {
    stan_data[[ paste0("observed",k) ]] <- 
      which(!is.na(data[,(2 * k)]))
    stan_data[[ paste0("N",k) ]] <- 
      length(stan_data[[paste0("observed",k)]])
    stan_data[[ paste0("est",k) ]] <- 
      data[stan_data[[paste0("observed",k)]],(2 * k)]
    stan_data[[ paste0("est",k,"_sd") ]] <- 
      data[stan_data[[paste0("observed",k)]],(1 + 2 * k)]
  }
  
  fit <- 
    rstan::stan(fit = stan_model_meta, 
                data = stan_data, 
                iter = 4000) %>% 
    extract
  
  data.frame(estimator_label = c(paste0("prev_", 1:N)),
             estimate = c(apply(fit$alpha, 2, mean)),
             sd =   c(apply(fit$alpha, 2, sd)),
             estimand_label = c(paste0("hidden_prev", 1:N)),
             big_Rhat = big_Rhat
             )
  
  }

Meta declaration


pop_args <- 
  list(study_1 = study_1$pop,
       study_2 = study_2$pop,
       study_3 = study_3$pop,
       study_4 = study_4$pop)

sample_args <- 
  list(study_1 = study_1$sample,
       study_2 = study_2$sample,
       study_3 = study_3$sample,
       study_4 = study_4$sample)

study_estimators <- 
  list(study_1 = study_1$estimators,
       study_2 = study_2$estimators,
       study_3 = study_3$estimators,
       study_4 = study_4$estimators)

study_estimands <- 
  list(study_1 = get_study_estimands,
       study_2 = get_study_estimands,
       study_3 = get_study_estimands,
       study_4 = get_study_estimands)


study_populations <- 
  declare_population(handler = get_stduy_populations, handler_args = meta_pop_args)

study_samples <- 
  declare_sampling(handler = get_study_samples, handler_args = meta_sample_args) 

study_estimands <- 
  declare_estimand(handler = get_study_estimands, handler_args = study_estimands) 

study_estimators <- 
  declare_estimator(handler = get_study_estimators, handler_args = study_estimators) 

meta_switch <- 
  declare_step(prep_study_estimators, handler_args = get_study_estimators)

meta_estimands <- 
  declare_estimand(handler = get_meta_estimands)

meta_estimators <- 
  declare_estimator(handler = get_meta_estimators)

meta_design <- 
  study_populations +
  study_samples +
  study_estimands +
  study_estimators +
  meta_switch +
  meta_estimands +
  meta_estimators
 

Implementation To-Do

  • Need to write helper functions that will allow to perform single study declarations on multiple studies with fixed labels
    • population
    • sampling strategies
    • estimators
    • estimands
  • Need a transformation function for meta_switch step
  • Need to double-check and enhance Stan code for meta-analyses and

link_tracing lambda parameter

Return lambda estimates and add input option for stratum of interest (which stratum is the hidden population) >> return estimated proportion of stratum of interest.
May require method specific sampling method from stochastic block model (currently we use rds sampling method).

Shiny app for input of parameters

It would be useful if we can create a Shiny app that will collect and merge the parameters of different studies and later probably also allow for analysis of those

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.