The survivalanalysis from latlio

feedback, rough draft

Great job on the project so far!!! It is really a pleasure to read. You two seem to have learned a ton, and you've really gone above and beyond with respect to additional analyses on the dataset.

I do have a few questions about some of those extended analyses and a few other clarifications. Here goes ... :)

When you give a test and p-value (e.g., middle of page 9 of pdf w chi-sq is 91.05), be specific about what you are testing (that is, give the null hypothesis or conclusion or whatever). Right after that, you do another chi-sq, but this time it seems like it is a nested test, where above it seems to be a full-model test. Anyway, be specific.
Not absolutely necessary, but do you have a good way to explain what is being measured on the "residual" axis? Something minus time? and is there any sense for what a "big" residual is? Or for that matter, a "big" dfbeta?
Bias in parameters (e.g., sample max vs true max) is different from bias in overfitting. I'm not sure that bootstrapping can estimate the bias in overfitting. You can estimate bias in overfitting from cross validation... but bootstrapping still has all the same data, so I'm not sure what it can tell you about new data.
Can you be more specific about the definition of D_xy? How is the probability of concordance calculated? Also, how is "optimism" technically calculated?
Figure 8: I think I know what you did because of the conversation we had. But you need to explain how you have hundreds of SEs by which you can make boxplots. And also, can you contextualize why the boxplots are so much closer to the blue dot than the red dot? It makes sense to me, and I don't find it worrisome, I just think it is worth mentioning because the plot makes it seem like the red is off.
Figure 9: I love this plot But can you add a caption? Also, my guess is that you did +/- 2*SE for the CIs? Say that because there are many different ways to create CIs, and it isn't clear what you did.
Your comment on Fig 9: you actually have NO idea whether the beta estimates ("b") are in the center or the tail of the TRUE sampling distribution. What the plots tell you is that using normal theory seems like a reasonable thing to do. If this particular comment isn't clear, I'll draw you a picture.
Figure 10 ... First of all, I'm a little bit confused about the layout of your report. Why are we back to the model building? Maybe just because the bootstrap validated it? Also, what is being plot? Seems like in (a) log(HR) is on the y-axis and in (b) and (c) log(HR) is on the x-axis? And then maybe there are some SE bars? Can you describe how those plots were created (and what is the ribbon for cd4? how did you get SE of the log(HR) as a function of cd4?)
Figure 11... I'm not even totally clear how one calculates the predicted survival. Don't you need to know/estimate h(t) in order to get S(t)? Or is there some way to estimate the change in survival without knowing h(t)?
XGBoost: interestingly, one comment here is related to both the main analysis and Lathan's bootstrapping: what is a residual? I don't actually think that you use residuals in the end, but you do discuss the idea of a residual when you introduce XGBoost, and residuals are a pretty tough concept in CoxPH. Maybe reflecting on why/how residuals are tough to think about will help you understand why they had to do the other complicated stuff.
You use F to mean two different things. Er, maybe you do?? I'd like to walk through your work together. Any chance we can do that? It's hard for me to give feedback in the GitHub text box ...
There are a few places where a good read would help the language part. I think you say "my" instead of "may". And some places you say "I" other places you say "we". Spell check. Etc.

Nice job, team! Thanks for the really great work and for sharing it with me.

feedback, part 2

@latlio @MadHobbs

Wow, you two have done a great job with this project already! Here are some thoughts I have along the way...

does the pairs / cor plot actually tell you anything? Seems hard to read given the binary nature of most of the variables. Is there a better way to present the same information?
interesting idea to calculate VIF. note that VIF assumes that the explanatory variables are the response and does linear regression on them. so we don't have to worry about censored information (that is, VIF is the same with CoxPH as LinReg), however, most of your explanatory variables are binary. What would VIF mean in that case?
Bootstrapping: have you thought at all about how you are re-sampling? certainly one way to do it is to just resample from the data. But I think there is some literature out there which isn't too technical and gives other methods. For example, sampling event / censor differently and/or sampling from residuals. I also really like the comparisons of the SEs. Can you expand that section a bit? I mean, I know what you are doing, but a reader just coming to the project might not know.
CoxPH / XGBoost ... I'm looking forward to hearing more about how the splits are related to the CoxPH model. Let me know if you get to a place where you want to go through any of the details together.

Just FYI (I'm reminding everyone), here are the instructions about the "new" from the assignment:

Each individual should have some analysis that goes beyond a Cox PH model. For your analysis, you should give details of what is going on, how it is relevant, what are the assumptions, what are the conclusions, etc. Your analysis should indicate a sense that you understand and that you can communicate the results to a possible client.

latlio / survivalanalysis Goto Github PK

survivalanalysis's People

Contributors

Stargazers

Watchers

Forkers

survivalanalysis's Issues

feedback, rough draft

feedback, part 2

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent