Code Monkey home page Code Monkey logo

sr--oshita--ifixit's Introduction

Summer Research, 2017

Student: Lisa Oshita

Faculty advisor: Shannon Pileggi

Industry advisor: Anthony Pileggi

Objective

The objective of this summer research is to utilize survival analysis techniques to model time until a question is answered from iFixit's Q&A forum. (https://www.ifixit.com/Answers)

Specific Aims

  1. Utilize GitHub to collaborate on project materials and updates.
  2. Write all R code according to Hadley Wickam's Style Guide (http://adv-r.had.co.nz/Style.html). All R code should be written in a reproducible manner, such that code will execute when applied to a new data set.
  3. Thoroughly explore the iFixit Answer's forum in conjunction with the data to understand what the data represents.
  4. Perform a literature review to determine if other researchers have identified characteristics associated with either likelihood or timeliness of a question being answered.
  5. Identify potential parametric distributions suitable for the time to event data. Compare pros and cons of parametric versus nonparametric models.
  6. Create new variables necessary to model time until a question is answered. For example, this may involve classifications of product type or parsing text strings to identify if a question mark is present.
  7. Research the incorporation and interpretation of ptential time varying covariates. Present an argument for the best treatment of time varying covariates, if appropriate.
  8. Create a model to predict time until question is answered.
  • Turn this model into a function that can be applied to any new data set
  • Allow users to calculate or sort through predictions for specific questions
  • Write a user manual on how to utlize the function and interpret the results
  1. Take some data camp courses to learn new R skills (listed in order of priority).

sr--oshita--ifixit's People

Contributors

lisaoshita avatar shannonpileggi avatar

Stargazers

 avatar

Watchers

James Cloos avatar Anthony Pileggi avatar  avatar

sr--oshita--ifixit's Issues

When was the data downloaded?

I made the time-to-event variable, but there are quite a few NAs. I'll use the date the data was downloaded as the right-censored value for those observations.

reproducible descriptive stats report

#Please start working towards making your descriptive stats report reproducible. You can embed results from R code inline with your text. See the snippet of code below. This way when we get new data your descriptive stats report will automatically update.

Also, you may want to check out this tutorial for other things you can do in Rmarkdown: http://rmarkdown.rstudio.com/lesson-1.html

The data set is cars.

data(cars)
head(cars)
mean(cars$speed)

The average speed is r mean(cars$speed)

device categorization

Before you start writing code to do device categorization, think of a plan for the categorization. How broad or specific do we want the categories to be? For example, do you want a category for phones in general, or specifically for iphones? Outline some options, and maybe Anthony can chime in with some ideas too.

Having problems with regex function

I’m practicing with product classification, and I wanted to search through the device variable and classify phones in one category. But the cases aren’t consistent, some are uppercase and some are lowercase, so I wanted to use the regex function to allow for case-insensitive matching. But calling regex in r doesn’t seem to work for me. I’ve copied and pasted the code used in the Data camp lesson to r and that doesn’t output the same either- it doesn’t recognize any patterns. I can’t figure this one out

This is the code that I copied and pasted from Data Camp. R doesn't match any patterns with this
x <- c("Cat", "CAT", "cAt")
str_view(x, regex("cat", ignore_case = TRUE), match = TRUE)

Plotting predictions

I finished streamlining my code and also wrote up the functions to fit the model and predict. All of those are in the functions_to_predict file in the oshitar repository if you can take a look and let me know if there's anything I can work on!

I'm also having some trouble with thinking of a way to plot the predictions. One way that I was thinking of is to plot the failure experience for one question/row of data. The function would take one question and predict it's failure probability at intervals, and then plot those probabilities against time. The question that is input would allow the user to specify the levels of each of the predictors, and see what happen to the failure predictions if they change those levels around. If I go about it this way, I'd need to think of a way to make specifying these levels easy.

There is also this function survplot from the rms package that allows you to plot the estimated survival curves for a cox model. You can get plots for specific levels of the predictors, adjusting for the other predictors in the model not specified in the function. But it's not the prettiest or easiest to understand. Here's an example:

rms::survplot(final, new_category, time.inc = 100, col = 1:length(levels(x$new_category)), xlab = "hour", label.curves=list(keys=1:length(levels(x$new_category))))

creates this plot survplot_output.pdf It shows the predicted survival curves for each level in new_category. I'm still playing around with some of the arguments to change the way it looks. Another thing is that I would need to figure out is how to get this function to plot the reverse probability of survival.

Overall, what do you think the best way to plot these predictions would be?

Data camp courses

How are questions displayed on the forum?

Are the questions posted to the forum according to the time they were asked? Or do some questions get moved up in the forum if someone commented on it recently? I think that how visible the questions are (like if they are at the top of the forum) might play a role in how quickly they are answered.

Reputation of the user

I'm thinking that it would be worth it to consider the reputation of the user- since you gain reputation points whenever your question/answer is up-voted, it probably motivates users to ask better questions which could get answers faster. Can you add a variable for reputation to the data? Thank you!

Frequent terms

I’m still kind of struggling to understand what we’re doing with the frequent terms. Just to make sure, we’re trying to find the most frequently used terms among unanswered and answered questions. But since there was quite a bit of overlap between the most frequently used terms for answered/unanswered questions (ex: screen, phone), we need to find words that are unique to each category. So we’re calculating the proportion of times each term shows up in it’s answered/unanswered category, then finding the ratio to see which group (answered/unanswered) the word is most likely to belong to. To pick out the words I’ll use, I need to determine min and max proportion thresholds, so that any words between those proportions is what I’ll include as most frequently used. And to determine those thresholds, it’s kind of up to however I see fit based off of the ratios and the distribution of the proportions? I feel like I'm almost understanding this- I just keep losing it lol

NA's in the category variable

In the out dataframe, there are 2022 NAs in the category variable (out of the 8089). How should I work with this? Also, is it okay if I subset the data to only deal with observations in English?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.