The sr--oshita--ifixit from shannonpileggi

26 negative values in the time_until_answer variable

All of the negative values for the time_until_answer variable correspond to questions that were answered-- so the post_date is greater than the first_answer_date. Should I filter these out of the dataset?

When was the data downloaded?

I made the time-to-event variable, but there are quite a few NAs. I'll use the date the data was downloaded as the right-censored value for those observations.

reproducible descriptive stats report

#Please start working towards making your descriptive stats report reproducible. You can embed results from R code inline with your text. See the snippet of code below. This way when we get new data your descriptive stats report will automatically update.

Also, you may want to check out this tutorial for other things you can do in Rmarkdown: http://rmarkdown.rstudio.com/lesson-1.html

The data set is cars.

data(cars)
head(cars)
mean(cars$speed)

The average speed is r mean(cars$speed)

Literature Review

Here's one paper we found a while back -- http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42287.pdf

Read the paper
Summarize main points

device categorization

Before you start writing code to do device categorization, think of a plan for the categorization. How broad or specific do we want the categories to be? For example, do you want a category for phones in general, or specifically for iphones? Outline some options, and maybe Anthony can chime in with some ideas too.

Having problems with regex function

I’m practicing with product classification, and I wanted to search through the device variable and classify phones in one category. But the cases aren’t consistent, some are uppercase and some are lowercase, so I wanted to use the regex function to allow for case-insensitive matching. But calling regex in r doesn’t seem to work for me. I’ve copied and pasted the code used in the Data camp lesson to r and that doesn’t output the same either- it doesn’t recognize any patterns. I can’t figure this one out

This is the code that I copied and pasted from Data Camp. R doesn't match any patterns with this
x <- c("Cat", "CAT", "cAt")
str_view(x, regex("cat", ignore_case = TRUE), match = TRUE)

Build an R package

Let's keep all the functions built during this project in a package.

Here's how I learned -- http://r-pkgs.had.co.nz/intro.html

I can work with you on this next time you are at iFixit.

Plotting predictions

I finished streamlining my code and also wrote up the functions to fit the model and predict. All of those are in the functions_to_predict file in the oshitar repository if you can take a look and let me know if there's anything I can work on!

I'm also having some trouble with thinking of a way to plot the predictions. One way that I was thinking of is to plot the failure experience for one question/row of data. The function would take one question and predict it's failure probability at intervals, and then plot those probabilities against time. The question that is input would allow the user to specify the levels of each of the predictors, and see what happen to the failure predictions if they change those levels around. If I go about it this way, I'd need to think of a way to make specifying these levels easy.

There is also this function survplot from the rms package that allows you to plot the estimated survival curves for a cox model. You can get plots for specific levels of the predictors, adjusting for the other predictors in the model not specified in the function. But it's not the prettiest or easiest to understand. Here's an example:

rms::survplot(final, new_category, time.inc = 100, col = 1:length(levels(x$new_category)), xlab = "hour", label.curves=list(keys=1:length(levels(x$new_category))))

creates this plot survplot_output.pdf It shows the predicted survival curves for each level in new_category. I'm still playing around with some of the arguments to change the way it looks. Another thing is that I would need to figure out is how to get this function to plot the reverse probability of survival.

Overall, what do you think the best way to plot these predictions would be?

Data camp courses

Take some data camp courses to learn new R skills (listed in order of priority).

Since the paper we posted had some machine learning topics, if you want to learn more about machine learning these courses would help give you perspective:

And just for fun, if you want to learn SQL:

https://www.datacamp.com/courses/intro-to-sql-for-data-science

How are questions displayed on the forum?

Are the questions posted to the forum according to the time they were asked? Or do some questions get moved up in the forum if someone commented on it recently? I think that how visible the questions are (like if they are at the top of the forum) might play a role in how quickly they are answered.

Reputation of the user

I'm thinking that it would be worth it to consider the reputation of the user- since you gain reputation points whenever your question/answer is up-voted, it probably motivates users to ask better questions which could get answers faster. Can you add a variable for reputation to the data? Thank you!

Frequent terms

I’m still kind of struggling to understand what we’re doing with the frequent terms. Just to make sure, we’re trying to find the most frequently used terms among unanswered and answered questions. But since there was quite a bit of overlap between the most frequently used terms for answered/unanswered questions (ex: screen, phone), we need to find words that are unique to each category. So we’re calculating the proportion of times each term shows up in it’s answered/unanswered category, then finding the ratio to see which group (answered/unanswered) the word is most likely to belong to. To pick out the words I’ll use, I need to determine min and max proportion thresholds, so that any words between those proportions is what I’ll include as most frequently used. And to determine those thresholds, it’s kind of up to however I see fit based off of the ratios and the distribution of the proportions? I feel like I'm almost understanding this- I just keep losing it lol

NA's in the category variable

In the out dataframe, there are 2022 NAs in the category variable (out of the 8089). How should I work with this? Also, is it okay if I subset the data to only deal with observations in English?

shannonpileggi / sr--oshita--ifixit Goto Github PK

sr--oshita--ifixit's Introduction

Summer Research, 2017

Student: Lisa Oshita

Faculty advisor: Shannon Pileggi

Industry advisor: Anthony Pileggi

Objective

Specific Aims

sr--oshita--ifixit's People

Contributors

Stargazers

Watchers

sr--oshita--ifixit's Issues

Recommend Projects

Recommend Topics

Recommend Org