Code Monkey home page Code Monkey logo

helsinki-open-data-science's Issues

Tests not working

Chapter 3:
In "Creating a factor variable" tests cause errors, they need to be modified. Currently there are no tests used in the chapter.

IODS-error-in-MCA-code-and-SOLUTION-by-Kimmo

From my own notes:


For some reason the MCA code in 5. Dimensionality reduction techniques did not work anymore. The problem is in the last exercise of the chapter.

I used some time (last year) for digging and testing, and found the solution!

The FactoMineR package had been updated (based on the GitHub commit history) last autumn:

cran/FactoMineR@5056929

and there are some changed in the function plot.MCA:

https://github.com/cran/FactoMineR/blob/master/R/plot.MCA.R

I noticed that the argument graph.type is handled a bit carelessly (it might also be a new argument). Its values are "ggplot" or "classic". Somehow I focused on that and I noticed that this code of ours (which halts DataCamp!):

plot.MCA(mca, invisible=c("ind"), habillage = "quali")

will work perfectly, as soon as it is updated as follows:

plot.MCA(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")

I tested this both with plain R and in the DataCamp window of IODS.

So, it seems this would be a very small fix, with an immediate positive result.

Check also the instructions in the DataCamp exercise and the RStudio exercise.

IODS-update-needed-with-tidyverse-in-last-chapter

This is an update that should be done ASAP:


(although the code still works)

It is all about the last chapter, 6.Analysis of longitudinal data,
where we convert the data sets between long and wide formats.

Since creating this chapter (it was added later than the other ones and coded by Petteri Mäntymaa, one of the assistants and stats students at the time), the tidyverse functions have been re-written, see:

https://tidyr.tidyverse.org/

Especially this first point on that page ("Getting started"):

  • “Pivotting” which converts between long and wide forms. tidyr 1.0.0 introduces pivot_longer() and pivot_wider(), replacing the older spread() and gather() functions. See vignette("pivot") for more details.

So, our code was built before tidyr 1.0.0, using spread() and gather(). Those function should be replaced by the above pivotting functions.

Should be fairly straight-forward, I think. Must revise the DataCamp instructions, too (and check the RStudio exercise).

Mistake in k-means

Chapter 3:

The usage of dist()-function needs to be removed from k-means chapters, because kmeans-function already calculates distances.

IODS-error-in-joining-datasets-and-SOLUTION-by-Reijo-Sund

This concerns the Logistic regression chapter:


Joining the datasets has not worked perfectly. Reijo Sund noticed this. See the detailed solution given by Reijo in his GitHub. This should be corrected in the DataCamp code and the instructions + in the RStudio Exercise.

Some messages from the IODS2020 forum:

alc.txt data - Exercise 3
Anne P - maanantai, 9 marraskuu 2020, 14:18
Vastausten määrä: 4
Hi,

we were told today by Reijo "Please note that for Exercise #3 in Datacamp the joining of datasets is not perfect. Please see the following code to see that there are actually 370 unique individuals instead of 382 in the datasets"

If I take the data for the analysis from:

http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt
then there is 382 obs. of 35 variables. Is it okey to use that data?
I did the data wrangling part but I am not sure if I did it correctly so I would like to use the data that is actually correct to do the analysis :)


Vs: alc.txt data - Exercise 3
Reijo Sund - maanantai, 9 marraskuu 2020, 16:38
If you want to use the data with 370 observations, do the wrangling part as shown in https://github.com/rsund/IODS-project/raw/master/data/create_alc.R. Actually that creates an excel file, so you may want to save it as a .txt or .csv instead, or read the excel file in the analysis part with readxl::read_excel()-function.

If you want to read the wrangled data directly, use the data available in https://github.com/rsund/IODS-project/raw/master/data/alc.csv.

You can also directly load the data in R:
alc <- readr::read_csv("https://github.com/rsund/IODS-project/raw/master/data/alc.csv")

Please note that for variables failures, paid, absences, G1, G2, and G3 there are also variables with extra .p and .m in their names containing the original values from both datasets and you may consider if there is better way to combine those than to calculate means (or taking the first values).


Vs: alc.txt data - Exercise 3
Anne P - maanantai, 9 marraskuu 2020, 17:52
Thank you for the answer!


Re: Vs: alc.txt data - Exercise 3
Andrei K - keskiviikko, 11 marraskuu 2020, 09:52
Hi!
I made joining of two data sets by creating unique ID based on variables given in task ("school", "sex", "age", "address", "famsize", "Pstatus", "Medu", "Fedu", "Mjob", "Fjob", "reason", "nursery","internet").
Same way was used for por and mat data sets. Then, I excluded replicates within each data set and merged two sets. If student was presented twice both obs were removed.

This reveals only 358 observations, not 370! If I would NOT exclude replicates, than number is 382 which in match to the task, but not correct, according to Monday meeting and your e-mail.

If we assume that your R code correct, why "join_cols" is different to joining variables given in the Task?

Mess in tasks and datacamp codes consuming time =(


Vs: Re: Vs: alc.txt data - Exercise 3
Reijo Sund - maanantai, 16 marraskuu 2020, 09:25

Read the metadata related to the datasets. There are a some free variables and then common fixed variables. To join two datasets correctly, you need to take into account all common fixed variables, because there may be duplicate values in subsets of common fixed variables. That is why unique identifiers, such as personal identity code in Finland or its psedonymized version or research number, would help a lot in joining datasets.

And in data wrangling it is very common that you need to deal with messy datasets. Unfortunately Datacamp exercises were constructed before the problem was detected during last year's course. But that (variables that should be used in the joining of datasets) certainly should be corrected for the task description.

For actual logistic regression part, it will be allowed to use any version of the data (of course you get a bit different results, but still reasonable close to each other). Actually it would be interesting task to compare how much results will change between different versions of wrangled data.


Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.