tuomonieminen / helsinki-open-data-science Goto Github PK

A DataCamp course for the University of Helsinki

Home Page: https://www.datacamp.com/courses/helsinki-open-data-science

R 1.17% HTML 98.83%

datacamp datacamp-course datacamp-exercises university-of-helsinki statistics data-science university-course regression cross-validation dimensionality-reduction

helsinki-open-data-science's Introduction

Helsinki Open Data Science

Welcome to the Helsinki Open Data Science repository! This repository includes the codes for both the DataCamp and the presentation slides related to the University of Helsinki course "Introduction to Open Data Science" (IODS), thought by prof. Kimmo Vehkalahti. You can click on the 'Course on DataCamp' -link above to go to the course page.

Both the DataCamp course and the presentation slides are created by Tuomo Nieminen and Emma Kämäräinen.

Course slides

The presentation slides for the IODS course are published in a GitHub web page enabled by this repository. The slides have been created by Tuomo Nieminen and Emma Kämäräinen, using Rpresentation. They can be found in the following link

IODS slides.

The Rpresentation codes are included in the 'docs' folder. The index.html file in the 'docs' folder has been used to enable the GitHub web page. See 'instructions.Rmd' in the 'docs' folder for more information.

DataCamp course creation

Changes made to this GitHub repository are automatically reflected in the linked DataCamp course. This means that you can enjoy all the advantages of version control, collaboration, issue handling ... of GitHub.

Workflow

Edit the markdown and yml files in this repository. You can use GitHub's online editor or use git locally and push your changes.
Check out your build attempts on the Teach Dashboard.
Check out your automatically updated course on DataCamp

Getting Started

A DataCamp course consists of two types of files:

course.yml, a YAML-formatted file that's prepopulated with some general course information.
chapterX.md, a markdown file with:
- a YAML header containing chapter information.
- markdown chunks representing DataCamp Exercises.

To learn more about the structure of a DataCamp course, check out the documentation.

Every DataCamp exercise consists of different parts, read up about them here. A very important part about DataCamp exercises is to provide automated personalized feedback to students. In R, these so-called Submission Correctness Tests (SCTs) are written with the testwhat package. Check out the GitHub repositories' wiki pages for more information and examples.

For more information check out the documentation on teaching at DataCamp.

Datasets

The data found in the 'datasets' folder of this repository are used in the DataCamp exercises. The files with data related filename extensions in the 'datasets' folder are automatically uploaded to amazon S3 servers.

The links to the currently used data files can be seen from the chapterx.Rmd files. The links to new files can be seen from the course build log under datacamp.com/teach. There is also information about uploading assets in the DataCamp teach documentation

adding a line adding a line

helsinki-open-data-science's People

Contributors

Stargazers

Watchers

Forkers

mannerheim jounivatanen paulabergman starstruckk gdrouard irene0709 kimmovehkalahti

helsinki-open-data-science's Issues

IODS-update-needed-with-tidyverse-in-last-chapter

This is an update that should be done ASAP:

(although the code still works)

It is all about the last chapter, 6.Analysis of longitudinal data,
where we convert the data sets between long and wide formats.

Since creating this chapter (it was added later than the other ones and coded by Petteri Mäntymaa, one of the assistants and stats students at the time), the tidyverse functions have been re-written, see:

https://tidyr.tidyverse.org/

Especially this first point on that page ("Getting started"):

“Pivotting” which converts between long and wide forms. tidyr 1.0.0 introduces pivot_longer() and pivot_wider(), replacing the older spread() and gather() functions. See vignette("pivot") for more details.

So, our code was built before tidyr 1.0.0, using spread() and gather(). Those function should be replaced by the above pivotting functions.

Should be fairly straight-forward, I think. Must revise the DataCamp instructions, too (and check the RStudio exercise).

IODS-error-in-MCA-code-and-SOLUTION-by-Kimmo

From my own notes:

For some reason the MCA code in 5. Dimensionality reduction techniques did not work anymore. The problem is in the last exercise of the chapter.

I used some time (last year) for digging and testing, and found the solution!

The FactoMineR package had been updated (based on the GitHub commit history) last autumn:

cran/FactoMineR@5056929

and there are some changed in the function plot.MCA:

https://github.com/cran/FactoMineR/blob/master/R/plot.MCA.R

I noticed that the argument graph.type is handled a bit carelessly (it might also be a new argument). Its values are "ggplot" or "classic". Somehow I focused on that and I noticed that this code of ours (which halts DataCamp!):

plot.MCA(mca, invisible=c("ind"), habillage = "quali")

will work perfectly, as soon as it is updated as follows:

plot.MCA(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")

I tested this both with plain R and in the DataCamp window of IODS.

So, it seems this would be a very small fix, with an immediate positive result.

Check also the instructions in the DataCamp exercise and the RStudio exercise.

IODS-error-in-joining-datasets-and-SOLUTION-by-Reijo-Sund

This concerns the Logistic regression chapter:

Joining the datasets has not worked perfectly. Reijo Sund noticed this. See the detailed solution given by Reijo in his GitHub. This should be corrected in the DataCamp code and the instructions + in the RStudio Exercise.

Some messages from the IODS2020 forum:

alc.txt data - Exercise 3
Anne P - maanantai, 9 marraskuu 2020, 14:18
Vastausten määrä: 4
Hi,

we were told today by Reijo "Please note that for Exercise #3 in Datacamp the joining of datasets is not perfect. Please see the following code to see that there are actually 370 unique individuals instead of 382 in the datasets"

If I take the data for the analysis from:

http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt
then there is 382 obs. of 35 variables. Is it okey to use that data?
I did the data wrangling part but I am not sure if I did it correctly so I would like to use the data that is actually correct to do the analysis :)

Vs: alc.txt data - Exercise 3
Reijo Sund - maanantai, 9 marraskuu 2020, 16:38
If you want to use the data with 370 observations, do the wrangling part as shown in https://github.com/rsund/IODS-project/raw/master/data/create_alc.R. Actually that creates an excel file, so you may want to save it as a .txt or .csv instead, or read the excel file in the analysis part with readxl::read_excel()-function.

If you want to read the wrangled data directly, use the data available in https://github.com/rsund/IODS-project/raw/master/data/alc.csv.

You can also directly load the data in R:
alc <- readr::read_csv("https://github.com/rsund/IODS-project/raw/master/data/alc.csv")

Please note that for variables failures, paid, absences, G1, G2, and G3 there are also variables with extra .p and .m in their names containing the original values from both datasets and you may consider if there is better way to combine those than to calculate means (or taking the first values).

Vs: alc.txt data - Exercise 3
Anne P - maanantai, 9 marraskuu 2020, 17:52
Thank you for the answer!

Re: Vs: alc.txt data - Exercise 3
Andrei K - keskiviikko, 11 marraskuu 2020, 09:52
Hi!
I made joining of two data sets by creating unique ID based on variables given in task ("school", "sex", "age", "address", "famsize", "Pstatus", "Medu", "Fedu", "Mjob", "Fjob", "reason", "nursery","internet").
Same way was used for por and mat data sets. Then, I excluded replicates within each data set and merged two sets. If student was presented twice both obs were removed.

This reveals only 358 observations, not 370! If I would NOT exclude replicates, than number is 382 which in match to the task, but not correct, according to Monday meeting and your e-mail.

If we assume that your R code correct, why "join_cols" is different to joining variables given in the Task?

Mess in tasks and datacamp codes consuming time =(

Vs: Re: Vs: alc.txt data - Exercise 3
Reijo Sund - maanantai, 16 marraskuu 2020, 09:25

Read the metadata related to the datasets. There are a some free variables and then common fixed variables. To join two datasets correctly, you need to take into account all common fixed variables, because there may be duplicate values in subsets of common fixed variables. That is why unique identifiers, such as personal identity code in Finland or its psedonymized version or research number, would help a lot in joining datasets.

And in data wrangling it is very common that you need to deal with messy datasets. Unfortunately Datacamp exercises were constructed before the problem was detected during last year's course. But that (variables that should be used in the joining of datasets) certainly should be corrected for the task description.

For actual logistic regression part, it will be allowed to use any version of the data (of course you get a bit different results, but still reasonable close to each other). Actually it would be interesting task to compare how much results will change between different versions of wrangled data.

Mistake in k-means

Chapter 3:

The usage of dist()-function needs to be removed from k-means chapters, because kmeans-function already calculates distances.

Tests not working

Chapter 3:
In "Creating a factor variable" tests cause errors, they need to be modified. Currently there are no tests used in the chapter.