Code Monkey home page Code Monkey logo

Comments (5)

thierrygosselin avatar thierrygosselin commented on June 5, 2024

Thanks Chris, I've made some change over the week end, might have broken something. Will check this afternoon.
Cheers
Thierry

from radiator.

thierrygosselin avatar thierrygosselin commented on June 5, 2024

Hi Chris,

It works on my end, but then I have the latest version of radiator from this week end (0.0.6).

I've also made a recent push. So might want to have this latest commit.

Best
Thierry

from radiator.

chollenbeck avatar chollenbeck commented on June 5, 2024

I took the most recent version this morning to be sure, but I'm still having the problem. I've now identified the source of the error in my dataset (see below). The issue involves the change_alleles() function with multiallelic datasets are input, particularly when the genotypes in the input data frame do not have a separator between the alleles (e.g. 001001).

I've also proposed a fix as well if this makes sense to you. I can submit a PR, but this function deals with so many input types that I wanted to make sure with you that I'm not fixing one case but breaking others.

library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Warning: package 'dplyr' was built under R version 3.4.2
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats


# A tribble that captures the state of the input df at this point in the function
input <- tibble::tribble(
  ~INDIVIDUALS, ~POP_ID, ~MARKERS, ~GT_HAPLO,
  "IND1",    "POP1",  "loc1", "000000",
  "IND2",    "POP1",  "loc1", "001002",
  "IND3",    "POP1",  "loc1", "003004"
)

# This is the original bit of code - lines 303-312 (0.0.6)
input <- input %>%
  dplyr::mutate(
    GT_VCF_NUC = dplyr::if_else(
      stringi::stri_detect_fixed(
        str = GT_HAPLO, pattern = "/"),
      GT_HAPLO,
      stringi::stri_join(GT_HAPLO, GT_HAPLO, sep = "/")),
    GT_VCF_NUC = stringi::stri_replace_na(str = GT_VCF_NUC, replacement = "./.")
  ) %>%
  dplyr::select(-GT_HAPLO)

# Note that the alleles are not split. Instead the entire 6 character genotype is treated as an allele.
# Also, missing data '000000' is not converted to NA.
input
#> # A tibble: 3 x 4
#>   INDIVIDUALS POP_ID MARKERS    GT_VCF_NUC
#>         <chr>  <chr>   <chr>         <chr>
#> 1        IND1   POP1    loc1 000000/000000
#> 2        IND2   POP1    loc1 001002/001002
#> 3        IND3   POP1    loc1 003004/003004

# Reset the input
input <- tibble::tribble(
  ~INDIVIDUALS, ~POP_ID, ~MARKERS, ~GT_HAPLO,
  "IND1",    "POP1",  "loc1", "000000",
  "IND2",    "POP1",  "loc1", "001002",
  "IND3",    "POP1",  "loc1", "003004"
)

# A modification that converts missing data to NA and splits the string into the two alleles
input <- input %>%
  dplyr::mutate(
    GT_HAPLO = replace(GT_HAPLO, which(GT_HAPLO == "000000"), NA),
    GT_VCF_NUC = dplyr::if_else(
      stringi::stri_detect_fixed(
        str = GT_HAPLO, pattern = "/"),
      GT_HAPLO,
      stringi::stri_join(stringi::stri_sub(str = GT_HAPLO, from = 1, to = 3), stringi::stri_sub(str = GT_HAPLO, from = 4, to = 6), sep = "/")),
    GT_VCF_NUC = stringi::stri_replace_na(str = GT_VCF_NUC, replacement = "./.")
  ) %>%
  dplyr::select(-GT_HAPLO)

# This is the desired output, and the function performs as it should after this point
input
#> # A tibble: 3 x 4
#>   INDIVIDUALS POP_ID MARKERS GT_VCF_NUC
#>         <chr>  <chr>   <chr>      <chr>
#> 1        IND1   POP1    loc1        ./.
#> 2        IND2   POP1    loc1    001/002
#> 3        IND3   POP1    loc1    003/004

from radiator.

thierrygosselin avatar thierrygosselin commented on June 5, 2024

Hi Chris, ok now I better understand the problem...
Most of these functions were recently modified but my test files for multi allelic datasets comes exclusively from RADseq data analyzed by DArT or within stacks (haplotypes file and vcf haplotype files) and all of these file format have nucleotide info, not 6 digits integers. But I used to be able to do it, as last year I was working a bit with microsatellite data.

Will have a look at your code and will also make some test with more input file, as you said they are many input file type, probably too many

Best
Thierry

from radiator.

thierrygosselin avatar thierrygosselin commented on June 5, 2024

Hi Chris, it should work now. I tested the function with a variety of multiallelic data set with/without nucleotide coding, with/without REF/ALT fields.

Let me know if this fixed your problem.
Best
Thierry

from radiator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.