Code Monkey home page Code Monkey logo

fsspark's Introduction

bigbio documentation

This is a project documentation of all the pipelines and tools developed by bigbio stack (bigbio.xyz) group. This a detailed documentation of the each component.

fsspark's People

Contributors

enriquea avatar ypriverol avatar

Watchers

 avatar  avatar  avatar

Forkers

enriquea

fsspark's Issues

Benchmarking using the single-cell dataset

  • Benchmark the single-cell dataset again with the Feature selection R-package (feseR) previously developed.

    • Test in a single machine.
  • Benchmark of the single-cell dataset in the following infrastructures using fsspark:

    • Single machine benchmark (preferably in a user laptop).
    • Spark cluster of a single node with multiple processors, benchmark with multiple processor sizes 10, 20, 50, 100?
    • Spark cluster with multiple nodes.

Multiple small issues from meeting 12/01/2024.

  • Main priority is the creation of the CPTAC dataset #2 #3 using the phosphoproteomics and acetylome data.
  • Review all the algorithms and see which one are loading everything in memory and which ones are parallelizing all the compute.
  • Review other libraries that provide FS methods in Spark and reuse some of the algorithms.

Small issues:

  • response in the file format must be changed to label
  • The example should contain letter such as A, B, C rather than binary notation.
  • Make sure the example (small example) always refer to gene expression and not protein expression. It is a GEO dataset.
  • Annotate for every algorithm if is provided by Spark or is implemented by us.

Update project README

  • Create a README file in the repository where the dataset format and structure are described.
    - The description should include the structure of the Tab-Separate Value (TSV) file as the primary dataset structure input.
    - The interface with Spark Data Frame for Feature Selection (internal data structure used by the tool).
  • Add to the README dataset file, the link to the Single-cell example we have been using for the benchmark of the algorithms.
    - Single-cell dataset: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE156793
    - Include the formatted matrix used as input by the tool.

Generate a dataset for feature selection using CPTAC phospho data.

We have tested the library and algorithms using single-cell data. However, we may use other types of data to see if the algorithms perform well.

  • Contact the CPTAC Team to see if we can get help from them @ypriverol . In case the CPTAC team can't help us, we should try to generate the dataset ourself.
  • Explore other sources of phosphorylation information, including quantms.
  • Create the file format needed to benchmark the algorithms and the FS workflows.

Spark feature selection library for bigdata multiomics

The spark feature selection library for bigdata multiomics in an evolution of a previous R-package developed by Enrique et. al.. Major steps to finalize the library are:

  • Create a README file in the repository where the dataset format and structure are described.
  • Add to the README dataset file, the link to the Single-cell example we have been using for the benchmark of the algorithms.
  • Benchmark the single-cell dataset again with the Feature selection R-package previously developed.
  • Benchmark the single-cell dataset in the following infrastructures:
    • Single machine benchmark (preferably in a user laptop).
    • Spark cluster of a single node with multiple processors, benchmark with multiple processor sizes 10, 20, 50, 100?
    • Spark cluster with multiple nodes.
  • Contact CPTAC team to get the list of phospho-sites with spectral counting with the different cancer and tumor types. @ypriverol #3
    • Perform the same benchmarks previously done for single-cell dataset.
  • Create a readthedocs for the project.
  • Implement the framework of algorithms:
    • Implement the independent feature selection algorithms: RF, correlation analysis, PCA.
    • Implement different workflows combining multiple FS algorithms. Name them.
    • Provide as group of command line tools that enable access to some of the workflow for a given dataset file.
  • Discuss the given results and write a publication.

Create a readthedocs for the project

  • The readthedocs for the project should include:
    • General description.
    • Description of the data structures.
    • Description of the methods for data pre-processing (e.g., imputation and normalization).
    • Univariate feature selection methods supported.
    • Multivariate feature selection methods supported.
    • Machine learning algorithms available.
    • Description of predefined FS workflows.
    • Examples (HOWTO).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.