bigbio documentation

This is a project documentation of all the pipelines and tools developed by bigbio stack (bigbio.xyz) group. This a detailed documentation of the each component.

fsspark's People

Contributors

Watchers

fsspark's Issues

Benchmarking using the single-cell dataset

Benchmark the single-cell dataset again with the Feature selection R-package (feseR) previously developed.
- Test in a single machine.
Benchmark of the single-cell dataset in the following infrastructures using fsspark:
- Single machine benchmark (preferably in a user laptop).
- Spark cluster of a single node with multiple processors, benchmark with multiple processor sizes 10, 20, 50, 100?
- Spark cluster with multiple nodes.

Multiple small issues from meeting 12/01/2024.

Main priority is the creation of the CPTAC dataset #2 #3 using the phosphoproteomics and acetylome data.
Review all the algorithms and see which one are loading everything in memory and which ones are parallelizing all the compute.
Review other libraries that provide FS methods in Spark and reuse some of the algorithms.

Small issues:

response in the file format must be changed to label
The example should contain letter such as A, B, C rather than binary notation.
Make sure the example (small example) always refer to gene expression and not protein expression. It is a GEO dataset.
Annotate for every algorithm if is provided by Spark or is implemented by us.

Update project README

Create a README file in the repository where the dataset format and structure are described.
- The description should include the structure of the Tab-Separate Value (TSV) file as the primary dataset structure input.
- The interface with Spark Data Frame for Feature Selection (internal data structure used by the tool).
Add to the README dataset file, the link to the Single-cell example we have been using for the benchmark of the algorithms.
- Single-cell dataset: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE156793
- Include the formatted matrix used as input by the tool.

Generate a dataset for feature selection using CPTAC phospho data.

We have tested the library and algorithms using single-cell data. However, we may use other types of data to see if the algorithms perform well.

Contact the CPTAC Team to see if we can get help from them @ypriverol . In case the CPTAC team can't help us, we should try to generate the dataset ourself.
Explore other sources of phosphorylation information, including quantms.
Create the file format needed to benchmark the algorithms and the FS workflows.

Spark feature selection library for bigdata multiomics

The spark feature selection library for bigdata multiomics in an evolution of a previous R-package developed by Enrique et. al.. Major steps to finalize the library are:

Create a readthedocs for the project

The readthedocs for the project should include:
- General description.
- Description of the data structures.
- Description of the methods for data pre-processing (e.g., imputation and normalization).
- Univariate feature selection methods supported.
- Multivariate feature selection methods supported.
- Machine learning algorithms available.
- Description of predefined FS workflows.
- Examples (HOWTO).

Recommend Projects