Code Monkey home page Code Monkey logo

valencia's Introduction

Tutorial for VALENCIA

Author: Michael France

Please contact the author if you have any questions or suggestions. Thank you and Enjoy!

  VALENCIA is a nearest-centroid based algorithm for the classification of human vaginal microbial community into state types based on their taxonomic composition. Samples are individually assigned to community state types (CSTs) based on their similarity to a set of thirteen reference centroids. The assignments provided by VALENCIA are reproducible and comparable across studies. Reference centroids were identified and defined using a dataset of >13,000 vaginal microbial taxonomic compositions from >1,900 North American women. We have done fairly extensive work to validate the usage of VALENCIA on vaginal microbial communities (see publication) including those that were derived from sequencing the V1V3, V3V4 and V4 regions. We have also looked at how it performs on samples from adolescent girls and post menopausal women, as well as on reproductive age African women.


Required dependencies

Python 3.0+ with pandas, numpy, and sys modules

Taxa names need to match used by VALENCIA for proper CST assignment. We use species level assignments for the Lactobacillus, Gardnerella, Prevotella, Atopobium and Sneathia. These appear as “Genus_species” format (e.g. “Lactobacillus_crispatus”. All other taxa are summarized to the genus or higher level. These appear as “g_Genus” or “f_Family” (e.g. “g_Bifidobacterium”). It is important that the major vaginal taxa’s names match the reference. This includes any critical in the definition of a CST, which appear below. Prevotella often causes problems due variations in naming conventions. If you are having difficulty with this taxa it is okay to combine all the data into "g_Prevotella", just make sure to use the reference centroid with the matching change .


Running VALENCIA

VALENCIA takes as input:

  • (1) the reference centroids file (provided)

  • (2) CSV file of the sample dataset with column 1 named “sampleID” containing unique sample names and column 2 named “read_count” containing each sample’s total read count. The remaining columns are expected to be taxa read counts, with the taxa name as the header.

The script is run as follows:

python /path/to/Valencia_v1.py /path/to/CST_profiles_012920.csv /path/to/test_dataset.csv

The output will be a file named something like “test_dataset_VALENCIA_CST.csv”, all of the added CST information will appear as the final columns of the dataset.


CST architecture

-CST I communities are dominated by L. crispatus and include subtypes

  • CST I-A almost completely L. crispatus
  • CST I-B less L. crispatus but still majority

-CST II communities are dominated by L. gasseri

-CST III communities are dominated by L. iners and include subtypes

  • CST III-A almost completely L. iners
  • CST III-B less L. iners but still majority

-CST IV communities that have a low relative abundance of Lactobacillus spp. and include subtypes

  • CST IV-A contain a high to moderate relative abundance of BVAB1 and G. vaginalis

  • CST IV-B contain a high to moderate relative abundance of G. vaginalis

  • CST IV-C do not contain G. vaginalis, BVAB1, or Lactobacillus spp. and include

    • CST IV-C0 relatively even community with Prevotella spp.
    • CST IV-C1 dominated by Streptococcus spp.
    • CST IV-C2 dominated by Enterococcus spp.
    • CST IV-C3 dominated by Bifidobacterium spp.
    • CST IV-C4 dominated by Staphylococcus spp.

-CST V communities are dominated by L. jensenii

valencia's People

Contributors

michaelfrance avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.