Code Monkey home page Code Monkey logo

biochat's Introduction

Screenshot

screen shot 2018-03-04 at 10 33 54 pm

Installation (for developers only)

Additionally to having Quicklisp you'll need to clone crawlik into the home directory.

To use PubMed word vectors, (pushnew :use-pubmed *features*) before loading the system biochat.

NLP in action (command-line example for developers)

Here is a sample run from Biochat using record #10 as input from the Gene Expression Omnibus (GEO) database:

#S(GEO-REC
   :ID 10
   :TITLE "Type 1 diabetes gene expression profiling"
   :SUMMARY "Examination of spleen and thymus of type 1 diabetes nonobese diabetic (NOD) mouse, four NOD-derived diabetes-resistant congenic strains and two nondiabetic control strains."
   :ORGANISM "Mus musculus")

Here is the output using two separate approaches (vec-closest-recs and tree-closest-recs, both discussed in the section How it works):

B42> (subseq (vec-closest-recs (? *geo-db* 0)) 0 3)
(#S(GEO-REC
     :ID 5167
     :TITLE "Type 2 diabetic obese patients: visceral adipose tissue CD14+ cells"
     :SUMMARY "Analysis of visceral adipose tissue CD14+ cells isolated from obese, type 2 diabetic patients. Obesity is marked by changes in the immune cell composition of adipose tissue. Results provide insight into the molecular basis of proinflammatory cytokine production in obesity-linked type 2 diabetes."
     :ORGANISM "Homo sapiens")
  #S(GEO-REC
     :ID 4191
     :TITLE "NZM2410-derived lupus susceptibility locus Sle2c1: peritoneal cavity B cells"
     :SUMMARY "Analysis of peritoneal cavity B cells (B1a) and splenic B (sB) cells from B6.Sle2c1 mice. Sle2 induces expansion of the B1a cell compartment, a B cell defect consistently associated with lupus. Results provide insight into molecular mechanisms underlying susceptibility to lupus in the NZM2410 model."
     :ORGANISM "Mus musculus")
  #S(GEO-REC
     :ID 437
     :TITLE "Heart transplants"
     :SUMMARY "Examination of immunologic tolerance induction achieved in cardiac allografts from BALB/c to C57BL/6 mice by daily intraperitoneal injection of anti-CD80 and anti-CD86 monoclonal antibodies (mAbs)."
     :ORGANISM "Mus musculus"))

B42> (subseq (tree-closest-recs (? *geo-db* 0)) 0 3)
(#S(GEO-REC
    :ID 471
    :TITLE "Malaria resistance"
    :SUMMARY "Examination of molecular basis of malaria resistance. Spleens from malaria resistant recombinant congenic strains AcB55 and AcB61 compared with malaria susceptible A/J mice."
    :ORGANISM "Mus musculus")
 #S(GEO-REC
    :ID 4258
    :TITLE "THP-1 macrophage-like cells response to W-Beijing Mycobacterium tuberculosis strains: time course"
    :SUMMARY "Temporal analysis of macrophage-like THP-1 cell line infected by Mycobacterium tuberculosis (Mtb) W-Beijing strains and H37Rv. Mtb W-Beijing sublineages are highly virulent, prevalent and genetically diverse. Results provide insight into host macrophage immune response to Mtb W-Beijing strains."
    :ORGANISM "Homo sapiens")
 #S(GEO-REC
    :ID 4966
    :TITLE "Active tuberculosis: peripheral blood mononuclear cells"
    :SUMMARY "Analysis of PBMCs isolated from patients with active pulmonary tuberculosis (PTB) and latent TB infection (LTBI). Results provide insight into identifying potential biomarkers that can distinguish individuals with PTB from LTBI."
    :ORGANISM "Homo sapiens"))

Record #10 ("Type 1 diabetes gene expression profiling") is a mouse diabetes record from spleen and thymus, which are organs where immunological tolerance is frequently studied. Even though no explicit mention of "immunological tolerance" is made in record #10, Biochat correctly pairs it with record #437 (where "immunological tolerance" is explicitly stated in the Summary). Likewise, record #10 is nicely paired with record #5167 ("Type 2 diabetic obese patients: visceral adipose tissue CD14+ cells"), which is from a different model organism (human) but involves an immunological study (CD14+ cells) from diabetic patient samples.

How it works

The data is obtained by web scraping using the project crawlik, which should be cloned from Github prior to loading Biochat. The crawled data from GEO is stored as text files in data/GEO/GEO_records directory & in memory in the variable *geo-db*. Here's an example record:

TITLE
Na,K-ATPase alpha 1 isoform reduced expression effect on hearts

SUMMARY
Expression profiling of hearts from 8 to 16 week old adult males lacking one copy of the Na,K-ATPase alpha 1 isoform.  Na,K-ATPase alpha 1 isoform expression is reduced by half in heterozygous null mutants.  Results provide insight into the role of the Na,K-ATPase alpha 1 isoform in the heart.

ORGANISM
Mus musculus

The purpose of this tool is to find related/similar records using different approaches. This is implemented in the generic function geo-group that processes the GEO database into a number of groups of related records. It has a number of methods:

  1. Match based on the same histone (the list of known histones is read from a text file).
  2. Match based on the same organism.
  3. Synonym based on the synonyms obtained from the biological PubData wordnet database (read from a JSON file).
  4. Other possible simple match methods may be implemented.

Another approach to matching is via vector space representations. Each record is transformed into a vector using the pre-calculated vectors for each word in its description (either all fields, or just summary, or summary + title). The vectors used are PubMed vectors.

The combination of individual word vectors may be performed in several ways. The most straightforward approach (implemented in the library) is direct aggregation, in which a document vector is a normalized sum of vectors for its words. Additional weighting may be applied to words from different parts of the document (summary, title, ...). Another possible aggregation approach is to use doc2vec PV-DM algorithm. The function text-vec produces an aggregated document vector from individual PubMed vectors.

The obtained document vectors may be matched using various similarity measures. The most common are cosine similarity (cos-sim) and Euclidian distance-based similarity (euc-sim). Unlike geo-group, vector-space modeling results in a continuous space, in which it is unclear how to separate individual groups of related vectors. That's why an alternative approach is taken: arrange record vectors in terms of proximity to a given record. This is done with the functions:

  • vec-closest-recs that sorts the aggregated document vectors directly with the similarity measure (cos-sim, euc-sim, etc.)
  • tree-closest-recs finds the closest records based on the pre-calculated hierarchical clustering (performed with the UPGMA algorithm using the cosine similarity measure). The results of clustering are stored in the text file.

Contact

You are welcome to:

Code of conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Acknowledgements

This software is thanks to the amazing work done by MANY people in the open source community of biological databases (GEO, PubMed, etc.). Some of the computing for this project was performed on the Sherlock cluster. We would like to thank Stanford University and the Stanford Research Computing Center for providing computational resources and support that contributed to these research results.

Citation

https://doi.org/10.1101/480020

biochat's People

Contributors

bohdan-khomtchouk avatar vseloved avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

biochat's Issues

"Look only in these library strategies:" panel

screen shot 2018-02-07 at 12 23 55 am

Remove the "SELEX", "WGS", and the empty artifact. Check how many records are in SELEX and WGS first before removing and post this info here (with a date/timestamp when you retrieved these records).

Replace the capitalizations and change to nicely formatted:

"Microarray", "Bisulfite-seq", "ChIP-seq", "RNA-seq", "RIP-seq", "DNase-seq", "ATAC-seq", "miRNA-seq", "MNase-seq", "Tn-seq", "FAIRE-seq", "MeDIP-seq", "ncRNA-seq", "MRE-seq", "MBD-seq", "Hi-C", "ssRNA-seq", "FL-cDNA"

Remove "Same library strategy" as redundant

Just keep "Same sequencing type" in the "Choose additional filters:" section, and rename "Look only in these library strategies" into "Look only in these sequencing types".

GDS: "Technology type: in situ oligonucleotide" --> microarray

Most of the GDS entries are "Technology type: in situ oligonucleotide", which means it's a microarray sample. We should add this to the metadata field. Just scrape the Technology type metadata field for all GDS entries and replace "in situ oligonucleotide" with "microarray" in the UI.

code indentation

Why is the code indentation like it is? Is an automatic indenting editor used?
Just question.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.