cvisb / cvisb_data Goto Github PK
View Code? Open in Web Editor NEWData portal and API for Center for Viral Systems Biology (CViSB) data
Home Page: https://data.cvisb.org/home
License: MIT License
Data portal and API for Center for Viral Systems Biology (CViSB) data
Home Page: https://data.cvisb.org/home
License: MIT License
currently, when the user navigates away from the patient page to another page, the filters persist. either:
Also change/parse url on filter application
Create dataset, data downloads, experiments, and patient metadata records of SARS-CoV-2 systems serology data from the Alter lab to upload into CViSB data portal
Will require @juliamullen to save SARS-CoV-2 .fasta sequence data on the backend and @flaneuse to revamp the /dataset front-end.
Related to #63
Facet size limit set to 10,000; may begin to approach it when querying for all patient IDs, etc.
Should have HLA? https://data.cvisb.org/patient/G15-756333
https://data.cvisb.org/patient/S-2077930
sero data not listed? https://data.cvisb.org/patient/G11-552352
Create dataset, data downloads, experiments, and patient metadata records of SARS-CoV-2 longitudinal repertoire sequencing data from the Briney lab to upload into CViSB data portal (related to: https://science.sciencemag.org/content/early/2020/06/15/science.abc7520)
If we don't allow search engines to crawl the page content, I think there is no reason to have them in our sitemap.xml? Thinking specifically of these lines...
<url><loc>https://data.cvisb.org/sample</loc></url>
<url><loc>https://data.cvisb.org/upload</loc></url>
<url><loc>https://data.cvisb.org/upload/dataset</loc></url>
<url><loc>https://data.cvisb.org/upload/patient</loc></url>
<url><loc>https://data.cvisb.org/upload/sample</loc></url></urlset>
Easy change to make of course, but just want to be sure I'm not missing something...
Periodically (when private --> public function called?) generate static stats for HLA data on the backend; store and serve to the front-end. Stats will be generated by an R script.
Right now, the public-ification script has to be manually called to synchronize public data with the private one. This should be scheduled to automatically sync.
Aggregation within Elasticsearch is much more powerful than is currently available within Biothings package; first useful thing to port over would be COUNT DISTINCT
functionality. Averages, medians, etc. might also be useful.
Just starting a thread to track notes on whether CViSB datasets are being indexed on Google Dataset Search.
Currently, there are five datasets on data.cvisb.org (all listed in https://data.cvisb.org/assets/sitemap.xml):
Two datasets are indexed (SARS-CoV-2, HLA) (https://datasetsearch.research.google.com/search?query=site%3Adata.cvisb.org)
Google Search Console reports 1 error, 0 "valid with warning" and 0 "valid" (https://search.google.com/search-console/datasets?resource_id=https%3A%2F%2Fdata.cvisb.org%2F). Oddly, the one error is for the HLA dataset (one of the successfully-indexed datasets). The error relates to having an object of type Organization
under Citation
.
Using the Rich Results Testing tool, that error shows up for 3 datasets (Ebola, Lassa, HLA) -- of those three, HLA is successfully indexed in Google Dataset Search. Two datasets (SARS-CoV-2 and systems serology) show up as "Page is eligible for rich results", but only systems serology is successfully indexed. The URL inspection tool on Google Search Console confirms that the datasets are successfully detected -- I just requested re-indexing in the hopes that those datasets will show up in Google Dataset Search (but I seem to recall doing this before).
And one last note that at different times, I have seen all five datasets successfully indexed and also three datasets successfully indexed. As far as I know, we have not changed anything on our end that would explain those changes. From now, will try to track that here...
Currently, schema_conversion.py cannot validate Experiment:data. Ideal behavior: Experiment:data will be of type HLAData OR ViralSeqData OR PiccoloData....
Right now, it's confused because it doesn't know which schema to validate against; it successfully validates against the generic Data schema and PiccoloData (since it has no required properties).
Solution: require within each data schema one @type
property with a unique value for each type, so each data instance will in effect successfully route to the proper schema for validation. Remove generic Data schema.
Check releaseDate
in /experiment
data; if null or after today, pop that record from the public index.
Currently, schema_conversion.py only validates public fields.
https://data.cvisb.org/api/patient/query?q=__all__&size=10&experimentQuery=includedInDataset:hla
https://data.cvisb.org/api/patient/query?q=__all__&size=10&experimentQuery=includedInDataset:viral-seq AND includedInDataset:hla
When you apply a filter to data.cvisb.org/patient, the url should change to a permanent url to reflect those filters
Uploading large chunks of data is a pain, since there's not a good way to queue the data to be uploaded, and due to the complexity of the .json validation before ES-insertion, 300 records takes ~ 5 min to upload.
There are at least a few limits to queuing large amounts of data:
Ideally, we could queue a buncha records and let it do its thing overnight. This may involve moving away from the front-end interface, but we'll still have problems with the multiprocessing inserting duplicates.
Create dataset, data downloads, experiments, and patient metadata records of SARS-CoV-2 sequences to upload into CViSB data portal
main-es2015.6ad4d7d983597843effc.js:1 ERROR TypeError: Cannot read properties of undefined (reading 'blurry_vision')
Steps to reproduce:
For SARS-CoV-2 seq patients, need to be able to display state / city info.
Complications:
homeLocation
, we can't make the admin3 / city data available to the public. Requires either a rewriting of the schema or the private to public python function.Cross-link the SARS-CoV-2 datasets into outbreak.info resources metadata
All links on the home page "cards" (https://data.cvisb.org/home) are broken. e.g., the link to https://data.cvisb.org/home/dataset should be https://data.cvisb.org/dataset (to match the correct link in the header). Assuming this has to do with the reassignment of the home page to "/home" so the root will redirect to cvisb.org/data...
Related to #38 (front-end problems with big uploads). Occasionally, when there are large number of documents to be added to the backend, the upload will fail, returning initially a 520
error followed by some 503
ones. It becomes challenging to decipher which records were successfully uploaded and which failed, as well as trying to understand why the upload failed.
Happens only for DRB4 and DRB5
Google's crawler is apparently getting confused between our production site at data.cvisb.org and our dev site at dev.cvisb.org. Specifically, when it finds two of the same datasets under the same domain name, then it is not predictable in terms of which dataset URL is actually indexed (see screenshot below). The solution (hopefully) will be to create dev.cvisb.org/robots.txt
and disallow all crawlers.
related to #67
Currently, if there are multiple cross-endpoint queries, only the second one gets executed. Requires rewriting the query parser to separate and combine queries. At least at first, queries will probably all be AND'd.
Returns 1832 results as of 2 December 2019:
https://data.cvisb.org/api/patient/query?q=__all__&elisa=[[elisa.virus.keyword:Lassa AND elisa.assayType.keyword:Ag AND elisa.ELISAresult.keyword:negative AND elisa.timepoint.keyword:"patient admission"]]
Returns 369 results as of 2 December 2019
https://data.cvisb.org/api/patient/query?q=__all__&experimentQuery=includedInDataset:hla
ERROR: combined query only returns second query:
https://data.cvisb.org/api/patient/query?q=__all__&elisa=[[elisa.virus.keyword:Lassa%20AND%20elisa.assayType.keyword:Ag%20AND%20elisa.ELISAresult.keyword:negative%20AND%20elisa.timepoint.keyword:%22patient%20admission%22]]&experimentQuery=includedInDataset:hla
Rather than copying/pasting the public field names into config_cvisb_endpoints.py
, save the output of schema_conversion.py to a config file that gets references in the endpoints config file (or something else that avoids manual copy/pasting).
Currently, ELISA data is attached to /patient
data; however, it makes more sense to define ELISA data as separate Experiments in /experiment
. The cross-endpoint &elisa
queries will need to be modified to reference the /experiment
endpoint.
NOTE: ELISA queries are nested queries and require special parsing in order to execute properly.
Fix public script so private variables that are arrays become null in the public form, rather than this ugly array of list
"Symptoms": [{"symptoms": {}}]
If authenticated:true
for a property that is attached to a root Class (Patient, DataDownload, Dataset, Sample, Experiment, DataCatalog), it won't get added to the schema in schema_conversion.py
. As a result, it doesn't get validated.
When executing a query limiting the patientIDs to a subset of them, in certain cases where there's a large number of IDs (> 50), the query doesn't execute since the URL string is too long. At the moment, only becomes a problem in a small number of queries so not high priority.
Raw data now shows all types in one list. Maybe adding a filter/search for individual types like fasta
or BAM
would be helpful.
Document API for internal and external use.
Right now, only the patientID
gets transferred over to alternateIdentifier
. Any connections to the other ID will be lost.
alternateIdentifier
Access API via Python and/or R.
when not logged in, https://data.cvisb.org/patient and https://data.cvisb.org/sample redirect to the login page https://data.cvisb.org/login. I assume that should be the same behavior as https://data.cvisb.org/dataset, but that currently just displays a blank page.
For newly added data; https://data.cvisb.org/patient/G12-658061 -- related to #82
Angular/rxjs issue... queries with fetch_all=true
get confused. rxjs:expand
currently executing asynchronously?
Neither of these pages seem to load content for me?
https://data.cvisb.org/dataset/ebola-virus-seq
https://data.cvisb.org/dataset/lassa-virus-seq
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.