datacarpentry / organization-genomics Goto Github PK
View Code? Open in Web Editor NEWProject Organization and Management for Genomics
Home Page: https://datacarpentry.org/organization-genomics
License: Other
Project Organization and Management for Genomics
Home Page: https://datacarpentry.org/organization-genomics
License: Other
03-project-planning and 04-tidiness appear to be largely duplicated content.
Please, consider adding a lesson
topic to the repository. To do so you can follow the help about how to add topics to the repository. Check out the topics that the Genomics R intro lesson has gotten to add others that may be relevant to this lesson.
This will help people to know which repositories are lessons and also could be used to automate analysis of the repositories.
In 02-organization we start using shell commands without opening or introducing the terminal. And then the command are just used and not explained. It is a bit unclear to me if we are expecting them to know the shell commands or not because it says we will introduce you to these commands and then it seems like we expect them to have done the shell lesson before.
I suggest:
mkdir
before using itls
before using itnano
before using itOR
In 01_tidiness_datasheet_example_messy.png, there are "description" columns. These columns have spaces because the contents must NOT be critical for bcl2fastq. So these are metadata columns and later in the cleaned spreadsheet, they continue to have spaces. (Note: our Illumina sequencing instrument submission sheets do not allow metadata that I am aware of.) I propose we change the column descriptions from "Study_Description" to "Study_Metadata" and "Biosample_Description" to "BioSample_Metadata". The "Sample_Owner" must also be a metadata column, and could be called "Owner_Metadata". Without these changes learners would likely put underlines for all spaces in these columns as well. This is an opportunity to make it clear that metadata can exist in a spreadsheet that also contains data, and so should be labeled clearly.
The link encoded here does not lead to the expected place
1. Access the Tenaillon dataset from the provided link: https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605.
episode 2 links to the cloud lesson but the link 404s
Hi!
The screenshot example data under "Structuring data in spreadsheets" is a great example of the good habits described above it however I think it could be misleading for the first two rows to be global headers (versus column names). It could interfere with reading the table directly into R. I know there are parameters to skip some rows but this may be easier to beginners. Another guideline that could be added is the first column are unique sample identifiers and the first row to be unique column names describing ... the descriptions!
I would also suggest to move the definition of metadata to a more obvious area since it could be easily missed under the discussion box and is the dominate concept of the page.
Looking forward to using the genomics lesson!
Cheers,
-Frances
Notes for instructors link is empty.
The file describing the setup should be present at http://www.datacarpentry.org/organization-genomics/setup.md
Maybe we only need one setup for all genomics lessons (something to think about discuss).
The lesson infrastructure committee unanimously approved the proposal of using the same set of labels across all our repositories during its last meeting on May 23rd, 2018.
This repository has now been converted to use the standard set of labels.
If this repository used the previous set of recommended labels by Software Carpentry, they have been converted to the new one using the following rules:
SWC legacy labels | New 'The Carpentries' labels |
---|---|
bug | type:bug |
discussion | type:discussion |
enhancement | type:enhancement |
help-wanted | help wanted |
newcomer-friendly | good first issue |
template-and-tools | type:template and tools |
work-in-progress | status:in progress |
The label instructor-training
was removed as it is not used in the workflow of certifying new instructors anymore. The label question
was left as is when it was in use, and removed otherwise. If your repository used custom labels (and issues were flagged with these labels), they were left as is.
The lesson infrastructure committee hopes the standard set of labels will make it easier for you to manage the issues you receive on the repositories you manage.
The lesson infrastructure committee will evaluate how the labels are being used in the next few months and we will solicit your feedback at this stage. In the meantime, if you have any questions or concerns, please leave a comment on this issue.
-- The Lesson Infrastructure subcommittee
PS: we will close this issue in 30 days if there is no activity.
EMBL-EBI is misspelt in the section heading of 03-ncbi-sra.
I think it would be nice to add some setup notes on ssh.
In Lesson 3, the instructions to get to REL4541B are no longer valid. I just looked and it appears that pull request #120 also contains a proposed a solution to this issue. My proposition is slightly different, and might reliably still point the students to the REL4541B (SRR2591054) run that the lesson currently intends to cause them to examine, which might be a bonus if the particular run is meaningful to the lesson.
Where the current lesson instructs
Click on the Run Number of the first entry (REL4541B). This will take you to a page that is a run browser. Take a few minutes to examine some of the descriptions on the page.
Modified instructions that should work (at least until the next visual redesign by NCBI) might read like this:
Scroll down to the list of Runs in this SRA Project. Let's try to find a particular run in this large project. Look for a run with Library Name "REL4541B." Try searching for the Library Name in the search box with the orange tag on it, and click Run SRR2591054 from the two results returned.
Metadata standards should be out on it's own not in a sub-box
"Data about the experiment is usually collected in spreadsheets, like Excel."
This should be moved below "Metadata standards" and be at the start of the actual region discussing the spreadsheets.
It also might be appropriate to introduce other spreadsheets to those who do not know other spreadsheet software exist. For example: "Data about the experiment is usually collected in spreadsheets, such as Microsoft Excel, Libre Office Calc, or Gnumeric."
In this lesson we make some recommendations around project and data organization. We will likely want to be sticking with recommendations, because every project is different, but maybe we could have more a list of guidelines, or some examples of how projects are organized.
We get comments that people are "still trying to get their head around how to organize data"
Next to the section that describes the command history:
What if we mention the two redirection operators ">" and ">>" instead of mentioning the latter only? Using the command "$ history > dc_workshop_log_xxxx_xx_xx.sh", we can create a new file with the name dc_workshop_log_xxxx_xx_xx.sh which is not appendable as we move on to use more commands.
In http://www.datacarpentry.org/organization-genomics/01-tidiness/ the messy spreadsheet example is missing
In the 01-tidiness lesson, we have an example of a spreadsheet and ask learners to find some things that are wrong with it. The example spreadsheet is field data. It would be better to have some metadata that is more like what people would be using in a genomics experiment. So, we could create a more relevant messy spreadsheet for this exercise.
@bvreede noticed when helping one of the participants that downloading the run selector table does not work with Internet explorer
@jessicamizzi and I thought it would be nice to show a png from the Tenaillon et al. paper (where the data for the wrangling-genomics
lessons was first sequenced) where the SRA accession or project number can be found. This would demonstrate one way to interact with data: read a paper, find accession number in paper, look paper up on the SRA or ENA. We were unsure of copyright infringement on the original paper though.
Thoughts?
Lesson 02-organization needs some cleanup of language.
'You should approach your sequencing project in a very similar way to how you do a biological experiment, and ideally, begins with experimental design.', this sentence is confusing right now, I'd suggest 'You should approach your sequencing project similarly to how you do a biological experiment and this ideally begins with experimental design.'
'Genomics projects can quickly accumulates hundreds of files across tens of folders.', accumulates --> accumulate
'Similarly, you probably won’t remember whether your best alignment results were in Analysis1, AnalysisRedone, or AnalysisRedone2; or which quality cutoff you used.' I suggest more options e.g. best alignment results, quality cutoff, version of software, settings for the software you used, etc. But this isn't really a necessary change.
Also typo here:
‘^X’, needs to have the '' removed.
The links for the Blount et al 2012 paper and supplementary are broken in both 01-introduction.md and 05-ncbi-sra.md
In the updated version of the SRA Run Selector Page, the downloaded SraRunTable.txt is actually now a comma-delimited file rather than a tab-delimited file as stated on the current version of the "Examining Data on the NCBI SRA Database" page.
Students will need to specify that their spreadsheet program interpret it as comma-delimited, so I suggest the following language: "Using your choice of spreadsheet program, open the SraRunTable.txt file. You may need to tell the program that this is a comma-delimited file in order to have the data separated properly into columns."
The following pages are rendering improperly:
Setup
Reference
Code of conduct
@ErinBecker fixed a similar issue by editing the _config.yml
file in the wrangling-genomics lesson.
Glossary section of reference page is "FIXME"
In 00_intro_organization.md
Please delete the text below before submitting your contribution.
In the episode ### "Planning for NGS project" the sub-heading ### "Retrieving samples from the facility" may factually confuse, for clarity, I suggest "Retrieving sample sequencing data from the facility". The change will address both the aspect of seq files and seq-file-metadata, otherwise, as is, it insinuates that we are getting back the sample.
#Please delete the text below before submitting your contribution.
The link to download the file in the last exercise of 01-data-tidiness is missing. The file is located at https://github.com/datacarpentry/organization-genomics/blob/gh-pages/files/Ecoli_metadata_composite_messy.xlsx
To keep from confusing learners with multiple pages of set-up instructions, it would be ideal to have only one "point of truth" for setup instructions for the whole workshop. That page is the setup page in the workshop overview repo.
We can include text like:
This workshop is designed to be run on pre-imaged Amazon Web Services
(AWS) instances. All the software and data used in the workshop are
hosted on an Amazon Machine Image (AMI). For information about how to
use the workshop materials, see the
setup instructions on the main workshop page.
The information about installing LibreOffice should first be added to the main setup page.
This issue is for things people have learned on this lesson
Comments
"PIs should require not just end results, but the whole path and parameters to it"
"I wish I had taken this workshop earlier. Deciding how to save and manage dat is one big lesson learned the hard way!"
The CSS isn't working at the following link:
https://datacarpentry.org/organization-genomics/setup/
If I use check with a CSS validation tool (http://jigsaw.w3.org/css-validator/validator?uri=https%3A%2F%2Fdatacarpentry.org%2Forganization-genomics%2Fsetup%2F&profile=css3svg&usermedium=all&warning=1&vextwarning=&lang=en), I see the following errors:
File not found: https://datacarpentry.org/organization-genomics/setup/assets/css/bootstrap.css: Not Found
File not found: https://datacarpentry.org/organization-genomics/setup/assets/css/bootstrap-theme.css: Not Found
File not found: https://datacarpentry.org/organization-genomics/setup/assets/css/lesson.css: Not Found
File not found: https://datacarpentry.org/organization-genomics/setup/assets/css/syntax.css: Not Found
Hello!
I noticed that the workflow for students here: https://github.com/datacarpentry/organization-genomics/blob/gh-pages/_episodes/03-ncbi-sra.md
for the "Download the Lenski SRA data from the SRA Run Selector Table"
Portion of the lesson is now out-dated due to the recent NCBI upgrade
see:
https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605&o=acc_s%3Aa
-Dave
If your Maintainer team has decided not to participate in the June 2019 lesson release, please close this issue.
To have this lesson included in the 18 June 2019 release, please confirm that the following items are true:
When all checkboxes above are completed, this lesson will be added to the 18 June lesson release. Please leave a comment on carpentries/lesson-infrastructure#26 or contact Erin Becker with questions ([email protected]).
Arizona BBQ Team: There may be some opportunities for combining these two sections. The spreadsheet with the current data set is good, and should be used so we don't need a second spreadsheet exercise (sample submission) - maybe some of those discussion questions can be moved over.
The link for the ENA has an extra paranthesis, causing the markdown link to not render properly.
Navigate to the [ENA]((https://www.ebi.ac.uk/ena).
The opening page for this repo has a lot of place holders for
Sorry if I missed an issue that covers this.
Arizona BugBBQ - The SRA lesson is too much and the subject matter is too deep to cover well. We suggest showing an SRA submission spreadsheet in the tidiness section. Learners could browse this is a short exercise and be made aware that this is probably metadata they will need to collect.
In this issue I'm proposing a reorganization of this module and some changes in the lessons.
Organizing a project that involves sequencing involves many components. There's the start of the experiment, with the records of the experimental setup and conditions, as well as the sequencing information and the records of the bioinformatics analyses. It's an extension of your lab notebook and freezer samples to digital data and analyses. In this lesson, we'll go through the project organization and documentation that will make your current life more organized and easier for future you to understand what was done.
In this lesson you will learn:
With this structure, I'm proposing to re-order and expand some of the existing lessons
Before working on this re-configuration, I wanted to get thoughts on this idea from other maintainers and genomics folks. Thanks!
Workshop Overview says:
This lesson assumes no prior experience with the tools covered in the workshop. However, learners are expected to have some familiarity with biological concepts, including nucleotide abbreviations and the concept of genomic variation within a population. Participants should bring their laptops and plan to participate actively.
Here says:
Data Carpentry’s teaching is hands-on, so participants are encouraged to use their own computers to insure the proper setup of tools for an efficient workflow.
These lessons assume no prior knowledge of the skills or tools.
Prerequisites
This lesson requires a spreadsheet program, such as Excel or OpenOffice, and a web browser.
To most effectively use these materials, please make sure to install everything before working through this lesson.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.