Code Monkey home page Code Monkey logo

community-archive's People

Contributors

93boy avatar ayghal avatar dhananjaya93 avatar ltcrod avatar mnortoft avatar nevrome avatar scarlhoff avatar smpeltola avatar stschiff avatar suzannefreilich avatar tclamnidis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

community-archive's Issues

prior2018_Boston_Datashare typos

Hi!
While extracting some individuals from the prior2018_Boston_Datashare_published package, I noticed that the new plink and eigenstrat genotypes and janno file inherited some mistakes from the original data share.

Group Individual
Vanuatu_2900BP_all_published I1369_al_published (should be I1369_all_published)
Nepal_Samzdong_1500BP.SG (should be Nepal_Samdzong_1500BP.SG) S35.SG
Nepal_Samzdong_1500BP.SG (should be Nepal_Samdzong_1500BP.SG) S41.SG

As they are consistent across all files, this is not urgent, but might be good to fix at some point for consistency within the groups we use for analysis. Thanks!

Checklists for contributors and reviewers

I believe we should to go back (we had prototypes, once) to a set of checklists that document step-by-step and in unnerving detail how the following operations in our public repository are done:

  • Submitting a new package with a PR
  • Applying a change to various components of a package with a PR
  • Reviewing a submission

The complexity of this system is just too big to remember all of it, our automatic tests can not cover everything (for technical and conceptual reasons, e.g. #84) and it is difficult right now to communicate these processes to new and inexperienced users. I think the current write-up we have here is not detailed enough. A very good inspiration might be the extensive guides of the rOpenSci community for package submission and review, e.g. https://devguide.ropensci.org/building.html

Add accession numbers

Mikkel Nørtoft provided us with a list of Accession numbers for most publications here (see the slack workspace). These should be systematically added to the .janno fiiles (#9).

Idea: Check for unattached files

We could add a simple script to our GitHub Action that creates a list of all files in the repository (excluding the POSEIDON.yml files) and then reads all the POSEIDON.yml files to check if all files are referenced exactly once. That would probably be a fairly easy check to prevent accidentally unattached files (e.g. the .bib files, which are easily forgotten).

Wrong genotype ploidy for 2019_Biagini_Spain package

The data from 2019_Biagini_Spain are diploid genotypes, and not haploid as the janno file suggests.

$ head 2019_Biagini_Spain.geno ## In EIGENSTRAT
001000020119000110001100000011000011100000101001000011000010110100100010100100011900000100101021201000020002112101110020
000000000011010000100010010010001110001000111010011101010110101002010021011001101000111100000100010110101000011100000011

Heterozygotes (1s) cannot exist in haploid datasets.

The genotype data of 2020_Nakatsuka_SouthPatagonia is broken

forge fails between the SNPs 980000 and 990000 with

Issues in genotype data parsing: SeqFormatException "Error while parsing: not enough input. Error occurred when trying to parse this chunk: \"U\\SOH\""

plink reports this:

$ plink1.9 --bfile Nakatsuka_SouthPatagonia --recode tab --out test
PLINK v1.90b6.22 64-bit (3 Nov 2020)           www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to test.log.
Options in effect:
  --bfile Nakatsuka_SouthPatagonia
  --out test
  --recode tab

15814 MB RAM detected; reserving 7907 MB for main workspace.
1233013 variants loaded from .bim file.
20 people (12 males, 8 females) loaded from .fam.
Error: Invalid .bed file size (expected 6165068 bytes).

@AyGhal @93Boy The broken data was added in 125b16b. I think this should be fixed asap, so with the highest priority.

trident validate did not find the issue, because it only ever looks at the first 100 SNPs (for performance reasons). Probably we should add a full validation to our review process, @stschiff (poseidon-framework/poseidon-hs#223).

Enable and fill Data_Preparation_Pipeline_URL (?)

This column is defined as follows:

The column Data_Preparation_Pipeline_URL should finally store an URL that links to a complete and human-readable description of the computational pipeline (for example a specific configuration for nf-core/eager) by which the sample data was processed. One solution to document and publish a computational workflow like this might be through protocols.io.

Filling it would be possible for data produced at the MPI-EVA, but it requires a protocols.io pipeline describing the respective data preparation steps. This can probably only be created by the respective author of a paper, so it's nearly impossible to obtain for legacy data.

Genotype data validation

As demonstrated by #5 we need validation of the genotype data. Maybe the github action could fetch everything and then attempt a complete forge. Or we teach validate some new tricks.

CHANGELOG.md files aren't proper .md files

Right now our CHANGELOG files usually look like this:

V X.X.X: What changed there
V X.X.X: What changed here

That is not proper markdown, because there are no explicit line breaks. We should either add line breaks with trailing spaces as defined in markdown (bad idea), make the entries part of a list by prepending - (slightly better idea) or just rename the files to .txt. I like the last option most, because it's simple to automate.

This issue also requires a change in trident update, which generated the wrong files in the first place.

Insufficient consent for present-day genomes from Taiwan

The genetic data for Amis and Atayal published in Lazaridis et al (2014) was produced from cell lines, for which the original samples were collected without proper informed consent. Therefore, any derived data is controversial in Taiwan and should not be used for further analyses. We suggest removal of the respective genotypes from the package with further explanation in the README.

Some questions on validation checks

I have some questions about the automatic workflow validate.

  1. Does the checkout that happens within that check contribute to our big-data bandwidth?
  2. If we already make a full checkout for this action, why do we say --ignoreGeno in the call to trident validate? Is it possible that is is a historic thing from the time when we didn't have the genotype data checked in?
  3. I think validate currently doesn't check bibtex-keys... at least they are not showing up when they mismatch, as in some of the recent Pull Requests. I will take a look myself.

ToDo

  • Fill missing Tissue_Type fields
  • Add accession IDs at least on a project/paper level to all packages
  • Fill missing Date_Type fields (#25)
  • Check and improve completeness of spatiotemporal context information (coordinates, dating) for all packages
  • Add context information for papers with very low .janno file coverage
    • 2014_MalaspinasCurrentBiology
    • 2014_RaghavanScience
    • 2020_Nakatsuka_SouthPatagonia
    • 2021_Kilinc_northeastAsia
    • ...
  • Add meaningful description sentences to all .yml files
  • Maybe: Move the "taken from the Reich lab" disclaimer from the description (yaml) into the package README file

Fill in relationships from group_name

We currently have lots of information about genetic relationships between individuals encoded in the group names (A hack invented in the AADR I believe). We need to move these over to the new Poseidon v2.5 schema columns Relation_To, Relation_Degree and Relation_Note.

Points to check and fix (?) in 2022_Gretzinger_AngloSaxons

When we added the 2022_Gretzinger_AngloSaxons package, we made a few decisions that need be checked by the first author:

  • Our supplement gives contamination estimates without error bars. Poseidon requires them, if the estimates themselves are set. I now set them all to a dummy value of 0.001, but it would of course be good if you could actually fill the correct ones if you still have them.
  • The dates in GRO004, GRO006, GRO015, GRO016, GRO020 were given only as upper bound in our Supplement. I have now set their lower bound to 900 in all of these. Please check whether that is appropriate
  • All HIDXXX samples were only given with a date fixed at 400 CE. I have now set this 300-500. Please check whether that is appropriate.

Finally, I noticed that the Supplementary Tables of the paper give the same C14 date ID for samples I0790 (LintonSk352) and I0791 (Linton351), namely SUERC-20250. They are from the same grave, so the common ID indicates that only one measurement was made, which is fine, but we need the uncalibrated date. I have, for now, switched the Date_Type to "contextual", as the validator doesn't allow the type to be C14 without uncalibrated dates given.

Improve POSEIDON.yml description fields

  • Add meaningful description sentences to all .yml files
  • Maybe: Move the "taken from the Reich lab" disclaimer from the description (yaml) into the package README file

RaghavanNature

2013_RaghavanNature and 2014_RaghavanNature have to be merged. 2014_RaghavanNature is the correct name.

Add citation for the Reich Lab data release

A significant portion of the data in Poseidon was copied and modified from the Reich Lab data release V37.2. This should be pointed out in the respective packages.

One solution would be to add or modify the package description with a sentence like this:

This package was initially constructed based on the Allen Ancient DNA Resource (AADR) (@AADR_37.2).

And then add the following BibTeX entry to the respective .bib file:

@misc{AADR_2019,
	title        = {Allen Ancient {DNA} Resource ({AADR)}: Downloadable genotypes of present-day and ancient {DNA} data, version 37.2},
	year         = 2019,
	url          = {https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data},
	howpublished = {\url{https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data}}
}

List of papers to be added

(Update on August 14, 2024 by @stschiff)
Special thanks to @mnortoft who compiled much of this list!

Preprints to watch

  • Watanabe et al. biorxiv Cold adaptation in Upper Paleolithic hunter-gatherers of eastern Eurasia
  • Saag et al. biorxiv North Pontic crossroads: Mobility in Ukraine from the Bronze Age to the early modern period
  • Purnomo et al. biorxiv The genetic origins and impacts of historical Papuan migrations into Wallacea
  • Zagorc et al. biorxiv Bioarchaeological Perspectives on Late Antiquity in Dalmatia: Paleogenetic, Dietary, and Population Studies of the Hvar - Radošević burial site
  • Lazaridis et al. biorxiv The Genetic Origin of the Indo-Europeans
  • Ravasini et al. biorxiv The Genomic portrait of the Picene culture: new insights into the Italic Iron Age and the legacy of the Roman expansion in Central Italy
  • McColl et al. biorxiv Steppe Ancestry in western Eurasia and the spread of the Germanic Languages
  • Scheib et al. biorxiv Local population structure in Cambridgeshire during the Roman occupation

Published papers

2024

  • Hui et al. 2024 Genetic history of Cambridgeshire before and after the Black Death
  • Antonio et al. 2024 Stable population structure in Europe since the Iron Age, despite high mobility
  • Allentoft et al. 2024 Population genomics of post-glacial western Eurasia
  • Parasayan et al. 2024 Late Neolithic collective burial reveals admixture dynamics during the third millennium BCE and the shaping of the European genome
  • Sasso et al. 2024 Capturing the fusion of two ancestries and kinship structures in Merovingian Flanders
  • Seersholm et al. 2024 Repeated plague infections across six generations of Neolithic Farmers
  • Alves et al. 2024 Human genetic structure in Northwest France provides new insights into West European historical demography
  • Barquera et al. Nature (Chichen-Iza)
  • Ghalichi et al. in prep.
  • Gnecchi-Ruscone Nature
  • ([x] Orfanou et al. 2024 Biomolecular evidence for changing millet reliance in Late Bronze Age central Germany) <- does not contain any DNA samples
  • Sirak et al. 2024 Medieval DNA from Soqotra points to Eurasian origins of an isolated population at the crossroads of Africa and Arabia
  • Penske et al. 2024 Kinship practices at the early bronze age site of Leubingen in Central Germany

2023

  • Bennett et al. 2023 Genome sequences of 36,000- to 37,000-year-old modern humans at Buran-Kaya III in Crimea
  • (in process) Carlhoff et al. 2023 Genomic portrait and relatedness patterns of the Iron Age Log Coffin culture in northwestern Thailand
  • Serrano et al. 2023 The genomic history of the indigenous people of the Canary Islands
  • Villa-Islas et al. 2023 Demographic history and genetic structure in pre-Hispanic Central Mexico
  • Chyleńsky et al. 2023 Patrilocality and hunter-gatherer-related ancestry of populations in East-Central Europe during the Middle Bronze Age
  • Aqil et al. 2023 A paleogenome from a Holocene individual supports genetic continuity in Southeast Alaska
  • Wang et al. 2023 High-coverage genome of the Tyrolean Iceman reveals unusually high Anatolian farmer ancestry
  • Mattila et al. 2023 Genetic continuity, isolation, and gene flow in Stone Age Central and Eastern Europe
  • Brielle et al. 2023 Entwined African and Asian genetic roots of medieval peoples of the Swahili coast
  • Wang et al. 2023 Human genetic history on the Tibetan Plateau in the past 5100 years
  • Brami et al. 2023 Investigating the prehistory of Luxembourg using ancient genomes
  • Gerber et al. 2023 Interdisciplinary Analyses of Bronze Age Communities from Western Hungary Reveal Complex Population Histories
  • Wang et al. 2023 High-coverage genome of the Tyrolean Iceman reveals unusually high Anatolian farmer ancestry
  • Stolarek et al. 2023 Genetic history of East-Central Europe in the first millennium CE
  • Simões et al. 2023 Northwest African Neolithic initiated by migrants from Iberia and Levant
  • Ferraz et al. 2023 Genomic history of coastal societies from eastern South America
  • (in process) Posth et al. 2023 Palaeogenomics of upper palaeolithic to neolithic European hunter-gatherers
  • Wang et al. 2023 Middle Holocene Siberian genomes reveal highly connected gene pools throughout North Asia
  • Skourtanioti et al. 2023 Ancient DNA reveals admixture history and endogamy in the prehistoric Aegean
  • Begg et al. 2023 Genomic analyses of hair from Ludwig van Beethoven
  • Peltola et al. 2023 Genetic admixture and language shift in the medieval Volga-Oka interfluve
  • Penske et al. 2023 Early contact between late farming and pastoralist societies in southeastern Europe
  • (in process) Rivollat et al. 2023 Extensive pedigrees reveal the social organization of a Neolithic community
  • Villalba-Mouco et al. 2023 A 23,000-year-old southern Iberian individual links human groups that lived in Western Europe before and after the Last Glacial Maximum

2022

  • Kumar et al. 2022 Bronze and Iron Age population movements underlie Xinjiang population history
  • Gelabert et al. 2022 Genomes from Verteba cave suggest diversity within the Trypillians in Ukraine
  • Marchi et al. 2022 The genomic origins of the world’s first farmers
  • Ariano et al. 2022 Ancient Maltese genomes and the genetic geography of Neolithic Europe
  • Gelabert et al. 2022 Northeastern Asian and Jomon-related genetic structure in the Three Kingdoms period of Gimhae, Korea
  • Liu et al. 2022 Ancient DNA reveals five streams of migration into Micronesia and matrilocality in early Pacific seafarers
  • Changmai et al. 2022 Ancient DNA from Protohistoric Period Cambodia indicates that South Asians admixed with local populations as early as 1st–3rd centuries CE
  • Morez et al. 2022 Imputed genomes and haplotype-based analyses of the Picts of early medieval Scotland reveal fine-scale relatedness between Iron Age, early medieval and the modern people of the UK
  • Lazaridis et al. 2022 trio of papers The genetic history of the Southern Arc: A bridge between West Asia and Europe (all three papers have the same ENA Project accession, so are to be considered one package)
  • Charlton et al. 2022 Dual ancestries and ecologies of the Late Glacial Palaeolithic in Britain
  • 4 2022_Pedersen_Lice: Ancient Human Genomes and Environmental DNA from the Cement Attaching 2,000-Year-Old Head Lice Nits
  • 6 2022_Lipson_SubSaharanAfrica: Ancient DNA and deep population structure in sub-Saharan African foragers
  • 201 2022_Kumar_Xinjiang: Bronze and Iron Age population movements underlie Xinjiang population history
  • 20 2022_Kennett_Maya: South-to-north migration preceded the advent of intensive farming in the Maya region
  • 66 2022_Gnecchi-Ruscone_Avars: Ancient genomes reveal origin and rapid trans-Eurasian migration of 7th century Avar elites
  • 49 2022_Fischer_Gauls: Origin and mobility of Iron Age Gaulish groups in present-day France revealed through archaeogenomics
  • 29 2022_Dulias_Orkney: Ancient DNA at the edge of the world: Continental immigration and the persistence of Neolithic male lineages in Bronze Age Orkney

2021

2020

  • 4 2020_Sjogren_BellBeaker: Kinship and social organization in Copper Age Europe. A cross-disciplinary analysis of archaeology, DNA, isotopes, and anthropology from two Bell Beaker cemeteries
  • 2 2020_Pugach_Guam: Ancient DNA from Guam and the peopling of the Pacific
  • 11 2020_Parker_MedievalGermany: A systematic investigation of human DNA preservation in medieval skeletons
  • 2020_Naegele_Caribbean: Genomic insights into the early peopling of the Caribbean
  • 1 2020_Mizuno_WilliamAdams: A biomolecular anthropological investigation of William Adams, the first SAMURAI from England
  • 214 2020_Jeong_EasternSteppe: A Dynamic 6,000-Year Genetic History of Eurasia’s Eastern Steppe
  • 2020_Burger_GermanyTollense: Low Prevalence of Lactase Persistence in Bronze Age Europe Indicates Ongoing Strong Selection over the Last 3,000 Years
  • 6 2020_Bongers_Inca: Integration of ancient DNA with transdisciplinary dataset finds strong support for Inca resettlement in the south Peruvian coast

Validation server idea

The discussion around #83 made me come up with an idea. It would actually be fairly easy to build an HTTP server (deployed on the same cloud-server where the other server currently lives), which does the validation of new branches, PRs and commits for us.

The server would be a simple HTTP server program with a REST API. I can think of the following APIs:

/validate/<branch_name> -> This would trigger the server to pull the branch and perform a validation. Bandwidth usage would be minimal, because it would only have to pull big files if they changed from master or the latest pulls

/get_validation_results/<branch_name> -> This would either respond with some kind of "pending" if it's still running, or "done" with the validation results.

We then use a GitHub Webhook to trigger /validate/<branch_name> whenever a new PR gets opened, or an existing PR gets new commits. A GitHub-action would then query /get_validation_results/<branch_name> for as long as the result is pending, and show the result when its done.

I'm arguing a but from an armchair here, knowing that the HTTP server itself needs maintenance and code updates, and I don't know when I'll find the time to do it, but conceptually it's easy, and it would solve our bandwidth problems and finally also check genotype data.

Wrong AD end date on one sample KIN003

Dear Poseidon team,

Thanks for all your great work.
I noticed that one sample, KIN003 (from WangGoldstein2020ScienceAdvances), has a wrong date in the Date_BC_AD_Stop column.
In the original publication, the calibrated date is "1665-present", which may explain the error that the "present" part has become 0 in your Date_BC_AD_Stop column, although this is of course not the calendar year 0 (which does not exist). A calibrated end date for "present" like 1950 would probably make more sense here since it is a C14 date.

2020_BarqueraCurrentBiology has its large files not in LFS

When updating the repository with gut pull I get errors:

Encountered 2 file(s) that should have been pointers, but weren't:
	2020_BarqueraCurrentBiology/BarqueraCurrentBiology.bed
	2020_BarqueraCurrentBiology/BarqueraCurrentBiology.bim

This is because these two files are actually committed as ordinary files instead of using the git lsf extension, probably because the committer did not run git lfs install. Probably an easy fix, but need to play a bit with git rm --cached and stuff. I'll work in this.

Integrating links to raw data

We agreed in our Poseidon meetings that we would soon upgrade our schema to allow for an additional optional file for Poseidon packages, named sequencingSourceFile. The file will be a tab-separated table, with a number of columns necessary to access and process the raw data behind the genotype data (i.e. fastq or bam files).

@jfy133 kindly provided some help on how to get this this information from the ENA. The easiest way to get started with the ENA data links is simply to use the TSV export feature on the ENA webpage. Example:

  • Go to https://www.ebi.ac.uk/ena/browser/home
  • Type in PRJNA429081 into the Accession ID search box
  • Click on "Column Selection". Choose columns that you need (this will have to be agreed upon in the Big Data meeting), but crucial ones would be "fastq_bytes", "fastq_md5" and "fastq_ftp"
  • Click on "TSV", which will download a table with all columns you've selected.

Ultimately this will be a joined file with project, sample, experiment (some weird intermediate level) and run level IDs. Run level would then correspond to the actual files you have (corresponding to libraries sequenced on a single run).

There is also an R script which James has written, which provides an R function that takes a Project Accession ID as input and provides a table conforming to https://github.com/SPAAM-community/AncientMetagenomeDir, which might end up being quite similar to what we want for Poseidon.

Typo in column names in: `2022_GnecchiRuscone_CarpathianBasin`

In a group call today, Matthis thor Straten reported that the package 2022_GnecchiRuscone_CarpathianBasin seems to have a typo in the column names.

A custom column named Nr_SNPCoverage_on_Target_SNPs exists, which is likely meant to be Nr_SNPs and Coverage_on_Target_SNPs. We should check if and how this has affected the content of other columns in the package.

Critical bug in trident caused a significant portion of Date_C14_Labnr columns to be filled with wrong values

A critical bug in trident (poseidon-framework/poseidon-hs@df72486), only now fixed in v.0.17.4, caused the janno columns Date_C14_Labnr to be overwritten by Date_C14_Uncal_BP. Unfortunately this affects a significant portion of all packages in published_data, because they went through trident when we split up the former Reich dataset.

We will have to go through all packages and reintroduce the respective lab numbers.

The most practical approach for that might be to copy the respective entries from the large Boston_datashare janno file we had before b9d0654 into the new janno files. So yet again boring, manual labour. Alternative ideas are very welcome.

We should also check the other janno files not extracted from the Reich dataset. Maybe some of them went through trident as well. Probably the most easy solution for that will be to make the labnr data type in poseidon-hs more specific, so to validate labnrs. They have a specific structure and can be validated accordingly.

Mismatches between janno and ena tables

It is sometimes the case that individuals that appear in the janno of a package do not appear in the ENA table for the package (if, say, some of the data was not properly uploaded to the ENA), or vice versa (e.g. when individuals were excluded from analyses and supplementary tables of a paper, but the sequencing data was still uploaded to the ENA).

This will pose a challenge for automatic processing of packages in the future.

1000 Genomes data should be haploid, but has some heterozygous positions

A new feature in validate in version 1.4.0.0 checks whether Genotype_Ploidy from the Janno file is consistent with genotype data.

When running this on the community-archive, I noticed that sample HG01804.SG in the package 2015_1000Genomes_1240K_haploid_pulldown has indeed heterozygotes, even though it should be haploid pulldown. We need to check whether this is an error already present in the AADR or not.

For now, in order to have this package validate, I marked the Genotype_Ploidy for this one single sample as missing. But we need to set it back to haploid once we have figured out what's going on

CHANGELOG missing, server breaks

The latest server restart failed because of a missing CHANGELOG.md file in the package 2020_Jeong_EurasiaEasternSteppe. I fixed it locally on the server for now by removing the line from the POSEIDON.yml, but I think we should implement a test into our CI for existence of these files.

@dhananjaya93 can you add a CHANGELOG file to that package?

Missing Countries in four packages

The following four packages have missing Country entries:

trident list --individuals -d . -j Country --raw | awk '$4 == "n/a"' | cut -f1 | sort | uniq -c
  4 2014_RaghavanScience
  20 2020_Nakatsuka_SouthPatagonia
  40 2021_Kilinc_northeastAsia
 383 2021_Wang_EastAsia
   4 Reference_Genomes

Obviously, the last one should have n/a, but the others should have proper Countries. Should be easy to fix by checking the original papers. @dhananjaya93 (@93Boy) perhaps you could get to that. Thanks.

Submission template/bot/process

We decided to improve the submission processes to be guided by bot-actions or PR-templates, in such a way that submitters will have a bit more hand-holding to submit valid packages to the PCA.

Fill uncalibrated Dates for 2021_PattersonNature

As Clemens found in #97, package 2021_PattersonNature has 826 entries with calibrated dates, but no uncalibrated dates. We should get this filled in. @93Boy could you have a look whether there is a supplementary Table in the paper that lists the raw uncalibrated dates?

Missing Date_Types in 12 packages

Lots of packages contain missing Date_Types in the Janno file. In my, a lot of those we should be able to fill easily:

a. If there are entries in the C14-type columns, put Date_Type to C14.
b. If there are entries in the calbrated columns, but not in the C14-columns, put Date_Type to contextual.
c. If it's modern samples, put to modern.
d. If the sample is ancient, but there is no date at all, keep at n/a for now, but of course those we should anyway also fill soon, at least as a contextual range, which should always be possible from a look into the paper.

published_data % trident list --individuals -d . -j Date_Type --raw | awk '$4 == "n/a"' | cut -f1 | sort | uniq -c
   5 2020_Brunel_France
   1 2020_Cassidy_IrishDynastic
  12 2020_Furtwaengler_Switzerland
  20 2020_Nakatsuka_SouthPatagonia
  30 2020_Ning_China
   1 2020_Wang_subSaharanAfrica
  24 2020_Yang_China
  40 2021_Kilinc_northeastAsia
 826 2021_PattersonNature
  18 2021_Saag_EastEuropean
  22 2021_SaupeCurrBiol
 383 2021_Wang_EastAsia

Update 2021_Wang_EastAsia

This package should be updated with the final published version (WangNature2021 in AADR). We added it when it was in bioRxiv.

Overall Repository versioning

Shall we introduce an overall Repository versioning? As we are now introducing schema changes for Poseidon 2.6.0, we should package up the old repository as one big tar.gz file and provide it for download somewhere.

Shall we start with 1.0.0 at this point, and put it into some archive, like zenodo?

Missing coordinates

A lot of samples across packages are lacking spatial latitude and longitude coordinates. Maybe this information can easily be recovered from the respective publications.

Data candy

With the growing data collection in this repository and the tools we wrote to access it, we could relatively easily prepare some automatic pipelines to construct useful, derived data products. One way to set this up would be to create a github repo, that gets updated automatically with a clever github action, whenever the master branch in published_data changes.

Some ideas:

  • Pairwise-distance matrices with multiple distance measures. This is especially important, given that many individuals are represented multiple times in this dataset. So far we do not offer a workflow to remove duplicates (or biologically related individuals).
  • An MDS with all ancient individuals.
  • Various data quality and -completeness indices.

Theoretically we could also produce figures and interactive toys - then the sky is the limit. I would suggest to stick to the basic necessities, though.

Date_Type missing for >500 individuals

I checked how Date_Type is filled currently with something equivalent to

trident list --remote --individuals -j Date_Type | cut -f4 | sort | uniq -c

yielding

3181 C14
2356 contextual
7404 modern
 534 n/a

Apparently we have 534 missing values in that column, which needs fixing.

A Complication in ENA to Poseidon data matching

I have faced a problem while matching ENA data with Poseidon data. In some packages, I have found multiple Poseidon IDs like I10871.DG, I10871.SG, I10871_published What should be the procedure to map these into ENA data? If I add multiple Poseidon_IDs to such fields will it be harder to read by Trident or other tools?

2020_Nakatsuka_SouthPatagonia has missing Publications in Janno file

trident list --individuals -j Publication -d ~/dev/poseidon-framework/published_data --raw | awk '$4=="n/a"'
trident v1.1.3.1 for poseidon v2.5.0
https://poseidon-framework.github.io

[Info]    Searching POSEIDON.yml files... 
[Info]    168 found
[Info]    Checking Poseidon versions... 
[Info]    Initializing packages... 
[Info]    Packages loaded: 168
[Info]    Preparing output table
[Info]    found 15052 individuals/samples
2020_Nakatsuka_SouthPatagonia	I8575	Aonikenk_SouthContinent_CerroJohnny_400BP	n/a
2020_Nakatsuka_SouthPatagonia	I8576	Selknam_FaroMendez_100BP	n/a
2020_Nakatsuka_SouthPatagonia	I12364	Selknam_NorthTierradelFuego_Grouped_500BP	n/a
2020_Nakatsuka_SouthPatagonia	I12366	Selknam_NorthTierradelFuego_Grouped_500BP	n/a
2020_Nakatsuka_SouthPatagonia	I12357	Haush_MitrePeninsula_Grouped_700BP	n/a
2020_Nakatsuka_SouthPatagonia	I12365	Argentina_Tierra_del_Fuego_brother.I12367	n/a
2020_Nakatsuka_SouthPatagonia	I12359	Haush_MitrePeninsula_Grouped_700BP	n/a
2020_Nakatsuka_SouthPatagonia	I12361	Haush_MitrePeninsula_Grouped_700BP	n/a
2020_Nakatsuka_SouthPatagonia	I12355	Yamana_BeagleChannel_Grouped_1900-500BP	n/a
2020_Nakatsuka_SouthPatagonia	I12362	Argentina_NorthTierradelFuego_LaArcillosa2_6000BP	n/a
2020_Nakatsuka_SouthPatagonia	I12376	Argentina_LagunaToro_2400BP	n/a
2020_Nakatsuka_SouthPatagonia	I12360	Haush_MitrePeninsula_Grouped_700BP	n/a
2020_Nakatsuka_SouthPatagonia	I12363	Selknam_NorthTierradelFuego_Grouped_500BP	n/a
2020_Nakatsuka_SouthPatagonia	I12367	Selknam_NorthTierradelFuego_Grouped_500BP	n/a
2020_Nakatsuka_SouthPatagonia	I12941	Yamana_BeagleChannel_Grouped_1900-500BP	n/a
2020_Nakatsuka_SouthPatagonia	I12943	Yamana_BeagleChannel_Grouped_1900-500BP	n/a
2020_Nakatsuka_SouthPatagonia	I12942	Yamana_BeagleChannel_Grouped_1900-500BP	n/a
2020_Nakatsuka_SouthPatagonia	I12356	Haush_MitrePeninsula_Grouped_700BP	n/a
2020_Nakatsuka_SouthPatagonia	I12354	Selknam_NorthTierradelFuego_Grouped_500BP	n/a
2020_Nakatsuka_SouthPatagonia	I12358	Haush_MitrePeninsula_Grouped_700BP	n/a

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.