petrelharp / local_pca Goto Github PK

Methods for examining PCA locally along the genome.

TeX 0.21% R 0.10% Python 0.03% Shell 0.01% HTML 99.65% Makefile 0.01% Slim 0.01%

local_pca's Introduction

Local PCA/population structure (lostruct)

If you use this method, please cite Li & Ralph 2019:

Local PCA Shows How the Effect of Population Structure Differs Along the Genome, Han Li and Peter Ralph, Genetics January 1, 2019 vol. 211 no. 1 289-304.


@article {li2019local,
	author = {Li, Han and Ralph, Peter},
	title = {Local PCA Shows How the Effect of Population Structure Differs Along the Genome},
	volume = {211}, number = {1}, pages = {289--304}, year = {2019},
	doi = {10.1534/genetics.118.301747}, publisher = {Genetics}, journal = {Genetics}
}

Note: a prototype python implementation by Joseph Guhlin is available at github:jghulin/lostruct-py.

Installation

To install the package, make sure you have devtools (by doing install.packages("devtools")), and then running

install.packages("data.table")
devtools::install_github("petrelharp/local_pca/lostruct")
library(lostruct)

Note: the library is called lostruct.

Using the R package

The example scripts in the directories above mostly work without the R package. To start using the code on your own data, have a look at these files:

A quick example : in four lines of code, reads in chromosome 22 from a TPED, and does local PCA.
Setting up data : after documenting where the data are from, does local PCA on a small subset of the whole dataset, to establish how the functions work.
Drop in your own data and make a report with the scripts provided in the templated/ directory, modified from the Medicago analysis.

Prerequisites

To use the functions to read in windows from a VCF or BCF file, you will need bcftools.
To compile the example report, you probably want templater.

Quick start

The three steps are:

Compute the local PCA coordinates - done with eigen_windows().
Compute the distance matrix between windows - done with pc_dist() on the output of eigen_windows().
Visualize - whatever you want; MDS is implemented in R with cmdscale().

Data formats:

The function eigen_windows() basically wants your data in a numeric matrix, with one row per variant and one column per sample (so that x[i,j] is the number of alleles that sample j has at site i). If your data are already in this form, then you can use it directly.

We also have two methods to get data in from standard formats, tped and vcf. Neither are extensively tested: double-check what you are getting out of them.

TPED: the read_tped() function will read in a tped file and output a numeric matrix like the above. For instance:
```
snps <- read_tped("mydata.tped")
```
VCF, in memory: the read_vcf() function does the same. For instance:
```
snps <- read_vcf("mydata.vcf")
```
BCF, not in memory: eigen_windows() instead of a matrix can take a function that when called returns the submatrix corresponding to the appropriate window. (see documentation) Since we only need one window in memory at a time, this reduces the memory footprint. We use bcftools to extract the windows, so you need bcftools, and your vcf file must be converted to bcf (or bgzipped) and indexed. To do this, for instance, run:
```
bcftools convert -O b mydata.vcf > mydata.bcf
bcftools index mydata.bcf
```
Once you have this, the function vcf_windower() will create the window extractor function. For instance,
```
snps <- vcf_windower("my_data.bcf",size=1e3,type='snp')
snps(10)
```
will return the 10th window of 1,000 SNPs in the file my_data.vcf.

In any case, the next step is:

pcs <- eigen_windows(snps,k=2)
pcdist <- pc_dist(pcs,npc=2)

which gives you pcs, a matrix whose rows give the first two eigenvalues and eigenvectors for each window, and pcdist, the pairwise distance matrix between those windows.

Standalone code

Also included in this repository is code we used to analyze the datasets in the paper (before the R package was written). The general order to see the code in each directory is

recode : turn bases into numbers
PCA : find local PCs
distance : compute distance matrix between windows from local PCs
MDS : visualize the result

There are standalone examples for each of the three datasets studied:

POPRES (Homo sapiens, SNP chip data from a few worldwide populations)

Chromosome 1 is the example given. See also popres_example.R for an example of some steps using the package.

POPRES_SNPdata_recode12.R : recodes the TPED as numeric
POPRES_cov.R : computes covariance matrix for the entire chromsome 1
POPRES_PCA_win100.R : computes local PCs
POPRES_jackknife_var.R : estimates SE of local PCs
POPRES_distance.R : computes distance matrix from local PCs
POPRES_MDS.R : finds and plots MDS visualization of distance matrix

DPGP (Drosophila melanogaster population genome project)

Chromosome 3L is the example given .

DPGP_recode_and_cov.R : recodes data as numeric, removes individuals with more than 8% missing data, sites with more than 20% missing data, and computes whole-chromosome covariance matrix
DPGP_PCA_plot.R : plots PCs for entire 3L
DPGP_PCA_win103.R : computes local PCs along 3L in windows of 1000 SNPs
DPGP_var_between_win.R : computes variance of PCs between windows
DPGP_jackknife_var.R : does jackknife estimate of SE for PCs on windows of 1000 SNPs
DPGP_distance.R : computes distance matrix from local PCs
DPGP_MDS_1d.R : computes and plots MDS plots from distance matrix
DPGP_get_extreme_points.R : identifies extreme points (with interaction)
DPGP_combine_extremes_and_get_cov.R : combines each of three sets of extreme windows and computes covariances for each

Medicago (Medicago truncatula hapmap)

For Medicago, it calculates the pairwise distance for all 8 chromosome together and then apply MDS and use subset of the whole MDS result for each chromosome.

Medicago_VCF_recode.py : recodes VCF file as numeric
Medicago_recode_and_cov.R : computes covariance matrix for (entire) chromosome 1
Medicago_PCA_win104.R : computes local PCs for chromosome 1
Medicago_distance_all_chr.R : computes a distance matrix from PC information
Medicago_MDS.R : computes and plots MDS plots from the distance matrix

A note on implementation:

This method works through the genome doing something (PCA on the covariance matrix) one window at a time. Because of this, it can be frustratingly slow to first load the entire dataset into memory. There are several methods implemented here to avoid this; for instance, vcf_windower() which is used to compute PCs for the medicago data. The interface is via a function that takes an integer, n, and returns a data frame of the genomic data in the nth window.

local_pca's People

Contributors

Stargazers

Watchers

Forkers

linhua-sun cc2qe seoncheolpark taohong08 lucasrocmoreira clairemerot dilansarange kdm9 alxsimon wheatwill morgan-sparks

local_pca's Issues

The effect of sample number and relationship on local PCA results

recently I used method 'local Pca' to detect potential haplotype region in our SNP data, The program ran normal and find some outliers.

But I wonder if sample number of different popluation and relationship between sample will cause deviation on the result. Our data include two popluation, one have 40 samples and another have 17 samples, and relationship between some samples is very close.If you could kindly provide me some advice or code, I will be appreciate.

Thanks

SNP filtering recommendations

I'm working with a data set that we want to try Local_PCA with. I'm using a VCF file that was filtered by a collaborator, they used these filtering VCFtools options:

--remove-indels --max-missing 1 --minDP 5 --maxDP 500 --minGQ 30 --plink --mac 2 --remove-filtered-all

And they only kept the chromosome-level scaffolds. It resulted in 256,649 SNPs divided among the chromosomes. This is many fewer SNPs than I'm used to working with. In the past, I've had many more SNPs on a single chromosome than in this whole assembly. The chromsome with the most SNPs has 14,364 SNPs and the one with the fewest SNPs has 3,148 SNPs. Do you have any recommendations for filtering differently or if this is fine as is.

I know the human data set also had a low number of SNPs and 100 SNPs/window was used. If I keep it as is, I'm assuming "-t snp -s 100" would work for 100 SNPs/Window like with the human dataset

Thanks for any help you can provide

Is vcf_windower compatible with the last versions of bcftools?

Hello,
I have been using successfully lostruct, including the function to load bcf window by window with the following commands on the server of my university which has bcftools 1.8, and that I am loading with module load bcftools
snps <- vcf_windower("capelin_NWA_sorted.bcf",size=100,type='snp', sites= vcf_positions("capelin_NWA_sorted.bcf"))

However, I am now trying to make it run on a AWS server in which we have install bcftools 1.9, and I am running into errors. I am using exactly the same I manage to twist the sites and samples by providing directly matrix and vector instead of using vcf_positions function, but then the function does not manage to proceed the windows and I get the error
" Error. Is bcftools installed?\n"

Do you have any idea why that is? How to solve that? Is it a problem of version of bcftools? permissions?
I am using exactly the same vcf/bcf files and the same code.
This is for a course about adaptation genomics.

Thank you for your help
Claire

re-run medicago with distance instead of distance^2

... after checking which version went into the paper.

github repo not installing with devtools

Hi guys!

I love lostruct - and I use it often! My computer died recently, so I'm reinstalling favourite packages, but I'm having a problem with the current tar release

I have the following versions of data.table and devtools:
data.table_1.14.2 devtools_2.4.2

My error when installing lostruct is:

> devtools::install_github("petrelharp/local_pca/lostruct") Downloading GitHub repo petrelharp/local_pca@HEAD Error in utils::download.file(url, path, method = method, quiet = quiet, : download from 'https://api.github.com/repos/petrelharp/local_pca/tarball/HEAD' failed

Thanks!
Josie

document medicago methods

run_on_medicago.R takes in a .json
and then summarize_run.Rmd makes a report

chromosome names gives me problems

Hi,

thanks for this awesome tool! We have tried it on our datasets and find it very useful.
I am currently trying to run lostruct on a dataset that has the first nine chromosomes named as so: 01, 02, 03 etc, and this seems to give me some issues. I think it's the naming of the chromosomes that's giving me problems, as it worked fine for chromosomes 10 and onwards. but not entirely sure ..

I'm running the templated, the run_lostruct.R script.
The issue seems to appear when or before the program tries to generate the pca.csv file, as it turns out full of NAs. I had a look in the regions.csv file, and see there that the name of the chromosome in the "chrom" column turns out "1" rather than "01" etc.

Any help/insight on what might be the issue and solution would help me greatly.
thank you so much!

all the best
Siv H

Unable to plot PCs of the corners

Receiving this error and not sure why. Any ideas?

`Here are all pairwise plots of the first 4 PCs for each of the three corners:

layout(t(1:3))
for (i in 1:(corner.npc-1)) {
for (j in (i+1):corner.npc) {
for (k in 1:ncol(mds.corners)) {
vectors <- matrix( corner.pca[[k]][-(1:(1+corner.npc))], ncol=corner.npc )[,c(i,j)]
colnames(vectors) <- paste("PC", c(i,j))
par(mgp=c(0.7,0.7,0), mar=c(2,2,2,0)+.1)
plot(vectors, pch=pop.pch[samps$population],
col=pop.cols[samps$population],
xaxt='n', yaxt='n' )
if (i==1 && j==2) {
mtext(paste("corner",k),side=3)
}
}
if (do.pdfs) { pdf_copy(plot.id=paste(i,j,sep="_")) }
}
}

Error in matrix(corner.pca[[k]][-(1:(1 + corner.npc))], ncol = corner.npc): 'data' must be of a vector type, was 'NULL'`

Displaying results error

I'm using local_pca with 41 closely related species. The reference that everything is mapped to has 32 chromosomes, so I used tabix to divide the vcf file into separate files for each of the largest scaffolds (i.e.)
tabix -h selectedSNPswREFSNPOnly.vcf.gz HiC_scaffold_1 > selectedSNPswREFSNPOnly.HiC_scaffold_1.vcf

next I converted the vcf files to bcf (i.e.):
bcftools convert -O b selectedSNPswREFSNPOnly.HiC_scaffold_1.vcf > selectedSNPswREFSNPOnly.HiC_scaffold_1.bcf

and finally, I indexed the files (i.e.):
bcftools index selectedSNPswREFSNPOnly.HiC_scaffold_1.bcf

I now had 32 bcf and bcf.csi files in my data directory. I also created a sample_info.tsv file with only "ID" and "population" columns and put it in the data directory. As a test, I decided to work with the four smallest bcf files, representing snps from scaffolds 29, 30, 31, and 32.

Previously, I was working with the medicago data and input my data instead, so here are the stats for these four scaffolds:

HiC_scaffold_29 HiC_scaffold_30
nsnps 1,945,193 2,044,694
nbp.HiC_scaffold_1 28266565 27710915
spacing.mean 14 13
spacing.5% 1 1
spacing.25% 2 2
spacing.50% 4 4
spacing.75% 10 10
spacing.95% 43 43
spacing.max 23964 23423

HiC_scaffold_31 HiC_scaffold_32
nsnps 1,660,526 1,496,728
nbp.HiC_scaffold_1 26516423 23710698
spacing.mean 15 15
spacing.5% 1 1
spacing.25% 2 2
spacing.50% 5 5
spacing.75% 12 11
spacing.95% 54 47
spacing.max 13397 16212

In the Medicago example, i think -s of 5,000 was chosen because there was 5 million snps in chromosome 1. I chose an -s of 1,500 because there were roughly 1.5 million snps in my smallest scaffold. Here was my command:
/Applications/local_pca/templated/run_lostruct.R -i data1 -t snp -s 1500 -I data1/sample_info_genus.tsv

and to visualize my results, I ran this command:
Rscript -e 'templater::render_template("/Applications/local_pca/templated/summarize_run.Rmd",output="lostruct_results/type_snp_size_1500_weights_none_jobid_023972/run_summary.html",change.rootdir=TRUE)'

For some reason, each "plot_corner_pca" pdf only show 3 plots for my results even though I'm working with 4 chromosomes/scaffolds. Additionally, mds_pairplot-1.png is blank and a pdf was not created for that plot. I will share the html file here so you can take a look at my results. Any help would be greatly appreciated. Thanks

file:///Users/albertlab/Desktop/Tembusu/local_PCA/Tembusu/lostruct_results/type_snp_size_1500_weights_none_jobid_023972/run_summary.html

run_summary.html.txt

Using lostruct package on pca data coming from other programs

I read with much interest you article on Local PCA and I'm interested in using this method on my dataset. I was wondering whether I could ask you for a piece of advice about how to integrate my kind of data into the loscruct pipeline?

I am working with seaweed flies and using genomic data to detect structural variation and population structure. As we have hundreds of genome at low coverage we tend to use probabilistic methods to infer covariance and pca (as implemented in angsd and pcangsd) and I was wondering how to apply the method that you developped to that sort of data.

For now, I have obtained covariance matrix between the individuals and/or pca matrix along the genome ( by splitting the initial data for a beagle in windows of X snps and applying pcangsd). First of all, have you ever tested that? or consider implementing that in the analyses? Or consider an intermediate step where the user could feed to eigen-windows function a set of covariance matrices by windows?

The simple way is perhaps that I transform each window's pca into a matrix to feed the function pc_dist. Then I'll be able to use the functions that you implemented to calculate distances and MDS between windows. To do so, I'd like to be sure of the format. 1) How I can obtain the first value (sum of square of the covariance matrix)? 2) if I followed well the format, there are k eigen-values (this is ok) and then 3) k eigen-vectors (are they coordinates of the n individuals along pc_1, then pc_2, .. pc_k or rotation vectors?)

Thanks a lot for your attention and help!
Claire Mérot

Questions regarding bcftools

Hi everyone,

I am highly interested in local PCA but got to a bit hesitant of installing bcftools on my windows 8.

I have never done any bioinformatics on my windows therefore I am wondering if it matters to have the Linux/python environment first or I can just install it? Thank you very much if you can help me with this super naive question!

Best regards,
Han Xiao

Medicago example out of date

Dear all,
I could not find the function named query_genotypes, which shows in "medicago_data_setup.html".
I checked the "medicago_data_setup.RMD", it changed to vcf_query, which I also failed to find it.
Would you mind give me some suggestions?

Best regards,
Sandy

Error in computing distances

I am using run_lostruct.R to run lostruct over 14 chromosomes contained in .bcf files contained in a directory full_data/ with the following command:

Rscript run_lostruct.R -i full_data/ -t bp -s 5000 -I full_data/sample_info.tsv

It successfully finds PCs, however I am getting an error during the step when it computes distances between them. It returns the following after running for a fairly long time (I'm running with 120G of mem):

Error in pmax(0, (out + t(out))/2) : 
  long vectors not supported yet: ../../src/include/Rinlinedfuns.h:519
Calls: system.time -> pc_dist -> pmax
Timing stopped at: 1.484e+05 1505 1.502e+05
Execution halted

If I use a window size of 10kb or larger, it runs until completion. Is there a workaround or a solution to get this to run on a smaller window size? Thanks! Just let me know if you need any more information from my end.

detect when data are not ACGT and give warning suggesting 'recode=FALSE'

If the alleles are all 0/1, then returns NA unless recode=FALSE.

Perhaps this should be an option to vcf_windower.

add sqrt to pc_dist

Now it computes squared distance. Need to fix tests and check usage of the function.

error on the read in

I'm using the stand-alone R script (although the same issue arises using the R package) to run local PCAs on windows across single scaffold in my genome with the command:

Rscript run_lostruct.R -i . -t bps 1000 -I sampleinfo

I get the following error immediately on the read in:
Finding PCs for ... and writing out to ...pca.csv and ...csv Taking input= as a system command ('bcftools query -f '%CHROM\t%POS\n' /ohta/julia.kreiner/waterhemp/data/fixed_assembly/reveal_psuedoassembly/TSR_work/lostruct/Scaffold_11_qualdpmissing.bcf') and a variable has been used in the expression passed to input=. Please use fread(cmd=...). There is a security concern if you are creating an app, and the app could have a malicious user, and the app is not running in a secure envionment; e.g. the app is running as root. Please read item 5 in the NEWS file for v1.11.6 for more information and for the option to suppress this message.

Thanks!

cmdscale issue

I've been running lostruct based on the Medicago documentation. I have an issue where cmdscale won't run at a smaller SNP window size due to missing data, but I don't know how to get it to work around that. If I increase the window size it will work, so it has to do with windows that have too much missing data, I assume.

Here's my code:
window_size <- 400
bcf.file <- paste(data_directory, "/", data_name,".bcf",sep="")
samples <- read_tsv(paste(data_directory, "/", data_name,".samplelist.txt",sep=""),col_names = F)
colnames(samples) <- c("sample")
sites <- vcf_positions(bcf.file)
win.fn.snp <- vcf_windower(bcf.file, size=window_size, type="snp", sites=sites)
system.time( snp.pca <- eigen_windows(win.fn.snp,k=2, mc.cores=10) )
system.time( pcdist <- pc_dist( snp.pca ) )
na.inds <- is.na( snp.pca[1,] )
k_kept <- 20
mds <- cmdscale( pcdist[!na.inds,!na.inds], eig=TRUE, k=k_kept )

And the error message:

Error in cmdscale(pcdist[!na.inds, !na.inds], eig = TRUE, k = k_kept) :
NA values not allowed in 'd'
Any ideas?

error when running on multiple scaffolds

Hello,

I am running local PCA on either one scaffold or all the large scaffolds (>10M bp) of my whole genome data.
For each scaffold run individually, the program runs fine and I get all the results including the PCA of the extracted corners.
When I use a bcf including all the large scaffolds, the different steps are running fine until making the PCA plots from the extracted corners.

I get the following error message with a windows of 100 SNPs:
#Warning in sweep(f(k), 2, colm, "-"): STATS is longer than the extent of 'dim(x)
#[MARGIN]'
#Error in array(STATS, dims[perm]): 'dims' cannot be of length 0

Or this error with windows of 1000 SNPs
#Taking input= as a system command ('bcftools query -f '[ %GT]\n' -r MRVK01000135.1:19528169-20462752 NA') and a variable has been used in the expression passed to input=. Please use fread(cmd=...). There is a security concern if you are creating an app, and the app could have a malicious user, and the app is not running in a secure environment; e.g. the app is running as root. Please read item 5 in the NEWS file for v1.11.6 for more information and for the option to suppress this message.
#Error in colMeans(x, na.rm = TRUE): 'x' must be an array of at least two dimensions

I have attached the results in pdf.

Could you please help on how I could resolve it?

Many thanks,

Marie

all_scaffolds_100SNPs.pdf
all_scaffolds_1000SNPs.pdf

run_lostruct.R error

Hi, I have local_pca installed and the main functions (popres_example.R) running without error. I'm getting stuck on an error in templated script though, which would be very useful for the more detailed outputs:

templated % Rscript run_lostruct.R -t snp -s 20 -i joined_data -I joined_data/sample_info.tsv
Finding PCs for chromsome chr1 in file /Users/ryan/local_pca-master/templated/joined_data/all_chrs.bcf and writing out to lostruct_results/type_snp_size_20_weights_none_jobid_269154/chr1.pca.csv and lostruct_results/type_snp_size_20_weights_none_jobid_269154/chr1.regions.csv
Error in n < 1 || n > max(chrom.breaks) :
'length = 49' in coercion to 'logical(1)'
Calls: ->
In addition: Warning messages:
1: In any(chrom.wins) : coercing argument of type 'double' to logical
2: In vcf_windower_snp(file = file, sites = sites, size = size, samples = samples) :
Trimming from chromosome ends: 1: 14 SNPs.
Execution halted

I get the same error with -i templated/data

Thanks

Issues with | in contig names

Hi Peter,

Thanks for another great tool. I'm having issues running lostruct with a new dataset that has contig names along the lines of ${numeric_id}|${assembly}, e.g. 32351|arrow. This breaks the logic used to extract variant data as the pipe is unescaped in the call to system(), and therefore the bcftools query command gets piped into the (thankfully non-existant) arrow command (i.e. bcftools query -f '[ %GT]\n' -r 000000F|arrow:280707-300706 -s KA17BG005,Cala2,...).

While "then don't use | in your contig names" is probably valid advice, I'm not able to change it in this dataset without breaking a lot of other people's work, so I'll have to come up with a more robust workaround.

I know of the python implementation, whose use of cyvcf2 should fix this, though my impression was that it's not quite ready for production use. Is that correct?

As it happens, I have an R library called WindowLickR that does exactly this type of V/BCF query using an Rcpp version of the logic you have here, calling htslib rather than system("bcftools ..."). For now I will hack something together myself using windowlickr to provide a "window function" that takes n and returns the genotypes as your code would. I'm happy to try fixing the issue for everyone by upstreaming this solution using windowlicker, if a dependency on that is acceptable to you.

Best,
Kevin

improve documentation for win.fn in vcf_windower

NA values in PCA and MDS results

Dear Dr. Ralph,

I am attempting to run your lostruct program, and it appears that it is working. However, I am getting a large portion of windows with NA values in the PC and MDS output (up to 30% of windows on a chromosome). At first, I thought this may be because I was using 50kbp windows and there wasn't enough variation in some windows, but I still get a few NA values even when using the program with windows of 10,000 SNPs. I am using a dataset that has no missing data and 22 individuals. I am running these analyses using the templated directory method and the command line.

Any thoughts on what may be causing these results or what I could do to troubleshoot?

Thanks,
Joe Manthey

Error in cmdscale(pc.distmat[!na.inds, !na.inds], k = opt$nmds) : NA values not allowed in 'd'

Dear Peter,

Sorry to bother you again. I restarted the analysis on a slurm cluster with a bash script like this:

#SBATCH -t 3-23:59:59
# Define partition
#SBATCH --partition=long
# Set number of nodes to run
#SBATCH --nodes=1
# Set number of cpus
#SBATCH -c 16
# Set memory
#SBATCH --mem=128G
# Define email for script execution
#SBATCH [email protected]
# Define type notifications
#SBATCH --mail-type=ALL
###################################################################

echo "Load module"
module purge
module load r/4.3.1
module load bcftools/1.15.1

echo "run Local PCA for 5kb windows"
Rscript --vanilla run_lostruct.R -i data -t bp -s 5000 -I data/sample_info.tsv > lostruct-${SLURM_JOB_ID}.Rout 2>&1

As you can see, I used -t bp and -s 5000options. I haven't had the error described below before using the -t snp -s 1000 options. The run_lostruct.R script remains unchanged.

After 15 hours, the job stops with the error message:

Error in cmdscale(pc.distmat[!na.inds, !na.inds], k = opt$nmds) : 
  NA values not allowed in 'd'
Calls: cbind -> cbind -> data.frame -> cmdscale
Execution halted

It seems to me, however, that the NAs are managed by the script and that the MDS calculation is performed without them, no? Where do you think the problem comes from?

Here's a link to download the *.pca.csv and regions.csv files, as well as the config.json file: https://filesender.renater.fr/?s=download&token=d83c40ce-8178-4fff-9d89-9cde1b3d3b2a

Thanks in advance for your help.

Best regards.

Error generating PCs of corners

Hi Peter!

Back running lostruct on some new data. I'm going off the medicago example and at the step of generating PCs from the corners. I run corner.pcs <- eigen_windows( data=corner.winfn, k=2, do.windows=1:ncol(mds.corners), mc.cores=10) but am getting an error (pasted below) and i'm not sure why. Thanks in advance for your help!

Error in dimnames(x) <- dn :
length of 'dimnames' [2] not equal to array extent)

error in reading VCF

Hello, first time I'm using lostruct.

I'm trying to read a VCF file using this code after installing the package to R studio:
snps <- read_vcf("my_vcf.vcf.gz")

and I'm getting this error:

snps <- read_vcf("my_vcf.vcf.gz")
|--------------------------------------------------|
|==================================================|
Error in dim(haps) <- c(2, dim(dips)) :
dims [product 270590588] do not match the length of object [268886609]

I also tried to convert the vcf to bcf like the example in the manual but still getting the same error.

i would really appreciate some help,

tnx

yael

Error when I try to plot MDS

Dear Peter,
I am working on Anopheles dataset and try to do a local PCA, using the great templated scripts. My first question is about the input files. Is it better to use a single BCF containing all the chromosomes, or one bcf file per chromosome?

The run_lostruct.R works perfectly and generates the pca.csv, regions.csv, config.json and mds_coords.csv files
config.json
mds_coords.csv
Bwambae.AgamP4ROAST.flt.snp.inv.annot.pass.GQ20.norm.chr2L.HRUN.sort.norep.indflt70.genflt80.maf.regions.csv
Bwambae.AgamP4ROAST.flt.snp.inv.annot.pass.GQ20.norm.chr2L.HRUN.sort.norep.indflt70.genflt80.maf.pca.csv

When I run the script from summarize_run.Rmd,

Rscript -e 'templater::render_template("summarize_run.Rmd",output="lostruct_results/type_snp_size_1000_weights_none_jobid_591404/run_summary.html",change.rootdir=TRUE)'

I get two errors:

The first is that the mds_pairplot-1.pdf is not generated

And the second error at step # Plot corners and MDS along the chromosome, and more precisely from line 177 and the function chrom.plot called. The output is

Error in xy.coords(x, y): 'x' and 'y' lengths differ

Do you have any idea where the problem might be coming from? I'm also attaching the run_summaru.md because for some reason I can't attach the summary in html format.
run_summary.md

Best regards

Error in `colnames<-`(`tmp`, value = out.colnames)

Hi there,

I am using lostruct and the test data https://github.com/petrelharp/local_pca/blob/master/lostruct/tests/testthat/test.bcf from https://github.com/petrelharp/local_pca/tree/master/lostruct/tests/testthat

The codes I used are

bcf.file <- 'test.bcf'
sites <- vcf_positions(bcf.file)
win.fn.snp <- vcf_windower(bcf.file, size=100, type="snp", sites=sites) 
snp.pca <- eigen_windows(win.fn.snp,k=2)

I got error messages as following, do you know what happened? I tried to look into that, and suspect this error is probably due to line 77 - 78 of the script https://github.com/petrelharp/local_pca/blob/master/lostruct/R/eigen_windows.R. However I still have no idea what exactly happened.

Error in `colnames<-`(`*tmp*`, value = out.colnames) : 
  attempt to set 'colnames' on an object with less than two dimensions

Thanks in advance!

Best,
Y. L.