dportik / dadi_pipeline Goto Github PK

An accessible and flexible tool for fitting demographic models with dadi using custom or published models (available here), conducting goodness of fit tests, and plotting.

License: GNU Lesser General Public License v3.0

Python 99.00% R 1.00%

dadi_pipeline's People

Contributors

Stargazers

Watchers

dadi_pipeline's Issues

about asymmetric migration and adjiacent asymmetric migration models

Dear dportik
I have used dadi for a long time, now I used three populations to calculate the best models, now I need Divergende, asymmetric migration and Divergence adjacent asymmetric migration models, that were not concluded in the script. you can send to my Email:[email protected]. Thankyou!

And another problem, I have encountered is that, when Round1 finished I get several tables. should I choose the maximum log-likelyhood or the minimum ones for the parameter to use. For example,

Model	param_set	Replicate	log-likelihood	theta	AIC	optimized_params
Refugia 2 with secondary contact	parameter set = [nu1, nuA, nu2, nu3, m1, m2, T1, T2]	18	-64414289	711737.1	1.29E+08	1.2559	0.6359	6.987	0.1385	0.0677	9.1648	7.3236
Refugia 2 with secondary contact	parameter set = [nu1, nuA, nu2, nu3, m1, m2, T1, T2]	46	-19011234	526348.7	38022484	1.3546	2.3947	29.5382	0.8774	0.3592	19.6276	4.8849

which line I should choose, Thanyou !

pop1 or pop2

Hello and thank you for developing this useful program! I'm determining demographic history between two divergent populations by using stairway plot with their respective SFS as well as a joint 2DSFS. Both sets of SFS were generated in ANGSD and realSFS command. Anyway, after modeling no_mig and no_mig_size in dadi, favoring no_mig, I've calculated Ne1 and Ne2, which are very similar to those estimated by stairway except they're flipped. As in the estimate from dadi for pop1 is much closer to that of pop2 from stairway. Furthermore, the difference between pop1 and pop2 is around 4 fold. I'm having trouble diagnosing how this "flip" occurred. Here is the joint SFS and the realSFS command used to generate. Thank you!

./angsd/misc/realSFS pop1_sfs.saf.idx pop2.saf.idx -P 24 -fold 1 >> 2d_Aug25.sfs

#joint sfs

1992768.151735 49159.185835 13568.706837 5144.128084 1913.596299 999.519330 205.482639 317.066135 20.735284 37.404963 69.785701 34.876663 0.057500 15.633086 28.633686 0.000000 0.000000 0.000000 0.000000 0.000000 6.317693 7.210363 0.000000 0.000000 0.000000 0.000000 38878.062382 20352.036277 9480.901455 4017.042444 2384.410338 395.734227 787.859164 136.130698 181.978437 103.240704 66.932155 0.000000 0.000000 3.707234 19.490405 0.000000 2.666079 6.410877 0.000004 0.000000 0.000000 0.000000 0.000000 0.016741 0.197921 0.000007 13176.123802 10293.696090 6750.689010 4360.456236 1260.212058 1934.486998 325.789728 393.768625 2.452084 0.232651 13.352794 85.345057 0.000000 0.160121 29.779686 2.186559 36.045907 0.000000 0.000000 0.000000 0.000000 0.000000 0.051804 4.477613 0.000000 0.000000 4733.461799 4144.621370 4835.960650 2195.747697 2574.639735 445.190697 550.637866 587.236537 525.537440 302.970953 11.219059 15.770535 35.735601 22.092268 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2140.306066 2668.668947 1674.437278 3372.508653 1197.347168 1643.737942 1723.931919 481.344369 190.733005 51.974370 78.733445 9.665332 47.180923 1.614863 0.003064 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 832.276756 1063.934277 1986.509981 88.099352 2156.664247 648.125382 531.768348 240.451802 391.700429 424.623079 85.777707 63.141629 194.847438 167.000549 0.009280 0.037480 18.773547 0.000000 0.000000 0.000000 0.000000 0.000954 7.968417 0.000000 0.000000 0.000000 739.604460 629.468934 423.212702 1776.325351 155.096740 517.812247 369.436148 770.468255 81.238149 222.710023 391.115134 33.517590 167.502818 24.537579 7.698388 99.847062 4.316497 0.000000 10.966750 0.000000 0.000000 10.217667 4.406112 0.000000 0.000000 0.000000 37.974083 148.534267 1002.414782 394.728825 600.268579 898.312924 564.789979 157.242107 1083.413795 6.000964 531.881782 15.462590 11.609458 0.248786 25.616397 100.979360 0.572815 0.000000 0.311193 0.000000 0.000000 0.499740 0.012581 0.000000 0.000000 0.000000 65.583895 356.504100 48.435549 100.705813 710.264225 582.189103 466.975016 317.604143 137.877517 146.310430 60.815252 65.132446 39.219711 0.964052 41.290089 12.159369 0.890080 0.000000 0.000008 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 280.351937 76.053207 14.792154 690.783181 160.884652 400.878739 719.555662 542.700832 265.625716 569.583987 367.542732 56.303517 309.836651 7.157761 132.896794 20.242474 74.024130 0.000000 0.000104 13.806615 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 26.166161 0.084342 78.630046 83.936783 148.797783 59.864401 41.395521 245.395559 218.331637 280.571417 245.956991 212.876902 56.000214 256.813893 6.314497 21.573326 27.683155 0.002035 0.048725 49.278511 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.209926 0.008622 103.470051 28.412793 55.447948 189.654951 190.070071 34.126611 529.432590 433.663302 163.452780 127.750070 42.452156 180.250336 14.067964 130.855168 4.547521 0.127857 0.393179 0.308944 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 33.892208 0.529200 86.555434 94.957193 13.206530 244.676654 138.749181 103.011653 41.957844 125.369902 110.495783 149.753720 95.892346 78.798231 40.210882 152.607802 22.027314 0.567482 120.548154 0.501512 0.015848 0.000000 0.000000 0.000000 0.000000 0.000000 57.504125 55.802606 3.514069 6.097744 1.803927 242.849581 135.744904 366.046997 21.047931 33.703644 299.428843 381.726250 402.853409 151.273238 60.314593 129.770789 82.173396 57.089250 74.780582 0.006226 33.286141 0.057329 0.665927 0.000000 0.000000 0.000000 16.770003 22.760802 0.028306 8.965116 16.914629 89.614913 4.515001 26.291564 153.070598 361.663293 83.250935 106.498500 62.599476 262.648926 96.730943 28.560368 131.546492 38.824921 38.627045 7.889028 21.472512 59.431930 24.514336 0.148538 0.000010 0.000000 0.433045 0.009227 0.000842 57.404641 20.106025 1.485078 181.654651 99.985972 111.267726 58.919322 22.685005 39.551974 31.813405 201.139792 158.896916 105.849952 87.844511 138.254852 2.500197 29.970280 0.554229 0.071983 0.008672 22.303755 3.940354 0.000001 0.002929 0.000001 0.008032 74.444486 40.691182 0.001031 4.265159 12.311178 58.711206 82.566326 81.775446 83.319485 192.032526 112.411820 202.742810 27.777605 19.767527 36.328568 0.234738 75.123197 0.917092 0.000000 0.000000 1.869007 4.079902 0.000004 0.001853 0.000000 0.383542 0.391963 5.563357 0.000011 0.051695 0.348889 25.927175 9.254005 194.639757 109.704397 107.209612 37.574982 105.507996 41.590609 86.725656 71.394120 1.124972 106.548455 1.463686 0.000000 0.010401 0.132531 2.893621 0.000001 7.488652 0.000000 7.262298 0.000001 0.097736 0.000830 0.004157 4.071843 65.346136 1.877036 95.328365 69.697578 217.000244 25.466528 238.627492 25.210326 131.334381 230.132121 81.965808 106.780689 38.712011 0.334674 6.139749 1.965220 8.406371 0.000000 4.631695 2.379329 2.350995 0.000000 0.000000 29.168283 0.146997 3.583515 185.820498 0.397605 96.510176 35.340741 146.424199 3.979446 82.389971 118.410992 69.127401 43.019561 408.457062 52.627223 9.742085 96.259896 50.430773 16.201770 3.599595 0.000000 0.000000 7.468630 1.598952 0.000000 0.000000 18.062332 3.631159 0.114993 13.333394 0.060626 57.234638 21.589903 40.200792 3.680328 87.093076 41.306965 46.036296 82.799656 63.496949 119.192167 0.517299 42.741642 39.162238 56.844064 4.213102 0.000000 0.000000 0.000000 0.172942 0.000000 0.000000 0.000000 14.641175 0.000017 0.000639 0.019176 5.902029 0.614084 13.254259 51.035969 154.759617 105.433197 202.657164 10.082911 13.230453 6.421261 0.104979 63.612889 6.363969 46.137592 1.614277 0.000000 0.000000 0.000000 0.000070 0.000000 0.000000 0.000000 0.019008 0.000000 0.000000 0.318192 1.471147 3.367861 9.863580 40.276493 128.573828 65.340182 158.248501 58.799708 27.297875 155.084749 190.553413 82.754570 39.763432 131.167471 4.831462 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.004363 0.000000 12.232847 4.454641 9.646836 13.346731 0.220822 34.584083 21.834435 9.857928 164.860644 64.441342 177.068152 209.549961 292.830862 112.898308 95.391257 75.596053 0.000000 0.000001 0.000001 0.000000 0.000000 0.000000 0.000000 0.000000 36.155355 0.031164 27.017660 3.323254 41.169045 20.031561 0.020472 0.267904 6.467312 3.328968 55.180750 49.728680 244.161912 13.227688 173.461895 40.135761 16.678386 14.294364 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.079693 0.004631 0.000010 0.439000 24.163084 7.763451 0.004276 1.914963 18.786225 0.290164 1.088806 40.700870 59.991381 0.187427 75.931245 141.797836 9.513128 0.806444 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.933836 8.387877 0.025039 0.341367 12.786395 1.608408 11.514196 16.790264 122.690641 0.645593 55.723883 48.229917 56.197295 3.491971 0.000005 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.003778 14.347139 0.222802 29.294626 32.750169 10.553146 37.563229 23.807834 24.586892 34.740750 5.138429 135.442734 344.337475 104.496460 0.034712 0.000000 0.000000 0.000001 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000004 22.737971 1.501783 59.774668 18.434579 1.395610 149.582179 2.282020 9.063602 134.657855 164.046699 304.018879 450.830369 720.402890 1731.442683

dadi_pipeline.py


#===========================================================================
# Import data to create joint-site frequency spectrum
#===========================================================================

#**************
 
#Create python dictionary from snps file
#dd = dadi.Misc.make_data_dict(snps)
dadi.Misc.make_data_dict_vcf("/media/molecularecology/Summit/CHTC/staging/Modern_Museum/fastqs/beagle/West_EAST_624_impute.vcf.vcf.gz", "/media/molecularecology/Summit/88_fastq/cleaned/AUG_GAMS/easySFS/easyp_pop2")
#dd = dadi.Misc.make_data_dict(fs)
fs = dadi.Spectrum.from_file("/media/Summit/CHTC/staging/Modern_Museum/fastqs/2d_Aug25.sfs")#, ['West', 'East'], projections = [30, 25], polarized = False)

#**************
#pop_ids is a list which should match the populations headers of your SNPs file columns
pop_ids=["West", "East"]

#**************
#projection sizes, in ALLELES not individuals should be 56, 50
proj = [30, 25]

#Convert this dictionary into folded AFS object
#[polarized = False] creates folded spectrum object
#fs = dadi.Spectrum.from_data_dict(dd, pop_ids=pop_ids, projections = proj, polarized = False)

#print some useful information about the afs or jsfs
print("\n\n============================================================================")
print("\nData for site frequency spectrum:\n")
print("Projection: {}".format(proj))
print("Sample sizes: {}".format(fs.sample_sizes))
print("Sum of SFS: {}".format(numpy.around(fs.S(), 2)))
print("\n============================================================================\n")

#================================================================================
# Calling external 2D models from the Models_2D.py script
#================================================================================
'''
 We will use a function from the Optimize_Functions.py script for our optimization routines:
 
 Optimize_Routine(fs, pts, outfile, model_name, func, rounds, param_number, fs_folded=True, 
                        reps=None, maxiters=None, folds=None, in_params=None, 
                        in_upper=None, in_lower=None, param_labels=None, optimizer="log_fmin")
 
   Mandatory Arguments =
    fs:  spectrum object name
    pts: grid size for extrapolation, list of three values
    outfile:  prefix for output naming
    model_name: a label to help label the output files; ex. "no_mig"
    func: access the model function from within 'moments_Run_Optimizations.py' or from a separate python model script, 
          ex. after importing Models_2D, calling Models_2D.no_mig
    rounds: number of optimization rounds to perform
    param_number: number of parameters in the model selected (can count in params line for the model)
    fs_folded: A Boolean value (True or False) indicating whether the empirical fs is folded (True) or not (False).

   Optional Arguments =
     reps: a list of integers controlling the number of replicates in each of the optimization rounds
     maxiters: a list of integers controlling the maxiter argument in each of the optimization rounds
     folds: a list of integers controlling the fold argument when perturbing input parameter values
     in_params: a list of parameter values 
     in_upper: a list of upper bound values
     in_lower: a list of lower bound values
     param_labels: list of labels for parameters that will be written to the output file to keep track of their order
     optimizer: a string, to select the optimizer. Choices include: "log" (BFGS method), "log_lbfgsb" (L-BFGS-B method), 
                "log_fmin" (Nelder-Mead method), and "log_powell" (Powell's method).

Below, I give all the necessary information to call each model available in the
Models_2D.py script. I have set the optimization routine to be the same for each
model using the optional lists below, which are included as optional arguments for
each model. This particular configuration will run 4 rounds as follows:
Round1 - 10 replicates, maxiter = 3, fold = 3
Round2 - 20 replicates, maxiter = 5, fold = 2
Round3 - 30 replicates, maxiter = 10, fold = 2
Round4 - 40 replicates, maxiter = 15, fold = 1

If this script was run as is, each model would be called and optimized sequentially;
this could take a very long time. For your actual analyses, I strongly recommend
creating multiple scripts with only a few models each and running them
independently. It is also not a good idea to mix models from the Diversification Set
and the Island Set, as each was meant to be mutually exclusive.

'''


#create a prefix based on the population names to label the output files
#ex. Pop1_Pop2
prefix = "825".join(pop_ids)

#**************
#make sure to define your extrapolation grid size (based on your projections)
pts = [200,300,400]

#**************
#Set the number of rounds here
rounds = 4

#define the lists for optional arguments
#you can change these to alter the settings of the optimization routine
reps = [10,20,30,40]
maxiters = [3,5,10,15]
folds = [3,2,2,1]

#**************
#Indicate whether your frequency spectrum object is folded (True) or unfolded (False)
fs_folded = True


'''
Diversification Model Set

This first set of models come from the following publication:

    Portik, D.M., Leache, A.D., Rivera, D., Blackburn, D.C., Rodel, M.-O.,
    Barej, M.F., Hirschfeld, M., Burger, M., and M.K.Fujita. 2017.
    Evaluating mechanisms of diversification in a Guineo-Congolian forest
    frog using demographic model selection. Molecular Ecology 26: 5245-5263.
    doi: 10.1111/mec.14266

'''

# Split into two populations, no migration.
Optimize_Functions.Optimize_Routine(fs, pts, prefix, "no_mig", Models_2D.no_mig, rounds, 3, fs_folded=fs_folded,
                                        reps=reps, maxiters=maxiters, folds=folds, param_labels = "NY1, WI2, T")


...etc.

Optimization is too slow

Thanks a lot for this useful tool which makes dadi much more convenient to use.

I used EasySFS.py to make a folded frequency spectrum of about 600,000 SNPs and wanted to use dadi-pipeline to inference the demographic history. While I stuck in the optimization step and almost no result came out after two days. I don't understand why this problem occurs, because the number of SNPs used in many papers is always greater than 1,000,000. Or is this an acceptable speed?

import dadi
import numpy
import Optimize_Functions

path_sfs = 'HANN-HANS-TIB.sfs'
fs = dadi.Spectrum.from_file(path_sfs)

from Three_Population_Pipeline.Models_3D import split_nomig

prefix = "V1"

pts_l = [180,190,200]
p_labels = 'nu1, nuA, nu2, nu3, T1, T2'
reps = [10, 10, 10]
maxiters = [5, 5, 5]
folds = [3, 2, 1]
Optimize_Functions.Optimize_Routine(fs=fs, 
                                    pts=pts_l, 
                                    outfile=prefix, 
                                    model_name="split_nomig", 
                                    func=split_nomig, 
                                    rounds=3, 
                                    param_number=6, 
                                    fs_folded=True, 
                                    param_labels=p_labels,
                                    optimizer="log",
                                    reps=reps,
                                    maxiters=maxiters,
                                    folds=folds)

HANN-HANS-TIB.sfs.zip

about plotting

Dear dportik
I used the example data of two populations, with the command "python dadi_2D_04_plotting_functions.py", and got the result:
Traceback (most recent call last):
File "dadi_2D_04_plotting_functions.py", line 373, in
vic_no_mig = Optimize_Functions.Optimize_Single(pts, fs, "vic_no_mig", vic_no_mig_params)
File "/home/data/dadi_new/dadi_pipeline-master/Two_Populations/Optimize_Functions.py", line 6906, in Optimize_Single
sim_model = func_exec(params, fs.sample_sizes, pts)
File "/home/share/software/anaconda2/lib/python2.7/site-packages/dadi-1.6.3-py2.7-linux-x86_64.egg/dadi/Numerics.py", line 261, in extrap_func
result_l = map(partial_func, pts_l)
File "/home/data/dadi_new/dadi_pipeline-master/Two_Populations/Models_2D.py", line 564, in vic_no_mig
nuA, nu1, nu2, T, s = params
ValueError: need more than 0 values to unpack
I donot know what's wrong with this software.
thankyou!

Fixing theta

Dear Daniel,

I am trying to set up theta explicitly as explained by Ryan Gutenkunst here.

I defined an explicit function that has theta :

def growthfixedtheta ( params , ns , pts ):
    nu ,T, theta = params
    xx = dadi.Numerics.default_grid (pts)
    phi = dadi.PhiManip.phi_1D (xx)
    nu_func = lambda t: numpy.exp ( numpy.log (nu) * t/T)
    phi = dadi.Integration.one_pop (phi , xx , T, nu_func,theta0= theta  )
    fs = dadi.Spectrum.from_phi(phi, ns, (xx,))
    return fs

And then I modified optimise functions at 3 different lines.

line 118:

 theta =params_opt[2] #My fixed theta, instead of dadi.Inference.optimal_sfs_scaling(sim_model, fs)

at line 107 no multinom:

ll = dadi.Inference.ll(sim_model, fs)

and at line 242, multinom =False:

params_opt = dadi.Inference.optimize_log_fmin(params_perturbed, fs, func_exec, pts, lower_bound=lower_bound, upper_bound=upper_bound, verbose=1, multinom=False,maxiter=maxiters_list[r], output_file = "{}.log.txt".format(model_name))

I am a tiny bit more confused because when I try to run it, the theta that I fix is always very slightly different from the theta I get back.

Could you tell me wether that seems correct or wether I missed something important? That would be so helpful! I understand if that is not the purpose of your code and you don't really have time to look into it tho.

If that seems appropriate I am happy to take the time to make it into options of your repos if you want it to be of course.

Kind Regards

Ludo

dadi plot error

Dear Mr.Daniel,

Thanks for writing such helpful pipeline and I am sorry to bother you. I had some questions about the plotting of the results when I used the make_plots.py. I modified the parameters according to my results, but showed:
Traceback (most recent call last):
File "/usr/section2/chenqing/software/dadi_pipeline-master/Plotting/Make_Plots.py", line 178
, in
Plotting_Functions.Plot_2D(fs, model_fit, prefix, "anc_asym_mig")
File "/usr/section2/chenqing/software/dadi_pipeline-master/Plotting/Plotting_Functions.py",
line 112, in Plot_2D
dadi.Plotting.plot_2d_comp_multinom(model_fit, fs, resid_range = 3)
File "/usr/section2/chenqing/miniconda3/envs/dadienv/lib/python3.10/site-packages/dadi/Plott
ing.py", line 306, in plot_2d_comp_multinom
plot_2d_comp_Poisson(model, data, vmin=vmin, vmax=vmax,
File "/usr/section2/chenqing/miniconda3/envs/dadienv/lib/python3.10/site-packages/dadi/Plott
ing.py", line 469, in plot_2d_comp_Poisson
plot_single_2d_sfs(masked_data, vmin=vmin, vmax=vmax,
File "/usr/section2/chenqing/miniconda3/envs/dadienv/lib/python3.10/site-packages/dadi/Plott
ing.py", line 181, in plot_single_2d_sfs
cb = ax.figure.colorbar(mappable, extend=extend, format=format)
File "/usr/section2/chenqing/miniconda3/envs/dadienv/lib/python3.10/site-packages/matplotlib
/figure.py", line 1277, in colorbar
cb = cbar.Colorbar(cax, mappable, **cb_kw)
File "/usr/section2/chenqing/miniconda3/envs/dadienv/lib/python3.10/site-packages/matplotlib
/_api/deprecation.py", line 384, in wrapper
return func(*inner_args, **inner_kwargs)
File "/usr/section2/chenqing/miniconda3/envs/dadienv/lib/python3.10/site-packages/matplotlib
/colorbar.py", line 380, in init
self._reset_locator_formatter_scale()
File "/usr/section2/chenqing/miniconda3/envs/dadienv/lib/python3.10/site-packages/matplotlib
/colorbar.py", line 1165, in _reset_locator_formatter_scale
self._process_values()
File "/usr/section2/chenqing/miniconda3/envs/dadienv/lib/python3.10/site-packages/matplotlib
/colorbar.py", line 1103, in _process_values
b = self.norm.inverse(b)
File "/usr/section2/chenqing/miniconda3/envs/dadienv/lib/python3.10/site-packages/matplotlib
/colors.py", line 1708, in inverse
raise ValueError("Invalid vmin or vmax")
ValueError: Invalid vmin or vmax

I tried to change the value of vmin, but the error kept showing. I used your example_data without any problem.
I don't know what the problem is. Can you give me some suggestions?

Best wishes
Chen

AIC

Dear Mr.Daniel,
I am sorry to bother you in your busy schedule. When I run the script 'dadi_Run_2D_Set.py' , the result file 'optimized.txt' shows:
log-likelihood AIC chi-squared theta optimized_params(nu1, nu2, T)
0.0 nan nan 0.0 0.3052, 0.1408, 0.2376.
The AIC value is nan , but I can not find the reason why it is. The optimized params have numerical values. I checked the data and didn't find the problem. Could you tell me about it?

VCF2snp

hello,when i want to use the pipeline ,i didn't find the script to generate the snp file

nu1,nu2,nu3

how did i transform the nu1 to real Ne?,i found that nu1<1, and nu2>1,but i found that nu1+nu2=nuA, so in three population, nu1,nu2,nuA,nu3,how did i transform ?

nan AIC on edited models

I'm running 2 separate runs of dadi with data from 2 closely related populations: one containing all samples from both populations (31 and 24, respectiveley) and one containing a subset of 10 samples from each population with higher coverage (trying to mitigate the impact of missing data).

Both populations suffer from higher than average levels of inbreeding, with one of them having an absurd level of it. So I decided to include it in the models changing the fs in the models to fs = dadi.Spectrum.from_phi_inbreeding(phi, ns, (xx, xx), (F1, F2), (2, 2)), as describeb in dadi's manual. After making all changes into the scripts I've started running both of them.

For both dataset, unfortunalety, the summaries are returning AICs as nan. I thought it could be a result of the models being edited, but running with the default models return the same results.

Unlike the other issue placed here, ALL my AIC are nan and I don't get and error message.

I assume it must be my data, but I don't know why.

Any advice would be appreciated.

Cheers,
Manuel Escalona

Goodness of Fit test suggests model overfit

Hello Daniel,

I have used the dadi pipeline for many different projects in the past two years, it simplifies the workflow considerably. Thank you for developing and maintaining such a useful tool!

In a last step to finish up one of my analyses, I performed GOF for a rather simple 2 populations assymetric migration model. I have 24 haplotypes in each population, and in order to estimate the best model, I downprojected them to [10,16]. Now, when I run the GOF, my empirical likelihood sits outside of the simulated/fitted likelihoods histogram, but on the right site, meaning that none of the simulations and subsequent fittings reached a likelihood as "good" as the empirical. I read on the user page that this phenomenon ("This prevents odd behavior, such as when the empirical data fit better than the simulated data.") was observed before an update that I indeed did not use for the GOF. However, after downloading and running the updated scripts, I still get an overfit.

Do you have any idea if this is related to my code (see below)? Or if this happens sometimes when one model just fits exceptionally well ?

dd = dadi.Misc.make_data_dict_vcf("2pops.vcf", "popfile_txt")

# create folded SFS
fs = dadi.Spectrum.from_data_dict(dd, ["pop1", "pop2"], projections=[10, 16], polarized=False)

## max projections
## Here, since the model was fit to a downprojected SFS, I used the same numbers as in projections above.
max_proj = [10,16]

#Convert this dictionary into folded AFS object based on
#MAXIMUM projection sizes
#[polarized = False] creates folded spectrum object
max_fs = dadi.Spectrum.from_data_dict(dd, pop_ids=["pop1", "pop2"], projections = max_proj, polarized = False)

## grid size
pts = [50,60,70]

## model
asym_mig_func = Models_2D.asym_mig

##best fitting parameters
emp_params = [12.8641 ,1.6077,1.5714, 2.0899, 5.3618] ## asym_mig

## folded
fs_folded = True

## optimize empirical
fs_for_sims = Optimize_Functions_GOF.Get_Empirical(fs, max_fs, pts, "Empirical", "asym_mig", asym_mig_func, emp_params, [10,16], fs_folded=fs_folded)

## simulations

#number of simulations
sims = 100

#**************
#number of parameters
pnum = 5

#**************
#number of rounds
rounds = 4

#number of reps
reps = [15,10,5,5]

#number of maximum iterations per round
maxiters = [5,5,10,10]

#parameter fold
folds = [3,2,1,1]

## labels
p_labels = "nu1, nu2, m12, m21, T"

Optimize_Functions_GOF.Perform_Sims(sims, fs_for_sims, pts, "asym_mig", asym_mig_func, rounds, pnum, fs_folded=fs_folded,
 reps=reps, maxiters=maxiters, folds=folds, param_labels=p_labels)

Thank you in advance!

Cheers,
Bernhard

founder_nomig_admix_two_epoch model argument T2 not used

On line 912 of Models_2D.py, the founder_nomig_admix_two_epoch model has:

phi = Integration.two_pops(phi, xx, T1, nu1, nu2, m12=0, m21=0)

I suspect it should instead read:

phi = Integration.two_pops(phi, xx, T2, nu1, nu2, m12=0, m21=0)

Since T2 is the time between the admixture event and the present. If I understand correctly, the model as it stands assumes equal time between the founding and admixture events and the admixture event and the present.

does the pipeline incorporates correction for ancestral state misidentification?

Hi,

Thank you for writing the pipeline. I find it straightforward and easy to use. I am trying to fit a simple 1D model to one of my populations and want to correct for the ancestral state misidentification. I was not sure if the pipeline is built to correct for the misidentification of the ancestral state. If not, is it possible to incorporate this correction into the pipeline, and what would be a good way to do so? I am new to dadi and python and any help would be greatly appreciated.

Thank you again!
Upasana

How to transform the parameter to true parameter

For example, we used the admix_origin_no_mig model in 3 population, but i got the best AIC result which is [4.3764,4.0015,1.4507,0.0146,0.0117,1.3506],["nu1, nu2, nu3, T1, T2 , f"), why the f >0? and how did the T1,T2 transform to true time?

Summarize_Outputs.py

Hey @dportik!

Thanks for making these pipelines! As someone who is very new to python, these pipelines are very helpful for use of dadi! I am receiving an error when I run: python2 Summarize_Outputs.py

"
Examining file WILD_LANDRACE.asym_mig.optimized.txt
Traceback (most recent call last):
File "Summarize_Outputs.py", line 59, in
content.sort(key=lambda x: float(x[3]))
File "Summarize_Outputs.py", line 59, in
content.sort(key=lambda x: float(x[3]))
ValueError: could not convert string to float: AIC
"

Could you please explain what is going on?

Also, have you created a script that allows the inclusion of ancestral states? I noticed the scripts were all set up to be folded, but I do have ancestral state information for each site I am using.

Question about theta, Nref, and nuA

Hi there,

Thanks for writing this pipeline - I find it incredibly helpful! I'm trying to interpret my results from the island models and I'm having trouble with the relationship between the Nref estimated from theta - which, as I understand it from the dadi-users Google group, is generally the ancestral population - and nuA. What Nref is nuA defined in relation to?

Thanks,

Default parameter boundaries

Hi Dan,

I have another question. If I don't set the optional bounds on the parameters, what are the bounds for them? Can I look for behavior like parameters that chase the upper/lower bound? Or are they able to take any value?

Thanks,

Joanna

running time

I used dadi to run the asymigration model of three populations, it last for a week, but only go to Round 2 , are there any ways to speed up. I tried to use more threads, but failed. Thankyou!

Question about the simulations

Hi Daniel,

First, thank you for this excellent pipeline! I'm currently running simulations for a 2D model and have two questions:

I've noticed that there is a lot of variability in the analysis time for the simulations, ranging anywhere from 1.5 hours to over 24 hours. Is this expected behavior for the program?
If I need to stop the simulations partway through (e.g., after 30 simulation runs out of 100) is it possible to start them back up and re-start the numbering? Or do I need to move the initial ~30 runs to another directory and start the simulations again, but only run 70 simulations? Or is it not recommended to interrupt the simulations at all?

All the best,
Sarah

Folded spectrum vs. unfolded spectrum

I am able to run the program when polarized = False, but when setting polarized = True I get the following error.

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/dadi-1.7.0-py2.7-macosx-10.9-x86_64.egg/dadi/Numerics.py:171: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use arr[tuple(seq)]instead ofarr[seq]. In the future this will be interpreted as an array index, arr[np.array(seq)]`, which will result either in an error or a different result.
Traceback (most recent call last):
File "dadi_Run_2D_Set.py", line 249, in
Optimize_Functions.Optimize_Routine(fs, pts, prefix, "asym_mig_twoepoch", Models_2D.asym_mig_twoepoch, rounds, 8, reps=reps, maxiters=maxiters, folds=folds, param_labels = "nu1, nu2, m12a, m21a, m12b, m21b, T1, T2")
File "/Users/peterstokes/Documents/Science/Blackman_Lab/dadi/dadi_data/Optimize_Functions.py", line 237, in Optimize_Routine
rep_results = collect_results(fs, sim_model, params_opt, roundrep)
File "/Users/peterstokes/Documents/Science/Blackman_Lab/dadi/dadi_data/Optimize_Functions.py", line 121, in collect_results
chi2 = numpy.sum((folded_sim_model - fs)**2/folded_sim_model)
File "", line 3, in sub
File "build/bdist.macosx-10.9-x86_64/egg/dadi/Spectrum_mod.py", line 1878, in _check_other_folding
ValueError: Cannot operate with a folded Spectrum and an unfolded one.

Do you see anything that would cause errors in the first bit of my script?

#===========================================================================

Import data to create joint-site frequency spectrum

#===========================================================================

#**************
snps = "/Users/peterstokes/Documents/Science/Blackman_Lab/dadi/dadi_data/python_scripts/angsd_joined_LRW_60_REF_ANC_included_ready.txt"

#Create python dictionary from snps file
dd = dadi.Misc.make_data_dict(snps)

#**************
#pop_ids is a list which should match the populations headers of your SNPs file columns
pop_ids=["WILD", "LANDRACE"]

#**************
#projection sizes, in ALLELES not individuals
proj = [14,14]

#Convert this dictionary into folded AFS object
#[polarized = False] creates folded spectrum object
fs = dadi.Spectrum.from_data_dict(dd, pop_ids=pop_ids, projections = proj, polarized = True)

#print some useful information about the afs or jsfs
print "\n\n============================================================================\nData for site frequency spectrum\n============================================================================\n"
print "projection", proj
print "sample sizes", fs.sample_sizes
sfs_sum = numpy.around(fs.S(), 2)
print "Sum of SFS = ", sfs_sum, '\n', '\n'

#================================================================================

Calling external 2D models from the Models_2D.py script

#================================================================================
'''
We will use a function from the Optimize_Functions.py script for our optimization routines:

Optimize_Routine(fs, pts, outfile, model_name, func, rounds, param_number, reps=None, maxiters=None, folds=None, in_params=None, in_upper=None, in_lower=None, param_labels=" ")

Mandatory Arguments =
fs: spectrum object name
pts: grid size for extrapolation, list of three values
outfile: prefix for output naming
model_name: a label to help label the output files; ex. "no_mig"
func: access the model function from within 'moments_Run_Optimizations.py' or from a separate python model script, ex. after importing Models_2D, calling Models_2D.no_mig
rounds: number of optimization rounds to perform
param_number: number of parameters in the model selected (can count in params line for the model)
Optional Arguments =
reps: a list of integers controlling the number of replicates in each of the optimization rounds
maxiters: a list of integers controlling the maxiter argument in each of the optimization rounds
folds: a list of integers controlling the fold argument when perturbing input parameter values
in_params: a list of parameter values
in_upper: a list of upper bound values
in_lower: a list of lower bound values
param_labels: list of labels for parameters that will be written to the output file to keep track of their order
Below, I give all the necessary information to call each model available in the
Models_2D.py script. I have set the optimization routine to be the same for each
model using the optional lists below, which are included as optional arguments for
each model. This particular configuration will run 4 rounds as follows:
Round1 - 10 replicates, maxiter = 3, fold = 3
Round2 - 20 replicates, maxiter = 5, fold = 2
Round3 - 30 replicates, maxiter = 10, fold = 2
Round4 - 40 replicates, maxiter = 15, fold = 1
If this script was run as is, each model would be called and optimized sequentially;
this could take a very long time. For your actual analyses, I strongly recommend
creating multiple scripts with only a few models each and running them
independently. It is also not a good idea to mix models from the Diversification Set
and the Island Set, as each was meant to be mutually exclusive.
'''

#create a prefix based on the population names to label the output files
#ex. Pop1_Pop2
prefix = "_".join(pop_ids)

#**************
#make sure to define your extrapolation grid size (based on your projections)
pts = [50,60,70]

#**************
#Set the number of rounds here
rounds = 4

#define the lists for optional arguments
#you can change these to alter the settings of the optimization routine
reps = [10,20,30,40]
maxiters = [3,5,10,15]
folds = [3,2,2,1]`

Originally posted by @pstokespmb in #3 (comment)

Question about input file

Dear Mr.Daniel,
Thank you for providing such a good pipeline! I'm trying to run a 2D model, but I have a problem. If I already have an SFS file, how do I modify the script "dadi_Run_2D_Set.py". I defined "fs" directly as my SFS file, but I found this didn't work.

Thanks,
Jing-fang Guo

dportik / dadi_pipeline Goto Github PK

dadi_pipeline's People

Contributors

Stargazers

Watchers

Forkers

dadi_pipeline's Issues

dadi_pipeline.py

Import data to create joint-site frequency spectrum

Calling external 2D models from the Models_2D.py script

Recommend Projects

Recommend Topics

Recommend Org