-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime for large datasets. #84
Comments
187 thousand samples!? |
Ok, I'll give that a try. Thank you. |
Is there any procedure where I can merge the intermediate files of the sample subsets so as to generate results for the whole sample set? |
do you mean for peddy? no. i would just use somalier for 20K at a time. in order to compare all pairwise combinations for your samples, you'd need to do: (187K*187K) 34,969,000,000 pairwise comparisons, and have multiple matrices with that many entries. You might be able to do all samples at once on a machine with 1TB of memory to do all at once with somalier. |
Hi,
I am trying to generate some cohort metrics for QC steps via peddy. My sample size is about 187000. I have provided the gz zipped VCF and fam (PLINK format) file for these samples as input. On running the command for the QC plots, all sample id are listed and a terminal output "ped_check" appears. However, there is no progress beyond this stage, and the process continues to run beyond 24 hrs mark.
I have executed this run on a HPC node:
Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
RAM: 180 Gb.
Is there a limitation on the input sample size?
The text was updated successfully, but these errors were encountered: