Runtime for large datasets. #84

sdsilva10 · 2021-09-29T12:36:32Z

Hi,

I am trying to generate some cohort metrics for QC steps via peddy. My sample size is about 187000. I have provided the gz zipped VCF and fam (PLINK format) file for these samples as input. On running the command for the QC plots, all sample id are listed and a terminal output "ped_check" appears. However, there is no progress beyond this stage, and the process continues to run beyond 24 hrs mark.

I have executed this run on a HPC node:
Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
RAM: 180 Gb.

Is there a limitation on the input sample size?

brentp · 2021-09-29T13:29:45Z

187 thousand samples!?
That is too big for peddy. You might try somalier on batches of ~20 thousand at a time.

sdsilva10 · 2021-10-01T06:52:57Z

Ok, I'll give that a try. Thank you.

sdsilva10 · 2021-10-01T09:25:56Z

Is there any procedure where I can merge the intermediate files of the sample subsets so as to generate results for the whole sample set?

brentp · 2021-10-01T10:43:55Z

do you mean for peddy? no.

i would just use somalier for 20K at a time. in order to compare all pairwise combinations for your samples, you'd need to do: (187K*187K) 34,969,000,000 pairwise comparisons, and have multiple matrices with that many entries. You might be able to do all samples at once on a machine with 1TB of memory to do all at once with somalier.

sdsilva10 closed this as completed Oct 1, 2021

sdsilva10 reopened this Oct 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime for large datasets. #84

Runtime for large datasets. #84

sdsilva10 commented Sep 29, 2021

brentp commented Sep 29, 2021

sdsilva10 commented Oct 1, 2021

sdsilva10 commented Oct 1, 2021

brentp commented Oct 1, 2021

Runtime for large datasets. #84

Runtime for large datasets. #84

Comments

sdsilva10 commented Sep 29, 2021

brentp commented Sep 29, 2021

sdsilva10 commented Oct 1, 2021

sdsilva10 commented Oct 1, 2021

brentp commented Oct 1, 2021