perl one-liner to pick random sequences from fasta file

In an earlier post we learned how to use Bio::SeqIO module to process fasta files with one-liner. Let’s do more with this capability. What about selecting random sequences from a fasta file?

To achieve that, we’ll load the fasta file contents into a hash and then utilize the fact that rand(@array) returns index of a random element from that array.

Let’s pick 100 random sequences from a fasta file with one-liner:

.. fasta file stream .. | perl -MBio::SeqIO -e '$seq=Bio::SeqIO->new(-fh => \*STDIN);while ($myseq=$seq->next_seq){ $hash{$myseq->id}=$myseq->seq }; END{@ids = keys %hash; foreach (1..100){my $index=rand(@ids); print ">",$ids[$index],"\n",$hash{$ids[$index]},"\n" } }'

UPDATE: If this one-liner throws problem about first sequence, please indicate the format of the input. Since read ahead is not possible in a pipe, the format might not be guessed correctly. So, please update the one-liner with this: $seq=Bio::SeqIO->new(-fh => \*STDIN, -format=>"fasta")

Avatar
Alper Yilmaz
Assist.Prof.Dr. Alper YILMAZ

My research interests include genome grammar and NGS analysis.

Related

comments powered by Disqus