perl one-liner to process sequence files in stream

Need a practical way to process fasta files with Bio::SeqIO module ? Below code will print sequence id and sequence length with tab per line.

perl -MBio::SeqIO -e '$seq=Bio::SeqIO->new(-fh => \*STDIN);while ($myseq=$seq->next_seq){print $myseq->id,"\t",$myseq->length,"\n";}' < filename 


cat filename | perl -MBio::SeqIO -e '$seq=Bio::SeqIO->new(-fh => \*STDIN);while ($myseq=$seq->next_seq){print $myseq->id,"\t",$myseq->length,"\n";}'

There are many more methods to use from Bio::Seq, such as revcom, translate, subseq(start,end), primary_id, desc, etc.

Piped file does not need to be in Fasta format, there are many other formats (listed here) which SeqIO can parse successfully.

UPDATE: If you are using this one-liner in a pipe, you might need to declare the format so that the stream is processed correctly. Also, in order to retrieve Bio::Seq methods, please use “->seq” to access the final sequence.

Considering all these updates, the one-liner should look like this:

perl -MBio::SeqIO -e '$seq=Bio::SeqIO->new(-fh => \*STDIN,-format=>"fasta");while ($myseq=$seq->next_seq){print $myseq->id,"\t",$myseq->length,"\t",$myseq->seq,"\t",$myseq->translate->seq,"\n";}'
Alper Yilmaz
Assist.Prof.Dr. Alper YILMAZ

My research interests include genome grammar and NGS analysis.


comments powered by Disqus