bioinformatics

Extract upstream region sequence with bedtools

Soon after SAM/BAM format became standard for short-read alignment softwares, high caliber tools have been emerging that can process the widely accepted format. bedtools is one of them and it’s easy to use and flexible. Most importantly you can integrate it with commandline pipes. In this post, I’ll be describing how to extract upstream region sequences with the help of bedtools. I’ll be using the following files in my sample:

perl one-liner to pick random sequences from fasta file

In an earlier post we learned how to use Bio::SeqIO module to process fasta files with one-liner. Let’s do more with this capability. What about selecting random sequences from a fasta file? To achieve that, we’ll load the fasta file contents into a hash and then utilize the fact that rand(@array) returns index of a random element from that array. Let’s pick 100 random sequences from a fasta file with one-liner:

Visualize Circos images with Seadragon

Circos is a very powerful tool to visualize different types of data (expression, homology, etc) in circular fashion. The software is capable of producing very large images if desired, suitable for posters. Actually, we can create large images for viewing online, since it’s trivial to view them with Seadragon. Below is an example from Circos tutorial (I modified the config file to obtain large image) (*EDIT: Since the seadragon page was very slow to respond, I just included the embed URL*S)

perl one-liner to process sequence files in stream

Need a practical way to process fasta files with Bio::SeqIO module ? Below code will print sequence id and sequence length with tab per line. perl -MBio::SeqIO -e '$seq=Bio::SeqIO->new(-fh => \*STDIN);while ($myseq=$seq->next_seq){print $myseq->id,"\t",$myseq->length,"\n";}' < filename OR cat filename | perl -MBio::SeqIO -e '$seq=Bio::SeqIO->new(-fh => \*STDIN);while ($myseq=$seq->next_seq){print $myseq->id,"\t",$myseq->length,"\n";}' There are many more methods to use from Bio::Seq, such as revcom, translate, subseq(start,end), primary_id, desc, etc. Piped file does not need to be in Fasta format, there are many other formats (listed here) which SeqIO can parse successfully.