perl one-liner to change table layout with Hash-of-Hash-of-Array

I really like one-liners which can do a lot in a single line.. I wanted to share one I just used to arrange a big table.

In the list of proteins below, only two proteins are shown, one protein has multiple attributes for 4 categories (InterPro, Cellular Component, Biological Process and Molecular Function). The other thing to notice is that, not all proteins have to have all the attributes, for instance, one protein might miss BiologicalProcess attribute.

protein1 Interpro kinase
protein1 BiologicalProcess protein folding
protein1 BiologicalProcess metabolic process
protein1 MolecularFunction DNA binding
protein1 CellularComponent membrane
protein2 Interpro transferase
protein2 Interpro Methyltransferase
protein2 CellularComponent membrane
protein2 CellularComponent integral to membrane

Out of this table, I’m trying to get the following table:

ProteinID InterPro Cellular Component Biological Process Molecular Function
protein1 kinase membrane protein folding; metabolic process DNA binding
protein2 transferase; Methyltransferase membrane; integral to membrane

Do you think this is possible with perl one-liner? Yes, it is..

Below is the code (suppose that Table 1 is in file called GeneCategories.txt

perl -F"\t" -ane 'chomp($F[2]); push @{$hash->{$F[0]}->{$F[1]}},$F[2]; END {foreach $id (sort keys %$hash){print $id,"\t"; foreach $field qw(Interpro CellularComponent BiologicalProcess MolecularFunction){print join ";",@{$hash->{$id}->{$field}}; print "\t";}; print "\n"; } }' GeneCategories.txt

Let’s breakdown the code now. As we know, you can run perl code within terminal in this format:

perl -e 'code'

If you want to run your code in a loop, then -n option should be used. In that case, either a filename should be provided or data should be piped to perl. Auto split can be turned on by -a option which will assign split elements to an array named @F.

If I don’t indicate that TAB is the separator, then SPACE or TAB is considered as separator. Since my data contains SPACE, I should specifically indicate that TAB is the separator by -F option.

One more thing about running perl in commandline with -n option. Suppose you wish to run additional code before and/or after the loop, then you should use the following format:

perl -ne 'BEGIN {code1}; code2; END {code3}' filename

In this particular example, code1 will run before looping thru lines of filename and code3 will run after loop ended.

Okay, now the meaning of the actual code:

chomp($F[2])

Last column contains newline character at the end, I am removing it so that final output is not bad.

push @{$hash->{$F[0]}->{$F[1]}},$F[2]

This is the core part where one protein can have multiple categories (Hash of hash) and one category can hold multiple values in an array (Hash of hash of array). Whatever is in third column is pushed into an array referred by hash of hash $hash->{ProteinNo}->{Category}

After loop ended, hash structure is printed and mission accomplished..

Avatar
Alper Yilmaz
Assist.Prof.Dr. Alper YILMAZ

My research interests include genome grammar and NGS analysis.

Related

comments powered by Disqus