perl one-liner to change table layout with Hash-of-Hash-of-Array

Mar 25, 2010 3 min read

I really like one-liners which can do a lot in a single line.. I wanted to share one I just used to arrange a big table.

In the list of proteins below, only two proteins are shown, one protein has multiple attributes for 4 categories (InterPro, Cellular Component, Biological Process and Molecular Function). The other thing to notice is that, not all proteins have to have all the attributes, for instance, one protein might miss BiologicalProcess attribute.

protein1	Interpro	kinase
protein1	BiologicalProcess	protein folding
protein1	BiologicalProcess	metabolic process
protein1	MolecularFunction	DNA binding
protein1	CellularComponent	membrane
protein2	Interpro	transferase
protein2	Interpro	Methyltransferase
protein2	CellularComponent	membrane
protein2	CellularComponent	integral to membrane

Out of this table, I’m trying to get the following table:

ProteinID	InterPro	Cellular Component	Biological Process	Molecular Function
protein1	kinase	membrane	protein folding; metabolic process	DNA binding
protein2	transferase; Methyltransferase	membrane; integral to membrane

Do you think this is possible with perl one-liner? Yes, it is..

Below is the code (suppose that Table 1 is in file called GeneCategories.txt

perl -F"\t" -ane 'chomp($F[2]); push @{$hash->{$F[0]}->{$F[1]}},$F[2]; END {foreach $id (sort keys %$hash){print $id,"\t"; foreach $field qw(Interpro CellularComponent BiologicalProcess MolecularFunction){print join ";",@{$hash->{$id}->{$field}}; print "\t";}; print "\n"; } }' GeneCategories.txt

Let’s breakdown the code now. As we know, you can run perl code within terminal in this format:

perl -e 'code'

If you want to run your code in a loop, then -n option should be used. In that case, either a filename should be provided or data should be piped to perl. Auto split can be turned on by -a option which will assign split elements to an array named @F.

If I don’t indicate that TAB is the separator, then SPACE or TAB is considered as separator. Since my data contains SPACE, I should specifically indicate that TAB is the separator by -F option.

One more thing about running perl in commandline with -n option. Suppose you wish to run additional code before and/or after the loop, then you should use the following format:

perl -ne 'BEGIN {code1}; code2; END {code3}' filename

In this particular example, code1 will run before looping thru lines of filename and code3 will run after loop ended.

Okay, now the meaning of the actual code:

chomp($F[2])

Last column contains newline character at the end, I am removing it so that final output is not bad.

push @{$hash->{$F[0]}->{$F[1]}},$F[2]

This is the core part where one protein can have multiple categories (Hash of hash) and one category can hold multiple values in an array (Hash of hash of array). Whatever is in third column is pushed into an array referred by hash of hash $hash->{ProteinNo}->{Category}

After loop ended, hash structure is printed and mission accomplished..

one-liner perl

Alper Yilmaz

Assist.Prof.Dr. Alper YILMAZ

My research interests include genome grammar and NGS analysis.

perl one-liner to change table layout with Hash-of-Hash-of-Array

Alper Yilmaz

Assist.Prof.Dr. Alper YILMAZ

Related