PERL subroutines and references Andrew Emerson, High Performance Systems, CINECA

PERL
subroutines and references
Andrew Emerson, High Performance Systems, CINECA
Consider the following code:
# Counts Gs in various bits of DNA
$dna=“CGGTAATTCCTGCA”;
$G_count=0;
for ($pos=0; $pos <length $dna; $pos++) {
$base=substr($dna,$pos,1);
++$G_count if ($base eq ‘G’);
} # end for
. . .do something else
$new_dna = <DNA_FILE>;
$G_count=0;
for ($pos=0; $pos <length $new_dna; $pos++) {
$base=substr($new_dna,$pos,1);
++$G_count if ($base eq ‘G’);
} # end for
Inconvenient to repeat pieces of code many times if it
does the same thing.
Better if we could write something like…
# Counts Gs in various bits of DNA
# improved version (PSEUDO PERL)
#
Main program
$dna=“CGGTAATTCCTGCA”;
count_g using $dna;
.
. do something else
$new_dna = <DNA_FILE>;
count_g using $new_dna;
.
.
count_G subroutine
for ($pos=0;$pos<length $dna;
$pos++) {
$base=substr($dna,$pos,1);
++$G_count if ($base eq ‘G’);
}
Subroutines
 The pieces of code used in this way are often
called subroutines and are common to all
programming languages (but with different
names)
 subroutines (Perl, FORTRAN)
 functions (C,C++*,FORTRAN, Java*)
 procedures (PASCAL)
 Essential for procedural or structured
programming.
* object-oriented programming languages
Advantages of using subroutines
Saves typing → fewer lines of code →less
likely to make a mistake
 re-usable
if subroutine needs to be modified, can be
changed in only one place
other programs can use the same subroutine
can be tested separately
makes the overall structure of the program
clearer
Program design using subroutines
Conceptual flow
subroutines can use
other subroutines to
make more complex
and flexible programs
Program design using subroutines
-pseudo code
#
# Main program
# pseudo-code
..set variables
.
call sub1
.
call sub2
.
call sub3
.
exit program
sub
1
# code for sub 1
exit subroutine
sub
2
# code for sub 1
exit subroutine
sub
3
# code for sub 1
call sub 4
exit subroutine
sub 4
# code sub4
exit
Using subroutines in Perl
Example 1.
# Program to count Gs in DNA sequences
# (valid perl)
# Main program
$dna=“GGCCTAACCTCCGGT”;
count_G;
print “no. of G in $dna=$number_of_g\n”;
# subroutines
sub count_G {
for ($pos=0;$pos<length $dna; $pos++) {
$base=substr($dna,$pos,1);
++$number_of_g if ($base eq ‘G’);
} # end for
}
Subroutines in Perl
Defined using the sub command:
sub name {
...
}
Called from the main program or another subroutine using
its name:
name;
Sometimes you will see in old Perl programs (like mine)
&name;
But is optional in modern Perl.
Subroutines in Perl
Subroutines can be placed anywhere in the program but
best to group them at the end;
# main program
.
.
exit;
# subroutines defined here
sub sub1 {
...
}
sub sub2 {
...
}
sub sub3 {
...
}
exit not strictly
necessary, but
makes it clear we
want to leave the
program here.
Return to example 1why is this bad?
# Program to count Gs in DNA sequences
# Main program
What does count_G
need ?
$dna=“GGCCTAACCTCCGGT”;
count_G;
print “no. of G in $dna=$number_of_g\n”;
exit;
# subroutines
sub count_G {
for ($pos=0;$pos<length $dna; $pos++) {
$base=substr($dna,$pos,1);
++$number_of_g if ($base eq ‘G’);
} # end for
}
Where did this
come from?
Perl subroutines-passing
parameters
The input/outputs to a subroutine are specified
using parameters (or arguments) given when the
subroutine is called:
$no_of_G = count_g($dna);
It is now clear what the subroutine expects as input
and what is returned as output.
Other examples:
$day_in_yr = calc_day($day,$month);
$num_sequences = read_sequence_file(@database);
Input parameters
All the parameters in a Perl subroutine (including
arrays) end up in a single array called @_
Therefore in the code:
#
$pos = find_motif($motif,$protein);
.
sub find_motif {
$a takes the
value of $motif
$a = $_[0];
$b = $_[1];
...
}
$b takes the value of
$protein
Subroutine output (return values)
A subroutine does not have to explicitly return something
to the main program:
print_title;
sub print_title{
print “Sequence Manipulation program\n”;
print “-----------------------------\n”;
print “Written by: A.Nother \n”;
print “Version 1.1: \n”
}
but often it does, even if only to signal the procedure went
well or gave an error.
Subroutine return values
By default the subroutine returns the last thing
evaluated but you can use the return statement
to make this explicit:
input
sub count_G {
$dna=@_[0];
for ($pos=0;$pos<length $dna; $pos++) {
$base=substr($dna,$pos,1);
++$number_of_g if ($base eq ‘G’);
} # end for
return $number_of_g;
}
output
return also exits the subroutine
Return Values
You can return from more than 1 point in the sub –
can be useful for signalling errors:
if ($dna eq “”) {
return “No DNA was given”; # exit with error message
} else {
..
} # end if
return $number_of_G;
} # end sub
In which case the calling program or sub should check the
return value before continuing. However, in general best to
return from only 1 point, otherwise difficult to follow the
logic.
Return values
Can also return multiple scalars, arrays, etc. but
just as for the input everything ends up in a single
array or list:
@DNA = read_file($filename);
.
.
($nG,$nC,$nT,$nA) = count_bases($dna);
sub count_bases {
...
return ($num_G,$num_C,$num_T,$num_A);
} # end sub
note ( and )
for the list
Counting bases – Attempt 2
# Program to count Gs in DNA sequences
# using input/output parameters
# Main program
better but ...
$dna=“GGCCTAACCTCCGGT”;
$num_g = count_G($dna);
print “no. of G in $dna=$num_g\n”;
exit;
# subroutines
sub count_G {
$dna=$_[0];
All the variables
inside are also
visible outside
the sub, not
only params !
for ($pos=0;$pos<length $dna; $pos++) {
$base=substr($dna,$pos,1);
++$number_of_g if ($base eq ‘G’);
} # end for
return $number_of_g; # return value of sub
The need for variable scoping
A subroutine written this way is in danger of overwriting a
variable used elsewhere in the program. Remember that a
subroutine should work like a black box, apart from welldefined inputs/outputs it should not affect the rest of the
program.
input
Apart from input/output
all vars needed by the
sub should appear and
disappear within the
sub.
Allows us also to use
the same names for
vars outside and inside
the sub without conflict.
sub
output
Variable scoping in Perl
 By default, all variables defined outside a sub
are visible within it and vice-versa – all variables
are global.
 Therefore the sub can change variables used
outside the sub.
Solution ?
Restrict the scope of the variables by making them local
to the subroutine → eliminate the risk of altering a
variable present outside the sub. Also makes it clear
what the subroutine needs to function.
.
Variable scoping in Perl
In Perl, variables are made local to a subroutine (or a block)
using the my keyword. For example,
my variable1;
# simple declaration
my $dna=“GGTTCACCACCTG”; # with initialization
my ($seq1,$seq2,$seq3); # more than 1
Attenzione
my $seq1, $seq2;
Must use () if multiple
vars per line
This means
my $seq1;
$seq2;
Which is valid Perl so the compiler won’t give an error.
Subroutines with local variables
# Program to count Gs in DNA sequences – final version
# Main program
$dna=“GGCCTAACCTCCGGT”;
$num_g = count_G($dna);
print “no. of G in $dna=$num_g\n”;
exit;
Remember that my
# subroutines
makes a copy of the
variable.
sub count_G {
my $dna=$_[0];
my ($pos,$base);
my $number_of_g=0;
for ($pos=0;$pos<length $dna; $pos++) {
$base=substr($dna,$pos,1);
Variables declared
++$number_of_g if ($base eq ‘G’);
like this is called
} # end for
lexical scoping in the
return $number_of_g; # return value
Perl man pages
}
Other examples of subroutine use
sub find_motif {
my $motif=shift;
my $protein=shift;
...
}
# shifts the @_ array
# (avoids $_[0], etc)
sub count_C_and_G {
my @fasta_lines=@_;
...
return ($num_C,$num_G);
}
sub reverse_seq {
use strict;
$seq=$_[0];
...
}
# returns a list
# the strict command enforces use of my
# so this line will give an error (even
# if defined in main program)
Question: What if we want to pass
two arrays ?
$seq=compare_seqs(@seqs1,@seqs2);
Remember that everything arrives in the sub in a single
array. Likewise for return values:
(@annotations,@dna) = parse_genbank(@dna);
...
sub parse_genbank {
...
return (@annotations,@dna);
}
In the first example what will be in the special array @_ ?
Solution: Use references
subroutines – call by value
Consider
$i=2;
$j =add_100($i);
print “i=$i\n”;
$i unaffected by the
subroutine
sub simple_sub {
my $i=$_[0];
$i=$i+100;
return $i;
}
This is called Call by Value because a copy is made of
the parameter passed and whatever happens to this
copy in the subroutine doesn’t affect the variable in
the main program
What are references?
References can be considered to be an identifier
or some other description of the objects rather
than the objects themselves.
To buy
Bananas
Beer
Fruit
Pasta
Frozen pizza
more beer
reference
or
copy
References
In computing, references are often
addresses of objects (scalars,arrays,..) in
memory:
@genbank
$dna
100
address of $dna
scalar
200
address of @genbank
array
300
References
References (sometimes called pointers in other
languages) can be more convenient because in Perl
they are scalars → often much smaller than the object
they refer to (e.g. an array or hash).
Array references can be passed around and copied very
efficiently, often also using less memory.
Being scalars, they can be used to make complicated
data structures such as arrays of arrays, arrays of hashes
and so on..
References in Perl
Simplest way to create a reference in Perl is with \
$scalar_ref = \$sequence;
# reference to a scalar
$dna_ref = \@DNA_list;
# reference to an array
$hash_ref = \%genetic_code; # reference to a hash
To get back the original object the reference needs to be
dereferenced;
$scalar = $$scalar_ref;
# for scalars just add $
@new_dna = @$dna_ref;
# for arrays just add @
%codon_lookup = %$hash_ref; # similary for hashes
Passing two arrays into a sub using
references
# compare two databases, each held as an array
#
$results = compare_dbase(\@dbase1,\@dbase2); # supply
refs
...
sub compare_dbase {
my ($db1_ref,$db2_ref) = @_; # params are refs to
arrays
@db1 = @$db1_ref; # dereference
@db2 = @$db2_ref; # dereference
...
# now use @db1,@db2
return $results;
}
Similarly we can return 2 or more arrays by the same
method
References – final words
Caution: Calling by reference can change the
original variables;
OUTPUT
@dna1=(G,G,T,C,T,G);
dna1=G G T C T G A A A A A
@dna2=(A,A,A,A,A);
dna2=A A A A A
add_seqs(\@dna1,\@dna2);
print “dna1=@dna1 \n dna2=@dna2 \n”;
sub add_seqs {
my ($seq1,$seq2) =@_;
push(@$seq1,$@seq2);
}
If you don’t want this behaviour then create local copies of
the arrays as in previous example.
subroutines-summary
 subroutines defined with sub represent the main
tool for structuring programs in Perl.
 variables used only by the subroutine should be
declared with my, to prevent conflict with
external variables (lexical scoping)
 parameters passed in to the sub end up in the
single array @_; similarly for any return values
 array references need to be used to pass two or
more arrays in (call by reference) or out of a
sub.