Monday, December 28, 2009

Perl regex, substitution, file reading/writing

Learning by doing is the best motto. With this motto when I was exploring perl I wrote few scripts to extract and use parts of learning materials on perl from the web.

Just because there are a lot of resource out there on the web doesn't mean they are going to help you be on a fast track of learning perl. There are a lot of redundant materials which we don't want to go over, often times we don't know how to cut to the chase.

I used to go to this site often: http://perldoc.perl.org

There are a lot of materials out there. In this article I would like to explore about pattern matching and substitution, reading from file and substituting certain parts of the lines in the file and writing the desired result to another file.

The program below builds the very fundamental concepts about:

1. Opening file.(a webpage from http://perldoc.perl.org)
2. Reading from the file line by line
3. Substituting into the line by getting rid of some interesting materials.
4. Writing to the file and to the console.

If you open http://perldoc.perl.org/perlrequick.html, you will see that the page tutors on pattern matching. It is a good material but personally I am only interested on things like:

1. "Hello World" =~ /World/; # matches

2. /[^a]at/; # doesn't match 'aat' or 'at', but matches # all other 'bat', 'cat, '0at', '%at', etc.

3. /[^0-9]/; # matches a non-numeric character

4. /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary

Looking at above line it is so easy to quickly build the idea without having to read all of the explanation.

So, as I continue to believe in "learning by doing", I wrote a couple of scripts to extract only info I wanted from such tutorial.


--The following script reads the input text file we created by coping/pasting from the url: http://perldoc.perl.org/perlrequick.html

--looks for only those lines which have a pattern as in: "1. blah or 1. blah etc"

--writes the matches lines from file input into a separate file and to the console.

*to successfully run the program you will have to select the page at http://perldoc.perl.org/perlrequick.html
and copy/paste into a text file (in my case the text file is called: perl_regex.txt located at my personal directory
on my mac which is "Users/my_name/perlscripts/"

#########################################################

#!/usr/bin/perl –w

$data_file = "/Users/my_name/perlscripts/perl_regex.txt"; #replace this line with location/name of your file

$out_file = "out_file.txt";

open(FHAND, $data_file) or die $!;

open(OUTHAND, ">out_file.txt") || die $!;

while(){
if($_ =~/^(\s.*[1-9]|[1-9])\./){
print "$_\n";
print OUTHAND "$_\n";
}
}

close(FHAND);
close(OUTHAND);

#########################################################


If you can successfully run the above you will be at a great relief(at least I was) when you see the simple to read concepts about pattern matching.

The out put looks something like

#########################################################

1. print "It matches\n" if "Hello World" =~ /World/;

1. print "It doesn't match\n" if "Hello World" !~ /World/;

1. $greeting = "World";

2. print "It matches\n" if "Hello World" =~ /$greeting/;

1. $_ = "Hello World";

2. print "It matches\n" if /World/;

1. "Hello World" =~ m!World!; # matches, delimited by '!'

2. "Hello World" =~ m{World}; # matches, note the matching '{}'

3. "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',

4. # '/' becomes an ordinary char

1. "Hello World" =~ /world/; # doesn't match, case sensitive

#########################################################

Still, it is a little unreadable if you look at the numbering and spacing. Those dotted numbers don't make sense. So?
Guess what we, we write another script to get rid of those as in the following:



To read the earlier output file, replace the numbering and space and to write the extracted text into a new file:

#########################################################
#!/usr/bin/perl -w

$read = "/users/nirmalksingh/perlscripts/out_file.txt";

open(READHANDLE, $read) || die $!;
open(WRITEHANDLE, ">out_file2.txt") || die $!;

while(){
# $_ =~ s/^([1-9]/.)\s.*/\s/g;
$_ =~ s/^([1-9]\.)\s+//g;
print WRITEHANDLE $_;
print "$_\n";
}

close READHANDLE;
close WRITEHANDLE;

#########################################################

Now, in essence, the output file/console looks more or less like as follows:

#########################################################

'cathouse' =~ /cat$foo/; # matches

'housecat' =~ /${foo}cat/; # matches

"housekeeper" =~ /keeper/; # matches

"housekeeper" =~ /^keeper/; # doesn't match

"housekeeper" =~ /keeper$/; # matches

"housekeeper\n" =~ /keeper$/; # matches

"housekeeper" =~ /^housekeeper$/; # matches

/cat/; # matches 'cat'

#########################################################

That's all.

As I succeeded along, I modified the input further on and on doing other substitutions and writing them to file/console accordingly.

It can get really really addictive because with perl it is always DWIS (do what i say) and that's very satisfying for programmers.

Thanks for visiting and reading, please feel free to write to me directly if you have any question/s. I will be glad to answer as you wish.

~nirmal