Inverted Index Formatting

Part 1 of Perl Weekly Challenge 024 was to create an Inverted Index. Here I will describe how I created the Inverted Index and also how I displayed the results using format.

The Code

Sample Run

$ perl perl5/

-----------------------------------------------------------------------------------------|Word                     |                                          Documents                                         |
|abbey                    |arrow.txt                                                                                       |
|abolishing            |eighty.txt                                                                                      |
|about                    |catriona.txt, eighty.txt                                                                |
|address                 |balloon.txt                                                                                    |
|adventures           |kidnapped.txt                                                                              |

What I Did

sub index_contents reads files and stores in the index in a hash where the index word is the key and the value for each key is an array reference of document names. The documents themselves are short excerpts of text taken from novels in the public domain and published on Project Gutenberg. See the References section below for the books used. The text is cleaned up by removing non-ascii characters and punctuation.

sub print_index takes the index and prints it in a nice tabular way using Perl's format capability. This functionality is seldom used these days. Currently, most reports end up being formatted for display on a web page or simply arranged in a comma delimited way for use in spreadsheets. Also, there are so many more advanced text formatting modules available on can that the more basic formatting capabilities of format is not the preferred solution. Still, Perl has this ability and despite being more rarely used it still has its applications.

Note: Formats specify how output is to be sent to a file handle. The default file handle is STDOUT. It is expected that the file handle name and the format name match, although this can be changed by setting the special variable $~ to be the name of the format to be used for the current file handle.

I actually created three formats. A header and footer for the table in addition to a format for the index itself. The file handle is always kept to be the default, STDOUT. I first set $~ = "INDEX_HEADER" followed by a write to print the header.  The format is then changed to the one for the index and then the contents of the index hash are printed. Finally, the footer is printed.

Note: Formats have a lot of capabilities not used in this particular example, and I did not describe in detail the templating aspect per se. These capabilities are very well documented elsewhere:




The Call of the Wild by Jack London

White Fang by Jack London

The Black Arrow: A Tale of Two Roses by Robert Louis Stevenson

Kidnapped by Robert Louis Stevenson

Catriona by Robert Louis Stevenson

Around the World in Eighty Days by Jules Verne

Five Weeks in a Balloon by Jules Verne

Comments for this post were locked by the author