adamcrussell

Perl Weekly Challenge 005

Anagrams!

This weeks challenge involves anagrams. 

The problem statements are short and are included in the first comment of the code. The code blocks shown link to GitHub Gists.

The Approach

For both parts I used the same general approach. 

By the Fundamental Theorem of Arithmetic every integer greater than 1 is either a prime number itself or can be represented as the unique product of prime numbers.

Using this information the idea used in both Part 1 and Part 2 is to have each letter mapped to a prime number and each word represented by the product of each its letter's corresponding numbers. For example let's consider the word "live":

    live = l * i * v * e = 31 * 11 * 73 * 2 = 49786.

Any word also spelled with an 'l' an 'i' a 'v' and an 'e' will have the same product of 49786 even if the letters are arranged differently. In this way we have a straightforward path to determining anagrams.

Assumptions:

  1. Only single word anagrams are considered
  2. Words shall contain only the 26 letters of the English alphabet
  3. Only lower case letters are used.
  4. The values assigned to each letter are inverse to their overall frequency. 
  5. No bad actors: all inputs are safe and clean and need not be checked.

That last point is a minor detail. I thought that to minimize the overall size of the numbers generated I would assign smaller numbers to the most frequently used letters. Letter frequency uses the Lewand ordering described in the wikipedia article. It seems that this was unnecessary for the test corpus used (the standard Unix dictionary file) but I decided to leave it in the final code anyway as it was an interesting design point nonetheless.

Part 1

The code for Part 1 takes a  test word from the command line and computes it's product. The product for every word in the standard Unix dictionary file is also computed and stored in a hash. That hash is then searched for matches against the test word and the results printed. The line 

    delete($word_product{$test_word});

is just a simple way to make sure that the test word itself is omitted from the printed results. Interestingly there are exactly two words in the dictionary file which contain a hyphen! Those two entries are "Jean-Christophe" and "Jean-Pierre". Because of these I make sure to remove hyphens.

Sample Run

$ perl ch-1.pl live
live: vlei levi vile veil evil

Part 2

Again using the same approach Part 2 words are read in from stdin and the word/product hash created. The values in the hash (the products) are counted and the sequence of letters corresponding to the product with the highest number of entries is printed.

Sample Run

$ perl ch-2.pl < /usr/share/dict/words
e a s l p

Note: In the case there is a tie the first one in the sorted list is taken. Co-incidentally in the dictionary file there is a tie! In addition to the one shown above is a o n r g which also has ten anagrams.

$ perl ch-1.pl easlp

easlp: salep speal pales spale saple slape lapse elaps sepal lepas

$ perl ch-1.pl aonrg

aonrg: ronga orang angor grano goran argon nagor groan rogan organ

Comments for this post were locked by the author