08/24/2015 dev1up Comments 0
Today I found myself in need of something that would count all of the occurrences of every word in a document so that I might see which were the most relevant. This is a fairly common task in Search Engine Optimization and ad campaigning, and while you could manually tally up every word in a document, that would be quite tedious. While I was searching, I came across this pretty nifty bash script. It did most everything I wanted, I just had to refine it so I would get the type of results I was looking for.
How Does it work?
If you’ve ever used a bash script before, you’ve probably come across
cat as well.
cat concatenates and prints files. We will use this in conjunction with a pipe
| to “pipe” the input of the file into the rest of our script.
tr is a useful tool and is prevalent throughout this script. It can translate, squeeze, and delete characters. First, it translates all uppercase characters to lowercase. This is useful because
uniq, explained below, is case sensitive, which means that “the” and “The” would count as two separate words. In addition to translating uppercase to lowercase,
tr also replaces all spaces with newlines and deletes punctuation. It is important to remove punctuation so that
uniq can understand that something like “end” and “end.” or “email” and “e-mail” are the same thing.
grep is our simple find/replace tool. In the above script it finds empty lines and removes them.
sort does exactly what you would expect, it sorts input. In this case this is important for
uniq to work correctly. After
uniq is run,
sort is used again to sort the results of our keywords so that they are in order from most to least common.
Once our input is sorted we can then ask
uniq to count the number of times a line is repeated. It is important to use
uniq relies on repeated lines being next to eachother.
- One thing I noticed while using this script is that
tr -d [:punct:]does not remove special characters, such as bullets (not asterisks).
- It would also be beneficial to have a blacklist of common words to exclude from the count, such as “the”.