Reducing a word to its base form using Stemming and Lemmatization is a part of the technique called Canonicalisation. Stemming tries to reduce a word to its root form. Lemmatization tries to reduce a word to its lemma. The root and the lemma are nothing but the base forms of the inflected words. just that the method is different in both.
There are some cases that can’t be handled either by stemming nor lemmatization. You need another preprocessing method in order to stem or lemmatize the words efficiently.
For example if the corpus contains two misspelled versions of the word ‘disappearing’ — ‘dissappearng’ and ’dissapearing’. After you stem these words, you’ll have two different stems — ‘dissappear’ and ‘dissapear’. You still have the problem of redundant tokens. On the other hand, lemmatization won’t even work on these two words and will return the same words if it is applied because it only works on correct dictionary spelling.
To deal with different spellings that occur due to different pronunciations, we use the concept of phonetic hashing which will help you canonicalise different versions of the same word to a base word.
There are certain words which have different pronunciations in different languages. As a result, they end up being spelt differently. Examples of such words include names of people, city names, names of dishes, etc. Take, for example, New Delhi. Delhi is also pronounced as Dilli in Hindi. Hence, it is not surprising to find both variants in an uncleaned text corpus.
Phonetic hashing buckets all the similar phonemes (words with similar sound or pronunciation) into a single bucket and gives all these variations a single hash code. Hence, the word ‘Dilli’ and ‘Delhi’ will have the same code.
Phonetic hashing is done using the Soundex algorithm. It doesn’t matter which language the input word comes from — as long as the words sound similar, they will get the same hash code.
Now, let’s see it through an example. The Soundex of the word ‘Mississippi’. To calculate the hash code, following are the steps:
- Phonetic hashing is a four-letter code. The first letter of the code is the first letter of the input word. Hence it is retained as is. The first character of the phonetic hash is ‘M’. Now, we need to make changes to the rest of the letters of the word.
- Now, we need to map all the consonant letters (except the first letter). All the vowels are written as is and ‘H’s, ‘Y’s and ‘W’s remain unencoded (unencoded means they are removed from the word). After mapping the consonants, the code becomes MI22I22I11I.
- The third step is to remove all the vowels. ‘I’ is the only vowel. After removing all the ‘I’s, we get the code M222211. Now, you would need to merge all the consecutive duplicate numbers into a single unique number. All the ‘2’s are merged into a single ‘2’. Similarly, all the ‘1’s are merged into a single ‘1’. The code that we get is M21.
- The fourth step is to force the code to make it a four-letter code. You either need to pad it with zeroes in case it is less than four characters in length. Or you need to truncate it from the right side in case it is more than four characters in length. Since the code is less than four characters in length, you’ll pad it with one ‘0’ at the end. The final code is M210.
Let’s put the theory to work and actually derive the soundex code for some words. Following function can be used in python to get the soundex code of any word.
This way you can found out all the words in the corpus having same hash code and replace them with the correct word. Another method for spell correction is to identify and measure the ‘distance between words’ using the concept of edit distance that we will cover in the upcoming article.
Here is a link to one of my other article in case you are in the mood of reading. Give it a shot. Happy Learning, until we meet again Goodbye.