Fuzzy Matching with Soundex and Levenshtein

Publisher: Lingk

Description
This recipe demonstrates the use of fuzzy matching in Spark with Soundex and Levenshtein Distance. The soundex algorithm is often used to compare first names that are spelled differently. You might want to use the Levenshtein distance when joining two DataFrames if you don’t want to require exact string matches. It’s always a struggle to minimize the number of false positives when performing fuzzy joins. So do multiple tests and join with multiple columns to improve results. To run this recipe, choose your environment and click Run!