Deduping Data

Duplicate Data

Duplicate data causes considerable issues in business operations and is challenging to “clean”. Removing duplicate data is commonly referred to as deduping and can mean many things.

This is considering the human readable aspect of duplicate data and not the dedupe processes used for improving storage utilization and network data transfers.

Duplicate data includes:

  • Exact match - where the same data occurs numerous times.
  • Simple variation - where upper and lower case make it different. (John Smith; john smith; JOHN SMITH)
  • Form Variation - where components of the data are included or excluded at times. (John P. Smith, John P Smith, John Paul Smith)
  • Template Variation - Consider computer or network system logs where there are defined fields but the input data can vary slightly.
  • Cultural Variation - where replacements or variations are common but do not make sense programmatically. (Names often have diminutive form such as John can be the legal name for someone known as Jack)
  • Abbreviations or like kind replacement - where one can shorten the word or use a different word interchangeably. (Drive = Dr., Street = St. - many times people will enter their address interchanging these)
  • Intent - where words or basic statements vary but the intent is the same. (Great Condition; Awesome Condition; Like New; Never Used)
  • Gaming the System - where people want to slip in changes to bypass the screening. (Bob’s painting and handyman services; Let Bob’s contracting help you with painting; …)

To make things even more challenging data fields may incorporate some or all of these issues.

Operational Issues

For most organizations their service or purpose is based on a person, so getting the person right is very important.

  • A customer calls to cancel all their services, but they purchased numerous services over time by different sales associates. The sales associates have entered the customer’s name and address into the system with slight variations each time a purchase was made. The customer service representative asks the customer for their name and address to cancel and it only pulls up three of the seven services purchased. Next month the customer gets charged for the four services that were not cancelled and calls the credit card company to cancel all charges from the company. Now the company has an angry ex-customer and the credit card companies will threaten to stop executing transactions if numerous customers have the same issue.
  • A customer enters their information in a variant to purchase more ads on a companies website.
    • One possibility - the customer gets two different bills each month. (Hopefully there are not any discounts for total purchases!)
    • Other possibility - the customer is noted to be an existing customer so billing information is not requested, but the variation does not allow for a match on the billing identity. The company fulfills the ad order but never charges the customer.
  • A service requires that certified mail is sent to a customer before sending to debt collection but three variations of the person are found. This customer gets three certified mail notifications which is quite expensive.
  • Searching maintenance data or logs for similar events or conditions.

These are just a sampling of problems witnessed firsthand from data that should have been deduped.

Robust Methods of Deduping

Basic Approach

For many of the issues outlined above straightforward approaches can help reduce the variation in the data. It should be noted that business processes and established workflows are very important. From a programmatic perspective it is always easier to enforce proper data entry instead of attempting to match entries later. Below are some approaches to cleaning and processing data that can help reduce duplicate data.

  • Convert all data to a string
  • Convert to lowercase
  • Remove punctuation
  • Maintain a word mappings (simple and specific to the field and industry)
    • Example: drive = drive, dr
    • Example: annie = ann, anna, anne, hannah, annette

The approach outlined above is for assessing and processing data after it has been created. This issue could be handled much better with formal unique identification procedures and form data entry rules. It should be noted that if one is using machine learning or similar the steps above are conducted as part of the model or models being used.

Advanced Approaches

So far deduping data has been approached with standard coding approaches, but there are more advanced ways to look for similar data. This is where the term deduping expands on the idea of what data is and how it is considered to be similar. As discussed earlier many form input duplications can be reduced at the time of input with rules and processing, but there are many inputs that are too open ended to use these approaches. Examples are the description field on a form, comments on a post, product reviews, ad blurbs, news articles, social media posts - just to name a few.

Locality Sensitive Hashing

Locality Sensitive Hashing is a common method of analyzing data to find similarities and differences. LSH avoids having to compare every pair of data samples in a large dataset in order to find the nearest similar neighbors for the different data samples. With LSH there is a high probability that a data sample and the closest similar neighbors will be hashed into the same bucket. By treating the data samples placed in the same bucket as candidates for similarity checking, it significantly reduces the computational burden associated with finding nearest neighbors in large datasets.

Basic Steps of LSH

  1. Feature Vector the record
  2. MinHash for Jaccard distance
  3. Conduct k-Nearest Neighbor clustering

LSH in Practice - Data Lake

The AWS Lake Formation offers a service called FindMatches that uses LSH to help identify duplicate data. It uses Spark MLlib which has LSH to hash and find nearest neighbors. The service goes further than just clustering and uses natural language processing and a linking process to “learn” the user’s data.

Lake Formation helps clean and prepare your data for analysis by providing a Machine Learning Transform called FindMatches for deduplication and finding matching records. For example, use Lake Formation’s FindMatches to find duplicate records in your database of restaurants, such as when one record lists “Joe’s Pizza” at “121 Main St.” and another shows a “Joseph’s Pizzeria” at “121 Main”. You don’t need to know anything about machine learning to do this. FindMatches will just ask you to label sets of records as either “matching” or “not matching”. The system will then learn your criteria for calling a pair of records a “match” and will build an ML Transform that you can use to find duplicate records within a database or matching records across two databases.

LSH in Practice - Uber Fraud

Uber uses LSH to detect fraud on their platform.

With five million plus Uber trips taken daily worldwide, it is important for Uber engineers to ensure that data is accurate. If used correctly, metadata and aggregate data can quickly detect platform abuse, from spam to fake accounts and payment fraud. … Amplifying the right data signals makes detection more precise and thus, more reliable.

… To address this challenge in our systems and others, Uber Engineering and Databricks worked together to contribute Locality Sensitive Hashing (LSH) to Apache Spark 2.1. The primary LSH use case at Uber is detecting similar trips based on their spatial properties, a method of identifying fraudulent drivers.

LSH can be a powerful tool for all kinds of applications.

Deploying LSH

The examples cited above require considerable infrastructure to use LSH. If one does not have AWS Lake Formation or Uber’s data platforms it is still possible to use LSH within the existing system regardless of size or scale.

  • Batch If a dataset needs to be deduped, clustered by similarity, or content similar to a specific record found a model can be run locally (or in the cloud) on the extracted dataset. This is generally a data dump of files (csv, xlsx, json, tsv, …) that are analyzed and the final product is a report with an indexed dataset.
  • Streamlined Workflow Use a similar approach to the batch model but run it inside the system so that it runs periodically (scheduled or when new data available).
  • Plugins & Add-ons Certain databases and data stores have LSH plugins or add-ons that will cluster and index the data.
  • Platform LSH built into the system like AWS Lake Formation or Uber’s.

Beyond LSH

LSH is a powerful method and can be incorporated with other techniques to offer more advanced features. Examples include Fuzzy Matching with Levenshtein, Word2Vec, and some very specific natural language processes that help identify entities and potentially context. Advances in deep learning, neural networks and related have provided some truly amazing capabilities and it is a constantly evolving field. Utilizing LSH will give considerable advantage and set one on a path to building even more advanced models for use in operations.

Code Examples

Before an organization decides to invest in a large project it is good practice to conduct a proof of concept (POC). For the LSH developing a POC is not too complicated. One can quickly develop a POC using one of the traditional data science languages (Python, R, Scala, Spark all have documents outlining approaches). Below is a series of Python code examples that outline the basics of LSH to help in developing a POC.

A couple Jupyter Python3 Notebooks on GitHub provide code examples of some of the basic components of and LSH. The repository can be found at LINK: GitHub data-deduplication.

  • shingling.ipynb walks through ngram shingling with a Jaccard distance measurement.
  • lsh_basic_example.ipynb builds on the shingle example and incorporates LSH - this uses NumPy and math.
  • datasketch_minhash_exmple.ipynb uses MinHash and MinHashLSH on the same dataset as above. The datasketch MinHash approach scales better and could be used on larger data.
  • data_process_kaggle_russian_ads.ipynb This is just a quick data processing file for a Kaggle data set of Russian Ads. This may be a good source of data if testing a model, but not expanded on here.

The outlined code is just basic examples but there are numerous posts, blogs, and reports available online that explain more advanced use cases. Kaggle also has quite a few competitions that use similar ideas with data and code to reference.

Note: The author of this post, Jesse, has developed LSH models for a variety of systems and continues to work with LSH. If you have any suggestions, recommendations, or questions please direct message Jesse on Twitter!