Resolving duplicate beneficiary data for integrated CSR management

Developing a platform to automate data cleansing and identity resolution using Zingg.

Services

Data Science

8.12%

Duplicates cleaned up

Cost wastage identified

Location

India

Industry

Community-based startup

Employees

20+

About client

Our client, a community-based startup, specializes in managing end-to-end CSR activities for enterprises. With their smart approach and efficient fieldwork, they alleviate the CSR workload for companies while ensuring their investments effectively benefit underprivileged groups, from women to students to children.

Challenges

The client is on their way to becoming fully tech and digitally enabled.

After finishing a recent CSR program about learning and upskilling, they have identified the following shortcomings.

Lots of duplicate beneficiary entries: Our client identified many discrepancies in the beneficiary data from registration records and the LMS platform, especially many similar entries. There was fuzziness in details like names, email, and phone numbers which led to confusion on whether they were real people or not. Also, the attendance data from LMS didn’t match the initial registration data.

No KYC for identification: Volunteers collected beneficiary data without double-factor authentication. The process involved KYC verification but volunteers included someone else’s documents. This leaves no option to double-check the real identity of an entry or map it to one person. The vast data made it impossible to verify manually all the details.

Fund wastage on duplicate entries: The trainer of the CSR program is paid based on the number of beneficiaries. This led to fund wastage due to identical entries coupled with the mismatch in the attendance register of the training.

Incorrect metrics to prove the campaign’s effectiveness: Our client couldn’t show the socio-economic impact of their investments. Even though they conducted the campaign successfully and recorded metrics, that data was incorrect, leading to misconclusions and unreliable results.

Lack of long-term solutions to prevent this in the future: Duplicate records are the case for a single campaign. However, it became a barrier for them to implement a long-term strategy for successful and transparent CSR campaigns. Their semi-digital, disparate systems and documentation didn’t have space for this.

How did datakulture solve this?

The goal is to clean up the data from this campaign, identify cost wastage impact, and solidify the base for a smart, automated beneficiary platform. For this, we chose identity resolution as the solution.

ZinggAI to perform probabilistic matching: After basic cleansing and standardization, we used the open-source platform Zingg, which is effective for identity resolution on large datasets. Using this, our team built the model with the training data separated from the standardized data.

Labeling: The model clustered near-similar records which are marked as duplicate or non-duplicate selections by our team until the model learned on its own. The matching scores it generated ranging between 0 to 1 (>0.7 or above is the threshold duplication score value set by us) helped identify and resolve duplicate data.

Results: We summarized our results based on the model’s predictions. We derived the total number of beneficiaries who registered and attended, along with the cost wastage on training and our recommended next best course of action. This included constraints like whether duplicate records were real persons from different training slots, whether they were deliberately marked present by trainers, etc., which helped them identify the root cause.

Developing an identity resolution platform: As a part of their digitization effort, our client wanted to automate this process and create a platform. This way, they could ensure that they have reliable beneficiary records, each tied down to the right identity. So, we automated the data cleansing and identity resolution using Airflow. Our team uploads files from their database where it standardizes data, initiates deduplication, and shares the cleaned data with them.

The platform makes their job quick and fast, ensuring they are left with reliable records free from duplicates with which they can create their beneficiary ecosystem.

Summing up

The client has envisioned tech and data-powered growth for their venture. They are on their way toward achieving this by developing an end-to-end CSR management system. But this also required looking back their way and setting things right. This way they can improve relationships with their vendors and clients without the stress and fuss of maintaining hectic documentation.