- ACID property
- Anomaly detection
- Batch processing
- Cloud data warehouse
- Customer support KPIs
- Data anonymization
- Data cleansing
- Data discovery
- Data fabric
- Data lineage
- Data mart
- Data masking
- Data partitioning
- Data processing
- Data swamp
- Data transformation
- eCommerce KPIs
- ETL
- Finance KPIs
- HR KPIs
- Legacy systems
- Marketing KPIs
- Master data management
- Metadata management
- Sales KPIs
- Serverless architecture
Data masking
What is data masking?
Data masking is a security measure to hide your data by replacing it with dummy characters.
Example of data masking: johnsmith@gmail.com -> jxxxxxxxx@gxxxx.com
Data masking is mainly done to protect sensitive data and hide personal identifiable information from unauthorized users who need to work with financial/healthcare/customer record datasets.
If done correctly, data masking can reduce data breaches and other unauthorized access in organizations.
Data masking and encryption might sound similar, but their core purpose, method, and use cases might vary. Data masking creates a usable version of sensitive data, whereas encryption secures data by converting it into a standard code that opens with a unique password.
Data encryption is reversible with an encryption key, whereas data masking is often irreversible, meaning you can’t gain back the original data after masking.
Types of data masking
Here are types of data masking used to mask sensitive data.
1. Static data masking is creating an irreversible copy of masked data to prepare sensitive data for non-production environments like testing, debugging, etc.. It mostly involves creating a copy of production data, as static data masking is irreversible.
2. Dynamic data masking masks data only when required, around runtimes depending on the role of those who access it.
3. Randomized masking is when you replace the characters with random values when there is a need for anonymization.
4. On the fly masking is another data masking type where data is masked when it’s transferred from one environment to another. Example: data migration projects.
5. Format preserving masking is when the original format of data is reserved while characters are being masked. Example: a phone number displaying alternative numbers.
Data masking techniques
Depending on data sensitivity and use case, the following data masking approaches are selected.
Substitution: Substitution is the most common data masking method where data characters are replaced with fictional values, making it useful for non-production users to see how data formatting is.
Tokenization: Tokenization is transforming a sensitive information into a non-sensitive token, usually a string of random characters. It’s mainly used in the payment industry to protect payment information, like card number, account balance, transaction statements, etc.. Tokenization is one of the reversible data masking techniques, where original data still resides in the token vault.
Nulling out: Nulling out is a data masking technique used to overwrite sensitive info with null values, particularly in datasets with both sensitive and non-sensitive values. Replacing date of birth and payment details with ‘NULL’.
Blurring or generalization: you use generalization techniques for data masking when you have to retain realistic aspects of data without showing precise value. Example: converting the salary $100,502 into a range, $10k to $11k.
Shuffling: Shuffling is another data masking technique where you switch positions of characters, so the data structure and pattern remain the same, but one cannot trace back the original value.
Encryption: Data encryption is a high-end masking technique done with the help of an encryption algorithm to change data into unintelligible ciphertext. One can only decrypt and read the data with the help of an encryption key. Encryption for data masking is used in cases where the data must be stored or transmitted to another medium without being able to be tapped into.
Why is data masking important?
Data masking is important for the following reasons.
Ensure compliance with data regulations and governance: many industries and organizations must abide by data protection laws and frameworks like GDPR, HIPAA, PCI DSS, CCPA, etc. Data masking ensures that you adhere to GDPR and other regulations by preserving the confidentiality of data beyond production environments. Data masking will come in handy during audits and compliance checks, too.
Applies minimum privilege: data masking allows you to give minimum privilege to users (developers, third-party vendors, etc.) working with sensitive data, so they couldn’t manipulate data.
Protect confidentiality of customers and employees: it’s the duty of an organization to ensure data integrity and pseudonymize their confidential information lying accessible for everyone.
Mitigates internal and external risks: masking data is a way to prevent risks, security breaches, and insider threats, while making sure that everyone has realistic datasets needed for their work.
Data masking in ETL for secure data processing: ETL involves data collection and transformation from various sources, which includes sensitive data being transferred across environments, too. Using data masking in ETL will minimize risks and enhance data security and integrity, while using the data for analytical purposes.
Data masking fit big data use cases?
Like any other data processing techniques, data masking is also evolving to fit modern needs of businesses and keep up with data privacy requirements. There are encryption and token based data masking techniques available. AI-driven data masking is also in use, where auto-detection and masking of data can be done.
Also, DDM (Dynamic Data Masking) is evolving fast to make real-time data masking of high data volumes happen, suitable for various roles and access levels.
So, the bridge between data protection and on-demand data availability is strongly being laid, with availability of self-service platforms, context-driven auto masking techniques, and alignment with changing regulatory requirements.