Culture of Data Protection: Data Quality, Privacy, and Security
As I explained in previous posts on building a culture of data protection, we in the technology world must embrace data protection by design.
To Reward, We Must Measure.
How do we fix this? We start rewarding people for data protection activities. To reward people, we need to measure their deliverables.
An enterprise-wide security policy and framework that includes specific measures at the data category level:
- Encryption design, starting with the data models
- Data categorization and modeling
- Test design that includes security and privacy testing
- Proactive recognition of security requirements and techniques
- Data profiling testing that discovers unprotected or under-protected data
- Data security monitoring and alerting
- Issue management and reporting
Traditionally, we relied on security features embedded in applications to protect our data. But in modern data stories, data is used across many applications and end-user tools. This means we must help ensure our data is protected as close as possible to where it persists. That means in the database.
Before we can properly protect data, we have to know what data we steward and what protections we need to give it. That means we need a data inventory and a data categorization/cataloging scheme. There are two ways that we can categorize data: syntactically and semantically.
When we evaluate data items syntactically, we look at the names of tables and columns to understand the nature of data. For this to be even moderately successful, we must have reliable and meaningful naming standards. I can tell you from my 30+ years of looking at data architectures that we aren’t good at that. Tools that start here do 80% of the work, but it’s that last 20% that takes much more time to complete. Add to this the fact that we also do a shameful job of changing the meaning of a column/data item without updating the name, and we have a lot of manual work to do to properly categorize data.
Semantic data categorization involves looking at both item names and actual data via data profiling. Profiling data allows us to examine the nature of data against known patterns and values. If I showed you a column of fifteen to sixteen digit numbers that all had a first character of three, four, five, or six, you’d likely be looking at credit card data. How do I know this? Because these numbers have an established standard that follow those rules. Sure, it might not be credit card numbers. But knowing this pattern means you know you need to focus on this column.
Ideally we’d use special tools to help us catalog our data items, plus we’d throw in various types of machine learning and pattern recognition to find sensitive data, record what we found, and use that metadata to implement data protection features.
The metadata we collected and design during data categorization should be managed in both logical and physical data models. Most development projects capture these requirements in user stories or spreadsheets. These formats make these important characteristics hard to find, hard to manage, and almost impossible to share across projects.
Data models are designed to capture and manage this type of metadata from the beginning. They form the data governance deliverables around data characteristics and design. They also allow for business review, commenting, iteration, and versioning of important security and privacy decisions.
In a model-driven development project, they allow a team to automatically generate database and code features required to protect data. It’s like magic.
As I mentioned in my first post in this series, for years, designers were afraid to use encryption due to performance trade-offs. However, in most current privacy and data breach legislation, the use of encryption is a requirement. At the very least, it significantly lowers the risk that data is actually disclosed to others.
Traditionally, we used server-level encryption to protect data. But this type of encryption only protects data at rest. It does not protect data in motion or in use. Many vendors have introduced end-to-end encryption to offer data security between storage and use. In SQL Server, this feature is called Always Encrypted. It works with the .Net Framework to encrypt data at the column level and it provides the protection from disk to end use. Because it’s managed as a framework, applications do not have to implement any additional features for this to work. I’m a huge fan of this holistic approach to encryption because we don’t have a series of encryption/decryption processes that leave data unencrypted between steps.
There are other encryption methods to choose from, but modern solutions should focus on these integrated approaches.
Data masking obscures data at presentation time to help protect the privacy of sensitive data. It’s typically not a true security feature because the data isn’t stored as masked values, although they can be. In SQL Server, Dynamic Data Masking allows a designer to specify a standard, reusable mask pattern for each type of data. Remember that credit card column above? There’s an industry standard for masking that data: all but the last four characters are masked with stars or Xs. This standard exists because the other digits in a credit card number have meanings that could be used to guess or social engineer information about the card and card holder.
Traditionally, we have used application or GUI logic to implement masks. That means that we have to manage all the applications and client tools that access that data. It’s better to set a mask at the database level, giving us a mask that is applied everywhere, the same way.
There are many other methods for data protection (row level security, column level security, access permissions, etc.) but I wanted to cover the types of design changes that have changed recently to better protect our data. In my future posts, I’ll talk about why these are better than the traditional methods.