It’s a truism that data is at the heart of AI. Similarly, it’s also becoming clear that data governance is at the core of AI governance. No doubt, there appears to be an inherent tension between mining user data for actionable insights and preserving privacy, which is referred to as the utility-privacy tradeoff. But it’s also possible to strike a balance between utility and privacy.
It’s the stated goal for many organizations to be data driven. Organizations want to open their data and analytics applications more widely and empower their employees. However, this goal of data democratization is not widely realized in practice because of data protection and data privacy concerns. Many data sets contain users’ personal data and organizations worry that if data is more widely shared, there’s a greater chance of personally identifiable information leakage or attack vectors.
To prevent violations of privacy regulations and requirements, access to such data is typically restricted to teams such as IT or analytics. In other words, data privacy fears are hindering data democratization. Because of these concerns, such data sets are not made available to the machine learning teams for training AI models. This can potentially reduce the efficacy and usefulness of those AI models and applications.
How AI and data privacy can coexist within a framework
A privacy-by-design approach helps overcome these limitations, as AI and data privacy can then coexist as two parts of the AI lifecycle. Essential to this approach is using data anonymization techniques that preserve privacy without losing data’s usefulness in AI applications.
-De-identification. Here, personal identifiers and sensitive attributes are masked with non-sensitive placeholder values. The masking rules can range from simple, such as hiding the first few digits of a social security number or credit card and showing only the last few, to complex, such as using random tokenization to replace an original value with a seemingly unrelated string.
-K-anonymization. Here, individual privacy is protected by pooling the data that identifies an individual into a set of data where everyone has similar attributes. This technique can also be referred to as “hiding in the crowd” so that no record can be uniquely linked to an individual. For example, an individual’s age or income is replaced with an age or income bracket. Sometimes, certain attributes may be even dropped.
-Differential privacy. It is possible to infer what the inputs to an AI model are by analyzing its outputs. This technique is aimed to curb data leaks. This is done by adding “noise” (in the form of synthetic data) to the data set without losing the “signals” (i.e., useful predictive characteristics) in the data. Differential privacy techniques can be employed when privacy requirements are higher and data is more sensitive. There are several data governance tools that help implement differential privacy.
-Federated machine learning. Here, model training happens iteratively using different decentralized devices instead of centralized aggregation of data. This is not an anonymization technique per se, but it helps improve privacy.
Businesses have several techniques at their disposal to help improve AI and data privacy practices. In addition to data privacy protection in the context of AI governance, prudence is warranted throughout the entire data supply chain — from data collection and storage to processing, access and sharing. Therefore, IT/engineering, business and legal teams all have roles to play here as well.
Sorry, the comment form is closed at this time.