Law 25: Applying artificial intelligence to privacy protection

The arrival of Law 25 on September 23, 2023, has led companies to adopt new standards in analytics and marketing operations. Certain guiding principles are now de rigueur: offering transparency regarding the use of personal data, ensuring security for customers, asking for explicit consent to collect data and facilitating the right to be forgotten.

To understand the impact of Law 25, take a look at our webinar that covers a number of issues and mitigation measures.

One issue that is often not discussed is the possibility of being able to use deidentified data without user consent for research and statistical purposes within an organization. This use constitutes an exception to the law in place. It’s important to note that companies risk being fined if these data are susceptible to reidentification. Performing a privacy impact assessment (PIA) is also recommended to ensure that the company has established all necessary measures to minimize the risk of reidentification. When possible, the Commission d’accès à l’information du Québec privileges the obtaining of consent, hence the importance of the PIA.

This paragraph does not constitute legal advice. Please refer to your legal team in order to determine the risks associated with such an initiative.

This leads us to the question, What is deidentified data?

The importance of deidentification

According to Law 25, a piece of information is deidentified if it no longer allows for identification of an individual and the risks of reidentification are negligible. It is the company’s responsibility to establish all necessary measures to prevent reidentification. Deidentification is different than anonymization. In recent years, traditional anonymization techniques used by organizations have been proven to have flaws.

Examples of frequently used anonymization strategies:

Hiding the person’s name or email (hidden identifiers)
Replacing the name or email with an arbitrary number (randomized identifiers)
Aggregating data to prevent granular monitoring (aggregation)

Two recognized techniques to actually deidentify personal information are synthetic data and differential privacy. We’ll briefly cover these two techniques in the following sections.

Synthetic data

Synthetic data are artificial data generated by an artificial intelligence (AI) model. These artificial data conform to the statistical attributes of the original data. Synthetic data generation models are often considered to be generative AI models (like ChatGPT). Thanks to a number of advances in artificial intelligence and technology, these models have been able to demonstrate capacities that have produced useful and relevant applications.

The image below is a simplified representation of synthetic data generated by an AI model. The statistical properties of all of the original data are respected despite the fact that the individual data points (in the synthetic copy) are fictional.

The synthetic data generator (the AI model) is called a synthesizer. A synthesizer is an application that can be integrated into your existing data architecture, such as a data warehouse, data lake or operational database.

In the next section, we’ll cover differential privacy, which is a very important technique in the quest for confidentiality.

Differential privacy

Differential privacy is a mathematical framework for protecting individuals within a dataset through the injection of noise.

The following image illustrates a dataset in which noise has been introduced through a differential privacy algorithm.

The concept of noise is represented by the removal or addition of data points within the dataset in order to make it more difficult to detect trends. Differential privacy is not a new technique, but in fact was presented by Cynthia Dwork and Frank McSherry back in 2006. It’s a technique used in the medical field to maintain individuals’ privacy when performing research. Depending on the type of algorithm, the amount of noise introduced in the dataset can be controlled. One of the challenges of differential privacy is the quantity of noise required to make information private without losing its usefulness or corrupting the information contained in the dataset.

Synthetic data and differential privacy: A winning combination

The following paragraphs will show how synthetic data and differential privacy can be powerful methods for deidentifying personal information. Just imagine if you could combine these two techniques to create a whole new dataset… in fact, this is very much a reality!

The architecture below (source: Microsoft) represents an algorithm that combines both techniques:

The creation of a synthetic dataset with differential privacy represents one of the highest levels of privacy for protecting consumers’ personal information. It makes the reidentification of individuals extremely difficult and, in some cases, practically impossible (as with large datasets).

The advantages of using synthetic data

Gartner estimates that most AI algorithms will be trained on synthetic data (see chart below).

There are many advantages to using synthetic data:

Generation of data for analysis (including machine learning) and testing at low cost
Significant reduction in risks of reidentifying individual people
Simplification of data sharing between organizations while respecting privacy
Acceleration of development process for data products and training of AI models

The industries that are most likely to use synthetic data are banking, insurance, medicine and telecommunications. These industries have an enormous amount of sensitive data on patients and consumers.

Synthetic data are used to train models in order to determine the probability of losing a customer, the risk of fraud and factors that affect customer satisfaction as well as to perform tests and share data with partners. In a privacy protection context, the integration of synthetic data into your data strategy is highly recommended in order to remain innovative and competitive.

The biggest challenge related to synthetic data is ensuring there is no bias introduced by the AI model (the synthesizer). Algorithms that are biased will produce biased synthetic data and, as a result, poor results and bad decisions by users. This danger is often mentioned by many researchers as one of the hazards of poor calibration of AI models.

If you would like to learn more about this topic, don’t hesitate to get in touch with us!