TL;DR: Behavioral data is the new gold. Go Synthetic to unleash its potential – and make sequential behavioral data your ultimate test case for choosing your preferred synthetic data solution!
In one of our earlier blog posts we demonstrated our synthetic data platform using a well studied, publicly available dataset of over 50’000 historic diamond sales with 10 recorded data points each. It served as an educational example to introduce the idea of synthetic data, as well as to showcase the unparalleled accuracy of our technology. With the click of a button, users of our platform can forge an unlimited number of precious, highly realistic, highly representative synthetic data diamonds.
But, to be fair, the example didn’t do justice to the type and scale of real-world behavioral data assets encountered in today’s industry, whether it’s financial, telecommunication, healthcare or other digital services. Organizations operate at a different order of magnitude, as they serve millions of customers and record thousands of data points over time for each one of those. Whether these recorded sequences of events represent transactions, visits, clicks or other actions, it is so important that these rich behavioral stories of customers are understood, analyzed, and leveraged at scale in order to provide smarter services with the best possible user experience for each customer.
But despite the immense growth in volume over recent years, the captured behavioral data still remains vastly an untapped opportunity. And over and over again, we can identify two key obstacles at organizations at play:
1) Behavioral data is primarily sequential and constantly evolving, rather than static and fixed – and with its thousands of data points per individual, there is a sheer unlimited number of potential temporal inter-dependencies and contextual correlations to look for. To say it simply: It’s a fundamentally different category beast than what is being taught at Statistics 101. Existing business intelligence tools, as well as regression or tree-based models struggle in making sense of this type of data at scale. Thus it is no surprise that only the most data-savvy organizations turn up on the winning side by knowing how to leverage their immense behavioral data assets to effectively gain a competitive edge with hyper-personalized customer experiences.
2) The second obstacle is, that behavioral data remains primarily locked up. Because with thousands of available data points per customer the re-identification of individual subjects becomes increasingly easy. Existing anonymization techniques (e.g. data masking), that have been developed to work for a handful of sensitive attributes per subject, stand no chance in protecting privacy while retaining the utility of this type of data at a granular level. A disillusion that is by now also broadly understood and recognized by the public:
As it turns out, these are two reinforcing effects: Without safe data sharing, you can’t establish data literacy around behavioral data. Without data literacy, you will not see the growing demand for behavioral data in your organization. However, only some companies will remain stuck in their inertia, while others are able to identify and thus address the dilemma by turning towards synthetic data, which allows them to offer smart, adaptive, and data-driven services to win the hearts of the consumers (as well as the markets).
The Curse of Dimensionality
Let’s look at an example to illustrate the complexity of sequential behavioral data. Within retail banking, each account will have a sequence of transactions recorded. But even if we discard any personally identifiable information on the customer, and even if we limit the amount of information per transaction to 5 distinct transaction amounts, and 20 distinct transaction categories, the number of behavioral stories quickly explodes with the length of the sequences. While a single transaction seems innocuous with its 20*50 = 100 possible outcomes, two transactions will already yield 100*100 = 10’000 outcomes. For a sequence of three transactions we are at 100^3 = 1 million outcomes per customer, and at forty recorded transactions, we will already have more possible outcomes (10^80) than atoms in the universe! No wonder, that these digital traces are highly identifying, and near impossible to obfuscate. No wonder, that making sense of this vast sea of data and detecting patterns and nuances therein poses such a huge challenge.
This combinatorial explosion, the exponential growth of outcomes with the number of records per subject, is also referred to as the curse of dimensionality. There is no person like another, everyone is different, everyone is unique. It’s a curse for analytics, it’s a curse for protecting privacy. But, at the same time, it’s a blessing for customer-centric organizations, who are willed to embrace a rich, diverse world of individuals, and who recognize this to be an opportunity to differentiate on top of these otherwise hidden behavioral patterns.
AI-Generated Synthetic Data to the Rescue
The power of synthetic data continues to be recognized as THE way forward for privacy-preserving data sharing. While there are various approaches and levels of sophistication, ranging from simple rule-based to more advanced model-based generators, our focus at Mostly AI has always been on offering the world’s most accurate solution based on deep neural network architectures. These are high-capacity, state-of-the-art machine learning models, that can reliably and automatically pick up and retain complex hidden patterns at scale. In particular for the type of sequential data, that is so prevalent among an organization’s behavioral data assets. These models make little a-prior assumptions and require no manual feature engineering by domain experts. They are the very same models that have revolutionized so many fields already over the past couple of years, like image classification, speech recognition, text translation, robotics, etc., that are now about to change privacy-preserving big data sharing once and for all.
And ultimately, it is the accuracy and representativeness of the synthetic data that is the key driver of its value. This is what will determine whether use cases go beyond mere testing & development, and expand towards advanced analytics and machine learning tasks as well, where synthetic data can be relied on in lieu of the actual privacy-sensitive customer data. And just as classic learning algorithms continue to be superseded by deep learning in the presence of big data, one can already observe a similar evolution for the market of synthetic data solutions for behavioral data assets.
This was the first part of our mini-series on sequential data, setting the stage for next week’s post. There we will present a handful of empirical case studies to showcase the power of our synthetic data platform, in particular with respect to the important domain of behavioral data – so make sure that you don’t miss out on it!
In the previous part of this series, we have discussed two risks entailed in the rise of digitalization and artificial intelligence: the violation of the privacy and fairness of individuals. We have also outlined our approach to mitigate privacy and fairness risks with bias-corrected syntheticdata: this allows for privacy-preserving data sharing and also aids the fair treatment of customers (data subjects) in downstream analysis and machine-learning tasks. (By the way, if you would like to experiment with fair synthetic data yourself, you can download the datasets we created at the bottom of the page.)
If you would like to dig deeper into algorithmic fairness and the potential risks in machine learning systems, we can highly recommend The Ethical Algorithm book by M. Kearns and A. Roth. For a more technical viewpoint, check out fairmlbook.org to find lecture notes, videos, and other great resources.
In this blog post, we take a deeper dive into our approach to de-bias synthetic data. For now, we focus on statistical parity as a fairness measure and show in detail the effects of our approach in two settings:
Let’s start with a quick reminder: a data set or algorithm being unfair usually refers to some kind of imbalance. A rather intuitive measure for such an imbalance is the so-called statistical or demographic parity. In mathematical terms, we can describe it as follows: consider a population that can be split into groups by a sensitive attribute S, such as gender, skin color, age or any other property. Then consider another target attribute T that contains sensitive information on the population such as income, whether or not people spent time in prison or credit history.
In the Adult data set, we select the sensitive attribute (S) gender, either “female” or “male”, and the target attribute (T) income, which is either “>50k” or “<=50k”.
In this example, statistical parity is satisfied when the number of females that earn more than 50K divided by the total count of females equals the number of males earning more than 50K divided by the total number of males:
In other words, the probability that a randomly chosen male is a high earner should be the same as the probability of a random female being a high earner. Also, note that when these two fractions are equal for the high-income segment (“>50K”) then this automatically holds true for the low-income segment (“<=50K”) as well.
In the real world, unfortunately, the equality above does not hold true. A simple visualization of the data set reveals a strong imbalance between females and males (Fig 2).
Only 10.96% of women are in the high-income range while among men, the fraction is 30.79%, almost three times higher. In the remainder of the blog post, we show how to create a fair, synthetic version of the Adult data set that removes the income gap between these two gender groups.
Though being intuitive, parity has limitations especially in the context of fair algorithmic decision making. We are aware of these shortcomings, some of which we mentioned in our fairness definitions post already, and we will discuss alternative fairness measures at the end of this blog post.
Generating a Fair Dataset
One of the first ideas to try when creating a fair data set for machine learning is to drop the sensitive column. In the presented case that’s the “sex” attribute. At first sight, this sounds like a good and easy-to-implement solution but, unfortunately, it can actually cause more harm than good. On one hand, what makes this approach fail can be so-called proxy or hidden proxy columns. Imagine we know which neighborhood a person lives in, the brand and model of the person’s mobile phone, the car this person drives, where this person buys her/his clothes, etc. Given some of the above information, we humans can make a pretty educated guess on this person’s sex, skin color, and other attributes. And since algorithms are better in analyzing patterns like this, they will definitely detect these correlations and exploit them, leading again to unfair predictions and decisions. We could actually go one step further and say that leaving the “sex” column in the data set is better for fairness because it offers a clear handle to enforce fairness constraints such as statistical parity. To give another example from criminal justice, women on average are less likely to commit future violent crimes than men with similar criminal records. So, a gender-neutral assessment can overestimate a woman’s recidivism risk.
Our Synthetic Data Platform, Mostly GENERATE, leverages deep neural networks to produce synthetic data. In order to generate fair synthetic data, we add a fairness constraint to the model parameter optimization during training. Sticking with the Adult data set, we penalize the violation of statistical parity within every mini-batch by increasing the training loss by a number that is proportional to the difference between the fraction of women and the fraction of men in the high-income segment. A very similar approach for training fair classifiers is described in a paper by P. Manisha and S. Gujar and an implementation can be found at Y. Shavit’s github repo.
In more general terms, adding the fairness constraint expands the objective of our software from generating accurate and private synthetic data to generating accurate, private, and fair synthetic data.
Private and Fair Synthetic Data
After feeding the Adult data set to our software and training it with the additional parity fairness constraint in place, we generate a synthetic fair version of the Adult data set. Once we evaluate the income distribution, we see a major change: the income gap almost disappeared (Fig 3).
Actually, we repeated the whole process 50 times and the plotted numbers are the average ratios over these independent runs. The income-ratios slightly varied across the 50 experiments but this variance (rooted in the stochastic nature of our training and generation process) was quite small: 1.2% and 1.3% for the Male and Female ratios, respectively. As apparent from the plots, the synthetization corrected the income gap: 25% of the synthetic males are high earners (instead of the real 30%) and 22% of the synthetic females are high earners (while the original value was 11% only).
With regards to parity, it is common to compare not just the difference but the fraction of high-income male ratio to high-income female ratio (that is, we divide the two sides of the above equation). This fraction is called the disparate impact and it is an industry-standard to ask for at least 0.8, the so-called four-fifth rule. In the original data set, this fraction is roughly 10/30 = 0.33, a quite severe disparate impact violation but the bias-corrected synthetic data is at 22/25 = 0.88, well over the threshold.
The additional parity constraint during model training does not diminish the quality and accuracy of the synthetic data. Univariate distributions of the synthetic-data attributes almost perfectly match their original counterparts (in Figure 4, we show only a selection). Please note that, while parity is modified to a large degree, both the population-wide male-to-female ratio, as well as the high-earner-to-low-earner ratios, are preserved.
Also the bivariate correlations of the synthetic data, on first sight, seem to be in excellent agreement with the original data (Figure 5).
A closer look, however, reveals some detailed changes due to the inner workings of the fairness constraint. Given the statistical parity definition, “income” must not depend on “sex”, which means these two attributes should not be correlated.
While in the original data, there is a clear “sex”-”income” correlation (red circle in the left plot in Figure 5) this dependency is almost reduced to noise level in the fair, synthetic data (red circle in the right plot in Figure 5). Apart from the “sex”-”income” pair, no other correlation seems to be altered by applying the fairness constraint, at least not strong enough to show a visible effect on the correlation plot.
But what about proxy attributes, columns in the data set that are correlated with “sex” and “income”? Can they introduce unfairness through a backdoor, as they are not explicitly mentioned in the parity constraint? Recall that the “parity equation” (see Equation 1) contains the attributes “sex” and “income” only.
To visualize the effect of the parity constraint on proxy attributes, we add an artificial feature to the Adult data set named “proxy”. We generated this column so that it is strongly correlated with the attribute “sex”. For females, “proxy” equals to 1 in 90% of all cases and equals to 0 for the remaining 10%. For males, the percentages are swapped. Looking at this new data set, we see, first, the strong correlation between “sex” and “proxy” (the black arrow on the left-hand side plot of Figure 6). Second, as these two attributes are strongly linked, also their correlation to “income” is comparable (the red arrow on the left-hand side plot of Figure 6). Now, when we run our synthetic data solution with the fairness constraint in place on “sex” only, we find that in the fair synthetic data both correlations “sex”-”income” and “proxy”-”income” are almost reduced to noise level (the red arrow on the right-hand side plot of Figure 6). The latter finding shows that the parity constraint works as intended and accounts for (hidden) proxy attributes.
In the Adult data set, gender is not the only sensitive attribute: if we train our synthetic engine with “race” as a sensitive attribute, we get similarly impressive corrections (for this task, we used a simplified version of the data set filtering for Black/White subjects). In the original data, there are twice as many high earners in the White population than in the African American, but the ratios are almost exactly equal in our adjusted synthetic data (Figure 7).
In summary, the introduction of (parity) fairness to our software solution shows very promising results. The quality and accuracy of the synthetic data remain high, the privacy of data subjects is protected, and parity-fairness is guaranteed. All these properties make private and fair synthetic data readily available for further application.
Mitigating Bias on More Than One Feature
It is also a possibility to turn on the fairness loss on multiple sensitive attributes at the same time which we did for race and gender. In this case, one must be careful what ratios to optimize: if we were to simply put fairness losses independently on race and gender then the algorithm might fall into the mistake of “fairness gerrymandering”. That is, the new data set would look fair with respect to both gender and race individually, but we would see high imbalances when restricted to gender and race simultaneously (Figure 8).
Taking this into account, our solution gives synthetic data with significantly balanced high-income ratios across the four groups given by race and gender (Figure 9).
It is apparent that we did not achieve complete parity but this difference can be further lowered by giving higher weight to the fairness loss against the accuracy loss.
Fair Synthetic Data in Downstream Machine Learning Tasks
In the previous post, we introduced a scenario in which Got Big Data Company generates a fair synthetic dataset. This data set is handed to an external vendor, SmartUP AI, to develop new predictive models. As the data set is fair and synthetic, SmartUP AI does not need to take specific privacy measures nor does it need to apply any bias correction so they can work with standard, out-of-the-box models.
We demonstrate this with the Adult census data by fitting a simple linear model, logistic regression, which predicts the income level, high versus low, based on the other attributes. As we mentioned, there is no point in removing gender as an explanatory variable since the data set can contain other hidden proxies. We train two models, one on the original data set and one on the bias-corrected synthetic data. Both models are then tested on a holdout from the original data. Moreover, we repeated the model training procedure 50 times with independently generated synthetic data.
The charts in Figure 10 show the mean performance of the real and synthetic models over these experiments. The synthetically fitted models have very competitive performance and generalize well to the unseen real data. Also, we observed only minimal variance across the experiments (2%, 2% and 2.5% in Accuracy, AUC-ROC, and F1-score, respectively).
Moreover, the models trained on the synthetic data treat the classes of the sensitive attribute (gender, in this case) nearly equally. These predictive models output the probability of being high-income for any data point, so we can look at how these probabilities are distributed. Since there are more low-income samples, we expect these probabilities to be concentrated close to 0, both for females and males. However, for the model fitted on the original data, we see below that there is a much higher number of around-0 probabilities for females than males (Figure 11).
On the other hand, with the predictors trained on the synthetic data these distributions are brought very close together. This is exactly the group fairness that parity is designed to capture. The important thing to keep in mind though is that the predictive-model training itself did not involve any type of optimization to fairness and the evaluation is also on the biased original data. So this fair outcome is solely due to using bias-corrected synthetic data for the training.
Our results align with the findings of research conducted at Carnegie Mellon University into fair representations of data. We see that our fairness-constrained synthetic data solution learns to represent data points in a way that removes the dependencies between the sensitive and target attribute while preserving other relationships.
Correcting the Compas Data Set
We return briefly to the ProPublica study on algorithmic justice and the corresponding Compas data set (see our introductory fairness post). This data set contains information about defendants together with their predicted risk to re-offend, the so-called Compas score. We generate a parity-fair synthetic version of this data set with “race” as the sensitive attribute and the Compas score as the target variable. The original data set is heavily biased towards African Americans which in turn gets perfectly corrected in our synthetic data.
In the original Compas data set, the ratio of individuals with high Compas scores is 59% and 35% for African Americans and Caucasians, respectively. Quite impressively, our bias mitigated data reduced this gap to merely 1%, settling the values in the middle at 49% and 48%, respectively.
In the subsequent prediction task, we can achieve almost perfect equality between the predicted probabilities for high Compas score between the two classes of the sensitive attribute “race” (Figure 13).
Looking at the classifier’s performance, this parity-correction comes with minimal compromise in predictive accuracy (Figure 14).
Alternative Fairness Definitions
While demographic parity is a very intuitive notion, it has certain limitations. As compared to other fairness definitions, there is a worse trade-off between satisfying parity and having high accuracy for the generated data. If your original data had a class imbalance then the parity-mitigated synthetic data or a classifier that is forced to satisfy parity cannot achieve the same level of accuracy as a predictor with no parity loss. Actually, the base-rate difference is a provable lower bound on the accuracy. Moreover, parity is a notion of group fairness, it equalizes outcomes across classes, while other approaches optimize for individual fairness focusing on treating similar individuals similarly. S. Corbett-Davies and S. Goel argue that all these approaches suffer from serious shortcomings and advocate a risk-based assessment that could better serve policymaking.
Since parity only considers the sensitive attribute and a single other variable, it is not designed to handle a situation involving both predictions and a ground-truth label (three variables all together). So, in a more nuanced approach to fairness, one aims to have a predictor model that makes the same mistakes with the same chance across the sensitive attribute classes.
Such notions include equal opportunity and equalized odds which we also tested in our synthetization process: our experiments showed that if we generate synthetic data sets with these fairness constraints then they also give rise to fair classifiers with respect to these stronger notions. We will share the details of these more technical results in a subsequent article.
Conclusions and How We Will Continue After #FairnessWeek
The notion of fairness (in particular, statistical parity) and synthetic data go together very well. Not only can we generate highly accurate synthetic data but we can also steer the generation to almost perfectly mitigate strong biases in the original data sets. The additional fairness constraint in the training loss of our generative models fine-tunes the correlation structure between attributes such that these biases are strongly reduced. Privacy and (parity) fairness are further preserved in downstream tasks: an out-of-the-box classifier model when trained on fair synthetic data makes fair predictions even on biased input.
Statistical parity has limitations, and, on a more general note, there is no concept of fairness or silver-bullet solution that is applicable to all possible use cases. While this was our last post of #FairnessWeek, we will definitely continue our work on fair synthetic data and mitigating bias in Artificial Intelligence. In an upcoming study, we will extend our approach to other fairness measures, such as equal false-positivity rate and equalized odds. So if you want to make sure that you don’t miss part 6 of our Fairness Series, sign up for our (monthly) newsletter below!
In the age of digitalization and the rise of artificial intelligence, more and more tasks in public and private organizations are managed or supported by computers and machine-learning algorithms. These include tasks such as data analysis, automated decision making, customer interaction services such as automated emails or chatbots, and recommendation systems. In general, we believe this is a good thing, as machine learning algorithms are fast, scalable, and can analyze way more complex data structures than humans. For example, there are studies showing that the adoption of automated underwriting in mortgage lending contributed to the increase of approval rates for minority and low-income applicants by 30% while improving the overall accuracy of default predictions.
However, machine learning algorithms typically require lots of training data and when this data contains sensitive information about real people, the stakes become extremely high. Two risks involve the violation of privacy and fairness: disclosing sensitive personal information and treating people unjustly during the decision-making process.
There are many well-documented cases of biased decision making that triggered an ongoing discussion about algorithmic fairness. A famous example is Google’s hate speech-detection algorithm that discriminated against African Americans. Researchers at the University of Washington found, that the algorithm was more likely to label their tweets as “hateful” or “offensive”. Not only was it biased against people of color, but also, as another study demonstrated, against well-known drag queens. Another case of bias in Artificial Intelligence was Amazon’s HR algorithm. The system was fed with 10 years worth of records of previous – and predominantly male – Amazon employees and thereby learned that being female poorly correlated with being a suitable candidate for a job at the tech company.
Now, in the cases above, algorithms systematically discriminate against a group based on its gender, race, or sexual orientation. If not addressed, these systemic biases end up in data sets that decision-making algorithms are trained on. Subsequently, the biased algorithms make unfair decisions, perpetuating, and actually amplifying the biases in our society.
We at Mostly AI believe in the positive powers of artificial intelligence to foster research and innovation. We will demonstrate that bias-corrected synthetic data can address both privacy and fairness concerns to allow for utilizing and democratizing big data assets while keeping the risks at a minimum. The current post will give a high-level overview of our work and in post 5 of our Fairness Series, we will discuss more technical aspects of our results as well as make our fair synthetic data sets available.
From Privacy Protection To Promoting Fairness in AI
Our Synthetic Data Platform, enables organizations to generate highly accurate, statistically representative synthetic data at scale such as synthetic customer records along with purchase histories. The software functions as an unlimited source of artificial individuals who have interacted with your business the same way as real people did historically. The synthetic data, however, can be shared safely without privacy concerns since these artificial people do not really exist and the privacy of your actual customers, the real data subjects, remains protected. (If you would like to learn more about synthetic data, watch our mini video series.)
Synthetic data generation doesn’t need to stop at privacy protection though. As we generate the data from scratch, we can model and shape it to fit different needs. A beautiful example of this is NVIDIA’s styleGAN, where a conditional generation of synthetic images allows for adding smiles or sunglasses to faces, or changing hair and skin color.
In this blog post, we want to leverage the possibility of modeling and shaping synthetic data to mitigate the second risk mentioned in the introduction: violation of fairness. The result is fair synthetic data that is fully anonymous and de-biased (in accordance with a specific fairness definition).
To Get Fair Synthetic Data You Need To Start With A Fairness Definition
Imagine a perfect world without any biases and discriminations, where attributes such as skin color or sex do not influence people’s lives either in a good nor a bad way. In such a world, the fraction of women among top management positions would equal those of men. Similarly, the fraction of women earning more than $50,000 per year would equal that of men and the fraction of African-Americans in US prisons would be the same as the fraction among Caucasians. This property comes under the name of statistical or demographic parity. The plot below shows how demographic parity is violated in the Adult US census data set with respect to gender and income.
Statistical parity is a very intuitive fairness measure and, in a perfect world with equal opportunities for everybody, it would be satisfied. There are many other, equally viable metrics but keep in mind that there is no single equation or approach that will perfectly fit vastly different scenarios. To truly address and derive actionable insight against bias, one needs a deep understanding of the underlying issues in each use-case. What we developed here is a flexible framework to generate synthetic data that satisfies fairness with respect to a given metric, focusing on parity for now and exploring other measures in a subsequent study.
How To Create A Fair Synthetic Dataset?
There are three points in the machine learning life cycle where you can mitigate bias: at the source, by changing your input data; during the modeling phase by using additional fairness constraints; and as a post-processing step, by revising the algorithm’s decisions in favor of a sensitive group. Naive data-level techniques, such as oversampling methods, have the risk of skewing important data distributions when mitigating imbalances. Our approach is a sort of hybrid, using fairness constraints on a generative model to produce fair synthetic data.
Now, the main objective of our Synthetic Data Platform is to generate new, synthetic data that is as accurate and as representative as the original data set. Under the hood, the software leverages deep neural networks that are trained to optimize an accuracy loss: this simply measures how well our model is reproducing the statistical distributions of the real data. Now, in order to get fair data, we can add a fairness constraint to this optimization step. To stick with the income example, for every mini-batch of data that enters during training, we penalize the violation of statistical parity by a number that is proportional to the difference between the fraction of women and the fraction of men in the high-income segment. We then adapt the model parameters with the objective to minimize both the accuracy loss and fairness constraint.
Using this approach, we successfully removed the income inequality with respect to gender from the synthetic version of the Adult data set. We did this with very little compromise on other aspects of data accuracy: for example, you can see we preserved the original Male/Female ratio perfectly.
How Organizations Can Benefit From Private And Fair Synthetic Data
One of our main motivations in working on fair synthetic data generation is the following scenario: imagine Got Big Data Company, a conscientious organization that aims to develop a new predictive model. To do so, they ask the help of a 3rd party vendor, SmartUp AI, and until recently, such collaborations involved allowing access to their sensitive database. Moreover, if Got Big Data Company wanted to address data bias then it required rather special know-how on the developer’s side. Here enters fair synthetic data: Got Big Data Company first generates a synthetic and hence private version of their original data set which is also fair with respect to the modeling task at hand. Next, the vendor, SmartUp AI, develops the predictive model on the synthetic data, just as they would for any task without having to be concerned about bias correction on their end. Then, these models are handed back to Got Big Data Company for use on actual customer data.
We find that out-of-the-box predictive models trained on fair synthetic data treat the classes of the sensitive attribute near equally (e.g., female and male). This fair outcome is solely due to using parity-corrected synthetic data, there are no fairness constraints of the predictive models. In the next article, we will release our parity-corrected synthetic data and dive into the technical details of our approach and analysis of the generated data.
There are many inherent risks in automated decision making and in the use of data sets that do not reflect the world we strive to live in. Historical and measurement biases skew predictive models which in turn affect millions of people who are applying for loans or submitting job applications. As data scientists, engineers, and business leaders, we are responsible to address these issues as best as we can. At Mostly AI, we offer a two-in-one tool to utilize data sets that are often sensitive and biased at the same time. First, our fair synthetic data can be safely shared without leaking personal information. Second, having addressed bias-mitigation at the synthetic data generation phase, it enables organizations to utilize existing analytics and modeling pipelines without the need for costly anti-discrimination modifications. To learn more about how fair synthetic data is generated, continue with part 5 of our Fairness Series.
“One of the major challenges in making algorithms fair lies in deciding what fairness actually means,” said Dr. Chris Russell, who is leading the safe and ethical AI group at the Alan Turing Institute, in an interview with Wired. “Trying to understand what fairness means, and when a particular approach is the right one to use is a major area of ongoing research.”
Fairness is a vastly complex concept and as people tend to have different values their interpretations of fairness differ as well. A mother might think it is fair if both of her children receive two pieces of chocolate. But instead of having two happy kids eating their chocolate, they start to quarrel. The older ones’ argument? He is much bigger and thus should have received a piece more than his brother. The little one’s opinion? It was him who helped dad do the dishes yesterday evening – therefore he is the one deserving more chocolate.
Equal Treatment Versus Equal Access
In the private as well in the business context we oftentimes strive to achieve fairness by treating everybody exactly the same. An equal amount of chocolate. An equal amount of time to finish a test in school and – in an ideal world – also equal pay regardless of gender or race. The concept behind this is called equality. But what it fails to take into account is that not every one of us starts from the same place and that some might need different support than others do. Imagine three people of divergent height trying to get to the beautifully ripe, red apples on an apple tree. If you were to give a small pedestal to everyone, it would not really improve the situation for the smaller individuals:
However, if everyone would receive exactly what they need to get to the fruits, you would have leveled the playing field. This condition can be described as equity. In contrast to equality, it does not aim to promote fairness by treating everybody the same, but by giving everybody equal access to the same opportunity.
Fair AI Requires A Mathematical Fairness Definition
In order to build fair machine learning systems, we need to precisely define and quantify what we mean by a fair outcome. There are several mathematical definitions that do just that and on a high level, these notions fall into two categories: group and individual fairness.
Group fairness and parity constraints aim to achieve the same outcomes across different demographics, or more generally, a set of protected population classes. In other words, the population that receives a given assessment by the algorithm (let it be positive or negative) should reflect the whole population and its demographics. We can furthermore require that the types of mistakes the model makes and the severity of these errors should be evenly distributed across the population. These requirements are intuitive, easily applied across domains, and hence are the most widely used, and studied.
At the same time, being fair with respect to parity can seem highly unfair from a single individual’s viewpoint. So individual fairness advocates treating similar individuals similarly. The ”Fairness through Awareness” approach is built on first defining a task-specific similarity measure between pairs of data subjects and using that to quantify how close predictions a randomized algorithm should give on two individuals. There are ways to combine group and individual fairness, such as learning fair representations (abstract transformations of the data points into high-dimensional numeric vectors) that could be used in downstream modeling tasks. Yet another approach develops individual risk scores and uses a thresholding policy to treat similarly risky individuals the same way.
The list of fairness definitions goes on and on, but in all cases, one aims to find the most accurate model that still satisfies a given fairness constraint. But who exactly selects the protected classes and the requirements that should be met? We know that certain parity requirements are impossible to satisfy simultaneously. On the other hand, finding the right metric and risk scores for individual assessments can be very challenging and needs to be done on a case-by-case basis. As Hanna Wallach from Microsoft Research puts it “[…] issues relating to fairness and machine learning are fundamentally socio-technical, and they are not going to be addressed by computer scientists or developers alone”. So it is of utmost importance to include a diverse set of stakeholders in these decisions with an insight into the whole decision-making process.
Demographic Parity – A Group-Fairness Measure
For defining and explaining in more detail group-fairness measures, let’s revisit David Weinberger’s tomato factory example. In this hypothetical factory, tomatoes are processed to end up in spaghetti sauce. An integral part of the factory is a machine learning algorithm that automatically analyzes tomatoes on the conveyor belt and classifies them into fresh and bad (or rotten) tomatoes. Fresh tomatoes are transferred into the “Acceptable” bin and ultimately end up in the spaghetti sauce, rotten tomatoes end up in the “Discard” bin and are thrown away. Consider there exist only two kinds of tomatoes worldwide: 80% of all tomatoes are red tomatoes and 20% of them are yellow. Apart from their appearance, there is no difference between red and yellow tomatoes. They taste the same, have the same shape, grow equally fast, and need equal amounts of care. They also have the same storage life which means they start to rot after the same time span.
One of the most intuitive definitions of fairness is demographic (or statistical) parity. In case the tomato sorting machine learning algorithm satisfies demographic parity, we expect about 80% of red and 20% of yellow tomatoes within the “Acceptable” bin in the spaghetti factory. In other words, we expect the fractions of red and yellow tomatoes in the global population to be reflected in the “favorable” group of “Acceptable” tomatoes in the factory. An unfair algorithm, that “favors” red tomatoes and discriminates against yellow ones, would put more than 80% of red tomatoes in the “Acceptable” bin. In this example, demographic parity is a perfectly fine measure.
However, as Dwork and co-workers pointed out, the notion of demographic parity has shortcomings and needs to be applied with great care. Imagine that our two hypothetical tomato sorts do differ in that yellow tomatoes tend to rot a little faster than red ones on their way to the factory. In that case, the fraction of red tomatoes in the “Acceptable” bin should be larger than 80% as more of the yellow tomatoes need to be discarded. Enforcing demographic parity in this scenario leads to two problems. First, it actually introduces some unfairness. To achieve demographic parity, say for a one-day batch of tomatoes processed in the factory, the algorithm needs to put some rotten yellow tomatoes into the “Acceptable” bin while, at the same time, prevent some of the perfectly fresh red tomatoes from going in there. The second shortcoming is related to the tension between accuracy and fairness. If the tomato sorting algorithm was what is called a perfect classifier (in practice a perfect classifier does not exist but for the sake of the argument let’s consider it does), it would not make any mistakes and place all tomatoes in the correct bin. As such, this algorithm is fair as it treats every single tomato the way the tomato “deserves”. Enforcing demographic parity on this perfect classifier would actually detune it – which clearly shows that there is a misalignment between optimizing a classifier and satisfying demographic parity. Therefore, demographic parity usually leads to larger costs in accuracy and, therefore, costs an organization more money than other fairness measures.
Equality of False-negatives And Equalized Odds
The core problem of demographic parity is that it does not take into account the ground truth. It does not care whether or not tomatoes are “Acceptable”, it just requires the fractions of red and yellow tomatoes in the global population being represented in the “Acceptable” bin. There is a group of fairness measures that do take into account the ground truth by, for example, balancing or equalizing the errors the sorting algorithm makes for both sorts of tomatoes. One of the simplest examples in this context is the so-called equality of false-negatives measure that enforces constant false-negative rates across groups. In our tomato example, this means that fresh tomatoes – irrespective of their color – have the same probability of falsely ending up in the “Discard” bin. This measure only amends the errors made in the group of fresh tomatoes as only they can falsely end up in the “Discard” bin.
An even stronger fairness notion that also mitigates errors in the group of rotten tomatoes is called equalized odds. It requires constant false-negative as well as true-negative rates across groups. This means that also the chances for rotten tomatoes ending up in the “Discard” bin is equal for red and yellow tomatoes. One big advantage of these types of fairness measures is that they allow for perfect decisions. For a perfect classifier, the false-negative and true-negative rates across all groups are 0% and 100%, respectively. This does not mean that the accuracy of a real-world classifier is not limited by an additional fairness constraint but it shows that, for example, equalized odds is usually better aligned with optimization than demographic parity.
The Accuracy Versus Fairness Trade-Off
Fairness always comes at a cost: as we put an additional constraint on the model, we introduce a trade-off with accuracy. This is not a new phenomenon and a stark example relates to a fatal Uber accident in Tempe, Arizona. The autonomous vehicle system detected the pedestrian in time to stop but the developers had tweaked the emergency braking system in favor of not braking too much, balancing a trade-oﬀ between jerky driving and safety.
Going back to bias, when we compare a model that maximizes total revenue, a fairness constrained model will probably promise less profit. You can explore these concepts with an interactive threshold classifier, including demographic parity and equal opportunity (that is, equal true positive rates), in a post from Google Research. By setting various global or group aware thresholds for giving out hypothetical loans, you can see how the bank’s profit and the distributions of loans across the population changes.
Call for action
We have seen that there are numerous differences between fairness definitions and one is not necessarily better than the other. In our forthcoming posts about fair synthetic data, we shall focus on group fairness and on statistical parity first. But before that, we are curious: how does your company approach fairness in AI? Do you have measures in place already or are you just starting to evaluate possible data bias and your modeling pipelines? If so, what resources do you consider most useful and what resources do you wish you have? We would love to hear your thoughts.
You may remember that back in 2015 Google was called out for its photo app that mistakenly labeled pictures of people with darker skin color as “gorillas”. As you can imagine, it was a PR disaster back then. Of course, the company publicly apologized, said that such a result is unacceptable and promised to fix the mistake. But apparently – as Wired uncovered three and a half years later – they somehow never got to truly fixing the underlying issue. What they did is, they implemented a workaround: by blocking their AI from identifying gorillas (and some other primates) altogether to prevent another miscategorization.
A Google spokesperson confirmed to Wired that certain image categories were and remained blocked after the incident in 2015 and added that “Image labeling technology is still early and unfortunately it’s nowhere near perfect”. But what does this mean for other companies and their chances for success in fighting bias in AI? If one of the biggest tech companies – that employs some of the brightest AI experts – was not able to come up with a better solution? Firstly, it definitely proves the point that it is extremely difficult to mitigate bias in machine learning models. Secondly, it may raise the question of whether Google really was not capable of resolving the issue, or whether they just were not willing to dedicate the necessary resources? But before we look into how companies could tackle bias in Artificial Intelligence (and whether AI regulations could be the motivational factor to do so), let us start with a (non-exhaustive) list of reasons why algorithms are biased.
Reason #1: Insufficient Training Data
As mentioned in part 1 of our Fairness Series, a major problem of bias in AI is that not enough training data was collected. Or more precisely, that only limited data for certain demographic groups or groups with extraordinary characteristics is available. The consequences of insufficiently diverse data can easily be observed with facial recognition technology. A study showed, that models performed significantly better on pictures of white males (99% accuracy) versus black females (65%), because the majority of images used in model training consisted of white men.
Reason #2: Humans Are Biased – And So Is The Data That
AI Is Trained On
Whether we like it or not, as humans we all carry our (un)conscious biases that are reflected in the data we collect about our world. As this is the exact same data that is used to train AI models, it is not surprising that these biases find their way into algorithms. Imagine a hiring algorithm that is trained on existing U.S. employment data. Last year, women accounted for only 5% of CEOs in the top 500 companies. They also held significantly less senior management positions than their male co-workers. What would this mean for the algorithm? Quite likely, it would pick up that being female correlates poorly with being a CEO. And if hiring managers were to look for the ideal candidate to fill an open senior management position, the system will probably mainly show the résumés from male applicants.
Another common problem with human bias occurs in the
context of supervised machine learning, where humans oftentimes label the data
that is used to train a model. Because even if they are well-intentioned and
not mean any harm, their unconscious biases could sneak into the training sample.
Reason #3: De-Biasing Data Is Exceptionally Hard To Do
If you wanted to have a fair algorithm – but we have just established that historical data is biased – what if you would clean the data to make it fair? One approach that has been tried is removing sensitive attributes. For example, a person’s race. Unfortunately, research has shown that this does not prevent models from becoming biased. Why? Because of correlated attributes that can be used as proxies.
Think about a neighborhood that is known to be home to
predominantly black people. Even if the race-column would have been excluded
from the training data, the ZIP code of this neighborhood would serve as a
proxy that indicates the race of a person. It has been shown, that even if
sensitive columns were removed, proxies allowed for systematic discrimination
of minorities. For example, the denial of bank loans or access to Amazon’s
same-day purchase delivery option.
To counteract this, some
researchers advise to actually keep the sensitive columns in the dataset,
as they could serve as a more straightforward lever to mitigate bias. For
example, if you aim for a model that treats males and females equally, you can
use the gender-column to directly monitor and correct for potential violations
of your desired equality criteria during model training.
Reason #4: De-Biasing AI Models Is Very Difficult Too
There is a multitude of reasons why it is difficult to develop a machine learning model that is free of bias. One aspect to consider is that there are many decisions involved in the construction of a model that could potentially introduce bias – but the downstream impacts of them oftentimes do not become apparent until much later. For example, the choices AI researchers made in regards to how speech was analyzed and modeled led to the outcome that a speech recognition algorithm performed significantly worse for female speakers as opposed to male ones.
Another aspect that has
been criticized is that common practices in deep learning are not designed
to help with bias detection. Because even though models are usually tested
before they are deployed, this oftentimes happens with a holdout sample from
the training dataset. While this certainly helps with the evaluation of an
algorithm’s accuracy, it does not help with bias detection since the data that
is used for testing is as biased as the data that is used during the training.
Lastly, building an unbiased model requires expert knowledge that not every AI engineer may have obtained (yet). Especially as more and more “off the shelf”-algorithms become available nowadays, that can be used by non-experts, this becomes an additional point of concern.
Reason #5: Diversity Amongst AI Professionals Is Not
As High As It Should Be
Another quite peculiar incident that can be attributed to non-diversity happened to a South Korean woman, who was sleeping on her floor until a robotic vacuum cleaner “attacked” and ingested her hair. Rest assured that firefighters managed to rescue her (minus about 10 strands of her hair). But if the product development team would have consisted of a more diverse group of people – with different cultural backgrounds – somebody might have raised the question of whether all future users tend to sleep in beds. But if something is not part of a person’s reality it is hard to think about it, consider it, and ask the necessary questions.
Reason #6: Fairness Comes At A Cost (That Companies May
Not Be Willing To Pay)
Depending on what is most important for a company that
is developing an AI algorithm, it could either optimize the model to maximize
the profits, to increase the revenue or the number of customers. No matter what
they decide on, their main objective will be to improve the model’s accuracy.
However, what would happen if the company decided that
it wanted to have a fair model as well? Then the model would be forced to balance
two conflicting objectives. Achieving fairness would inevitably come at the
cost of maximum accuracy.
In our economy companies often tend to optimize for
profit. Thus, it can be put into question how many businesses would voluntarily
decide to take the path of fairness or whether regulations would be required to
Reason #7: External Audits Could Help – If Privacy
Would Not Be An Issue
Especially in scenarios where AI applications are used in high-stake environments, voices were being raised that external audits should be used to systematically vet algorithms to detect potential biases. This may be an excellent idea – if privacy would not be an issue. To thoroughly evaluate an algorithm not only access to the model but also to the training data would be beneficial. But if a company would share the privacy-sensitive customer data it used to develop its model, it would quickly get into conflict with GDPR, CCPA, and other privacy regulations.
data – a new approach to big data anonymization – could provide a solution
for this issue. Synthetic data tools allow to generate fully anonymous, yet
completely realistic and representative datasets. Their high accuracy enables
an organization to directly train its machine learning models on top of it, while
the strong privacy protection properties allow to externally share synthetic datasets
with auditors without infringing people’s privacy.
Reason #8: Fairness Is Hard To Define
In the 1970s only 5% of musicians in the top five orchestras were female. Blind auditions increased the percentage of women to 30%, which certainly is an improvement – but many people would agree that this is not yet fair. However, it would be much harder to reach an agreement about what would be fair. Should there be 50% of women in the orchestra, because roughly half of our world’s population is female? Or would it be fairer if the same percentage of female, as well as male applicants, get accepted? For example, 20% each. Considering that many modern orchestras employ approximately 100 full-time musicians, this could mean that 40 seats get to female musicians and 60 to male ones (if 200 women and 300 men were to apply). Others might argue that due to centuries of injustice (and overrepresentation of male musicians in orchestras) employing significantly more women would be fairest.
As you can see, it is pretty hard to define fairness. One
reason is that different people have different values. Another one is that there
are so many different ways to define fairness – in general, as well as
mathematically (Arvind Narayanan, an associate professor of computer science at
Princeton, even compiled an
astonishing list of 21 fairness definitions).
Reason #9: What Was Fair Yesterday Can Be Biased
Do you remember Microsoft’s
“Tay”? The innocent AI chatbot started as a harmless experiment and was
intended to learn from conversations with Twitter users – which it did (but
probably not as imagined). In less than a day, Tay became misogynistic and racist:
Quickly, Microsoft decided to take Tay from the web. What remained is a statutory example that even if you take measures to mitigate bias during the initial training phase, many algorithms are designed to continuously learn and thus are potentially vulnerable to become biased over time. (An issue, where also external audits would be stretched to their limits if they are not designed for constant monitoring.)
Reason #10: The Vicious Bias Cycle (Biased AI Will Lead
To More Bias)
If bias in AI is not successfully addressed, it will perpetuate and potentially even amplify biases in our society. A good example of this is Google’s search algorithm that was accused of showing racist images. When users searched for terms like “black hands”, the algorithm predominantly showed pictures of black hands working in the earth and other derogatory depictions. It can be assumed that more users will click on the top search results as opposed to potentially more neutral images which are not shown on the first page. Consequently, the algorithm is much more likely to display them more often and thus would further contribute to perpetuating biases.
How To Achieve Fairness In AI?
Precisely answering this question would go beyond the scope of this post. However, a first and very important step would be that society demands fairness in AI and puts it on the agenda of regulators (which subsequently will improve the chance that de-biased AI also makes it on the priority list of conscientious companies). Once the relevant stakeholders have decided that they want to have anti-discriminatory algorithms, a next step would be to define fairness and to establish a shared understanding about which outcomes would be considered ethically acceptable.
Thirdly, researchers, and AI practitioners should continue to collaborate on the development of solutions to reduce AI bias. As diverse training data takes on such an important role in the mitigation of bias it is imperative that companies start to collect data that reflects the full spectrum of human diversity. Additionally, removing bias already at its source – namely, in existing datasets – would allow developing models without having to be concerned about bias correction anymore. Fair Synthetic Data seems to be particularly promising in this regard, but we will dive deeper into this topic in part 4 of our Fairness Series.
Another important aspect is diversity in teams. However, achieving it may not be the easiest task considering not only the gender gap but also the underrepresentation of ethnic minorities in data science. Thus, companies might not be able to tackle this issue alone but will also be dependent on universities and governments undertaking efforts to make AI and data science education more attractive and inclusive.
At this point, we would love to hear your thoughts and ideas on how fairness in AI could be achieved. Message us anytime for a fun conversation or a friendly debate. Tomorrow, part 3 of our Fairness Series will follow where we shine the spotlight on the concept of fairness and how it can be defined.
In November 2019 a Danish Apple Card user uncovered that Apple’s AI algorithm granted him 20 times the credit limit that his wife received. This disparity came as a major surprise as the couple shared assets and she actually had a higher credit score than he did. Apple and Goldman Sachs, who partnered on this financial product, repeatedly assured that this wasn’t a case of discrimination. But what started with a viral tweet brought more and more cases of bias to the surface – and ultimately led to an investigation by the New York State Department of Financial Services.
high-profile issue was investigated by ProPublica: An algorithm used by the US’
criminal justice system that predicts the risk of defendants to re-offend once
they have served their sentences. Based on the algorithm’s results, judges
determine which defendants are eligible for probation or treatment programs. In
ProPublica’s study, the authors demonstrate that the
algorithm is biased with respect to ethnicity:
“Afro-Americans are almost twice as likely as whites to be labeled a higher risk but not actually re-offend. It [the algorithm] makes the opposite mistake among whites: They are much more likely than blacks to be labeled lower risk but go on to commit other crimes”.
While Artificial Intelligence has remarkable potential
to analyze and identify patterns even in very large datasets and to help humans
to make more informed decisions, the examples mentioned above were by far not
the only incidents where customers or researchers are concerned about machine
learning models exhibiting discriminatory behavior based on gender, race or
But Why Is AI Biased?
Simply spoken, there is not one root cause for bias in AI – and that is why it is so difficult to get rid of it. One of the major problems is insufficient training data where some demographic groups are missing or underrepresented. For example, one study by MIT researchers revealed that facial recognition technology had higher error rates for minorities – particularly if they were female. Another study found, that a facial recognition system was 99% accurate in detecting male faces, but was only capable of correctly recognizing a black woman two out of three times. This difference in performance originated from the dataset that was used to train the model: there were 75% male faces but only 25% female ones in the training sample and 80% of the total amount of images showed white persons. Naturally, the algorithm better learned to identify those categories where significantly more data was available.
Similar issues can be observed when AI is used for recruiting purposes without making sure that a substantial amount of training data, for a diverse group of people, is used during model training. Thus concerns have been raised, that a remote video interview software that evaluates employability based on facial expression, speech patterns, and tone of voice could unfairly disqualify disabled people. Possibly even more concerning (and potentially fatal) are scenarios where AI is applied for healthcare. One example is the British health app “Babylon” that was accused of putting female users at risk. It suggested that sudden pain in the left arm and nausea may be due to depression or having a panic attack and advised women to go see a doctor in the next few days. In contrast, males showing the same symptoms were advised to immediately visit an emergency department based on the diagnosis of a possible heart attack.
According to Prof. Dr. Sylvia Thun, director of eHealth at Charité of the Berlin Institute of Health, “there are huge data gaps regarding the lives and bodies of women”. In a recent Forbes interview, she explained that medical algorithms are oftentimes based on U.S. military data – an area where women, in some cases, only represent 6% of the total personnel. Thus she emphasized the importance of making sure that medical apps take relevant data not only from men but also from women into account.
Artificial Intelligence Is A Human-made Problem
However, AI bias is not always a consequence of limited training data. To a certain extent, all humans carry (un)conscious biases and behave accordingly. Thereby our human biases find their way into the historical data that is used to train algorithms – and thus it is not surprising that they get picked up by machine learning models. An example for this is Amazon’s recruiting algorithm that learned that – historically – the majority of technical roles was filled by males and thus penalized a resume if it included the word “women”. Or Google’s discriminatory job ads algorithm that disproportionately showed high-paying job ads to men but not to women.
Another example comes from the U.S. health care system where AI is deployed to guide healthcare decisions. A study found that a widely used algorithm discriminated against black people by estimating the level of care that is needed based on health costs. Due to the fact that more money is spent on white patients, the algorithm concluded that black patients are healthier and don’t require the same amount of extra care.
Human bias in historical data is an issue that needs to be addressed when developing AI algorithms. But it also makes apparent that simply refraining from applying AI in our day-to-day lives wouldn’t solve our society’s problem of discrimination and unfair treatment of minorities. Long before AI found its way into today’s economy, researchers documented cases of injustices due to race, gender, or sexual orientation. From hiring managers, that invited people with white- instead of black-sounding names 50% more often to a job interview. To women, who in general are 47% more likely to suffer from a serious injury and 17% more likely to die in a car accident, because seatbelts and other safety features in cars were designed based on crash dummies with male physiques.
In fact, quite the opposite might be true: As AI algorithms require us to be completely clear and to precisely define which outcomes we would consider fair and ethically acceptable, there might be the inherent potential that this new technology will enable us to better mitigate bias in our society. This rationale is also shared by Sendhil Mullainathan, Professor at the University of Chicago, who authored several studies on bias (in people and in AI). In a recent New York Times article he stated that:
“Changing algorithms is easier than changing people: software on computers can be updated; the “wetware” in our brains has so far proven much less pliable. None of this is meant to diminish the pitfalls and care needed in fixing algorithmic bias. But compared with the intransigence of human bias, it does look a great deal simpler.”
Business Leaders Care About Bias in AI?
As a society, we should strive to develop AI technology that is effective and fair for everyone. But also businesses – which will become increasingly more reliant on machine learning algorithms – will benefit if they proactively tackle bias in AI. One of the more obvious advantages is, that the mere idea of an underlying algorithm being biased could be enough to turn customers against a product or a company. Moreover, having researchers expose the discriminatory nature of a proprietary AI application constitutes a significant reputational risk that could be hard to recover from.
Another point to consider is that developing an algorithm that accurately performs on the whole spectrum of human diversity is also much more likely to deliver superior value to a broader and varied group of potential customers. But not only customers would benefit from unbiased AI. According to Gartner, “By 2022, 85% of AI projects will deliver erroneous outcomes due to bias in data, algorithms, or the teams responsible for managing them. This is not just a problem for gender inequality – it also undermines the usefulness of AI”.
Ultimately, the ongoing discussion about anti-bias regulations, the issued Ethical AI guidelines and the call for AI certifications could be motivating factors to approach this topic proactively. Especially since the European Commission emphasized the importance of having “requirements to take reasonable measures aimed at ensuring that [the] use of AI systems does not lead to outcomes entailing prohibited discrimination.” in their white paper on Artificial Intelligence, which was just released in February 2020.
handing over the CEO role to me and will continue to shape the long-term vision
of Mostly AI as Chief Strategy Officer. I’m incredibly excited and humbled as I
step into Michael’s footsteps in order to continue growing Mostly AI together
with him and the two co-founders Klaudius and Roland.
Mostly AI 3 years ago and were among the first to fully grasp the enormous
potential that lies in utilizing generative AI for creating privacy-preserving
structured synthetic data. 3 years ago this was just that: a visionary idea,
unclear if technically possible. Today we know how powerful this is and how well
it works – and it works really well! I know from first-hand experience how
difficult it is to start a company and I can’t emphasize enough how impressed I
am by what these three pioneers of synthetic data have built since joining
forces. Taking this company to the next level is an enormous responsibility and
I’m truly thankful for the trust the founders and investors are putting in me.
It was in the fall of 2017, 2.5 years ago, when I spoke with Michael at a data science meetup, where he shared what they were building at Mostly AI for the first time. My initial reaction – and forgive me for I was a layman then – was: what’s wrong with traditional ways to anonymize data? But it only took Michael a minute to convince me of the shortcomings of classical ways of anonymizing data in the era of big data and I was shocked once I understood the problem. What I remember very vividly was the instant fascination for Michael’s vision. He was able to depict a future world where insights from a multitude of data sources could be fully leveraged while protecting the privacy of individuals.
I met Klaudius and Roland in 2018 and as the conversations got deeper, my desire to become a part of Mostly AI grew steadily. In May 2019 I started working for Mostly AI, first as an external consultant and soon thereafter as the company’s Chief Operating Officer. The last year has really flown by and it’s impossible to adequately recap the many highlights and accomplishments we have had at Mostly AI without going beyond the scope of this post. Let me just share that on a personal level it has been deeply satisfying and impactful. In my career of almost 15 years, I’ve held many different positions and worked for a variety of fantastic organizations. I don’t need to think for a second to say that the year at Mostly AI has been the most thrilling and fulfilling professional year for me so far.
My previous roles were always centered around starting new or building and growing businesses. I’ve either done this as a consultant, intrapreneur, or many years ago with my own start-up. It’s great to see how all these experiences have now put me in a sweet spot to help take Mostly AI to the next level. We have a multitude of things we are working on including feature upgrades to our core product Mostly GENERATE and the launch of our SaaS offering which I’m particularly excited about. We can’t wait to share all of this with the world in the months and years ahead.
I look very much forward to working for many more years with the outstanding and growing team at Mostly AI, our clients, partners, and investors on shaping the synthetic data revolution together. While times right now are challenging for everyone, it’s amazing to see how well we cope with the situation as a team and company.
PS: I also look forward to writing on this blog more frequently. Watch out for posts about the business side of synthetic data, our company, and all things scaling a deep-tech start-up.
What are the most important cybersecurity trends in 2020? How can CISOs identify the right security and privacy solutions for their company? And is it really possible to securely anonymize the location data that is currently being shared to combat the spread of COVID-19?
To answer these and more questions, SOSA’s Global Cyber Center (GGC) invited our CEO Michael Platzer to join them on their Cyber Insights podcast for an interview. For those of you, who don’t know SOSA: it’s a leading global innovation platform that helps corporates and governments alike to build and scale their open innovation efforts. What follows is a transcript of the podcast episode, but if you prefer to watch the video you can access it on the GGC blog.
William (Senior Analyst at GGC): Tell us about Mostly AI! What do you do and how do you do it?
Michael: Thanks for having me! We are Mostly AI and we are a deep-tech startup founded here in Europe while preparing for GDPR. Very early on, we had this realization that synthetic data will offer a fundamentally new approach to data anonymization. The idea is quite simple. Rather than aggregating, masking or obfuscating existing data, you would allow the machine to generate new data or fake data. But we rather prefer to say “AI-generated synthetic data”. And the benefit is, that you can retain all the statistical information of the original data, but you break the 1:1 relationship to the original individuals. So you cannot re-identify anymore – and thus it’s not personal data anymore, it’s not subject to privacy regulations anymore. So you are really free to innovate and to collaborate on this data – but without putting your customers’ privacy at risk. It’s really a fundamental game-changer that requires quite a heavy lifting on the AI-engineering side. But we are proud to have an excellent team here and to really see that the need for our product is growing fast.
William: Wonderful, now Micheal, when you think about the broad array of cybersecurity trends that are unfolding today – ranging from new threats to new regulations – what is really top of mind for you in 2020?
Michael: Of course, the current situation is heavily dominated by the current corona crisis. But we see two trends at the moment: Like for one, the value of big data is broadly recognized. Most of the media coverage, the decision-makers, the politicians, everyone is talking about data, data, data. There is an increased and broader understanding of the value of the data throughout the whole society to combat the crisis. And for good reason, because this is where you’ll find the answers. But you also need to be able to share data, to share data across borders, to share data between the regions, between cities and so forth to make the best out of the situation, to solve the crisis fast. Secondly, we also observe that organizations are forced to going into a remote mode. The lines blur a lot more between whether employees are sitting at the office or at home, whether you work together with internal team members or external partners. But this only increases the need actually for privacy and security solutions. So, organizations need to have privacy by design. And they need to have that fast, now that we are in an increasingly remote collaboration mode.
William: Very interesting! Now, we know that location data is among our most accessible PII– we kind of give it out all the time via our mobile device. In the wake of the coronavirus, we are seeing calls to use our location data to track the spread of this pandemic. Is it possible to really effectively anonymize and secure our location data? Or can this data just be reverse engineered? Could using synthetic data help?
Michael: Yes definitely, and we are also engaging with decision-makers at this moment in this crisis. Location data is incredibly difficult to anonymize. There have been enough studies that show how easy it is to re-identify location traces. So what organizations end up with is only sharing highly aggregated count statistics. For example, how many people are at which time at which location. But you lose the dimension at the individual level. And this is so important if you want to figure out what type of socio-demographic segments are adapting to these new social distancing measures, and for how long they do that. And is it 100% of the population that’s adapting, are social contacts reducing by 60% or is it maybe a tiny fragment of segments that is still spreading the virus? To get to this kind of level to intelligence you need to work at a granular level. So not on an aggregated level, but on a granular level. Synthetic data allows you to retain the information on a granular level but break the tie to us individually. We just, coincidentally, in February wrote a blogpost on synthetic location traces– so before the corona crisis started – because we were researching this for the last year. It’s on our company blog and I can only invite people to read it. Super exciting new opportunities now to anonymize location traces!
William: That is exciting – and it sounds as if it could be very helpful, especially given what we are all going through! Now, Micheal, there is an expanding list of techniques to protect data today; from encryption schemes, tokenization, anonymization, etc. Should CISOs look at the landscape as a “grocery shelf” with ingredients to be selected and combined or should they search for one technique to rule them all?
Michael: Well, I don’t believe that there is a one-size-fits-all solution out there. And those different solutions really serve different purposes. It’s important to understand that encryption allows you to safely share data with people that you trust – or you think that you trust. Whether that’s people or machines, at the end, there is someone sitting who is decrypting the data and then has access to the full data. And you hope that you can trust the person. Now, synthetic data allows you to share data with people where you don’t necessarily need to rely on trust, because you have controlled for the risk of a privacy leak. It’s still super valuable, highly relevant information. It contains your business secrets, it contains all the structure and correlations that are available to run your analytics, to train your machine learning algorithms. But you have zeroed out your privacy risk! In that sense, synthetic data and encryption serve two different purposes. So every CISO needs to see what their particular challenge and problem is that needs to be overcome.
Michael, we’re coming up on our time here. Are there any concluding remarks or
anything you would like to add before we hang up?
Michael: Well, we just closed our financing round so we’re set for further growth both in Europe as well as the US. We’re excited about the growing demand for data anonymization solutions, also for our solution. Happy to collaborate with innovative companies, who take privacy seriously. And of course, I wish everyone best of health and that we get – also as a global community – just stronger out of the current crisis.
Whether you’ve just heard about Synthetic Data at the last conference you attended or are already evaluating how it could help your organization to innovate with your customer data in a privacy-friendly manner, this mini video series will cover everything you need to know about Synthetic Data:
What it is? (Pssst, spoiler: a fundamentally new approach to big data anonymization)
Why is it needed?
Why classic anonymization fails for big data (and how relying on it puts your organization at risk)
How Synthetic Data helps with privacy protection,
Why it is important that it is AI-generated Synthetic Data & how to differentiate between different types
And lastly, Synthetic Data use cases and insights on how some of the largest brands in the world are already using Synthetic Data to fuel their digital transformation
1. Introduction to Synthetic Data
2. Synthetic Data vs. Classic Anonymization
3. What is Synthetic Data? And why is AI-generated Synthetic Data superior?
4. Why Synthetic Data helps with Big Data Privacy
5. How Synthetic Data fuels AI & Big Data Innovation
It’s 2020, and I’m reading a 10-year-old report by the Electronic Frontier Foundation about location privacy that is more relevant than ever. Seeing how prevalent the bulk collection of location data would become, the authors discussed the possible threats to our privacy as well as solutions that would limit this unrestricted collection while still allowing to reap the benefits of GPS enabled devices.
Some may have the opinion that companies and governments should stop collecting sensitive and personal information altogether. But that is unlikely to happen and a lot of data is already in the system, changing hands, getting processed, and analyzed over and over again. This can be quite unnerving but most of us do enjoy the benefits of this information sharing like receiving tips on short-cuts through the city during the morning rush hour. So how can we protect the individual’s rights and make responsible use of location data?
Until very recently, the main approach was to anonymize these data sets: hide rare features, add noise, aggregate exact locations into rough regions or publish only summary statistics (for a great technical but still accessible overview, I can recommend this survey). Cryptography also offers tools to keep most of the sensitive information on-device and only transmit codes that compress use-case relevant information. However, these techniques make a big trade-off between privacy, accuracy, and utility of the modified data. Even after this preprocessing, if the original data retains any of its utility then the risk of successfully re-identifying an individual is extremely high.
“D.N.A. is probably the only thing that’s harder to anonymize than precise geolocation information.”
Paul Ohm, law professor and privacy researcher at the Georgetown University Law Center. Source: NYT
Time and again, we see how subpar anonymization can lead to high-profile privacy leaks and this is not surprising, especially for mobility data: in a paper published in Scientific Reports, researchers showed that 95% of the population can be uniquely identified just from four time-location points in a data set with hourly records and spatial resolution given by the carrier’s antenna. In 2014, the publication of supposedly safe pseudonymized taxi trips allowed data scientists to find where celebrities like Bradley Cooper or Jessica Alba were heading by querying the data based on publicly available photos. Last year, a series of articles in the New York Times highlighted this issue again: from a data set with anonymized user IDs, the journalists captured the homes and movements of US government officials and easily re-identified and tracked even the president.
Mostly AI’s solution for privacy-preserving data publishing is to go synthetic. We develop AI-based methods for modeling complex data sets and then generate statistically representative artificial records. The synthetic data contains only made-up individuals and can be used for test and development, analysis, AI training, or any downstream tasks really. Below, I will explain this process and showcase our work on a real-world mobility data set.
The Difficulties With Mobility Data
What is a mobility data set in the first place? In its simplest form, we might be talking about single locations only where a record is a pair of latitude and longitude coordinates. In most situations though, we have trips or trajectories which are sequences of locations.
The trips are often tagged by a user ID and the records can include timestamps, device information, and various use-case specific attributes.
The aim of our new Mobility Engine is to capture and reproduce the intricate relationship between these attributes. But there are numerous issues that make mobility data hard to model.
Sparsity: a fixed latitude/longitude pair from the data set appears once or a few times, especially at high granularity records.
Noise: GPS recordings can include a fair amount of noise so even people traveling the exact same route can have quite different trajectories recorded.
Different scales: the same data set could include pedestrians walking in a park and people taking cabs from the airport to the city so the change in data points can vary highly.
Sampling rate: making modeling even more difficult, even short trips can contain hundreds of recordings and long trips might sample very infrequently.
Size of the data set: the most useful data sets are often the largest. Any viable modeling solution should handle millions of trips with a reasonable turn-over.
Our solution can overcome all these difficulties without the need to compromise on accuracy, privacy or utility. Let me demonstrate.
Generating Synthetic Trajectories
The Porto Taxi data set is a public collection of a year’s worth of trips by 442 cabs running in the city of Porto, Portugal. The records are trips, sequences of latitude/longitude coordinates recorded at 15-second intervals, with some additional metadata (such as driver ID or time of day) that we won’t consider now. There are short and long trips alike and some of the trajectories are missing a few locations so there could be rather big jumps in them.
Given this data, we had our Mobility Engine sift through the trajectories multiple times and learn the parameters of a statistical process that could have generated such a data set. Essentially, our engine is learning to answer questions like
“What portion of the trips start at the airport?” or
“If a trip started at point A in the city and turned left at intersection B, what is the chance that the next location is recorded at C?”
You can imagine that if you are able to answer a few million of these questions then you have a good idea about what the traffic patterns look like. At the same time, you would not learn much about a single real individual’s mobility behavior. Similarly, the chance that our engine is reproducing exact trips that occur in the real data set, which in turn could hurt one’s privacy, is astronomically small.
For the case of this post, we trained on 1.5 million real trajectories and then had the model generate synthetic trips. We produced 250’000 artificial trips for the following analysis, but with the same trained model, we could have as easily built 250 million trips.
First, for a quick visual, we plotted 200 real trips recorded by real taxi drivers (on the left, in blue) and 200 of the artificial trajectories that our model generated (on the right, in green). As you can see, the overall picture is rather convincing with the high-level patterns nicely preserved.
Looking more closely, we see very similar noise in the real and synthetic trips on the street level.
In general, we keep three main things in mind when evaluating synthetic data sets.
Accuracy: How closely does the synthetic data follow the distribution of the real data set?
Utility: Can we get competitive results using the synthetic data in downstream tasks?
Privacy: Is it possible that we are disclosing some information about real individuals in our synthetic data?
As for accuracy, we compare the real and synthetic data across several location-and trip-level metrics. First, we require the model to accurately reproduce the location densities, the ratio of recordings at a given spatial area, and hot-spots at different granularity. There are plenty of open-source spatial analysis libraries that can help you work with location data such as skmob, geopandas, or Uber’s H3 which we used to generate the hexagonal plots below. The green-yellow-red transition marks how the city center is visited more frequently than the outskirts with a clear hot-spot in the red region.
From the sequences of latitude and longitude coordinates, we derive various features such as trip duration and distance traveled, origin-destination distance, and a number of jump length statistics.
The plots above show two of these distributions both for the real and synthetic trips. The fact that these distributions overlap almost perfectly shows that our engine is spot-on at reproducing the hidden relationships in the data. To capture a different aspect of behavior, we also consider geometric properties of paths: the radius of gyration measures how far on average the trip is from its center of mass and the straightness index is the ratio of the origin-destination distance with the full traveled distance. So, for a straight line, the index is exactly 1 and for more curvy trips it takes lower values with a round trip corresponding to straightness index 0. We again see that the synthetic data follows the exact same trends as the real one, even mimicking the slight increase in the straightness distribution from 0.4 to 1. I should stress that we get this impressive performance without ever optimizing the model for these particular metrics and so we would expect similarly high accuracy for other so far untested features.
Regarding utility, one way we can test-drive our synthetic mobility data is to use it in practical machine learning scenarios. Here, we trained three models to predict the duration of a trip based on the origin and destination: a Support Vector Machine, a K-nearest neighbor regressor, and an LGBM regressor.
We trained the synthetic models only on synthetic data and the real models on the original trajectories. The scores in the plot came from testing all the models against a holdout from the original, real-life data set. As expected, the synthetically trained models performed slightly worse than the ones that have seen real data but still achieved a highly competitive performance.
Moreover, imagine you are an industry-leading company and shared the safe synthetic mobility data in a competition or hackathon to solve this prediction task. If the teams came up with the above-shown solutions and got the green error rates on the public, synthetic data then you can rightly infer that the winner solution would also do best on your real but sensitive data set that the teams have never seen.
The privacy evaluation of the generated data is always a critical and complicated issue. Already at the training phase, we have controls in place to stop the learning process and thus prevent overfitting. This ensures that the model is learning the general patterns in the data rather than memorizing exact trips or overly specific attributes of the training set. Second, we also compare how far our synthetic trips fall from the closest real trips in the training data. In the plot below, we have a single synthetic trip in red, with the purple trips being the three closest real trajectories, then the blue and green paths sampled from the 10-20th and 50-100th closest real trips.
To quantify privacy, we actually look at the distribution of closest distances between real and synthetic trips. In order to have a baseline distribution, we repeat this calculation using a holdout of the original but so far unused trips instead of the synthetic ones. If these real-vs-synthetic and real-vs-holdout distributions differ heavily, in particular, if the synthetic data falls closer to the real samples than what we would expect based on the holdout, that could indicate an issue with the modeling. For example, if we simply add noise to real trajectories then the latter distributions will clearly flag this data set for a privacy leak. However, the data generated by our Mobility Engine passed these tests with flying colors.
We should be loud about not compromising on privacy, no matter the benefits offered by sharing our personal information. Governments, operators, service providers, and others alike need to take privacy seriously and invest in technology that protects the individual’s rights during the whole life cycle of the data. We at Mostly AI believe that synthetic data is THE way forward for privacy-preserving data sharing. Our new Mobility Engine allows organizations to fully utilize sensitive locational data by producing safe synthetic data at a so-far unseen level of accuracy.
This research and development on synthetic mobility data is supported by a grant of the Vienna Business Agency (Wirtschaftsagentur Wien), a fund of the City of Vienna.