Being Open in the Era of Privacy

When Netflix announced a 1 million dollar challenge for improving their recommendation engine, together with releasing 100 million movie ratings in 2006, they knew little what would happen next. Netflix has been working on a recommender systems for years by then, were recognized for their business innovation as well as technical excellence, managed to hire the smartest data engineers and machine learning experts alike, and certainly knew all the ins and outs on movies, genres and popular actors. Still, they were eager to learn more.

The competition began on October 2nd, 2006. Within a mere six days a first contestant succeeded in beating their existing solution. Six days! From getting access to the data and taking a first look, to building a movie recommender algorithm from scratch, all the way to making more accurate predictions for over a million ratings than anyone before. Six days, from zero to world class.

It didn’t stop there. Within a year, over 40,000 teams from 186 countries entered the competition, all trying to improve on Netflix’ algorithms. And contestants increasingly started collaborating, sharing their learnings and taking lesson from others, formed bigger teams and ensembled ever more powerful models. Hardly ever before was there such a rich & big dataset on consumer behavior openly available. Together with a clearly stated and measurable objective, it provided a challenging, yet safe and fair sandbox for a worldwide community of intellectually driven engineers. All working together on advancing science, while, as a more than welcome side effect, also helping Netflix improve its core algorithm.

The Case for #OpenBigData

Being open enables to pick the brain of a wider group, to bring in fresh perspectives to existing challenges, to build upon the creative minds of the many. And with data being the lingua franca of today’s business world, being the common denominator across departments, across corporations, across industries, being open is really about sharing data, about sharing granular level data at scale!

Openly sharing customer data at scale in 2006 was a bold move. But it was no coincidence that it’s been done by Netflix. Already in their early years they successfully established a culture of excellence, curiosity as well as courage, well documented in “one of Silicon’s valley’s most important PowerPoint decks” [1]. Sharing data broadly takes courage, but is even more so a sign of curiosity and a thrive for excellence. No holding back, no hiding out, no making excuses. Netflix was never afraid of their competitors. They were afraid of stopping to strive for the best.

Openly sharing customer data at scale in 2019 is an (unl)awful move. Over the past years, the explosion in data volumes met a poorly regulated market, with little sanctions being imposed, and thus allowing excessive misuse of personal data. The tide though has turned, both the regulators as well as corporates are acknowledging privacy as a fundamental human right, one that is to be defended [2,3]. This is indeed a new era of privacy.

Unfortunately, this plays into the hands of modern-day corporate gatekeepers. Those decision makers, who’ve never really been fond of being transparent and being challenged, and thus reluctant to share “their” data in the first place. It turns out they found a new ally in defending their corporate data silos: privacy.

The Case for #SyntheticData

This is the point where one needs to tell the lesser known part of the Netflix Prize story: As successful the competition was for the company overall, they also had to pay their prize at court. In fact, they were forced to cancel their second machine learning challenge, that was planned for 2011 [3]. Netflix had misjudged their anonymization measures they’ve had put in place [4]. Even though they limited data to movie ratings and their dates, merely linked to a scrambled user ID, it proved inefficient to prevent re-identification. It had taken only 16 days, after the data was released for an outsider (with no superhuman hacking skills) to link these user IDs with freely available public data – with enough time left at hand to write up a whole paper on de-anonymization [5]. Netflix had unintentionally exposed the full movie history for parts of their customer base, with no chance of making that privacy infringement undone. A decade later Facebook had to learn that same painful lesson. Once the data is out and you failed to properly anonymize, no matter how good your intentions might have been, you will have a hard time to undo your actions.

This risk of re-identification in large-scale data is by now well understood by privacy and security experts [6], and yet still widely under-estimated by the corporate world. That’s why these experts face a challenging role within organizations, as they need to educate their colleagues, that most anonymization attempts for big data in fact fail to provide safety for their customers. And these experts are forced to give a NO more often than a YES to a new initiative, to a new innovation project, in order to keep privacy safe and secure.

the under-estimated risk of re-identification

So, this is the big quest of our time: How to be open, while being private at the same time? How to put big data to good use, while still protecting each and everyone’s right for privacy? How to foster data-driven, people-centric and innovative societies and organizations, all at the same time, while not giving up an inch on safeguarding privacy?

We, at Mostly AI, set out to solve this challenge, and developed a one-of-its-kind technical solution to this long-standing problem: an AI-based synthetic data generator. One that learns based on actual behavioral data, to generate statistical representative synthetic personas and their data. Synthetic data, that can be broadly shared, internally as well as externally, without exposing any individuals. It’s all the value of the original data, but without the privacy risk.

This is 2019. It’s time to protect privacy, as well as to embrace the power of open again. It’s time to #GoSynthetic!

Synthetic Data Diamonds

My colleague Mitch likens synthetic data to synthetic diamonds when pitching our value proposition to potential customers. And it’s indeed a fitting analogy. Synthetic diamonds are near indistinguishable from actual diamonds to the human eye, they have the same structure, and they bear they same valuable characteristics, like hardness, purity and thermal conductivity. Yet, they take 3 billion years less to form, come at a fraction of cost and are considered more ethical in terms of sourcing.

Along the same lines, synthetic data can as well be near indistinguishable from actual data, can have the same structure, and can retain all the same valuable properties (=statistical information) of actual data. Yet, machines can generate synthetic data in unlimited quantities, and more importantly, synthetic data allows big data to be utilized and shared without putting anyone’s privacy at risk. Synthetic data has the potential to become the new risk-free & ethical norm to leverage customer data at scale. Finally, there is a solution for big data privacy!

However, just like with diamonds, the process of generating high quality synthetic data is anything but trivial. At Mostly AI we’ve mastered the automated generation of synthetic structured data over the past 2 years by leveraging state-of-the-art generative AI, and are confident to claim that we offer the world’s most advanced synthetic data engine.

As a simple demonstration, as part of this first post, let’s apply our solution to a publicly available diamonds dataset with 53’940 records, and 10 attributes. As can be seen, the attributes are a mix of categorical as well as numerical variables.

first 8 records of the original diamonds dataset

Generating an equally sized, structural identical, yet synthetic version of this dataset is, thanks to the flexibility of our engine, as simple as:

> mostly train diamonds.csv
> mostly generate -n 53940

This will create a new data file with 53’940 records, where none of the generated records have any direct relationship anymore to the records in the original dataset. Hence, the information and thus privacy of any individual diamond is being protected (in case they care :)), while the structure of the population is retained.

first 8 records of a synthetically generated diamonds dataset

Further the statistical properties of the various attributes are successfully retained. Here are side-by-side comparison of the frequency of the three categorical variables clarity, cut and color. All these univariate distributions are matched to near perfection.

side-by-side comparison of distributions of categorical variables

Similarly, the engine is also able to retain the distributions of the numerical variables, and nicely captures location, skew, the tails as well as the multiple modes.

side-by-side comparison of price percentiles
side-by-side comparison of histograms for price, x, yand z

More importantly, the (non-linear) interdependencies between the various variables is being properly captured. These statistical multivariate relationships are the key value of any dataset, as they help to gain a deeper understanding of the underlying domain.

As can be seen, the patterns of the synthetic dataset mimic the patterns found in the original dataset to a very high degree. By analyzing the synthetic data we can gain various insights in terms of the diamond market. Whether that’s the relationship between market prices and carats, or the dominance of “Ideal” cut for diamonds with clarity “IF”. Even unexpected patterns, like the lack of “just-little-less-than-2-carat” diamonds is perfectly captured as well, if you take a close look at the upper chart.

Apart from descriptive statistics the synthetic data also turns out to be of immense value for training machine learning algorithms and predictive tasks. This opens up a whole range of opportunities to power next-generation AI, as more data can be gathered across different sources, across borders and industries, without infringing individual’s privacy.

Let’s now take a subset of 40k actual records, and generate 40k synthetic records based on that. We will need the remaining 13’940 actual records to have a fair holdout set available to benchmark our two models for diamond prices: one model trained on actual data, and another one on synthetic data. As expected, there is a loss in accuracy, as the data synthesis results in an information loss (to protect privacy), but the loss is within a rather small margin. In terms of mean absolute error we acchieve an error of 350$ vs 296$. Still, the 350$ is a far cry from a naive model, that would in this case result in an error of $3036 when having no data available at all.

R code for benchmarking models trained on actual vs. synthetic data

In an upcoming post we will dive deeper into an example of synthetic sequential data, which actually turns out to be the prevalent form when it comes to privacy-sensitive user data. But this is a story for another day. For now, let’s simply generate us a million more synthetic data diamonds and wallow in these for the time being.

> mostly generate -n 1000000
> mostly wallow :)