Whether you’ve just heard about Synthetic Data at the last conference you attended or are already evaluating how it could help your organization to innovate with your customer data in a privacy-friendly manner, this mini video series will cover everything you need to know about Synthetic Data:
What it is? (Pssst, spoiler: a fundamentally new approach to big data anonymization)
Why is it needed?
Why classic anonymization fails for big data (and how relying on it puts your organization at risk)
How Synthetic Data helps with privacy protection,
Why it is important that it is AI-generated Synthetic Data & how to differentiate between different types
And lastly, Synthetic Data use cases and insights on how some of the largest brands in the world are already using Synthetic Data to fuel their digital transformation
1. Introduction to Synthetic Data
2. Synthetic Data vs. Classic Anonymization
3. What is Synthetic Data? And why is AI-generated Synthetic Data superior?
4. Why Synthetic Data helps with Big Data Privacy
5. How Synthetic Data fuels AI & Big Data Innovation
It’s 2020, and I’m reading a 10-year-old report by the Electronic Frontier Foundation about location privacy that is more relevant than ever. Seeing how prevalent the bulk collection of location data would become, the authors discussed the possible threats to our privacy as well as solutions that would limit this unrestricted collection while still allowing to reap the benefits of GPS enabled devices.
Some may have the opinion that companies and governments should stop collecting sensitive and personal information altogether. But that is unlikely to happen and a lot of data is already in the system, changing hands, getting processed, and analyzed over and over again. This can be quite unnerving but most of us do enjoy the benefits of this information sharing like receiving tips on short-cuts through the city during the morning rush hour. So how can we protect the individual’s rights and make responsible use of location data?
Until very recently, the main approach was to anonymize these data sets: hide rare features, add noise, aggregate exact locations into rough regions or publish only summary statistics (for a great technical but still accessible overview, I can recommend this survey). Cryptography also offers tools to keep most of the sensitive information on-device and only transmit codes that compress use-case relevant information. However, these techniques make a big trade-off between privacy, accuracy, and utility of the modified data. Even after this preprocessing, if the original data retains any of its utility then the risk of successfully re-identifying an individual is extremely high.
“D.N.A. is probably the only thing that’s harder to anonymize than precise geolocation information.”
Paul Ohm, law professor and privacy researcher at the Georgetown University Law Center. Source: NYT
Time and again, we see how subpar anonymization can lead to high-profile privacy leaks and this is not surprising, especially for mobility data: in a paper published in Scientific Reports, researchers showed that 95% of the population can be uniquely identified just from four time-location points in a data set with hourly records and spatial resolution given by the carrier’s antenna. In 2014, the publication of supposedly safe pseudonymized taxi trips allowed data scientists to find where celebrities like Bradley Cooper or Jessica Alba were heading by querying the data based on publicly available photos. Last year, a series of articles in the New York Times highlighted this issue again: from a data set with anonymized user IDs, the journalists captured the homes and movements of US government officials and easily re-identified and tracked even the president.
Mostly AI’s solution for privacy-preserving data publishing is to go synthetic. We develop AI-based methods for modeling complex data sets and then generate statistically representative artificial records. The synthetic data contains only made-up individuals and can be used for test and development, analysis, AI training, or any downstream tasks really. Below, I will explain this process and showcase our work on a real-world mobility data set.
The Difficulties With Mobility Data
What is a mobility data set in the first place? In its simplest form, we might be talking about single locations only where a record is a pair of latitude and longitude coordinates. In most situations though, we have trips or trajectories which are sequences of locations.
The trips are often tagged by a user ID and the records can include timestamps, device information, and various use-case specific attributes.
The aim of our new Mobility Engine is to capture and reproduce the intricate relationship between these attributes. But there are numerous issues that make mobility data hard to model.
Sparsity: a fixed latitude/longitude pair from the data set appears once or a few times, especially at high granularity records.
Noise: GPS recordings can include a fair amount of noise so even people traveling the exact same route can have quite different trajectories recorded.
Different scales: the same data set could include pedestrians walking in a park and people taking cabs from the airport to the city so the change in data points can vary highly.
Sampling rate: making modeling even more difficult, even short trips can contain hundreds of recordings and long trips might sample very infrequently.
Size of the data set: the most useful data sets are often the largest. Any viable modeling solution should handle millions of trips with a reasonable turn-over.
Our solution can overcome all these difficulties without the need to compromise on accuracy, privacy or utility. Let me demonstrate.
Generating Synthetic Trajectories
The Porto Taxi data set is a public collection of a year’s worth of trips by 442 cabs running in the city of Porto, Portugal. The records are trips, sequences of latitude/longitude coordinates recorded at 15-second intervals, with some additional metadata (such as driver ID or time of day) that we won’t consider now. There are short and long trips alike and some of the trajectories are missing a few locations so there could be rather big jumps in them.
Given this data, we had our Mobility Engine sift through the trajectories multiple times and learn the parameters of a statistical process that could have generated such a data set. Essentially, our engine is learning to answer questions like
“What portion of the trips start at the airport?” or
“If a trip started at point A in the city and turned left at intersection B, what is the chance that the next location is recorded at C?”
You can imagine that if you are able to answer a few million of these questions then you have a good idea about what the traffic patterns look like. At the same time, you would not learn much about a single real individual’s mobility behavior. Similarly, the chance that our engine is reproducing exact trips that occur in the real data set, which in turn could hurt one’s privacy, is astronomically small.
For the case of this post, we trained on 1.5 million real trajectories and then had the model generate synthetic trips. We produced 250’000 artificial trips for the following analysis, but with the same trained model, we could have as easily built 250 million trips.
First, for a quick visual, we plotted 200 real trips recorded by real taxi drivers (on the left, in blue) and 200 of the artificial trajectories that our model generated (on the right, in green). As you can see, the overall picture is rather convincing with the high-level patterns nicely preserved.
Looking more closely, we see very similar noise in the real and synthetic trips on the street level.
In general, we keep three main things in mind when evaluating synthetic data sets.
Accuracy: How closely does the synthetic data follow the distribution of the real data set?
Utility: Can we get competitive results using the synthetic data in downstream tasks?
Privacy: Is it possible that we are disclosing some information about real individuals in our synthetic data?
As for accuracy, we compare the real and synthetic data across several location-and trip-level metrics. First, we require the model to accurately reproduce the location densities, the ratio of recordings at a given spatial area, and hot-spots at different granularity. There are plenty of open-source spatial analysis libraries that can help you work with location data such as skmob, geopandas, or Uber’s H3 which we used to generate the hexagonal plots below. The green-yellow-red transition marks how the city center is visited more frequently than the outskirts with a clear hot-spot in the red region.
From the sequences of latitude and longitude coordinates, we derive various features such as trip duration and distance traveled, origin-destination distance, and a number of jump length statistics.
The plots above show two of these distributions both for the real and synthetic trips. The fact that these distributions overlap almost perfectly shows that our engine is spot-on at reproducing the hidden relationships in the data. To capture a different aspect of behavior, we also consider geometric properties of paths: the radius of gyration measures how far on average the trip is from its center of mass and the straightness index is the ratio of the origin-destination distance with the full traveled distance. So, for a straight line, the index is exactly 1 and for more curvy trips it takes lower values with a round trip corresponding to straightness index 0. We again see that the synthetic data follows the exact same trends as the real one, even mimicking the slight increase in the straightness distribution from 0.4 to 1. I should stress that we get this impressive performance without ever optimizing the model for these particular metrics and so we would expect similarly high accuracy for other so far untested features.
Regarding utility, one way we can test-drive our synthetic mobility data is to use it in practical machine learning scenarios. Here, we trained three models to predict the duration of a trip based on the origin and destination: a Support Vector Machine, a K-nearest neighbor regressor, and an LGBM regressor.
We trained the synthetic models only on synthetic data and the real models on the original trajectories. The scores in the plot came from testing all the models against a holdout from the original, real-life data set. As expected, the synthetically trained models performed slightly worse than the ones that have seen real data but still achieved a highly competitive performance.
Moreover, imagine you are an industry-leading company and shared the safe synthetic mobility data in a competition or hackathon to solve this prediction task. If the teams came up with the above-shown solutions and got the green error rates on the public, synthetic data then you can rightly infer that the winner solution would also do best on your real but sensitive data set that the teams have never seen.
The privacy evaluation of the generated data is always a critical and complicated issue. Already at the training phase, we have controls in place to stop the learning process and thus prevent overfitting. This ensures that the model is learning the general patterns in the data rather than memorizing exact trips or overly specific attributes of the training set. Second, we also compare how far our synthetic trips fall from the closest real trips in the training data. In the plot below, we have a single synthetic trip in red, with the purple trips being the three closest real trajectories, then the blue and green paths sampled from the 10-20th and 50-100th closest real trips.
To quantify privacy, we actually look at the distribution of closest distances between real and synthetic trips. In order to have a baseline distribution, we repeat this calculation using a holdout of the original but so far unused trips instead of the synthetic ones. If these real-vs-synthetic and real-vs-holdout distributions differ heavily, in particular, if the synthetic data falls closer to the real samples than what we would expect based on the holdout, that could indicate an issue with the modeling. For example, if we simply add noise to real trajectories then the latter distributions will clearly flag this data set for a privacy leak. However, the data generated by our Mobility Engine passed these tests with flying colors.
We should be loud about not compromising on privacy, no matter the benefits offered by sharing our personal information. Governments, operators, service providers, and others alike need to take privacy seriously and invest in technology that protects the individual’s rights during the whole life cycle of the data. We at Mostly AI believe that synthetic data is THE way forward for privacy-preserving data sharing. Our new Mobility Engine allows organizations to fully utilize sensitive locational data by producing safe synthetic data at a so-far unseen level of accuracy.
This research and development on synthetic mobility data is supported by a grant of the Vienna Business Agency (Wirtschaftsagentur Wien), a fund of the City of Vienna.
As of January 1, 2020, the California Consumer Privacy Act (CCPA)
went into effect, forcing many U.S. companies to improve their privacy
standards. Thanks to the European Union’s General Data Protection Regulation
(GDPR), some organizations were prepared for this change and well equipped to
handle the shift in compliance requirements. While other organizations
scrambled to meet new privacy standards, some turned to AI for a helping hand-
which we’ll get to later but first, let’s define the privacy law everyone’s
The CCPA In a Nutshell
Generally speaking, the CCPA impacts any company, regardless of
physical location, that serves California residents and:
Brings in annual
revenue of $25 million or more,
personal data of at least 50,000 people, or
than half its revenue selling personal data.
To be clear, these companies do not need to be based in California
or even the United States, to fall under this law.
Facebook’sPolarizingReaction to the CCPA
Although a few exceptions to the rule above exist, it appears
Facebook would be governed by the CCPA. Facebook brings in billions of dollars
in revenue each year, collects personal data from over two billion people (with
millions of users living in California – the same state where its headquartered),
and earns the majority of its revenue through targeted advertising to its
Despite that, Facebook is taking a coy yet combative stance with regards to the CCPA’s application. In a December blog post by Facebook, it states:
“We’re committed to clearly explaining how our products work, including the fact that we do not sell people’s data.”
adamant it does not sell user data and as such, it’s encouraging advertisers
and publishers that use Facebook services “to reach their own decisions on how
to best comply with the law.” Facebook is attempting to shift responsibility to
the companies that use Facebook for advertising purposes. To some, this may
seem a bit surprising, however, their position aligns with previous arguments
associated with privacy lawsuits dating back to 2015.
drawing a line in the sand. My interpretation of Facebook’s blog post is this: We
do not sell the data we collect. We merely offer a service to help businesses
advertise. As such, we’re exempt and the CCPA does not apply. Regardless, we’re
CCPA compliant and have invested in technology to help our users manage their
Recall above, I
mentioned there are a few exceptions to the rule. Well, per the CCPA, some
“service providers” are exempt. Service providers require data from others to
offer their services and in these situations, the transfer of data is
acceptable. Facebook doesn’t technically “sell” user data; Facebook simply
makes it easy for companies to advertise to targeted individuals on its
platform by leveraging the personal data it collects.
taking a clever stance but legal experts are not convinced it will hold in the
court of law. Some believe Facebook does not fall under the “service provider”
exemption because it uses the data it collects for other business purposes in
addition to providing advertising services. Therefore, Facebook would not be exempt
and the CCPA should apply.
Potential Penalties for Facebook
In 2019 alone, according to a GDPR fine tracker, 27 companies, including Google, Uber and Marriott, received penalties totaling over €428 million. With the CCPA now in effect, fines from both the GDPR and CCPA are expected to reach billions in 2020.
If a business fails to cure a CCPA violation within 30 days of
notification of noncompliance, the business may face up to $2,500 per
violation, or $7,500 per intentional violation. For example, if a company
intentionally fails to delete personal data of 1,000 users who made those
requests, the resulting penalty could be $7.5 million.
So, what could this look like for Facebook? Well, according to estimates and calculations by Nicholas Schmidt, a lawyer and former research fellow of the IAPP Westin Research Center, Facebook “could face a rough full maximum penalty of $61.6 billion for an unintentional violation affecting each of its users and up to $184.7 billion for an intentional violation.” That’s a whole lot of cash, even for a company worth over $500 billion. This may explain why Facebook has taken steps to help people access, download and delete their data, in accordance with the CCPA.
Despite Claiming CCPA Exemption, Facebook Makes Effort to Comply
The CCPA provides consumers with new data privacy rights,
including the right to data collection transparency, the right to opt-out, and
the right to be forgotten.
Essentially, consumers have the right to know what data of theirs is being collected, if their data has been shared with others, and who has seen or received their data. Consumers also have the right to opt-out, or in other words, prevent the transfer or “sale” of their data. And even if they opt-out, they still have the right to receive the same pricing or services as those who have not. Equally as important, consumers have the right to access their own personal data and request its deletion.
Privacy Law Fun Fact:
The CCPA borrowed the “Right to be Forgotten” aka the “Right to Erasure” from the GDPR and applies its similar variation as the “Right to Request Deletion”
When I prepared this blog post, I learned that Instagram (which is owned by Facebook) has made it incredibly easy for people to review and download their personal data. Rather than forcing users to jump through hoops, send email requests, and wait 45 days for a reply, Instagram has streamlined the request and data review process required under the CCPA. It took me only a minute to send a request to download data and to exercise my “right to request deletion”. Viewing my account and advertising data was simple and straightforward too:
Facebook is going above and beyond in an attempt to meet CCPA compliance for all users, regardless of where those users reside. Based on their updated user experience for CCPA inquiries and data reporting, it seems like Facebook is aware it falls under the law – regardless of what it states on its blog.
Lessons Learned from Facebook and How To Use AI for Compliance
The CCPA influences how businesses collect, store
and share data, which continues to create ripple effects of improved policies
and procedures nationwide.
In an effort to meet CCPA compliance, Facebook provides a great example of how to embrace artificial intelligence and self-serving tools to help individuals manage privacy. Despite Facebook’s position on the CCPA, we’ve seen how they’ve updated their “help center” and automation flow to meet the requirements of this new law. Some companies, however, are handling privacy requests manually, one email at a time, which is resulting in additional work for its employees. Others are turning to artificial intelligence to handle these tasks while improving security, compliance and data protection.
To prevent fraud, organizations are using AI to verify user identity before allowing people to download their personal data. Some businesses are using AI-powered chatbots to efficiently answer privacy related questions and manage data access. In addition to fraud detection and customer support, companies are using AI to generate synthetic data for CCPA compliance. This saves companies hundreds of hours of time spent masking personally identifiable information and empowers organizations to train machine learning algorithms regardless of how much customer data has already been (or will be) deleted.
Artificially manufactured synthetic data appears just as real to the naked eye but contains no personal information from actual users. Since it’s generated by an algorithm that learns the patterns and correlations of raw data before using this knowledge to create new, highly realistic synthetic data,it’s completely anonymous and untraceable to any individuals. Therefore, synthetic data can be freely shared amongst teams without the risk of reverse engineering any users’ identities or personal information. It’s with this advancement in technology that companies can protect the privacy of its users and continue to make data-driven decisions.
Organizations around the globe, including Amazon,
Apple, Google and Uber are now using synthetic data. It’s an incredible tool to
train AI and leverage data previously collected without putting any of your
customers at risk. It’s GDPR and CCPA compliant, and rumor has it, Facebook is already
using synthetic data too.
If you focus on compliance, AI, or anywhere in between, I’d love to hear how you’ve embraced the CCPA. You can message me here, anytime, for a fun conversation or friendly debate.
“Knowledge is power” should be a familiar quote for most of us. But did you know that in the earliest documented occurrence of this phrase from the 7th century the original sentence finishes with “…and it can command obedience”? If we apply this idea to personal information it’s easy to understand why privacy matters. Nobody wants to be at the receiving end.
Today privacy is recognized as one of the fundamental human rights by the UN and protected through international and national level legislation. Not complying with privacy regulations can be very costly if not a business-threatening risk for organizations working with their customers’ data.
Should We Trust Blindly?
Never before has it been as difficult as now to preserve individuals’ privacy. People have lost track and control of what kind of personal data, how much of it, why and where it is being collected, stored and even sold further. From “1000 songs in our pockets” (Apple’s first iPod slogan) we moved to an endless number of apps and trackers in our pockets, recording our daily activities. In our homes as well as in public spaces, devices with sensors and meters record every step and action we take.
As an individual, it’s wise to question if the consent given to a data controller is compliant with current privacy regulations. Because sometimes data controllers turn to the Dark Side by collecting unnecessary data, sharing data with third parties, failing to protect data appropriately or using it in a discriminatory way against the data subjects.
The Industry’s Dirty Secret About Privacy
But even the data controllers who are not crossing the legal red lines are struggling more and more to protect their customer’s data at rest as well as in motion. Their life would be much easier if customer data could be locked up in a safe. But data, if not the new oil of our economy, is for sure the lubricant necessary to keep the business machine running. This means that every company needs to expose its customer’s data at least for the following purposes:
Often it’s not the bad intention but simple incapability of data controllers to comply with privacy regulations because the instruments at their hands are not capable of securely anonymizing data without destroying all of the utility in it. I remember many discussions with data security officers at different companies admitting that their current processes aren’t compliant but as long as there is no technical fix they can’t stop the machine that’s running.
What do those data security officers mean when they say that there’s no technical fix for privacy compliance? Although there are many different methods to protect data like cryptography, tokenization and anonymization, all of them have some weak points when it comes to data protection.
Sometimes the weak point is the human: If cryptographic keys are compromised the privacy is lost. Sometimes, like in the case of classic anonymization techniques, the weak point is the technology. What more and more people realize is that classic anonymization (randomization, generalization,…) is at the same time an optimization problem and an uphill battle. When original data is modified it loses some of its information value and therefore its utility. The more protection the less utility and vice versa. The data anonymization engineer is continuously balancing between the legal requirements to protect the privacy of the data subjects and the business requirements to extract value out of the data.
But in the age of big data with millions of records and thousands of attributes this optimization is getting almost impossible to achieve. The more data points are being collected about individuals, the easier it gets to re-identify individuals even in large databases. For example, 80% of all credit card owners are re-identified by 3 transactions, even when only the merchant and the date of the transaction are revealed.
The anonymization trade-off has shifted to a paradox state where only useless data is sufficiently protected data. So if you have been using anonymized, granular level data for analytics or testing, the odds are high that a significant fraction of this data is actually re-identifiable and the privacy of your customers isn’t adequately protected.
This dirty secret is causing some sleepless nights in legal and security departments across various industries. What’s even scarier is that some organizations aren’t even aware of it. And the only thing worse than not protecting your customer’s privacy is acting with the false impression that you are doing it, which leads to organizations taking their guard down to freely share poorly anonymized information internally and externally.
The trouble is on the horizon. In 2019 GDPR fine total went up to almost 500 million Euros and it wouldn’t surprise me if this amount would double in 2020. But the legal fines aren’t the biggest trouble awaiting data controllers.
What Happens When Privacy Is Breached?
Nowadays, data breaches happen more frequently even when access to sensitive data is restricted. According to IBM, the odds of experiencing a data breach went up 31% in the last 5 years to a 29% likelihood of experiencing a data breach in the next two years. Data snoopers are the bank robbers of the 21st century. The average total cost of a data breach is millions of Euros and that doesn’t only consist of regulatory fines. The biggest cost factor of a data breach is the loss of reputation that directly leads to the loss of business.
This Sunday a story was published in British media holding all the components of a perfect privacy nightmare: a data breach of a governmental database containing personal information of about 28 million children. The breach happened through the access that the UK government has granted to an educational and training provider.
Just imagine what horrible things could be done with that data. Now go through your diabolical list and strike out “selling children’s private data to betting firms” – this task has already been successfully accomplished. How successfully? According to UK media, betting companies were able to significantly increase the number of children passing their identity checks and have used the stolen data to increase the number of young people who gamble online. The house always wins.
And there’s no sign that the winning streak of data adversaries will break anytime soon. Every year we see the number of data breaches and the volume of exposed sensitive data captured rising. With a total of 550 million leaked records last year, the 28 million data records of the recent breach in the UK will hardly break the top five data breaches of 2019.
The types of breaches go from incompetency like accidental exposure, employee error, improper disposal or lost to intentional theft like hacking, insider theft, unauthorized access, and even physical data theft.
In the dark web the data is offered to other criminals or even legal entities like the betting companies in the example above, as well as social media organizations who use this data among others to check if their own passwords were indirectly exposed.
What Are The Alternatives?
The list of things that could be done to reduce privacy risk is long. When data is in motion, for example during business process operations, data controllers will continue to use cryptography and tokenization as preferred methods to protect data in operative applications. For data at rest used for data analytics, development & testing, machine learning and open innovation there are new innovative ways to truly anonymize data and fix the privacy vs. utility trade-off once for all.
AI-generated synthetic data is THE way forward to preserve the utility of data while protecting the privacy of every data subject. This innovative approach was only possible thanks to the progress in the field of artificial intelligence. Generative deep neural networks can be trained on the original structured data to then be used to generate new synthetic data points. These new synthetic datasets preserve all the correlations, structures and time dependencies present in the original data. Customer-related events like financial transactions, online-clicks, movements etc. can be synthesized and all the important insights contained in the original data can be preserved in the synthetic dataset. And at the same time there’s no way to re-identify the original customer.
With synthetic data, the way to do machine learning with anonymous “full privacy-full utility” data is wide open. The evaluation results obtained by doing machine learning on synthetic data are very similar to those generated from the original data. Application testing with synthetic data is able to cover the edge cases which are normally covered only in the original data. And the performance tests of the applications are not a problem anymore because synthetic test data, in contrast to original data, isn’t scarce anymore. Millions of synthetic records can be produced with the click of a button.
With AI-generated synthetic data being used more and more, data security and legal departments will finally be able to sleep easier. At the same time, the data scientists and development & testing teams will be able to focus on more productive tasks and won’t get distracted by legal and security requirements. Trust is good, control is better, but no dependency on trust is the best.
And what about data adversaries? What would happen if they manage to steal 28 million synthetic records of UK children like in the story above?
These data adversaries would be sitting on terabytes of artificially generated data: nothing more than a high-quality-look-a-like dataset. They’d be painfully disappointed to discover that re-identifying any of the 28 million UK kids would not be possible. And you can bet on that.
Experience the world’s most accurate Synthetic Data Generator today
We proudly present a light demo version of our powerful enterprise solution Mostly GENERATE and warmly welcome you to try it out for free.
What Does Mostly GENERATE Do?
Mostly GENERATE enables you to automatically transform your privacy-sensitive big data assets into highly realistic and accurate synthetic datasets. The benefit is, that synthetic data is fully anonymous and thus exempt from data protection regulations. This results in as-good-as-real data, that is free to use, share or monetize. So with Mostly GENERATE, you are finally able to freely innovate with one of your most valuable resources – all while avoiding the financial, regulatory or reputational risks of a privacy violation!
How Does Mostly GENERATE Work?
Mostly GENERATE leverages state-of-the-art generative deep neural networks with in-built privacy mechanisms to automatically learn the patterns, structure and variation from an existing dataset. Once the training is completed, it allows you to simulate an unlimited number of highly realistic and representative – but completely anonymous – synthetic customers. Thereby, you are able to retain all the valuable information in your data assets, while at the same time rendering the re-identification of any individual impossible.
How Does The Demo Compare?
Although the Mostly GENERATE Demo is offered for free, it comes with the same powerful core technology as the enterprise version: our fully automated Synthetic Data Engine. This will enable you to generate synthetic data based on a provided actual dataset and to experience the magic of generative AI in action. Try it out today and persuade yourself of its quality, its flexibility and its ease-of-use.
But keep in mind, that it is a free version and that some usage restrictions apply. While the enterprise version of Mostly GENERATE supports multiple tables and extremely large datasets with millions of rows and hundreds of columns, we limit the input for the free version to a single table with 500.000 rows and 50 columns. Furthermore, you won’t have the same flexibility as with our enterprise version to configure the training or the generation process. And as we operate this demo on a low-cost cloud infrastructure, the compute time will be significantly longer than for production setups.
Lastly, since it is a demo version, you must not use it to upload any personal or sensitive data. But no worries – if you don’t have a suitable dataset at hand, just go to Kaggle and choose a publicly available dataset you like. Or, if you are looking for even more convenience, simply select one of the datasets already provided on the demo site to start your first synthetization run in a matter of seconds!
With that being said, we warmly welcome you to try out this service and to start your own personal journey with synthetic data. We are very much looking forward to your valued feedback and are here to answer your questions, should any arise.
When Netflix announced a 1 million dollar challenge for improving their recommendation engine, together with releasing 100 million movie ratings in 2006, they knew little what would happen next. Netflix has been working on a recommender systems for years by then, were recognized for their business innovation as well as technical excellence, managed to hire the smartest data engineers and machine learning experts alike, and certainly knew all the ins and outs on movies, genres and popular actors. Still, they were eager to learn more.
The competition began on October 2nd, 2006. Within a mere six days a first contestant succeeded in beating their existing solution. Six days! From getting access to the data and taking a first look, to building a movie recommender algorithm from scratch, all the way to making more accurate predictions for over a million ratings than anyone before. Six days, from zero to world class.
It didn’t stop there. Within a year, over 40,000 teams from 186 countries entered the competition, all trying to improve on Netflix’ algorithms. And contestants increasingly started collaborating, sharing their learnings and taking lesson from others, formed bigger teams and ensembled ever more powerful models. Hardly ever before was there such a rich & big dataset on consumer behavior openly available. Together with a clearly stated and measurable objective, it provided a challenging, yet safe and fair sandbox for a worldwide community of intellectually driven engineers. All working together on advancing science, while, as a more than welcome side effect, also helping Netflix improve its core algorithm.
Being open enables to pick the brain of a wider group, to bring in fresh perspectives to existing challenges, to build upon the creative minds of the many. And with data being the lingua franca of today’s business world, being the common denominator across departments, across corporations, across industries, being open is really about sharing data, about sharing granular level data at scale!
Openly sharing customer data at scale in 2006 was a bold move. But it was no coincidence that it’s been done by Netflix. Already in their early years they successfully established a culture of excellence, curiosity as well as courage, well documented in “one of Silicon’s valley’s most important PowerPoint decks” . Sharing data broadly takes courage, but is even more so a sign of curiosity and a thrive for excellence. No holding back, no hiding out, no making excuses. Netflix was never afraid of their competitors. They were afraid of stopping to strive for the best.
Openly sharing customer data at scale in 2019 is an (unl)awful move. Over the past years, the explosion in data volumes met a poorly regulated market, with little sanctions being imposed, and thus allowing excessive misuse of personal data. The tide though has turned, both the regulators as well as corporates are acknowledging privacy as a fundamental human right, one that is to be defended [2,3]. This is indeed a new era of privacy.
Unfortunately, this plays into the hands of modern-day corporate gatekeepers. Those decision makers, who’ve never really been fond of being transparent and being challenged, and thus reluctant to share “their” data in the first place. It turns out they found a new ally in defending their corporate data silos: privacy.
This is the point where one needs to tell the lesser known part of the Netflix Prize story: As successful the competition was for the company overall, they also had to pay their prize at court. In fact, they were forced to cancel their second machine learning challenge, that was planned for 2011 . Netflix had misjudged their anonymization measures they’ve had put in place . Even though they limited data to movie ratings and their dates, merely linked to a scrambled user ID, it proved inefficient to prevent re-identification. It had taken only 16 days, after the data was released for an outsider (with no superhuman hacking skills) to link these user IDs with freely available public data – with enough time left at hand to write up a whole paper on de-anonymization . Netflix had unintentionally exposed the full movie history for parts of their customer base, with no chance of making that privacy infringement undone. A decade later Facebook had to learn that same painful lesson. Once the data is out and you failed to properly anonymize, no matter how good your intentions might have been, you will have a hard time to undo your actions.
This risk of re-identification in large-scale data is by now well understood by privacy and security experts , and yet still widely under-estimated by the corporate world. That’s why these experts face a challenging role within organizations, as they need to educate their colleagues, that most anonymization attempts for big data in fact fail to provide safety for their customers. And these experts are forced to give a NO more often than a YES to a new initiative, to a new innovation project, in order to keep privacy safe and secure.
So, this is the big quest of our time: How to be open, while being private at the same time? How to put big data to good use, while still protecting each and everyone’s right for privacy? How to foster data-driven, people-centric and innovative societies and organizations, all at the same time, while not giving up an inch on safeguarding privacy?
We, at Mostly AI, set out to solve this challenge, and developed a one-of-its-kind technical solution to this long-standing problem: an AI-based synthetic data generator. One that learns based on actual behavioral data, to generate statistical representative synthetic personas and their data. Synthetic data, that can be broadly shared, internally as well as externally, without exposing any individuals. It’s all the value of the original data, but without the privacy risk.
This is 2019. It’s time to protect privacy, as well as to embrace the power of open again. It’s time to #GoSynthetic!
My colleague Mitch likens synthetic data to synthetic diamonds when pitching our value proposition to potential customers. And it’s indeed a fitting analogy. Synthetic diamonds are near indistinguishable from actual diamonds to the human eye, they have the same structure, and they bear they same valuable characteristics, like hardness, purity and thermal conductivity. Yet, they take 3 billion years less to form, come at a fraction of cost and are considered more ethical in terms of sourcing.
Along the same lines, synthetic data can as well be near indistinguishable from actual data, can have the same structure, and can retain all the same valuable properties (=statistical information) of actual data. Yet, machines can generate synthetic data in unlimited quantities, and more importantly, synthetic data allows big data to be utilized and shared without putting anyone’s privacy at risk. Synthetic data has the potential to become the new risk-free & ethical norm to leverage customer data at scale. Finally, there is a solution for big data privacy!
However, just like with diamonds, the process of generating high quality synthetic data is anything but trivial. At Mostly AI we’ve mastered the automated generation of synthetic structured data over the past 2 years by leveraging state-of-the-art generative AI, and are confident to claim that we offer the world’s most advanced synthetic data engine.
As a simple demonstration, as part of this first post, let’s apply our solution to a publicly available diamonds dataset with 53’940 records, and 10 attributes. As can be seen, the attributes are a mix of categorical as well as numerical variables.
Generating an equally sized, structural identical, yet synthetic version of this dataset is, thanks to the flexibility of our engine, as simple as:
This will create a new data file with 53’940 records, where none of the generated records have any direct relationship anymore to the records in the original dataset. Hence, the information and thus privacy of any individual diamond is being protected (in case they care :)), while the structure of the population is retained.
Further the statistical properties of the various attributes are successfully retained. Here are side-by-side comparison of the frequency of the three categorical variables clarity, cut and color. All these univariate distributions are matched to near perfection.
Similarly, the engine is also able to retain the distributions of the numerical variables, and nicely captures location, skew, the tails as well as the multiple modes.
More importantly, the (non-linear) interdependencies between the various variables is being properly captured. These statistical multivariate relationships are the key value of any dataset, as they help to gain a deeper understanding of the underlying domain.
As can be seen, the patterns of the synthetic dataset mimic the patterns found in the original dataset to a very high degree. By analyzing the synthetic data we can gain various insights in terms of the diamond market. Whether that’s the relationship between market prices and carats, or the dominance of “Ideal” cut for diamonds with clarity “IF”. Even unexpected patterns, like the lack of “just-little-less-than-2-carat” diamonds is perfectly captured as well, if you take a close look at the upper chart.
Apart from descriptive statistics the synthetic data also turns out to be of immense value for training machine learning algorithms and predictive tasks. This opens up a whole range of opportunities to power next-generation AI, as more data can be gathered across different sources, across borders and industries, without infringing individual’s privacy.
Let’s now take a subset of 40k actual records, and generate 40k synthetic records based on that. We will need the remaining 13’940 actual records to have a fair holdout set available to benchmark our two models for diamond prices: one model trained on actual data, and another one on synthetic data. As expected, there is a loss in accuracy, as the data synthesis results in an information loss (to protect privacy), but the loss is within a rather small margin. In terms of mean absolute error we acchieve an error of 350$ vs 296$. Still, the 350$ is a far cry from a naive model, that would in this case result in an error of $3036 when having no data available at all.
In an upcoming post we will dive deeper into an example of synthetic sequential data, which actually turns out to be the prevalent form when it comes to privacy-sensitive user data. But this is a story for another day. For now, let’s simply generate us a million more synthetic data diamonds and wallow in these for the time being.