Synthetic Data: The Real Deal? The opportunities and challenges of synthetic data for market research

Ashok Kalidas

Chief AI Scientist

Jane Ostler

EVP Global Thought Leadership

Article

Get in touch

Learn how Kantar is leading the way in synthetic data and discover the opportunities and challenges of synthetic data for market research.

Synthetic data is data that is artificially generated using algorithms. It can be used to augment existing data, create new data and simulate future scenarios. Data scientists have been imputing predictive data for decades for many purposes, including in the market research field – but synthetic data goes to the next level, purporting to represent an individual’s or a group’s attitudes or behaviors.

Synthetic data can offer many benefits for market research, including increasing sample size and diversity by mimicking hard to reach populations at low cost, creating new insights and solutions through predictive modelling, and speeding up the research process.

But synthetic data also poses major risks. In market research, synthetic datasets can introduce bias or distortions so that they don’t accurately reflect the characteristics and preferences of a target population. So, it’s critical to understand the use cases, the solution methodologies, and the evaluation frameworks before relying on synthetic data to inform key business decisions.

Synthetic data needs solid foundations

It is not often recognized that a key precursor to a synthetic data-generating algorithm is an abundance of a baseline of “real” data. We demonstrate in our earlier article entitled: Synthetic Data: Is it all it’s Cracked up to be?, that exclusively relying on an off-the-shelf Large Language Model (LLM) is often a poor strategy. It’s vital to start with a high quality data source that is very specific to the problem at hand, and use that to train a synthetic data-generating algorithm.

This is why Kantar invests heavily in the quality of our panels, which are the backbone of our data collection and synthesis. Our panels are carefully designed, recruited and maintained to ensure that they are representative, diverse, engaged, and compliant.

We use rigorous quality checks and controls to verify the identity, location, and behavior of our panelists, and use AI to detect and prevent any fraudulent or anomalous responses. By ensuring the high quality of our underlying panel data, we can train AI models to more reliably and accurately generate synthetic data that can mitigate the inherent risks we outlined earlier.

What are some common use cases in synthetic data?

Although the term synthetic data is new, many of the use cases that come under this broad umbrella aren’t.

Indeed, the problem of “filling holes” in datasets by looking for information either within the dataset, or fusing information from other datasets, is an age-old problem that we have been tackling at Kantar for decades, at scale. A common example of such a use-case is when we join multiple datasets – for instance, attitudes from a survey with behaviours from a different panel. Or when we shorten survey lengths by not asking all questions of all respondents, but predicting some of them using machine learning. This means we are already well positioned to take on some of the new use cases currently under discussion.

There are three main use cases that we see in this field. For each case, we are conducting extensive and careful pilot testing with clients , as well as rigorous data cleaning processes, to assess the accuracy of the outcomes:

Sample Boosting: it is possible to take a survey dataset in a particular category and boost the dataset with more respondents in one or more subgroups (who, for example, might be under-represented or expensive to recruit in). If we think of a survey dataset as a table where the rows are respondents and columns are survey questions, we are trying to synthetically create “new rows”, corresponding to respondents from small sub-groups.
Predictive Augmentation: survey length considerations often mean that we need to make difficult decisions as to which questions we can accommodate. Can we fill in some gaps in our data based on other historical survey respondents or indeed with profiling data we already have in our panel, to deliver additional (modelled) fields along with the base survey data collected?
Digital Twins: over time, we’ve built up a huge number of data points for individual respondents; at Kantar, we often have years of high-quality attitudinal ad behavioral data on our most loyal panelists. Can we leverage this historical information to fine-tune AI models that then allow us to “extend” beyond previous survey questions into new categories, behaviors and topics

How is Kantar leading the way in Synthetic Data?

We have talked about the paramount importance of high-quality foundational data. This is something that we possess in abundance. But data alone isn’t enough. Most existing methods to create synthetic data rely on sophisticated data science algorithms such as neural networks (in particular, generative adversarial networks), boosting, elastic nets and other machine learning techniques, and advanced econometric modelling. Some of the use cases, like Digital Twins, also require a deep understanding of foundation models and large language models – not just the ability to use them off-the-shelf, but to fine tune and adapt them to new datasets in clever ways.

At Kantar, these are capabilities we have years of experience with, and have already deployed as part of other AI tools such as LinkAI, ConceptEvaluate AI , our GenAI assistant KAiA, etc. The combination of exceptional data assets combined with over a decade of building and deploying AI tools, especially in the context of predictive modelling, gives us an excellent starting point to tackle the three emerging use cases we have shared here.

What are our next steps?

We have identified synthetic data as one of the key strategic issues that Kantar and the industry are facing. This is a major focus of our recently announced AI Lab, a cross-Group forum that brings together our scale and expertise to accelerate R&D, and to build state-of-the-art AI and GenAI products and solutions across the Group. The AI Lab has inbuilt mechanisms to ensure rigor and scalability of our AI solutions, and provides an opportunity to share best practices, learnings and feedback. We have also been exploring partnerships and collaborations with external providers, as well as engaging with our clients and stakeholders to understand their needs and expectations.

Synthetic data has lots of potential but the industry has a lot more work to do to build technically and methodologically sound, enterprise-ready solutions. While we intend to take full advantage of the potential of some of the latest algorithms and technologies, as responsible leaders in the industry, we also need to address the major questions and challenges that synthetic data presents, such as its accuracy and bias, its feasibility in particular situations, and its robustness. These are the questions we are actively exploring and piloting with our clients and we will issue further guidance as it becomes available. Stay tuned.

Looking to gain deeper insights on AI? Connect with Kantar today and discover how our expert analysis and solutions, including those powered by AI, can empower your growth.

Synthetic Data: The Real Deal? The opportunities and challenges of synthetic data for market research

Learn how Kantar is leading the way in synthetic data and discover the opportunities and challenges of synthetic data for market research.

Synthetic data needs solid foundations

What are some common use cases in synthetic data?

How is Kantar leading the way in Synthetic Data?

What are our next steps?

Want more like this?

Shape your
brand future

Global Office

Synthetic Data: The Real Deal? The opportunities and challenges of synthetic data for market research

Share

Learn how Kantar is leading the way in synthetic data and discover the opportunities and challenges of synthetic data for market research.

Synthetic data needs solid foundations

What are some common use cases in synthetic data?

How is Kantar leading the way in Synthetic Data?

What are our next steps?

Want more like this?