There are extraordinary bias and infrastructure challenges inherent to big-data marketing analytics solutions built on cookies or cookie-replacement type IDs. In this article we will explore how panel-based data sets eliminate these challenges to enable more accurate, faster, and more efficient insights. As cookies are going away, marketers are searching for new approaches for user-level data sets for analysis. In my previous article we defined two key options: cookie-replacement, or panel-based solutions. In that article we highlighted some limitations related to targeting that cookie-replacement solutions struggle with, now we will dive deeper into the draw-backs of these types of solutions, and why panel-based approaches are superior.
Selection bias: What it is, and why it’s bad
A major issue for “all-inclusive” data sets built on cookies or other IDs is selection bias. Selection biases occur when samples are not representative of the population due to how the sample is selected.
For big-data, cookie-replacement solutions, bias is introduced at every step of the data integration process, and many companies have dozens of integrations that introduce bias.
When your first party data is onboarded to digital IDs like cookies, the onboarding partner can only match to those IDs they have in their database. Unfortunately, these databases are built on black-box approaches. Is it over weighted to younger or more tech savvy populations? What about renters vs. home owners? If there was clear information on how the sample was biased, marketers could account for that, but unfortunately, the systems don’t account for the biases that are known to exist in the population.
From here the process compounds bias as onboarding providers typically makes some assumptions and ‘loosely’ match customer data to a questionable ID to achieve the scale marketers are looking for. For example, does this digital ID belong to me, or someone in my household with my same physical address, or that shares a credit card? Even with deterministic authentications like user-IDs and password protected logins there is significant bad data. More than 20% of Netflix users share a password. That’s more than 12 million inaccurate matches in the U.S. alone. How many other bad connections are made?
From there, third party enrichment uses approximate matches to provide additional attributes for segmentation, introducing selection bias again. The very point of enriching your customer data is to improve your understanding your customers, but when the underlying data set is inaccurate, and additional black-box mechanisms further dilute the precision, it’s no wonder marketers struggle to find insights.
Finally, ad networks and publishers further wash out the sample to deliver the desired volume of impressions. This results in inaccurate data on who is actually being targeted with ads and included in the sample for modeling media effectiveness. All these aspects impact the sample at every step and make it impossible for big-data solutions to deliver a truly representative sample. Whether expanding a user-level ID to members of a household based on IP address matching, or just using outdated data, significant miss-matches happen everywhere.
If we are using a biased sample for analytics, we are getting inaccurate results. A study from Flatfile showed that nearly 80% of marketers have trouble with customer data accuracy.
The majority of data linkages are hypothetical
Many data-linkages connect offline data to online ‘cookie pools’ which are built using cookies or some other replacement ID, but there is often no corroboration against an authoritative offline dataset. Because providers don’t have any independent data available to use to “clean” your records, they rely on the data you already have. This often contains incomplete and outdated information. As a result, they will find fewer accurate connections to online IDs. Definitive email-to-cookie linkages typically generate only ~30% match rates which is simply too small to be useful. Onboarding providers try to extend their reach by supplementing your data with statistical modeling (which is really another way of saying “best guesses”). Naturally, these hypothetical linkages result in a higher rate of false positives. Even if you’re told that you have a high match rate, 50–70% of your audience is not an accurate match.
Media Post covered this phenomenon and estimated that 70% of matches are false positives. Everyone knows that insights from analytical models are only as good as the data feeding the model, and we’ve all heard “garbage in, garbage out” and Adweek confirms this is a major issue for marketers.
Kantar’s panel-based solutions use numerous checks and balances to ensure the sample is accurate and representative of the market. Many data providers interlock their panels with the U.S. Census, ensuring that each zip code, age group, household income category, and other characteristics mirror the national profile. Most providers also include detailed quality checks on the survey responses themselves, confirming that responses are from actual humans, not bots. Additionally, data providers will also work with many different sources of panelists to balance the data inputs and ensure consistency over time, through the rule of large numbers. Because these panels are constantly in flux, with new panelists joining the data-set on a weekly basis, this ensures that the data is accurate, and reflective of the marketplace, not skewed by a single respondent that has undue power from weighting, as was often the case with TV households that used insufficient sample sizes.
The challenge of building big data infrastructure
The International Data Corporation estimates that the world’s data will grow from 33 zettabytes in 2018 to 175 billion zettabytes by 2025 – a compounded annual growth rate of 61%. This growth is unmanageable. Not surprisingly, 42% of brands say that inaccurate customer data is their biggest barrier to effective multichannel marketing.
There are vast engineering requirements to building big data sets for analysis. Building data-lakes, setting up governance rules, joining, normalizing, and formatting data is the work of multiple full-time employees. How many companies have these kinds of resources to invest in that level of data warehousing infrastructure and then maintain and continually improve them? This is before we even begin analysis.
When building data sets for analytics, any company that tries to bring in a complete picture of all customers faces a daunting challenge simply with the size and scope of the data being generated.
The importance of data governance
Another key challenge brands are experiencing is how to unify the massive amounts of customer data that they collect across all channels and touchpoints. Right now, that data ends up in silos, where it's difficult or impossible to use effectively. A study cited by Marketing Profs noted that 62% of marketers don’t have data in the right formats to use.
For example, only 38% of marketers say they have the customer segment and persona data they need in the right format to make good marketing decisions. The Capgemini Research Institute CMO survey found that most marketers are struggling to make sense of the data they already have. Although working with an external vendor can help brands get a better handle on their data, the engineering and data science resources that go into capturing, normalizing, and maintaining these size data sets makes these types of solutions cost prohibitive for all but the largest advertisers.
One senior analytics leader from the consulting industry noted the following on his experiences working with Fortune 500 clients, “Most organizations are woefully inept when it comes to data governance. Formats, documentation, access rights, quality assurance, validation and consistency across sources are all pressing issues that make a big data approach difficult to pull off in even the best of situations.....it's actually kind of silly to not leverage professionally managed panels.”
Are we there yet?
The complexity of assembling a data set of all available consumers creates a monumental task, even for the best external partners. Even forgetting that the data set is biased, it simply takes too much time to gather the data, format it, normalize it, join it with other data, etc. etc. etc. These projects almost always stretch the original timeline. It’s not uncommon for user ID-based projects to take six months or more just to assemble the data. All the while, the brand is paying for insights they don’t have yet, and missing opportunities to optimize marketing. In the real world of business, do you have that time to wait?
Then there’s privacy
In the age of General Data Protection Regulations (GDPR), anyone working with customer data must create a transparent permission and control experience for consumers. What explanatory copy do you use? Is it clear enough? Are you popping up to ask for acceptance? How can consumers access info on how their data is used?
Brands need to constantly review their end-to-end data collection experience to see where it can be improved. Often, privacy and data use terms are intentionally obscured with long and hard-to-understand writing that customers must scroll through to find the box to check. Is that really consent?
These are all poor customer experiences, both from a communication and trust standpoint and from the perspective of efficiency. Imagine if a customer walked into a store and the security guard asked for identity documents before they could shop, and then required them to sign an agreement they didn't understand before allowing them to look at merchandise.
Just as cookies operated in the background with limited explicit permissions, most cookie replacement solutions rely on the same approach, banking on customer laziness, not true informed consent for tracking.
Customers are noticeably on edge and the implementation of GDPR and the California Consumer Privacy Act (CCPA) have absolutely worsened their online experience by confusing them more than enlightening them. Why should they trust any entity collecting their data these days?
Given the inherent selection bias, overwhelming logistics, and privacy concerns that are intrinsic to cookie replacement approaches to building data sets for analytics; there are just too many pitfalls with this type of solution. In the next part of our series, we will look at why “small data” or permission-based panels, are a much more viable alternative now and in the future. Included will be an examination of some of the specific strengths permission-based panels bring to delivering superior insights to marketing analytics.