#10 - The Berkson's paradox, Single Customer View, Data Quality and Big Data

Have your pen and paper ready as there are a lot to unpack in this week newsletter

Jan 14, 2024

The Berkson’s paradox

Our limited experiences sometimes lead us to draw wrong conclusions about things, for example:

Nice men tend to be unattractive
Heavy smokers suffer less severed Covid-19 symptoms
The more firefighters deployed, the more injuries reported

These conclusions are incorrect; however, they are formed due to a false interpretation of collected statistics. These are examples of the Berkson’s paradox.

Berkson’s paradox refers to a false conclusion that two unrelated characteristics are negatively correlated. This fallacy is driven by sampling bias, where samples show a negative relationship between two characteristics, but the whole population doesn’t possess such a relationship.

An example is how sampling affects the study of the relationship between SAT Verbal and Math score:

When the samples are chosen at elite universities, only the admitted students are selected (the orange section). In this subset, there is a negative relationship between Math and Verbal SAT scores. However, in the entire population, the relationship between them is actually positive. ( Allen Downey, Probably Overthinking It)

The Berkson’s paradox can lead to wrong decisions made by both humans and machines, especially for machines, as it results in ML models that persistently reinforced biassed predictions (as seen in Amazon's AI recruiting tool).

Being mindful of this paradox, before making decisions, ask yourselves: “Are my samples biassed?”

This Week’s Strategy

Single Customer View

If you are a data leader aiming to increase the maturity of the data, Single Customer View (SCV) is a key strategic capability to consider.

With SCV, you can expect tremendous benefits [1]:

More effective and personalised communication with customers
Better product offerings based on customers’ profile
Higher retention rate
More revenue

The business case for SCV is also strengthen when you consider the need of business facing department, such as Marketing or Customer Service:

“You can’t transform something you don’t understand. If you don’t know and understand what the current state of the customer experience is, how can you possibly design the desired future state?” - Annette Franz - author of Customer Understanding and Built To Win.

Do note that SCV relies on strong data governance and good data quality. It also requires integrating data from multiple systems, reconciling formatting mismatches, removing duplications and designing a data model that reflects the true business logic.

[1] Summary Insights from Data and Analytics Strategy for Business, chapter 09 - A single customer view, Simon-Aspen Taylor

This Week’s Operation

To be influential, focus on the positive side of data quality

“When looking for attention and budget, it is often more effective to sell the positive benefits of data quality internally: what can be achieved with high-quality data that is currently not possible” - Simon-Aspen Taylor, Data and Analytics Strategy for Business.

Emphasising the downside of poor data quality is certainly effective, but it also makes data quality improvement a chore rather than something one can enjoy doing.

The positive side of improving data quality can be motivating for your data team as well. It’ll improve the usability of the dataset, leading to more effective decision making, and more revenue. If these positive impacts can be quantified, they will build a strong business case for a data quality quick-win project.

Simon-Aspen Taylor has shared an example of data quality as the lever to improve results at a CRM team. When testing the quality of their customer database, he discovered that only 32% of customers have email, post or SMS details associated with them. By cleaning up the data, he can help the team contact 35% more users. Assuming a 4% conversion rate among the contacted customers, he calculated the additional revenue attributed to improved data quality will be £4 million per annum. Quite a significant number, and surely a great motivation for data teams to love their tedious data cleansing work more.

This Week’s Impact

It’s 7am on a Sunday, John hears the alarm clock of his phone. He hits the snooze button, then goes back to sleep. Half an hour later, he opens his eyes, grabs his phone, opens the Reddit app and browses for a while. At 8am, he gets out of bed, plays his favourite morning playlist from Spotify while going through his hygiene routines.

It’s 8:30 am, time for the morning run. It’s lightly raining outside, but that doesn’t stop John. He runs for about an hour, heading towards the south of the city where one of his best friends lives. At the end of the run, he picks up his phone and calls his friend. “Hey mate, you up yet? Want to grab a cup of coffee?” he asks.

Meanwhile, at a data centre somewhere in the US, roughly 10 - 50MB of data is recorded, and attributed to customer ID 9x34 (John’s ID). The data include where John had been (longitude/latitude), what he did (server ID of the app he interacted with, timestamps of texts, calls), and the context of his activities (weather conditions). It may even categorise John as one of the fitness lovers who run even in the rain. It’s a binary journal of John’s Sunday morning.

John was a hypothetical customer of Sprint Corporation (which has now merged with T-Mobile). Before merging, the company provided mobile network services to 55 million customers, and recorded 60TB new data every day [1].

The data recorded was an integrated sum of first-handed network authenticated data (user information, billing information, long/lat, events), and 3rd party data coming from websites and apps. It provided a comprehensive picture of customers’ locations, behaviours and demography. [2]

Exploiting Mobile Big Data: Sources, Features, and Applications - Cheng et al - https://www.semanticscholar.org/paper/Exploiting-Mobile-Big-Data%3A-Sources%2C-Features%2C-and-Cheng-Fang/c78e16a91eabc65edfb869151b2c7ce801a8a36d

Initially used to understand better network traffic performance, the data later became crucial for Sprint’s targeted mobile advertising subsidiary - Pinsight Media. Eight years ago, the company already served 6 billion ad impressions per month [3]. Now a part of T-mobile’s marketing empire, it contributes to T-mobile’s 78B annual revenue [4]. The revenue from mobile advertising is estimated to be around hundreds of million of dollars.

Putting aside the elephant in the room (GDPR and customer privacy), Sprint is an impressive success case of how customer data can be monetised, form a competitive advantage, and even completely change the business model of a company.

----

Insights from:

[1] Bernard Marr, Big Data In Practice, Chapter 26

[2] T-Mobile is now selling app usage data to advertisers, but iPhone users are in the clear, https://9to5mac.com/2022/06/27/t-mobile-selling-user-data/

[3] Big Data At Sprint: Turning Mobile Network Data Into Business Value, https://www.forbes.com/sites/bernardmarr/2016/05/05/big-data-at-sprint-turning-mobile-network-data-into-business-value/

[4] Revenue for T-Mobile US, https://companiesmarketcap.com/t-mobile-us/revenue/

That’s it for this week! If you enjoy or get puzzled by the content, please leave a comment so we can continue the discussion. Throw in a like as or share as well if you know of someone who may enjoy this newsletter :)

Data & Beyond