October 2, 2015

Making big data into little data


Are there good ways to shrink ‘big’ data without destroying potential insights?

‘Big’ data is the subject of much mythologising throughout the marketing community, but in this short paper we want to look at the simple question of whether analysis can do for ‘big’ data what ‘black’ holes do for matter? In other words maintain the mass but in a much smaller space?

The black hole at the centre of our galaxy has been on a near-starvation diet for almost a million years—but now it’s time for a snack. Scientists in Garching, Germany, are closely watching a rare event some 26,000 light years away: a super-massive black hole in the act of devouring a huge gas cloud. It’s providing the first-ever glimpse of how a black hole uses its massive gravitational power to pull in and consume interstellar materials—a little understood phenomenon. Wall Street Journal 19 August 2013

When we are dealing with ‘big’ data, many of us fear that means unmanageably large datasets, multiple files, and a galactic level of complexity, but that need not necessarily be the case.

So here we take a look at how to make ‘big’ data into ‘little’ data!


What does ‘big’ mean in ‘big data’?

The definition of ‘big’ data according to a Google search is:

Extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions.”


According to the Oxford Dictionary, there are eight definitions of the word ‘big’. It can mean large in size, older, important, ambitious, popular, enthusiastic or generous. ‘Big’ when applied to data seems to cover several of these definitions.

Our Google definition of big data suggests that the data sets are “extremely large”, and the inclusion of “trends” suggests that it may contain a time element as well (i.e. historical data).

So let us tackle the ‘large’ element of this. Again let’s refer to our Google definition (the last definition I promise), which gives us two options; of great size, or wide scope. Let’s assume these relate to the ‘length’ of the data set, and the ‘number’ of datasets respectively.

We have all dealt with long datasets for years. The electoral role is tens of millions of records long, but it only has a few name and address fields on it, so that is not considered big. A bank may have tens of billions of recorded transactions per year, which is considered large; but this is not ‘big data’ as it is only regarded as one source of data.

Herein we can started to see what big actually means. It is referring to the number of data sources that need pulling together, as well as their size, and potentially their time dimension.


Creating actionable insight from big data

So the focus of a big data project is often going to be on merging multiple sources of data together, (which we discussed in a previous article “The case for an Analytical Data Mart”), and then successfully reducing them through analysis. The output may be an individual level file of several million customers, with information drawn from several sources of data.

As we said earlier, this problem is not a new one. For years companies have been gaining benefit from large datasets. Taking home shopping as an example, they often summarise their large volume transactional data into a very condensed customer level segmentation.

For instance a recency, frequency, and value segmentation could look like this for their customers:

  AOV (£)
Number of transactions £0-£50 £51-£100 £100-£500 £500+
1-2 8,842 465 1,353 4,653
3-5 4,353 21,315 3,155 3,215
6+ 3,541 480 654 544


So in this case the segment “most recent customers with an AOV over £500 and 1-2 transactions” has 4,653 people in it.

From there you may decide to produce a profile by gender:

Gender Segment (%) Base file (%) Index
Male 80 50 160
Female 20 50 40


In this case we can see that this particular segment is overrepresented by males, with 80% compared to 50% in the rest of the customer base. If the more active high value segments have more females this may suggest that the company is not communicating with males as well as females.

In another example from the life assurance industry, we found that customer and prospect data covering a seven year period, ten million records, and containing every step in the journey from application to underwriting and policy, was held in 130 separate tables. Through analysis we were able to simplify this to a single Analytical Data Mart containing less than 200 columns.

Within that, for example, multiple attempts by a prospect to apply and purchase a policy could be reduced to a single row. In another section we were able to categorise thousands of different types of ‘impairments’ (what we would call health problems) and record them as a count by category.


So how can you shrink your ‘big’ data into manageable insight?

There is we suggest a relatively straightforward pathway, for which these are some of the key steps:

  • Start by deciding what tasks the data is expected to perform , from managing individual customers in different ways, to producing summary reports that improve your ability to make evidence-based decisions
  • Then source the data you will need from amongst the plethora of ‘big’ data candidate data tables
  • Design where possible a single table or Analytical Data Mart to hold the data you do need, including derived data elements like the RFM described above
  • Set up the feeds you need from the ’big’ data
  • Publish the customer table with refined insights and summaries to the business users
  • And keep the design flexible so that new insights or data sources can easily be accomodated