data science

Predictive Analytics in Marketing – Case Study 1: Lead Generation for SaaS and Leaks from the Future

This is the first Case Study, one of many that Xpanse AI team will share in this blog about applications of Predictive Analytics in Marketing.

We are reminiscing about our past projects – the good, the bad and the ugly, executed in different workplaces with the hope that they will provide some ideas for Marketing Teams and their Data Scientists on how to use Predictive Analytics for improving their business practice. We are also open to snarky comments from “been there, done that” veterans if they notice bugs and errors in the projects we describe.

The names of the victims are changed but all stories are real.

Hope you will find them useful

Key Learning from this project: Beware of Leaks from the Future – they are out to get you.

Business Understanding

Some years ago one of our Directors successfully sold the idea of Lead Conversion modelling to a SaaS company. The cost of the project was set at 40 man-days when it landed on our laps to deliver.

Yeah – negotiate the effort instead of estimating it – a fundamental principle of best-practice Project Management and a pillar of modern consultancy (sarcasm emoji here).

Anyway, after the first meeting we had a rough idea what was going on. The Client was selling virtual fax services. You were getting a phone number and could send and receive faxes online without buying a fax machine. Pretty neat.

The Problem:

  • hundreds of thousands of potential leads captured on the “Free Tier”;
  • low conversion rate to the paid tiers ;
  • recently set up Call Centre operation with 10 people was barely paying back their own costs with the sales they were making.

You get the picture. Time to look at the data.

Data Understanding

The key step for any Predictive Analytics project is to define the training Target. Which users should be tagged (labelled) as “good profiles” and which users should be tagged as “bad profiles”.

If this sounds perhaps insignificant step – imagine that this is an exact equivalent of aiming a gun when you have only one bullet. Better get it right as a do-over will cost you half of the project.

There is usually more than one way to define the Target of which we decided to use the simplest possible this time:

  • Good profiles: users sent to Call Centre who converted to a paid tier.
  • Bad profiles: users sent to Call Centre who stayed on the free tier.

The database consisted of a number of tables of which we focused on the following :

  • Pre sign-up information (acquisition channel, campaign, etc)
  • Sign-up information (who, what industry, geography, role, etc)
  • Outbound usage information (all activities on the account related to sending faxes)
  • Inbound usage information (all activities on the account related to receiving faxes)
  • Visits to the service and clicking patterns outside the faxing facilities
  • Previous tiers, e.g. if the account used to paid and moved to free
  • CRM data – a 360 view of customer.

Data Preparation

So far – so good. We started the process of Feature Engineering, i.e. converting the non-aggregated signals and data-points into a cohesive Modelling Dataset.

The process above portrays a typical Predictive Analytics delivery. The fun part (Model Build with Machine Learning tech) is actually very short with most of the effort going into integrating the data into the Modelling Dataset format.

5 days into this process we decided to run initial models and see what happens.

“Knowingly using Leaks from the Future during modelling is playing insider trading with Nature”

The predictive accuracy of our model was 100%!

So.. did we just break the world record with the most powerful model ever?

Unlikely. When you see results like that – this could only mean 2 things: either your Target is messed up or you have “Leaks from the future”. Or both.

What are the “Leaks”? It’s the information that was generated AFTER the target event happened and fed to the model as an input, falsely pretending to exist BEFORE the target event.

Leaks are a curse, an arch-nemesis of Predictive Modelers. Sometimes they pour into your model in broad daylight and on other occasions they sneak in hidden behind other inputs.

Knowingly using Leaks in your models is playing insider trading with Nature. It will only work on paper because when scoring – they simply don’t exist anymore and our model fails spectacularly.

The Leaks were here and we were in trouble

Fighting Leaks from the Future – Round 1 

The prime suspect was CRM data. It was a one-record-per-user and a quick investigation revealed that it contained ALL interactions with customers in – including activities after the users were signed up to the paid tier.

Machine Learning “learned” very quickly that if someone, for instance, started receiving invoices – well, uhm, it’s strongly correlated with that user purchasing a paid plan. True but useless.

We ended up on a joint session with the client and manually rejecting the columns populated AFTER the successful sale.

Quick run of Machine Learning on new data and…

Nothing changed, our model was still pitch perfect Crystal Ball.

Fighting Leaks from the Future – Round 2 

We had to keep looking.

Soon, we found out that BI team implemented usage-based segments on the system and they decided to retrospectively populate segments for historical data.

Nothing wrong with that until we discovered that the retrospective segmentation was conducted only for customers who were signed up for paid tiers on the day of deploying the segmentation.

As a result the pre-segmentation values were highly contaminated and Machine Learning discovered a simple rule in the pre-segmentation data:

If Segment = “Free Trial” Then Good Profile (successful sale in the future)

If Segment = NULL Then Bad Profile (no sale in the future).


We had say bye bye to the segmentation data.

Fighting Leaks from the Future – Rounds 3-12

We discovered a few more leaks – for instance, there was a field containing call status and a value of “call again” indicated that the prospect was successfully reached even-though the “contacted date” was not populated by the system. Things like that.

We spent ca 30 days on converting the “signals” into aggregates ready for Machine Learning of which ca 10 days was burnt on investigating the “leaks”.


Once we had the Modelling Dataset the work progressed fast.

We built the model within a few days and started drip-feeding the best leads to Call Centre, while mixing in a control group with lower scores.

When there is no “champion” model to challenge – the outcomes for lead-gen projects are usually very strong. We got an uplift of 300% between the Target vs Control Group at which point the Control Group was abandoned.

Xpanse AI Team