Customer Segmentation and Targeting

Akshay Jadiya
7 min readJun 24, 2021

--

I have led a bunch of Segmentation & Targeting projects at my work and I aim to share my learnings through this article.

This article will —

  1. Cover the “What” , “Why” and “How” of the S&T process
  2. Give you a walkthrough of a Segmentation project using a Kaggle dataset

So, let’s dive right in!

Photo by Alex on Unsplash

What is Segmentation and Targeting?

I like to think of the Segmentation process as an analytical way of classifying customers to optimize targeting efforts. Although there can be more, but I have come across these two types —

  1. Behavioral Segmentation — Done using ‘secondary’ data i.e. using historical data that captures the ‘behavior’ of a customer.
  2. Attitudinal Segmentation — Float a survey having questions directed towards assessing customer preferences.

The focus of this post will be on the Behavioral Segmentation.

Targeting is the process of using segment definitions and corresponding characteristics to inform marketing decisions (frequency / channel / target value etc.). Targeting can be thought of prioritizing segments in a way that aligns with the marketing strategy and optimizes resources.

Why Segmentation and Targeting is done?

There are three main reasons to leverage data and analytically derive segments —

  1. It leads to effective deployment of marketing tactics
  2. It divides the customer universe into logical groups that are easy-to-understand
  3. It allows you to make efficient use of the marketing budget allocated to a specific channel

The “How” of Segmentation

The best way to understand one of the ways to create segments is to show an example. You can find the code/dynamic excel file here and get the data here.

The data that I am using has the drug level prescriptions written by a doctor over a period of time. The dataset also captures some of the other physician characteristics such as specialty, years of experience etc. The goal is the divide the doctors into 4 different segments so that the sales representatives can visit them with varying frequencies — highest frequency to physicians in a group that is the most valuable and low/no visits to doctors in segments that are of low value.

If I have to summarize the overall process in 3 broad steps, they would be —

  1. Identify customer characteristics that influence purchasing behavior
  2. Rank customers on each of the identified dimensions
  3. Define segments based on market understanding

Step 1 — Identifying customer characteristics

I derived 3 characteristics that affect the “frequency of visit” of a sales representative to a physician to promote a product —

  1. The “value” of a physician — This is equal to the total number of scripts written by a doctor for the product of interest (Prednisone in our analysis)
  2. The “potential” of a physician — This is equal to the total number of scripts written by a doctor for ALL the drugs (Prednisone + drugs of other competitor) that can be used to treat a disease/ailment
  3. The “access” of a physician — The ratio of “brand scripts” to “generic scripts” written by a physician. Brand scripts mean scripts for drugs that are manufactured by a large Pharma Company and is still under patent (what this means is that other manufacturers can’t manufacture copies of this drug to bring the cost down). To understand more about the product lifecycle in a pharma industry, please watch this. For our purposes, we just need to understand that is there is high access — the doctor is more willing to write branded products in general, hence, high access is good for us.

Okay, let’s get to the data now.

The Kaggle dataset is an jsonl file that I read using pandas as the dataset looks like this —

Physician level prescription data

Each row has an NPI which is the doctor identifier and dictionaries for product wise prescriptions and demographic variables in ‘provider_variables’ column. For the purpose of this post, I assumed PREDNISONE to be the brand of interest and METOLAZONE, OMEPRAZOLE and LISINOPRIL as its competitors. I extracted prescription counts for PREDNISONE and the market (PREDNISONE + competitor prescriptions) and now the data looks like —

Dataset with prescriptions extracted for brand and market

So here we have our “value” and “potential” dimensions. For the “access” dimension, I created a dataframe from the ‘provider_variables’ column and it looks something like —

Physician demographic information

The “access” variable is the ratio brand_rx_count / generic_rx_count.

So, now we have all the three dimensions for creating segments. Now let’s move on to the step 2

Step 2 — Ranking customers among each dimension (creating deciles)

This is pretty straightforward — I created deciles of each dimension using the pandas qcut function. The below table shows decile summary for the “potential” dimension—

Market deciles — ‘sum’ means the sum of prescriptions and ‘count’ means no. of doctors contributing to the sales in that decile

As you can see, decile 1 physicians are the most valuable ones as they write more avg. prescriptions than all the other deciles. This is apparent as there are roughly same doctors in every decile, however, the prescriptions decrease significantly as we go from decile 1 to decile 10.

Decreasing total prescriptions as decile increases

After making deciles for all the 3 dimensions, I grouped the “value” and “potential” into High, Medium and Low (Very low for decile 0) as per the following rule —

High (H)— Decile 1 to Decile 3

Medium(M) — Decile 4 to Decile 7

Low (L)— Decile 8 to Decile 10

Very Low (VL) — Decile 0

I followed a slightly different grouping approach for the “access” dimension as well. I created different histograms to check the distribution of access ratios and decide cutoffs for Very High, High, Medium, Low and Very Low access ratios

1. For complete data
2. Access ratios > 1
3. Access ratios between 1 and 5
4. Access ratios between 0 and 1
5. Access ratios between 0 and 0.4

Based on the above ratios I came up with these cutoffs for the access ratios —

  • Very High (VH) : >2
  • High (H) : (0.8 to 2]
  • Medium (M) : (0.15 to 0.8]
  • Low (L) : (0 to 0.15]
  • Very Low (VL) : 0

It is completely okay to define some other ranges of cutoffs, however, I found those reasonable by looking at the distributions. Try coming up with some other cutoffs yourself!

This completes our step 2. Now, let’s move on to step 3.

Step 3 — Assign segments to physician groups (VH,H,M,L,VL) as per market understanding

For step 3, I thought creating a dynamic excel workbook will help a lot as we can choose a different structure of assigning segments and can see changes being reflected in real-time.

This part is very specific to the industry you’re working in. Also, it depends on the total sales force size, brand strategy, competitive landscape etc.

I have divided the customers into 4 segments— A,B,C and D. Segments A,B and C are “targets” as they will be covered by the sales force albeit with reducing frequency(say 12 visits a year for Segment A physicians, 8 for Segment B and 4 for segment C). Segment D is the non-target segment (0 visits). The logic I used to assign the segments is based on the below table —

Decision Chart for assigning segments to dimension groups

The below gif shows how you can select segments in the top left table using a dropdown. For reference, the table just below it shows distribution of doctors among “value” , “potential” and “access” deciles. Also, the summary table in top right updates dynamically to reflect the changes.

Dynamic excel to do make adjustments

Conclusion

The below figure shows the arrangement I finally settled upon.

Final Segmentation

A few observations to note —

  1. Top ~10% physicians cover ~30% market potential and ~45% PREDNISONE prescriptions. Being so valuable, we want them to remain loyal to our brand so cover them with a higher frequency to maintain our relationship with them
  2. As we go to segment B and C, we see that the physicians in these groups become less and less valuable to us. Hence, we want to reach them with reduced frequency
  3. If you look at segment D, the % physician to % value is flipped as compared to segment A. We should be okay to leave behind ~10% of value as well as potential as covering these physicians will not result in many prescriptions
  4. Ideally, if we had infinite resources, we would want to send a sales rep to each and every doctor there is. However, since the cost of a sales rep in the US is very high, the number of physicians we can reach is usually limited by this high cost

This is my first blog post, so feel free to drop comments for suggestions/present your POV. Thanks for reading!

Cheers!

--

--

Akshay Jadiya

Data science graduate student @Georgia Tech. Passionate about decision science.