One of the most common analyses that will be performed by an analyst is a user segmentation analysis. Segmenting users, and other important business entities, is standard in business analytics, and therefore an important skill.
In this post you'll learn how best to segment your users, and avoid some common pitfalls.
Why do we want to segment our users?
The main reason we want to segment our users is to get a better understanding of our user base.
Not all users are created equal and it's important to understand how different segments of users perform.
Let's take an example to explore this concept in more detail.
Let's say that we've built a service which is specifically designed for freelancers. Since we want to make it easy for freelancers to sign up to our service, we provide the service for free for the first 30 days.
When we analyze our accounts we see that on average only 10% end up converting to a paying account.
This is what I like to call the "fully blended" view of a metric. When you don't segment your users and group everyone together, this is a fully blended representation.
Now let's say we segment our users by the number of years they have worked as a freelancer. When we do that we see the following:
- 0 - 1 year: 2% conversion to paid
- 1 - 3 years: 15% conversion to paid
- >3 years: 20% conversion to paid
Since there is a big difference between someone who is just starting as a freelancer compared to someone who has been doing it for a while, we see a big difference in adoption for our service. A freelancer who is just starting has limited resources and is less likely to invest in software than a freelancer with a successful business.
If we just looked at the "fully blended" view we would miss out on this important insight.
We can now take this insight and use it to make strategic decisions that will impact all areas of the business.
How to perform a user segmentation analysis?
The best way to perform a user segmentation analysis is by following the steps below:
Step #1 - Get to know the business, target market and available data
As an analyst you must know the business at a very detailed level. This includes the target market and data which is important to the business.
Ask yourself the following questions:
- What does the perfect user look like?
- Which segments exist among the user base? (think industry, geo, size, age, etc)
- What information about our users should we collect?
- Which information about our users are we collecting?
The best person to talk to about the user base, especially in an early stage company is the CEO. The CEO is typically one of the founders and he or she has been thinking about the target market and ideal user from day 1.
Step #2 - Balance the quick and dirty approach vs. the long-term approach
When it comes to segmenting your users, there are two ways you can tackle this challenge, the quick way, or the slow but smart way.
The quick way involves manually building a data set of your users with the dimensions you want to use in your analysis. Most early stage companies with limited business intelligence infrastructure will opt for this method.
The issue with this approach is that if in a week you want to rerun the numbers, you have to start from scratch.
A better approach is to build a master data source that can be reused whenever needed. Ideally this data source exists as an extracted table in a tool like Tableau or PowerBI. If you aren't there yet then at the least build such a data source using SQL and save it as a view in your database. An extra tip is to save the query somewhere (sharing it on Slack or sending it to yourself via email works).
Step #3 - Building your user segmentation data set
The next step in the process is to actually build the data set.
For every business this will be different but the overall structure of the data set should be the same.
The goal is to build a data set which has the following attributes:
- One row per user
- Include user_id as a column - This will make it easy to do a distinct count of users and group by the different dimensions
- Include created_at (when the user was created in the database) as a column
- Include relevant dimensions that you want to use to either filter or segment your users
Step #4 - Build your user segmentation visualizations
There are different ways to represent the findings of a user segmentation analysis. The most common approaches are with pie charts, tables and bar charts.
Pie charts can be used if you're providing a snap shot style analysis. Like for example, segmentation of users that signed up in a given month by a certain dimension. In this case there is no time dimension so a percentage of total representation is all you need.
If you want to look at a distribution of users with a time dimension then you're better off using a cohort type view.
You must be careful when trying to segment users with a time dimension. A cohort representation of users with a dimension is a different animal and has it's own pitfalls.
Step #5 - QA your results and present your user segmentation analysis
The last step in every analysis is to QA your results and present your findings.
Sit with a second analyst and go over what you see so you have a fresh pair of eyes that can help point out any errors you may have made.
If any segmentation doesn't make sense to you then go back to your database and make sure that you don't have any data issues.
If everything looks good then schedule a meeting with whoever requested the analysis and go over your findings.
Common mistakes analysts make when segmenting users
There are a number of common mistakes that inexperienced analysts will make when segmenting users.
Re-categorize null values
This mistake is more of an aesthetic error than an analytical one.
There will be times when you build your data set and you will have users which don't have values for certain calculated fields or dimensions.
For example, you ask users when they sign up to enter their age but this isn't a required field. In such a situation a certain percentage of users won't have a value set in the "age" field. If you bring that field into your data set you will have blanks (null values).
In your query or via a calculated cell in Excel, you should replace blanks with "unknown".
Do this for all relevant fields so when you present your findings you don't show "blank" or "null".
Forgetting to normalize for time-sensitive dimensions
A very common mistake that analysts make is forgetting to normalize.
There are times when time comes into play and you therefore need to make sure that you give each record enough time to accomplish whatever you're trying to measure.
Let's say you want to segment users by those that started a trial for your premium offering, and those that didn't. Let's assume that a user can start a trial at any point.
A user who signed up last week has had 7 days to sign up to your trial while a user that signed up a month ago has had over 30 days to accomplish the same task.
The correct way to segment these users would be by picking a time frame, like say within 14 days from signing up, and then only looking at users that have had at least 14 days to start the trial. Users that started a trial after the first 14 days would still be in the "didn't start a trial within 14 days" bucket.
Coming up with the ideal time frames can be tricky and should be discussed internally. You could also run a distribution analysis to see when the majority of certain actions take place and then use that to determine your time periods.
Over complicating the analysis
Like any analysis, it is easy to over complicate things.
I always suggest picking a small number of questions that you want to answer and then work systematically, and smartly. You want to built an ever expanding data set that allows you to answer each question.
Start with the most high priority segmentation questions and go from there.
Often a segmentation analysis is coupled together with a correlational analysis. The decision makers want to know how to move certain metrics so they want to see which types of users are more likely to perform certain actions.
This is completely valid and important but such a challenge should be done in steps.
The first step is to build the relevant data set to answer the segmentation questions, and then you can blend it with the relevant performance indicators and run the correlational analysis.
Creating duplicates
Another common mistake is creating duplicates in your data set. This is more likely to happen if you're building your data sets using SQL with left joins.
Make sure you visualize your data set in its raw form before jumping into your analysis. This will help you spot any duplication, or other data issues. You may also want to check a count vs. distinct count of your user id to confirm you have one row per user.
If you do have duplications its most likely because you didn't realize you have multiple records matching your user_id. You will need to rethink which data you want to add to your data set and either go with a min, max, or some other aggregation so you bring a single data point for each user.