DataHadi: Bike Share Marketing Campaigns Analysis using R or Python

Miftahul Hadi

🧾Information 🧾

For presentable analysis and script code in R & Python, click Appendix

Introduction

Scenario

Welcome to the Cyclistic bike-share analysis case study!

Fig. 1. Bike-share station — ***Fig. 1.*** Bike-share station

In this case study, I am here as a junior data analyst working in the marketing team at Cyclistic, a fictional bike-share company in Chicago. Lily Moreno, the marketing director, believes that the company’s future success depends on maximizing the number of annual memberships.

To ride a Cylcistic bike, you can unlock it from one station and return it to any other station in the system. There are a few pricing plans to purchase:

Single-ride passes
Full-day passes
Annual memberships

Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Our task here is to understand how casual riders and annual members use Cyclistic bikes differently. The marketing team will design a new marketing strategy from these insights to convert casual riders into annual members. But, Cyclistic executives must approve your recommendations with compelling data insights and visualization.

More about company’s details

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geo-tracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system at any time.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign targeting all-new customers, Moreno believes there is a good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

Objectives

Identify how casual and members riders use Cyclistic bikes differently to maximize the number of annual memberships.

Preparation

Here are a few tools used in this analysis:

RStudio
The library used in R: tidyverse, ggplot2, lubridate, readxl
Dataset
- Cyclistic’s historical trip data in 2022
- The license has been made available by Motivate International Inc.

Processing the Data

Here’s a step by step of processing raw data to be prepared for analysis:

Import your data

To connect the data to the R environment, the step as follows:

First, download the previous 12 months’ Cyclistic trips in .csv format.
Second, convert the data from .csv to xlsx format. Store it in a new folder (e.g. “~/bike-project/xlsx-version”)
Next, load the library used in this analysis and set the main directory where the data is stored.

library(tidyverse)
library(ggplot2)
library(lubridate)
library(readxl)
#setwd("C:/Users/LENOVO/Documents/bike-project/xlsx-version")
#setwd is not used because of web development circumstances

Using the readxl library, load the data and give the name each month.

m1_2022 <- read_xlsx("C:/Users/LENOVO/Documents/bike-project/xlsx-version/202201-divvy-tripdata.xlsx")
m2_2022 <- read_xlsx("C:/Users/LENOVO/Documents/bike-project/xlsx-version/202202-divvy-tripdata.xlsx")
m3_2022 <- read_xlsx("C:/Users/LENOVO/Documents/bike-project/xlsx-version/202203-divvy-tripdata.xlsx")
m4_2022 <- read_xlsx("C:/Users/LENOVO/Documents/bike-project/xlsx-version/202204-divvy-tripdata.xlsx")
m5_2022 <- read_xlsx("C:/Users/LENOVO/Documents/bike-project/xlsx-version/202205-divvy-tripdata.xlsx")
m6_2022 <- read_xlsx("C:/Users/LENOVO/Documents/bike-project/xlsx-version/202206-divvy-tripdata.xlsx")
m7_2022 <- read_xlsx("C:/Users/LENOVO/Documents/bike-project/xlsx-version/202207-divvy-tripdata.xlsx")
m8_2022 <- read_xlsx("C:/Users/LENOVO/Documents/bike-project/xlsx-version/202208-divvy-tripdata.xlsx")
m9_2022 <- read_xlsx("C:/Users/LENOVO/Documents/bike-project/xlsx-version/202209-divvy-tripdata.xlsx")
m10_2022 <- read_xlsx("C:/Users/LENOVO/Documents/bike-project/xlsx-version/202210-divvy-tripdata.xlsx")
m11_2022 <- read_xlsx("C:/Users/LENOVO/Documents/bike-project/xlsx-version/202211-divvy-tripdata.xlsx")
m12_2022 <- read_xlsx("C:/Users/LENOVO/Documents/bike-project/xlsx-version/202212-divvy-tripdata.xlsx")

#("C:/Users/LENOVO/Documents/bike-project/xlsx-version") still used because of web development circumstances

Make columns consistent and merge into a single data frame

To make columns ready to analysis, here’s the step:

Firstly, check the columns name of each months trips data.

colnames(m1_2022)
colnames(m2_2022)
colnames(m3_2022)
colnames(m4_2022)
colnames(m5_2022)
colnames(m6_2022)
colnames(m7_2022)
colnames(m8_2022)
colnames(m9_2022)
colnames(m10_2022)
colnames(m11_2022)
colnames(m12_2022)

After that, check the data types.

str(m1_2022)
str(m2_2022)
str(m3_2022)
str(m4_2022)
str(m5_2022)
str(m6_2022)
str(m7_2022)
str(m8_2022)
str(m9_2022)
str(m10_2022)
str(m11_2022)
str(m12_2022)

Change the data types if you find a column that differs from the rest of the columns in another file.
Lastly, combine all sheets into one single data frame and named it all_trips.

all_trips <- bind_rows(m1_2022, m2_2022, m3_2022, 
                       m4_2022, m5_2022, m6_2022, 
                       m7_2022, m8_2022, m9_2022, 
                       m10_2022, m11_2022, m12_2022)

Clean the data

After combine all month trips data, we can clean them to be ready to analysis. Here’s the step:

Remove a few unused column.

all_trips <- all_trips %>%
  select(-c(start_lat, start_lng, end_lat, end_lng))

Check the data frame attributes.

colnames(all_trips)   #list of column names
nrow(all_trips)       #total row
dim(all_trips)        #total dimension
head(all_trips)       #view first 6 rows
str(all_trips)        #see list of data types
summary(all_trips)    #statistical summary

Check if there’s any problem in memberships column

unique(all_trips$member_casual) #for unique value

[1] "casual" "member" NA

table(all_trips$member_casual)  #total unique value


 casual  member 
2322031 3345683

As we know, memberships program in Cyclistic only have two, casual and members riders. So, if there’s any dualism value (e.g. subscriber instead of members), change the value.

all_trips <- all_trips %>% 
mutate(member_casual = recode(member_casual, "Subscriber" = "member", "Customer" = "casual"))

Add columns that list the date, month, day, and year of each ride.

all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")

Make a new column that shows different time of started ride to ended ride.

all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)

After creating a new column, inspect the columns data types to make sure.

str(all_trips)

After checking the data types, we found that ride_length column still not in numeric types.

is.numeric(all_trips$ride_length) #to check if the ride length was in number/factor

[1] FALSE

all_trips$ride_length <- as.numeric(all_trips$ride_length)
is.numeric(all_trips$ride_length)

[1] TRUE

We found that there is two ride_length column. Remove bad data to make it consistent.

all_trips <- select(all_trips, -c(ride_of_length))

After creating a ride length column, there’s some data that peculiar (minus ride length). Remove it using subset function, because its a bad data.

all_trips_v2 <- subset(all_trips, ride_length>0)     #include only ride length over 0 s

Conduct descriptive analysis

In order to identify trends and relationships in the data, we can conduct descriptive analysis. Here’s the steps:

First, we can summarize the rider behavior based on ride_length column. For example longest & shortest ride and average ride length.

mean(all_trips_v2$ride_length)

[1] 1166.846

median(all_trips_v2$ride_length)

[1] 617

max(all_trips_v2$ride_length)

[1] 2483235

min(all_trips_v2$ride_length)

[1] 1

summary(all_trips_v2$ride_length)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1     349     617    1167    1108 2483235

Compare members and casual riders based on average ride length or number of rides.

member_compare <- all_trips_v2 %>%                #call dataset
  drop_na(member_casual) %>%                      #drop empty data
  group_by(membership_type = member_casual) %>%   #group the data by members type
  summarize(average_ride_length = mean(ride_length), number_of_rides = n())  #summarize them with avg and number of rides

View(member_compare)      #view new data frame

Do a comparison of average ride time by each day_of_week column.

#Reorder the levels of day_of_week column with Sunday as the first

all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

#Summarize them based on day of the week

avg_member_per_day <- all_trips_v2 %>%
  drop_na(member_casual) %>% 
  group_by(membership_type = member_casual, day_of_week) %>% 
  summarize(average_ride_length = mean(ride_length))

View(avg_member_per_day)

Lastly, let’s count the number of rides by members type and day of week using ride_id column.

### first check the ride_id duplicate to count number of ride
sum(duplicated(all_trips_v2$ride_id))

[1] 0

### if zero, lets goo
membertype_per_day <- all_trips_v2 %>%
  drop_na(member_casual) %>% 
  group_by(membership_type = member_casual, day_of_week) %>% 
  summarise(number_of_rides = n(), average_ride_length = mean(ride_length)) %>% #n() for counting 
  arrange(day_of_week)

View(membertype_per_day)

Design a data visualization

For the next phase, we visualize what has been describe in the analysis process. Here’s the step:

First, we can visualize number of rides by membership type. This will helps us to understand which riders have the highest number of rides.

ggplot(data = membertype_per_day, mapping = aes(x = day_of_week, y = (number_of_rides/1000), fill = membership_type)) + 
  geom_col(position = "dodge") +
  labs(title = "Number of Rides by Membership Type in 2022", 
       x = "", y = "Number of Rides<br><span style = 'font-size:8pt'>(in thousand)</span>", 
       fill = "Membership Type")+
  theme(axis.title.y = ggtext::element_markdown())

#save the data viz in .png format
ggsave(filename = "number-of-ride.png", width = 6.86, height = 4.11, dpi=300)

Next, visualize the average duration by members type each day of the week.

ggplot(data = membertype_per_day, mapping = aes(x = day_of_week, y = average_ride_length, fill = membership_type)) + 
  geom_col(position = "dodge") +
  labs(title = "Average Ride Length by Membership Type in 2022", 
       x = "", y = "Average Ride Length<br><span style = 'font-size:8pt'>(in second)</span>", 
       fill = "Membership Type")+
  theme(axis.title.y = ggtext::element_markdown())

#save the data viz in .png format
ggsave(filename = "avg-length.png", width = 6.86, height = 4.11, dpi=300)

To find a potential days to promote memberships program to casual riders, we need to identify which day with the most number of rides by casual riders.

#filter the dataframe by only casual riders
highest_day_nor <- membertype_per_day %>% 
  filter(membership_type == "casual") %>% 
  arrange(desc(number_of_rides))

  ggplot(data = highest_day_nor, mapping = aes(x = (number_of_rides/1000), y = (day_of_week = reorder(day_of_week, number_of_rides)))) + 
    geom_col() +
    labs(title = "Number of Rides Per Day by Casual Riders",
        x = "Number of Rides<br><span style = 'font-size:8pt'>(in thousand)</span>", y = "")+
    theme(axis.title.x = ggtext::element_markdown())

#save the graph  
ggsave(filename = "high-nor-day.png", width = 6.86, height = 4.11, dpi=300)

Export a summary file for further analysis

To export a summary data frame into a csv format, here’s the code

write.csv(member_compare, file = "all_member_compare.csv")
write.csv(membertype_per_day, file = "by_day_member_compare.csv")

Analysis

The Cyclistic bikes usage difference by rider type

To understand each rider behavior of using bikes in 2022, we need to utilize number of rides and ride duration data. This will helps describe how they differ from each other as we can see from Fig. 2.

Fig. 2. Casual and members riders difference based on number of rides and ride duration. — ***Fig. 2.*** Casual and members riders difference based on number of rides and ride duration.

In Fig. 2, we see a comparison between two types of rider that using Cyclistic bikes. Then compare them based on number of rides and average ride length. We discovered that members riders have a higher number of rides than casual riders. However, casual riders tend to have a longer ride duration (1,749 second) even with less number of rides (2.321 mil).

Number of rides by day

This section will explain about how frequently they use Cyclistic bikes during the week. This will assist Cyclistic marketing to determining which day they are most likely to ride. We can see more detail in this Fig. 3.

Fig. 3. The number of rides done by each rider during the week. — ***Fig. 3.*** The number of rides done by each rider during the week.

This graph Fig. 3. above illustrates the number of rides done from Sunday to Saturday throughout 2022. Casual riders are represented by the red column, while Cyclistic members are represented by the blue column. From this graph Fig. 3. , we discovered that casual riders maintained a lower number of rides than members from Monday to Friday. On weekends, however, casual riders have a slightly higher number of rides than members. This suggest that casual rider less likely to ride on weekday.

Average ride length by day

This section will explain about how longer they ride using Cyclistic bikes during the week. This will assist Cyclistic marketing to determining which day they are most likely to ride longer. We can see more detail in this Fig. 4.

Fig. 4. The average ride lengths done by each rider during the week. — ***Fig. 4.*** The average ride lengths done by each rider during the week.

This graph Fig. 4. above illustrates the average ride length done during the week throughout 2022. From this graph Fig. 4. , we observed that casual riders shown longer ride lengths than members every day. In facts, casual riders have a longer ride on weekday, even with less number of rides than members. This suggest that casual rider more likely to ride longer.

Potential days to promote membership program

In this section, we observed the number of rides taken by casual riders. This data will help us to determine potential days to promote the membership program to casual riders who are already familiar with Cyclistic. This will also help the company generate more annual members, which will be beneficial. Here is Fig. 5. about number of rides per day by casual rider.

Fig. 5. The number of rides per day by casual riders. — ***Fig. 5.*** The number of rides per day by casual riders.

This graph Fig. 5. above illustrates the number of rides taken by casual riders during the week in 2022. From this graph Fig. 5. , we observed that casual riders show a higher number of ride on weekend. On Saturday, 473 thousand ride are completed. On the other hand, weekdays appear to have fewer rides, ranging from 264 to 334 thousand rides. This suggest that on weekends, there are more casual riders.

Why casual riders do not convert to members?

In this section, we create a hypothesis after identifying how casual and members use Cyclistic bikes differently. Here are a few potential causes of why casual riders do not purchase membership programs:

Marketing mismatch

Our first hypothesis is that casual riders do not join the membership program due to a marketing mismatch. So, we need to analyze the Cyclistic marketing strategies in 2022. To realize this, we need more data required such as:

Lead generation data (e.g. how did you hear about Cyclistic?)
Digital media/ads performance data

After gathering data, we can determine whether there are significantly different methods of discovering Cyclistic for casual riders and members. Then we look at how digital media influences casual riders to become members.

Incentive mismatch

Our second hypothesis is that casual riders are not interested by current membership benefits as they can obtain them as a casual. As a result, we need to investigate the membership plan. To accomplish this, we need to take steps such as:

Breakdown of riders difference by number of rides and ride length
Survey of “why did you use Cyclistic bikes” – broken out by riders type

After breakdown the data we have and conducting a survey, we can understand how casual and member riders use the bikes differently. As a result, we understand why the use of Cyclistic bikes varies by rider type.

Recommendation

After conducting a descriptive analysis to identify how different riders use Cyclistic, our strategies to increase annual membership as follow:

Define new membership program main benefits

As we learn before, casual riders have a longer ride duration. We can then design pricing plans based on limiting the ride duration. So, more people have the urge to be Cyclistic members with unlimited duration benefits.

Create new pricing plans based on ride length for casual riders

Based on casual riders longer ride length, we can create new pricing plans using The Decoy Effect strategies. For casual, make the pricing plans based on limited ride duration. This will make the membership plan more profitable.

Design a marketing strategies that highlight new membership benefits

Following the establishment of pricing plans, we can plan the marketing strategy on a day with higher number of rides by casual riders (Saturday and Sunday). Highlight the main benefits of members in a targeted manners.

Test and learn

Run the program on weekly basis to evaluate which pricing plans are working and which aren’t. Create a KPI’s to define a successful program (e.g. new members growth, number of rides by casual, day of the week, etc.)

Appendix

Presentable data analysis → PPT
Data Analysis Process in:
- R scripts → R Code
- Python scripts → Python Code
Google Data Analytics Coursera

Bike Share Marketing Campaigns Analysis using R or Python