Divvy Bikes: A Year in 4.3 Million Trips

data science6 min read

Published 23 February 2024

Behavioral and operational analysis of the full 2023 Divvy bike-share dataset, separating member from casual riders and recommending where the operational levers are.

TableauPythonPandasMatplotlibStatistical ModelingSpatial Analysis

GitHub →

In 2023, Divvy recorded 4,331,707 bike-share trips across Chicago. This case study breaks that dataset apart along the dimensions an operator would actually act on: member versus casual usage, weekday versus weekend, seasonal pattern, and station-level concentration. The deliverable is a Tableau workbook plus a written set of recommendations. The point of the exercise was to take a year of mobility data and produce something an operations director could use on Monday morning, not a notebook full of charts that prove the analyst can use seaborn.

I picked the Divvy dataset specifically because the obvious chart (total trips by month) is the chart that hides the most. The interesting structure is in the disaggregation: who is riding, when, and why.

The system

Divvy is operated by Lyft for the City of Chicago. As of 2023, the system covered all 50 wards with more than 800 stations and over 15,000 bikes and e-bikes. Among North American bike-share systems it is the largest by service area. The dataset is publicly released by the City and contains, for every trip, the start and end station, the start and end timestamps, and a rider type (member or casual).

Operationally a successful bike-share system runs on three things: a viable membership base that anchors revenue and demand predictability, enough bike availability at the right stations at the right time of day, and pricing and product design that attracts new riders fast enough to grow. The 2023 data contains signal on all three.

The headline numbers

Members took 60 percent of trips. Casual riders took the other 40 percent. The asymmetry matters because the two groups behave entirely differently and require different operational responses.

The average trip duration was about 11 minutes, but that number hides a split: members average roughly 12 to 13 minutes per trip, casual riders average 25 minutes or more. Members are using bikes for transportation. Casual riders are using bikes for recreation. Same fleet, two products.

Seasonal volatility is extreme. Peak summer days run roughly four times the volume of mid-winter days. The seasonal curve starts climbing in April, peaks in July and August, and drops sharply through November as Chicago weather turns. Operations that don't acknowledge that volatility (in fleet sizing, in maintenance scheduling, in marketing spend) leak money.

Members: the commuting backbone

The temporal signature of member trips is rush hour, twice a day, weekdays. Trip volume spikes around 8 AM, plateaus through midday, spikes again around 6 PM, and falls off through the evening. Saturdays and Sundays show a flatter curve with no rush-hour peaks. The interpretation is straightforward: members are commuting to and from work and using midday trips for short errands.

Geographically, member trips concentrate along corridors that overlap with the city's commuting infrastructure. Origin-destination pairs cluster between residential neighbourhoods (Lincoln Park, Logan Square, West Town) and the central business district. The pattern looks like the CTA Blue and Red lines drawn in trip arrows.

The operational implication is that member demand is predictable. You can build rebalancing algorithms around it. You can stage maintenance during the midday lull. You can size station capacity based on the morning surge. None of this is true of casual riders to the same degree.

Casual riders: the recreation product

Casual rider behaviour is essentially the inverse. Weekend-heavy, midday-and-afternoon-clustered rather than rush-hour spiked, longer per-trip duration, geographically concentrated around tourist destinations and the lakefront. Origin-destination pairs cluster around Millennium Park, Navy Pier, Grant Park, the lakefront trail, and the waterfront entertainment districts.

The seasonal sensitivity is steeper for casual riders than for members. Members ride year-round at reduced volume. Casual riders functionally disappear in January and reappear in May. That swing creates demand-management problems summer-side (not enough bikes at the lakefront on a 28-degree Saturday) and revenue-management problems winter-side (a lot of capacity sitting idle).

The operational implication is that casual demand requires flexibility, not predictability. Surge pricing, dynamic fleet repositioning, partnerships with tourism operators, weather-responsive marketing. The skill is in matching capacity to a moving target.

Weather and seasonal mechanics

Temperature is the dominant exogenous variable. Trip volume starts climbing meaningfully above 50°F (10°C), peaks in the 70 to 80°F range (21 to 27°C), and curiously declines slightly above about 90°F (32°C) when extreme heat begins discouraging outdoor activity.

Precipitation has a sharp same-day effect. Rainy days show 60 to 80 percent volume reductions versus comparable dry days. The reduction is steeper for casual riders than for members, which is consistent with the underlying use cases: a commuter still has to get to work in the rain, a tourist does not have to bike along the lakefront.

These patterns are not surprising. They are quantifiable, which makes them useful for operational planning. Predictive models that incorporate weather forecasts can size the active fleet a day or two ahead with usable accuracy.

Station-level concentration

The station distribution follows a Pareto-like shape: roughly 20 percent of stations carry about 60 percent of system volume. The top stations are concentrated downtown and along the lakefront. The bottom of the distribution contains stations that record fewer than five trips per day on average.

Two operational responses follow.

Strategic capacity additions at the high-volume stations. Adding docks at a station that already runs at saturation captures real incremental demand. Adding docks at a station running at 20 percent of capacity captures nothing.

Selective repositioning of low-volume stations. Some low-traffic stations exist for equity reasons (network coverage of underserved neighbourhoods) and should stay regardless of volume. Others are simply badly located and should be moved. Distinguishing those two cases is a policy question, but the data identifies the candidate list.

Recommendations

The case-study writeup ended with five recommendations, paraphrased here:

Convert casual riders systematically. The data identified casual riders who took three or more trips in a single month as a high-likelihood conversion segment. A targeted offer (free trial, summer discount, partner promotion) aimed at that segment converts at rates that justify the marketing spend.
Build a predictive rebalancing model. Member trip patterns are predictable enough to forecast directional flows by time of day. Pre-staging bikes at expected destinations reduces stockouts at the morning end and full-station refusals at the evening end. The expected operational gain is in the 20 to 30 percent range based on comparable systems' published results.
Implement dynamic pricing for casual rides. Same fleet, different price by time and weather. Smooths peak-hour congestion at high-demand stations and improves revenue recovery on weekend afternoons in summer.
Expand strategically into latent-demand neighbourhoods. Origin-destination analysis identified neighbourhoods that show high inbound trip volume but limited station coverage. These are the candidate expansion locations.
Integrate with CTA for first-and-last-mile use cases. Member trip patterns show clear handoffs between bike trips and transit. A bundled or integrated payment product would capture commuters who currently use both systems but pay separately.

What I got wrong on the first pass

Two things, both worth flagging because they are common.

The first was a duration outlier I did not catch until the second cleaning pass. Divvy's 2023 release included trips with ended_at - started_at measured in days, not minutes. Some of those are real (a bike stolen and recovered), most are docking-sensor failures. Computing the average trip duration with those rows in killed the median estimate by minutes. The fix was to clip durations to a sensible range: drop trips under one minute (mostly docking-sensor glitches) and trips over twenty-four hours (mostly equipment lost or stolen). Plotting the duration histogram before and after was the diagnostic. The clipped distribution had a clean right tail; the unclipped one had a thin scatter of multi-day outliers that pulled every aggregate.

The second was an early version of the station-concentration chart that conflated unique stations with start-station counts. Each station appears as both an origin and a destination across the year, so the naive count double-counts by a factor of two. The Pareto shape is real but my first numbers overstated it. The corrected version is the one in the final workbook.

Neither error showed up in tutorials I had used to learn pandas. They came from spending an extra evening on the data after I thought I was done.

Methodology

The data work was done in Python with Pandas for cleaning and feature engineering. Time-series decomposition isolated weekly cycles from seasonal trends. Spatial analysis used station-level geocoding to produce origin-destination flow maps. The visualizations were built in Tableau Public for the final deliverable.

The interactive workbook is embedded below. Explore the full dashboard, filter by season, and drill into station-level patterns.

Divvy: a year in review (2023). Interactive Tableau Public dashboard.

This analysis was completed during the Big Data Analytics program at Georgian College and was framed as a hypothetical consulting engagement rather than a published business analysis. The dataset, as a public release, is identical to the one used by Divvy's operating team. The recommendations are mine.

What carried forward

The pattern I keep using from this project is the member-versus-casual disaggregation. It is a generic move (split a population by behavioural type before computing any aggregate) and once you have the habit it shows up everywhere. The dashboards I build at metricHEALTH for clinical operations are the same shape: total volume is rarely the metric a manager actually wants. Volume by population, by time of day, by hand-off pattern is. The Divvy workbook was where I built the muscle for that kind of cut, before the data started being about people instead of bikes.

GitHub: https://github.com/TirtheshJani/Case_Study_Divvy_Bikes-