Christopher Orr

Using Customer.io to fix rate limits on Customer.io

Customer.io Journeys is an excellent product. It’s by far the best customer messaging platform I’ve used — and there have been a few.

It works reliably, it’s easy to get data in to (and out of!) the product, and everything is well tied together. From looking at a customer profile, I can see which segments they belong to. From a segment, I can see which campaigns are targeting it. From a campaign, I can see exactly which step any given customer is at.

There are many flexible features. Their documentation is pretty solid, and sometimes even provides guidance on how to implement certain features if you don’t want to use their SDK. It’s a very good product overall.

However there’s a bit of a learning curve, and there are definitely rough edges and less-than-stellar parts, e.g. any parts of subscription preferences, or multiple languages.

But today we’ll concentrate on the problems with rate-limiting on Customer.io, and how you can use features of Customer.io itself to fix it! 🚀

Sending messages with Customer.io

Customer.io enables you to send various types of message, including emails, mobile push notifications, in-app messages, SMS, or webhooks.

These can be sent in different ways, including:

  • Broadcasts — sent to a large group of people at once, such as an email newsletter;
  • Campaigns — individual customers pass through a workflow: like a flowchart where different messages can be sent over time, based on various conditions.
Example of a campaign workflow in the visual editor on Customer.io

Example of a campaign workflow in the Customer.io visual editor

Broadcasts are designed to contact tens of thousands of customers at the same time. It’s also possible for large numbers of customers to enter a campaign at the same time.

Either way, Customer.io has the infrastructure setup to massive volumes of messages within a matter of moments. Which is great; this is exactly what we want!

The problem with sending too many messages at once

Sending email or SMS as fast as possible should be fine, since Customer.io takes care of this for you. But for other types of message, sending a lot of them too quickly can cause a burden on your infrastructure.

The simplest example is sending webhooks. If Customer.io tries to deliver a webhook for each of 10,000 customers as quickly as possible, this may completely overwhelm your backend servers. Customer.io is capable of making thousands of webhook requests within just a few seconds, whereas your infrastructure is perhaps set up to typically deal with tens or hundreds of HTTP requests per second, rather than huge request spikes.

But this doesn’t just affect webhooks. For example, if your Android app receives a push notification, its process will be started, which may trigger API requests to your backend. If 10,000 customers receive a push notification within a few seconds, this may indirectly result in thousands of requests to your backend at almost exactly the same time.

Rate-limiting messages on Customer.io

The good news is that broadcasts allow you to throttle the message sending rate:

Broadcast setup on Customer.io showing the 'limit send rate' option

Broadcasts let you restrict how many messages are sent per minute/hour/day

The bad news is that there are numerous caveats to the rate-limiting behaviour on Customer.io, meaning that “limit send rate” doesn’t really solve the concerns above.

Problem 1: “Limit send rate” only works for broadcasts

If many people enter your campaign, or otherwise hit an action at the same time, Customer.io can trigger hundreds of messages at the same time. There is no built-in option to limit the send rate on a per-message or per-campaign basis, and messages will be sent as quickly as possible.

Problem 2: “Limit send rate” can’t be used in various cases

Customer.io has a very useful feature which lets you send a newsletter at a certain time of day, while respecting each customer’s time zone.

Unfortunately, in addition to the limitation with languages shown in the screenshot above, if you want to use this time zone feature, the rate-limiting option will be disabled.

Splitting up your customer base by time zone should cause fewer messages to be sent each hour, but each of these batches can still contain a huge number of messages. Ideally, the “limit send rate” config would be applied to these batches, but it’s currently not possible with Customer.io, and messages will be sent as quickly as possible.

Problem 3: Messages are not spread across the rate-limit period

In cases where you can apply a limited send rate, it unfortunately doesn’t work as you might expect. While the broadcast page says that you can “set a maximum send rate” to avoid “large spikes in traffic”, this feature doesn’t actually prevent request spikes.

Warning in the broadcast UI that sending as quickly as possible can lead to large spikes in traffic, and that you can set a maximum send rate

For example, if you specify a send rate of 1500 messages/minute, you might reasonably expect Customer.io to spread out those 1500 messages across a 60-second period, attempting to deliver an average of 25 messages per second.

In reality, Customer.io will send a batch of 1500 messages as quickly as possible, in the first few seconds of the minute. It will then sit idle for a minute before repeating this behaviour with the next batch of 1500 messages.

This means that even with “limit send rate” enabled, a webhook broadcast will send you huge numbers of HTTP requests within a few seconds, rather than a more constant and manageable number of requests spread across the duration of the broadcast.

Chart of HTTP requests, spiking to 1000 reqs/sec once each minute, followed by 60 seconds of nothing

Webhook broadcast that respects the 1500 reqs/min limit, but has a peak of over 1000 reqs/sec

You can reduce the height of these request peaks by vastly reducing the configured send rate. This extends the time it takes for the broadcast to complete, but this might be acceptable if you’re sending messages that aren’t particularly time-critical.

Problem 4: Webhook retries occur without much jitter

If your backend servers are overloaded and unable to respond to a webhook request quickly enough, or they return HTTP errors, Customer.io will keep retrying those failed deliveries with exponential backoff.

Aside from initial retry requests perhaps happening too quickly sometimes, which can further exacerbate the problem of your servers being overloaded, exponential backoff behaviour can indeed be observed for HTTP requests that Customer.io retries. 👍

However, it appears that the backoff implementation has very little jitter, meaning that retried requests are sent at roughly the same time, leading to request spikes, which can once again overload your servers, and cause some of those webhooks to time out.

Chart of HTTP requests, with spiky clusters of retries around 300 and 800 seconds after campaign start

Example where webhook retries are quite clustered, rather than spread out more evenly

Implementing granular rate-limiting using Customer.io

The great thing about Customer.io is that you can do a lot with campaign workflows — including combining features together to greatly improve the rate-limiting behaviour!

Rather than sending a huge message spike once per minute, you can spread the load across each minute, sending manageable batches of messages every five seconds! 🔥

Using this technique, I’ve been able to stop worrying about huge numbers of customers entering a campaign at the same time. I’m also able to broadcast push notifications to customers at the same time of day with respect to their time zone, without causing our backend to be hit with a deluge of concurrent HTTP requests.

Here’s a look at a rate-limited campaign workflow on Customer.io:

Screenshot of the rate-limiting campaign workflow

How this Customer.io campaign limits the message send rate

We allow the relevant customers to enter the campaign, and for each one:

  1. Just before it’s time to send a message, assign them a random future timestamp;
    • This can be anything: one minute from now, or one hour from now; it depends on the time period over which you wish to spread out message deliveries.
  2. Wait until that customer’s randomly chosen time arrives;
  3. Send them the desired message;
  4. Tidy things up by removing the timestamp attribute.

Since these steps are taken for each customer, each one is assigned a different random timestamp, after which point they’ll be sent the message. This automatically spreads out the message sending over time, rather than sending to everybody simultaneously.

Customer.io appears to evaluate the journey steps for each customer every five seconds, so the message won’t be sent exactly when the customer’s random timestamp is reached, but it’ll be very close!

Let’s go through each step needed to implement this in a campaign.

1️⃣ Assign a random future timestamp to the customer

Add a “Create or Update Person” action so that we can attach a new attribute to the customer, specifying when the message should be sent.

In this example, the attribute is called tmp_delay, but this can be anything you like, so long as it’s unique and won’t be used elsewhere. I like using a prefix like tmp_ when creating short-lived profile attributes that are specific to a particular campaign.

Screenshot of adding an attribute to a customer using JavaScript

The attribute’s value should be defined by a JavaScript snippet that looks like this:

return Math.ceil((Date.now() / 1000) + (Math.random() * 60 * 3))

This example calculates the current timestamp, then adds on a random number of seconds between 0 and 180.

You can choose to add any length of time here: the value you choose defines how long it will take to send messages for a group of customers who enter the campaign at the same time.

Note: the Date.now() method returns the current timestamp in milliseconds, but Customer.io expects an epoch timestamp in seconds, so that’s why we first divide by 1000 before adding on the randomly calculated extra time.

2️⃣ Wait until the time chosen at random

Add a “Wait Until” action so that we can delay sending the message until the current time reaches the randomly chosen value stored in the tmp_delay attribute.

Screenshot of the 'wait until' action configuration

Relative dates are somewhat tricky to understand in Customer.io, but to wait until the tmp_delay timestamp, you should add an attribute condition which says:

  • tmp_delay (the timestamp attribute)
  • is a timestamp before
  • a relative date
  • 0 days
  • from now.

While setting up this campaign, the tmp_delay attribute won’t yet exist and so won’t appear in the dropdown, but you can type its name in the textbox and hit Enter.

Once customers reach this point of the campaign, they should all have a tmp_delay attribute. But to be on the safe side, you should also add a “Maximum time” path, set to a duration that’s a little longer than the campaign duration chosen above (e.g. five minutes for a three-minute campaign).

3️⃣ Send the message

This is the easy part! Send the push notification, webhook, or whatever type of message you need, as you would normally in a Customer.io campaign.

4️⃣ Tidy up

Once the “Wait Until” step has been passed, the tmp_delay attribute is no longer needed for the customer, and can be removed with another “Create or Update Person” action.

You can do this before or after sending the message; it doesn’t matter.

Impact of our more granular rate-limiting strategy

The chart below represents what happens when using this campaign approach, where 4500 customers enter the campaign at the same time, and we assign each of them a random timestamp up to three minutes in the future.

We can see that a much smaller number of requests is sent roughly every five seconds — peaking at just over 150 requests per second.

Chart of HTTP requests, showing spikes of 50–150 reqs/sec every few seconds

Our solution: 4500 webhook deliveries in three minutes, with a peak of around 150 reqs/sec

Compare this to the example we saw previously, where a Customer.io broadcast to the same 4500 customers uses the “limit send rate” feature, but hits the per-minute rate limit within two seconds, sending between 500 and 1000 requests per second:

Chart of HTTP requests, spiking to 1000 reqs/sec once each minute, followed by 60 seconds of nothing

CIO Broadcast: 4500 webhook deliveries in three minutes, with a peak of around 1000 reqs/sec

Downsides to this approach

This is definitely much more cumbersome than if rate-limiting worked as expected. However, this approach at least gives us the ability to rate-limit messages for campaigns, which is currently not possible at all.

But it also requires having a rough idea of how many people will enter your campaign at the same time, so that you can choose an appropriate time period to use in the JavaScript calculation. This will also depend on what your backend infrastructure can handle, so some trial and error will be needed.

If you’re using this to emulate a broadcast, you also need to take care that only the desired customers enter the campaign. Otherwise, since a campaign will always keep monitoring for customers that meet the entry criteria, you might end up sending to more people than intended.

For example, you could target the campaign using the desired segmentation for the broadcast, but apply an additional segment that targets customers with a created_at timestamp before the intended broadcast time. This way, customers created after starting the “broadcast” campaign won’t enter it. You can also stop the campaign once you see that all customers have flowed through it.

Emulating a Customer.io broadcast with time zones

Here’s a quick look at how to use this campaign approach to work around the problem of not being able to combine broadcast rate-limiting and time zones.

Screenshot of the rate-limiting campaign workflow

In this case, we know in advance that we want to send a campaign this week, on Friday, at 1pm local time for each customer.

Compared to the campaign described above, the only difference is a “Time Window” action which waits until 1pm on Friday in each customer’s time zone before proceeding to assign a random delay.

Since you can’t schedule when a campaign starts (or start one via the API), you have to consider when best to start it (and later remember to stop it) when attempting to use a campaign like this to broadcast to many customers.

Final thought

I’m a big fan of products that aren’t too rigid, and provide some degrees of freedom when working with your data. The fact that we can use Customer.io to fill some of the (few) gaps in its own behaviour is truly remarkable.

Though as mentioned, the setup demonstrated in this article is still pretty inconvenient. I would be very happy if Customer.io were to fix all of the shortcomings described here, and this article became redundant. 🤞

Anyway, I hope this article was informative and can help on your Customer.io journey. Let me know if you have any feedback!