Telemetry Is Math. Start With The Constraint.

Telemetry is math.

If it does not change a decision.

It is trivia.

I learned this the expensive way.

By asking engineering for “a few more events” every week.

Then wondering why nobody trusted the dashboard.

Then realizing the dashboard was not the problem.

My telemetry design was.

So I went back to first principles.

Pick an outcome.

Pick the smallest set of signals that explain it.

Turn those signals into triggers that drive actions.

And yes.

I started with regression.

It failed.

That was useful.

The Real Issue Was Shape

Most PLG motions are not smooth.

They look smooth in your head.

Because you imagine a buyer slowly warming up over time.

Then you look at behavior.

And it is a cliff.

Nothing changes.

Then the user hits a structural constraint.

Free tier cap.

Rate limit.

Quota.

Seat limit.

Usage ceiling.

Whatever your product calls the wall.

Now you do not have “more usage equals slightly higher probability.”

You have “fine, fine, fine, decision.”

That shape matters.

Because your modeling tool is only as good as your assumptions.

Why Regression Struggled

Regression works great when the world is continuous.

It wants a relationship that behaves like a dimmer switch.

A little more X.

A little more probability.

A capped PLG motion behaves like a turnstile.

You can push on it all day.

Nothing happens.

Then you hit the stop.

Then you decide whether to pay, churn, or hack around it.

In that world, regression can still work.

But it often needs help.

You end up engineering the step into the feature set.

At that point, you are spending time translating the product’s constraint into math.

Instead of letting the math surface the constraint.

So I changed tools.

Why CART Worked

A decision tree is built for thresholds.

It is not trying to explain the world with a single smooth equation.

It is trying to split the world into buckets that behave differently.

That maps to how PLG actually runs.

Because PLG operations are routing.

If usage crosses X, do Y

If cap proximity is within Z, prompt upgrade

If errors spike, intervene

If activity drops after onboarding, rescue

That is what a tree gives you.

A split you can point at.

A threshold you can build around.

In practice, the tree surfaced a tipping point around recent usage minutes.

That became the heavy tester window.

And it gave me something that matters more than a pretty model.

A defensible trigger.

Where Regression Comes Back In

Regression is still valuable.

When you use it as a filter and a validator.

Once the tree showed me where the signal lived, I stopped chasing clever behavioral features.

I focused on volume primitives.

Usage minutes.

Cap proximity.

Error volume.

Maybe one or two operational health signals.

Then multivariable logistic held up fine.

The takeaway was blunt.

Minutes carried most of the conversion signal.

That is the whole game for telemetry.

Find the boring variables that explain outcomes.

Boring is cheap to maintain.

Boring stays stable when the UI changes.

Boring survives redesigns, experiments, and onboarding tweaks.

Primitives Beat Event Spam

Most teams instrument nouns.

Clicked pricing

Viewed docs

Hovered tooltip

Opened modal

Those can be useful for product discovery.

They are also fragile.

They break when UX changes.

They create endless debate.

They grow into 200 events that nobody owns.

Primitives travel.

Time

Volume

Rate

Proximity to a constraint

Failures and friction counts

If you are building a system, primitives win.

Because they support thresholds.

Thresholds support plays.

Plays change outcomes.

Two Classes Of Telemetry

I bucket telemetry into two classes.

Action telemetry

Signals that trigger an intervention

Diagnosis telemetry

Signals that explain what happened after the fact

Both have value.

They should not be funded the same way.

Action telemetry deserves engineering time.

Because it drives routing.

Diagnosis telemetry should start lightweight.

Sampled.

Aggregated.

Added only when there is a specific learning goal.

If you treat every curiosity as a first-class event, you bury the team.

Then you get what you deserve.

A noisy dashboard and a cynical engineering org.

The Build Sequence

Here is the sequence that avoids wasted instrumentation.

  1. Pick one outcome

    Conversion, activation, retention, expansion

    One at a time

  2. Start with primitives

    Minutes

    Cap proximity

    Errors

    Latency

    Retry volume

    Anything that reflects friction or value

  3. Use regression as a filter

    Does this primitive explain the outcome at all

    If it does not, kill it

  4. Use CART to find thresholds

    Where does behavior change

    What window separates converters from non converters

  5. Turn thresholds into plays

    Product prompts

    Lifecycle messaging

    Sales assist

    Support interventions

    Whatever your motion uses

This keeps telemetry tied to decisions.

It keeps the backlog sane.

Cap Proximity Management

If your PLG motion has a cap, cap proximity management is core infrastructure.

You are not tracking minutes for fun.

You are tracking minutes because the wall creates a predictable moment.

The system you want is simple.

Measure how close someone is to the wall

Measure how fast they are approaching it

Measure friction near the wall

Trigger the right intervention before they collide with payment friction

That is where conversion happens.

Or does not.

Most teams wait until the user hits the wall.

Then they call it a “pricing problem.”

It is rarely a pricing problem.

It is a sequencing problem.

Engineering Checklist

This is what I ask engineering for now.

It is boring on purpose.

Core primitives

  • Rolling usage minutes by window (24h, 7d)

  • Remaining free capacity and days-to-cap at current pace

  • Error count and error rate near cap proximity

  • Latency and timeout rates if relevant

Core derived signals

  • Cap proximity band (far, near, imminent)

  • Heavy tester band based on the tree threshold

  • Friction band based on recent failures

If a proposed event cannot be mapped to one of those signals or to a specific decision, it does not ship.

The Point

Telemetry is not a scrapbook.

It is a control system.

Regression helps you identify which primitives matter.

CART helps you turn those primitives into thresholds.

Thresholds turn into plays.

Plays change outcomes.

Everything else can wait.

Next
Next

AI killed AOPs. Or that’s how you’re probably using it today