[object Object]

A/B testing is the only reliable way to know if a change improves conversion. Done wrong, you cherry-pick noise and ship changes that hurt. Freshmarketer’s experiment tool is solid; here is how to use it without fooling yourself.

Pick the right metric

The primary metric is the one you optimize. It should be:

  • Downstream of the change you are testing (a checkout test should optimize purchase, not button clicks).
  • Stable enough to detect a 10 percent change with reasonable sample.
  • Aligned with business value (revenue, sign-ups, retention, not vanity metrics).

Pick before launching the test. Do not change mid-experiment; that invalidates the result.

Sample size calculation

Freshmarketer’s experiment setup includes a sample size calculator. Inputs: baseline conversion rate, minimum detectable effect, significance level (95 percent default), power (80 percent default).

A typical e-commerce checkout at 3 percent conversion needs about 6,000 visitors per arm to detect a 10 percent relative lift. Below the calculated sample, your result is noise.

Do not stop a test early because it “looks like it’s winning.” Wait for the sample, then decide.

Randomization

Default Freshmarketer randomizes per visitor cookie. Same visitor sees the same variant on repeat visits. This is correct for most experiments.

For experiments where the change persists post-purchase (subscription pricing), confirm the cookie persists long enough to capture the post-conversion behavior. A 30-day cookie misses 60-day churn analysis.

Variant design

Test one change at a time. Multivariate (testing 4 elements together) requires 4x the sample to attribute effect to each element.

Variant should differ on the hypothesis under test. If your hypothesis is “shorter button text increases clicks,” only the button text changes. Same color, same placement, same surrounding copy.

Running the test

Launch on a Tuesday. Run for at least one full week to capture day-of-week effects. Common e-commerce trap: launch on Monday, see a result by Friday, ship over the weekend, weekend behavior differs and the win evaporates.

Two weeks is the safe minimum for B2B; one week for high-traffic consumer.

Stopping criteria

Stop the test when:

  • Sample size reached and result is significant.
  • Sample size reached and result is not significant (no effect detected).
  • Test has run 4 weeks with no convergence (something is wrong with setup).

Do not stop because of a hunch.

Avoiding p-hacking

The cardinal sin: peeking at results, deciding to keep running until significance, then stopping. This guarantees a false positive eventually.

Freshmarketer shows interim results. Discipline yourself: only check at the calculated sample size or at predefined checkpoints.

Segmentation analysis

After test concludes, segment the result: did mobile users behave differently from desktop? New vs returning? Different traffic sources?

Segment analysis reveals who the change helped or hurt. A test that wins overall may lose for mobile users; ship only to desktop in that case.

Do not segment for finding significance where overall test failed. That is p-hacking by another name.

Documenting results

Every experiment gets a write-up:

  • Hypothesis.
  • Variant description.
  • Sample size, duration, primary metric, result.
  • Segments analyzed.
  • Decision (ship, kill, iterate).

Without documentation, you re-test the same questions and lose institutional learning.

What to do this week

If you have an active experiment, confirm the sample size requirement and stick to it. If you do not have one, pick the highest-leverage page on your site and design one experiment for next week.

[object Object]
Share