Skip to main content

Visualize Bayesian Updating From Prior to Posterior

Visualize how a Beta prior distribution updates to a posterior after observing successes and failures. See the shift in probability estimates and credible intervals.

Last Updated: February 13, 2026

Posterior ∝ prior × likelihood. That's Bayes' rule for an unknown probability θ: you start with a prior belief about θ, multiply by the likelihood of the data given θ, and the renormalized product is the posterior. The page works on Bernoulli/binomial data because the conjugate prior is Beta(α, β): with s successes and f failures, the posterior is Beta(α + s, β + f) in closed form. No MCMC needed.

The Beta parameters have an intuitive read. α acts like pseudo-successes, β like pseudo-failures, baked into the prior before any data arrives. Beta(1, 1) is uniform (no information). Beta(2, 2) is weak, gently bunched around 0.5. Beta(50, 50) is sharp, peaked at 0.5, and takes a lot of contrary data to budge. The credible interval (the central region containing 95% of the posterior mass) is the Bayesian analogue of a frequentist CI, and unlike a CI it can be read directly as "P(θ ∈ this range | data) = 0.95." That clean interpretation is the practical reason to go Bayesian for proportion problems when stakeholders ask the natural question.

Set the Prior (Beta Parameters) With Intuition

The Beta distribution lives on [0, 1], making it a natural choice for probabilities. Its two parameters, α and β, shape the curve. Think of α as "pseudo-successes" and β as "pseudo-failures" baked into your initial belief. Beta(1, 1) is uniform—every probability is equally likely before you see data.

Higher α + β means a tighter distribution. Beta(2, 2) is gently mounded around 0.5, representing mild uncertainty. Beta(50, 50) is sharply peaked at 0.5, representing strong confidence that the probability is near 50%. The sum α + β acts like sample size—larger sums indicate firmer beliefs that take more data to move.

To encode historical knowledge, translate past data into α and β. If last quarter's conversion rate was 8% across 200 trials, a reasonable informative prior is Beta(16, 184), with mean 16/200 = 0.08. This anchors your analysis without ignoring what you already know.

Common priors:

• Beta(1, 1) — uniform, no preference

• Beta(0.5, 0.5) — Jeffreys, invariant under transformation

• Beta(2, 2) — weak, centered at 0.5

Add Evidence: Successes and Failures

The Beta-Binomial conjugate pair makes updating trivial. If your prior is Beta(α, β) and you observe s successes and f failures, the posterior is Beta(α + s, β + f). No integrals, no numerical approximations—just add the counts. This property is why the Beta prior is so popular for binary outcomes.

Sequential updating works identically: update after each batch of data, or wait until all data arrives—the final posterior is the same. Only total counts matter, not the order. This lets you monitor results in real time without penalty.

With little data, the posterior clings to the prior. With lots of data, the posterior shifts toward the observed rate and sharpens. Twenty observations barely move a Beta(100, 100) prior, but decisively reshape a Beta(1, 1) prior.

Example: Prior Beta(5, 5), observe 30 successes and 20 failures → Posterior Beta(35, 25). New mean = 35/60 ≈ 0.583, up from prior mean of 0.5.

Posterior Mean and Credible Interval Readout

The posterior mean is (α + s) / (α + s + β + f), a weighted blend of your prior belief and the observed rate. When α + β is small relative to s + f, the data dominates. When α + β is large, the prior anchors the estimate.

A 95% credible interval marks the region where the true probability lies with 95% posterior probability. Unlike frequentist confidence intervals, this is a direct probability statement given your prior and data—not a long-run coverage guarantee.

As data accumulates, the credible interval shrinks. With 10 observations, you might have [0.25, 0.75]. With 1,000 observations, that narrows to something like [0.48, 0.52]. More data means more certainty.

Credible interval from inverse CDF:

Lower = F⁻¹(0.025), Upper = F⁻¹(0.975)

where F is the Beta CDF with posterior parameters

How Strong Is the Prior? Sensitivity Toggle

Sensitivity analysis tests how much your conclusions depend on the prior. Run the same data through Beta(1, 1), Beta(5, 5), and Beta(50, 50). If all three posteriors agree, your inference is robust. If they diverge, you need more data or a better-justified prior.

A weak prior (low α + β) lets the data speak. A strong prior (high α + β) resists change. Neither is inherently right—the choice depends on how much you trust your prior information versus the new observations.

When sample size vastly exceeds prior strength, different priors converge to the same posterior. At n = 1,000 observations, Beta(1, 1) and Beta(10, 10) yield nearly identical results. At n = 10, they differ noticeably.

Tip: Document your prior choice and rationale. Reviewers and stakeholders should understand why you used Beta(α, β) and what it represents in real-world terms.

Limits: When Beta-Binomial Isn't Appropriate

The model assumes independent, identically distributed Bernoulli trials with constant success probability. If your success rate drifts over time, trials are clustered, or outcomes depend on covariates, the simple Beta-Binomial breaks down.

Non-binary outcomes need different conjugate pairs. For counts without an upper bound, use Poisson-Gamma. For continuous measurements, use Normal-Normal. The Beta-Binomial is purpose-built for yes/no data.

Hierarchical models handle groups with different underlying rates. If you're comparing multiple variants or segments, each with its own probability, a hierarchical Beta-Binomial pools information across groups rather than treating them as one homogeneous population.

Check assumptions: Independence, constant probability, binary outcomes. Violations require more sophisticated models (logistic regression, mixed effects, time-series).

Bayes Visualizer Questions

Why does my posterior look so similar to my prior?

Your prior is stronger than your data. If α + β = 100 and you observe 20 trials, the prior contributes five times as much weight as the data. Collect more observations, or use a weaker prior if you lack solid historical justification.

Can I use Bayesian updating for continuous outcomes?

Yes, but not with the Beta-Binomial. For continuous data, the Normal-Normal or Normal-Gamma conjugate pairs apply. Each likelihood-prior pair has its own update rules. The principle—posterior ∝ prior × likelihood—remains the same.

How do I compare two variants in an A/B test?

Compute separate posteriors for each variant. Then sample from both posteriors and count how often variant A exceeds variant B. The fraction of times A > B is P(A better than B). Alternatively, compute the posterior on the difference or ratio directly.

What if my prior is wrong?

With enough data, the prior washes out. A poor prior slows convergence but does not block it. Sensitivity analysis shows whether the prior is still steering the result or whether the data has effectively taken over at your current sample size.

Why is the credible interval asymmetric?

If α and β differ, the Beta distribution is skewed. A posterior Beta(10, 90) has mean 0.1 and is right-skewed, so the interval is tighter on the low end and longer on the high end. Symmetry appears only when α ≈ β.

Limitations of the Beta-Binomial model

Binary outcomes with constant p: the Beta-Binomial conjugate pair only fits binary outcomes with a constant true probability. Count data wants Poisson-Gamma. Continuous outcomes want Normal-Normal or Normal-Gamma. Each likelihood-prior pair has its own update rules.

Independence: assumed between trials. Clustered data (multiple users from the same household, repeated visits from the same customer) violates it and you want a hierarchical model.

Prior sensitivity at small n: Beta(1, 1) is uniform, but with 3 observations a Beta(50, 50) prior dominates the data. Run a sensitivity check across plausible priors before drawing firm conclusions. If the posterior moves meaningfully, your prior is doing too much work.

Credible interval interpretation: reads as "P(θ in this range | data) = 0.95," which is the cleanest interpretation in any framework. That cleanliness depends on the prior being defensible.

Note: For real Bayesian work, PyMC and Stan are the workhorses. JAGS still has its niche for older models. Gelman et al. "Bayesian Data Analysis" (BDA3) is the standard reference. Kruschke's "Doing Bayesian Data Analysis" is the more applied entry point.

Sources & References

Methods and formulas follow standard Bayesian statistics references:

Bayesian updating: working questions

What's a prior, and how do I pick one?

A prior is your belief about the parameter before seeing data, encoded as a probability distribution. For Beta-Binomial work, the Beta(α, β) prior parameterizes that belief, with α acting like "pseudo-successes" and β like "pseudo-failures." Beta(1, 1) is uniform: no information. Beta(2, 2) is gently mounded around 0.5: weak prior. Beta(50, 50) is sharply peaked at 0.5: strong prior. Pick based on what you actually know before the experiment. If you have historical data showing 30% conversion across many trials, encode that. If you have nothing, use Beta(1, 1) or Jeffrey's prior Beta(0.5, 0.5).

Why are conjugate priors useful?

The posterior stays in the same family as the prior, with parameters updated by simple addition. For Beta(α, β) prior plus binomial(n, p) data with s successes and f failures: posterior is Beta(α + s, β + f). No integration, no MCMC, just arithmetic. Other conjugate pairs: Normal-Normal (with known variance), Gamma-Poisson, Dirichlet-Multinomial, Inverse-Gamma-Normal (variance unknown). Conjugacy is mathematically convenient but not always physically natural; for problems without conjugate structure, you fall back to numerical methods (PyMC, Stan).

Posterior, likelihood, prior, what's the difference?

Bayes' rule: posterior ∝ prior × likelihood. The prior P(θ) is what you believed before seeing data. The likelihood L(data | θ) is the probability of the observed data assuming a specific parameter value. The posterior P(θ | data) is what you believe after combining both. The renormalizing constant on the bottom of Bayes' rule (the marginal likelihood, P(data) = ∫ P(data | θ) P(θ) dθ) ensures the posterior integrates to 1. For point estimates from the posterior, the mean, median, or mode are all defensible choices, with the mean being most common.

Bayesian vs frequentist, what's the practical difference?

Frequentist: parameters are fixed unknowns, data are random; conclusions phrased as long-run capture rates of intervals or rejection rates of tests. Bayesian: parameters are random, data are fixed once observed; conclusions are direct probability statements about the parameter given the data. The Bayesian credible interval reads directly as "P(θ in this range | data) = 0.95," which is the cleaner interpretation and what stakeholders usually mean. The cost: you have to specify a prior, and the conclusions depend on that choice. For small data, prior matters. For lots of data, prior washes out and Bayes and frequentist agree.

Why use Beta for proportions?

Beta lives on [0, 1], the natural support for a probability. Its two parameters (α, β) flexibly encode a wide range of beliefs: uniform (1, 1), peaked at 0.5 (50, 50), skewed toward 0 (1, 10), skewed toward 1 (10, 1), bathtub-shaped (0.5, 0.5). And it's conjugate to the binomial likelihood, which makes updating trivial. The combination of flexibility and conjugacy is why Beta dominates for proportion problems. For other quantities (rates, means, variances), different distributions handle the same role.

Credible interval vs confidence interval, what's the difference?

A 95% credible interval contains the parameter with probability 0.95 given the data and prior. That's the direct interpretation. A 95% confidence interval has the property that, across repeated sampling, 95% of such intervals would contain the true parameter; for any specific interval, the probability is either 0 or 1, not 95%. The interpretations are different even when the numerical values often agree. For most practical purposes with weak priors, Bayesian credible intervals and frequentist CIs agree closely. They diverge most under strong priors or small samples.

How does prior strength affect the posterior?

α + β acts like a pseudo-sample-size for the prior. Strong prior (large α + β) needs lots of contrary data to budge. Weak prior (small α + β) lets even small datasets dominate. With a Beta(50, 50) prior and 5 successes / 5 failures of new data, the posterior is Beta(55, 55), barely moved. With a Beta(1, 1) prior and the same data, the posterior is Beta(6, 6), still gently peaked at 0.5 but with much less confidence. Run a sensitivity check: try a few plausible priors and see how much the posterior changes. If the conclusion flips, your prior is doing too much work.

What's a posterior predictive distribution?

The distribution of a future observation, marginalized over the posterior of the parameter. For Beta-Binomial, after observing s successes in n trials with prior Beta(α, β), the posterior is Beta(α + s, β + f). The posterior predictive for the next m trials is Beta-Binomial(m, α + s, β + f), which has heavier tails than a plain binomial because it integrates over parameter uncertainty. Useful for: forecasting, A/B test outcome simulation, posterior model checks. PyMC's pm.sample_posterior_predictive and Stan's generated quantities block both compute it.