
Imagine you’re building a model to predict whether a patient has a disease based on their symptoms. After going through the full pipeline — data cleaning, feature engineering, modeling — you reach an accuracy of 83%.
In a stakeholder meeting, you confidently say:
“We can probably push this to 90%.”
Then the real work begins.
You tune hyperparameters.
You try more data.
You switch to bigger, more powerful models.
But nothing changes. The accuracy barely moves.
And then it hits you:
Two patients can have identical symptoms…
and still have different diagnoses.
At that point, a different kind of question emerges:
What if the problem itself has a limit?
When a model stops improving, we usually assume one of two things:
So we respond accordingly:
But there’s a third possibility that is often overlooked:
The data itself does not contain enough information.
Even with the perfect model, trained perfectly, some predictions would still be wrong. Not because the model failed — but because the problem is inherently ambiguous.
This unavoidable uncertainty has a name:
Bayes Error.
It represents the lowest possible error any model can achieve on a given problem.
Not because models are limited —
but because the data itself is.
In simple terms:
Even if you had the perfect model, you still wouldn’t get everything right.
Why?
Because sometimes, the same input can legitimately map to different outcomes.
Machine learning is often described as learning a function from inputs to outputs. But another useful way to think about it is this:
Machine learning is about reducing uncertainty using the information available in the data.
Before seeing any features, the label has some inherent unpredictability. In information theory, this is captured by Entropy — a measure of how uncertain the outcome is before you know anything. Once we observe the features, some of that uncertainty is reduced. How much uncertainty was removed is captured by Mutual Information — the overlap between what the features tell you and what the label is.
But features rarely explain everything completely. The uncertainty that remains after observing the inputs is captured by Conditional Entropy — H(Y|X). This is the part of the label’s unpredictability that the features simply cannot explain.
And this is exactly where Bayes Error lives.
In our experiment with 20% label noise, the features still perfectly separate the two classes — X > 0 predicts the true label without error. But 20% of the labels were randomly flipped, so the features no longer predict the observed label perfectly. That 20% is pure conditional entropy. It is uncertainty that no model can resolve, because it has no structure, no pattern, and no feature that explains it. It is just noise.
In practice, Bayes Error is tied to this remaining uncertainty — the part of the problem the features simply cannot explain. A model can only do as well as the information in the features allows. Everything beyond that is irreducible.
To see this in action, we’ll create a simple dataset where:
This allows us to control how much information is available — and observe what happens when that information is limited.
import numpy as np
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 20000
# Input feature
X = np.random.randn(n, 1)
# True rule: perfectly predictable
y = (X[:, 0] > 0).astype(int)
# Inject irreducible noise
noise_rate = 0.2
flip = np.random.rand(n) < noise_rate
y[flip] = 1 - y[flip]
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y)
The underlying rule here is simple:
A perfect model could achieve 100% accuracy. So, we introduce noise. 20% of labels are randomly flipped. This means that even a perfect model cannot exceed 80% accuracy and that 20% is pure uncertainty. It cannot be learned, no matter what model we use.
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
models = {
"Logistic Regression": LogisticRegression(),
"MLP (small)": MLPClassifier(hidden_layer_sizes=(10,), max_iter=2000),
"MLP (large)": MLPClassifier(hidden_layer_sizes=(100, 100), max_iter=2000),
}
for name, model in models.items():
model.fit(X_train, y_train)
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, model.predict(X_test))
print(f"{name} → Train: {train_acc:.3f}, Test: {test_acc:.3f}")
We try models with increasing capacity:
If model capacity was the limiting factor, we should see accuracy improve. But that’s not what happens.
| Model | Train Accuracy | Test Accuracy |
|---|---|---|
| Logistic Regression | 0.795 | 0.803 |
| MLP (small) | 0.796 | 0.803 |
| MLP (large) | 0.796 | 0.805 |
As we increase model capacity, both training and test accuracy plateau at nearly the same level.
This tells us something important:
The model is not improving because there is no additional signal left to learn.

Here, we keep the noise rate fixed at 20% and compare models with increasing capacity. The point is simple: once the information limit is reached, adding model capacity does not meaningfully improve performance.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
np.random.seed(0)
n = 20000
noise_rate = 0.2
X = np.random.randn(n, 1)
y = (X[:, 0] > 0).astype(int)
flip = np.random.rand(n) < noise_rate
y[flip] = 1 - y[flip]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
models = [
("LogReg", LogisticRegression()),
("MLP (5)", MLPClassifier(hidden_layer_sizes=(5,), max_iter=2000, random_state=42)),
("MLP (10)", MLPClassifier(hidden_layer_sizes=(10,), max_iter=2000, random_state=42)),
("MLP (50)", MLPClassifier(hidden_layer_sizes=(50,), max_iter=2000, random_state=42)),
("MLP (100,100)", MLPClassifier(hidden_layer_sizes=(100, 100), max_iter=2000, random_state=42)),
]
model_names = []
train_scores = []
test_scores = []
for name, model in models:
model.fit(X_train, y_train)
train_pred = model.predict(X_train)
test_pred = model.predict(X_test)
model_names.append(name)
train_scores.append(accuracy_score(y_train, train_pred))
test_scores.append(accuracy_score(y_test, test_pred))
plt.figure(figsize=(10, 5))
plt.plot(model_names, train_scores, marker="o", label="Train Accuracy")
plt.plot(model_names, test_scores, marker="o", label="Test Accuracy")
plt.axhline(y=1 - noise_rate, linestyle="--", label="Theoretical Ceiling")
plt.xlabel("Increasing Model Capacity")
plt.ylabel("Accuracy")
plt.title("More Capacity Does Not Break the Information Limit")
plt.ylim(0.75, 0.85)
plt.legend()
plt.tight_layout()
plt.show()
This is the point where the idea becomes hard to ignore. Across models, the achievable accuracy stays pinned near the same ceiling. The important thing is not that one model is slightly better than another. The important thing is that none of them escape the limit.

That is the signature of an information-limited problem. The ceiling doesn’t depend on the model - it depends on the noise.
Just like in the medical example, the model wasn’t failing because it was too simple. It was failing because data didn’t contain enough signal.
At this point, it’s natural to question what’s happening. If models are not improving, we usually fall back on the familiar explanations.
But neither explanation fits what we’re seeing.
It’s not underfitting, because even simple models can learn the true rule.
It’s not overfitting, because training and test accuracy are close.
Instead:
The model is doing the best it possibly can given the information available.
The remaining errors are not due to bad optimization or insufficient model capacity. They exist because the data itself is uncertain.
So far, this was a controlled experiment. We deliberately injected noise to make the limit visible. But in real-world problems, no one tells you how much noise exists. And that’s where things get tricky.
In practice, when a model plateaus, the default reaction is almost automatic:
It feels like progress.
But often, it isn’t.
Because sometimes, the issue is not the model — it’s the amount of information available in the data.
In real systems, noise isn’t artificially injected. It comes from things you simply don’t observe.
For example:
Churn prediction
You can track clicks, sessions, and purchase history. But you cannot observe the moment a user simply loses interest, finds a better alternative, or goes through a life change that makes your product irrelevant. That decision happens entirely outside your data. No matter how much behavioral data you collect, the signal that actually drove the churn was never recorded.
Medical diagnosis
Two patients can present with identical symptoms, identical test results, and identical histories, and yet have different underlying conditions. Biology has individual variation that no standard set of features can fully account for. The ambiguity is not a gap in the data collection process. It is built into the problem itself.
In these cases, the data is inherently incomplete. The missing information is not something you can engineer your way around.
Instead of asking:
“How do I make the model better?”
You start asking:
“Do I have enough information to solve this problem?”
When you hit this limit, the solution is not:
The solution is:
In the end, models don’t create information - they extract it.
It’s easy to believe that better models should always lead to better results, and in many cases they do. But this experiment highlights something more fundamental: sometimes performance stops improving not because the model is weak, but because the problem itself has a limit. By creating a simple dataset and controlling the noise, we saw that increasing model capacity didn’t break the accuracy ceiling and that training and test performance plateaued together. That is the essence of Bayes Error: there is a limit to how well any model can perform, and that limit is set by the information available. In real-world systems, this limit is hidden. You don’t know the noise rate or exactly how much uncertainty exists; you only see a model that refuses to improve. And that’s where this perspective matters, because instead of endlessly tuning models, you can step back and ask a better question: am I trying to solve a problem that the data cannot fully answer? That shift changes how you approach machine learning. You stop chasing marginal gains through complexity and start focusing on improving the data itself.
Key takeaway
Sometimes, the most important decision isn’t how to improve the model — it’s whether improving the model will help at all.
© Copyright 2023 by DataChronicle