Incorrect values returned by the evaluate method in Model API #20788

Carath · 2025-01-20T13:41:48Z

The evaluate() method from the Model API returns accuracy and loss values incoherent with the ones computed during training.

Issue happening with:

tensorflow==2.18.0, keras==3.8.0, numpy==2.0.2
tensorflow==2.17.1, keras==3.5.0, numpy==1.26.4

Issue not happening with:

tensorflow==2.12.0, keras==2.12.0, numpy==1.23.5

This has been tested both with and without a GPU, on two personal computers and on a Google Colab VM.

Example of this behavior on the validation values:

Training
1500/1500 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step - accuracy: 0.9496 - loss: 0.0078 - val_accuracy: 0.9581 - val_loss: 0.0065

evaluate()
313/313 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9519 - loss: 0.0075

Code to produce the issue:

import numpy as np
import tensorflow as tf
import keras

seed = 123
np.random.seed(seed)
tf.random.set_seed(seed)
keras.utils.set_random_seed(seed)

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

x_train = np.expand_dims(x_train / 255.0, -1)
x_test = np.expand_dims(x_test / 255.0, -1)
y_train_one_hot = keras.utils.to_categorical(y_train, 10)
y_test_one_hot = keras.utils.to_categorical(y_test, 10)

model = keras.models.Sequential([
	keras.layers.Input((28, 28, 1)),
	keras.layers.Flatten(),
	keras.layers.Dense(64, activation="relu"),
	keras.layers.Dense(10, activation="softmax")
])

model.compile(optimizer="Adam", loss="mean_squared_error", metrics=["accuracy"])
model.summary()

model.fit(
	x=x_train,
	y=y_train_one_hot,
	batch_size=40,
	epochs=2,
	validation_data=(x_test, y_test_one_hot)
)

mse = keras.losses.MeanSquaredError()

# Incorrect values. Changing the batch_size to 40 or 1 does not remove the issue:
print("\nResults from evaluate():")
model.evaluate(x_test, y_test_one_hot, batch_size=None)

print("\nBypassing evaluate():") # those values match the ones computed during training!
print("accuracy =", np.mean(np.argmax(model.predict(x_test), axis=1) == y_test))
print("mse = %.4f" % float(mse(model.predict(x_test), y_test_one_hot)))

The text was updated successfully, but these errors were encountered:

Carath · 2025-01-20T19:01:42Z

Additional information: the problem doesn't seem to arise when passing batch_size=None and steps=1 to evaluate().

Maybe there is an issue with one of the internal callbacks ? If anything the correct values for the loss and accuracy should be returned no matter which batch_size is used (as long as it divides the number of samples).

sonali-kumari1 · 2025-01-22T09:26:52Z

Hi @Carath -

I have tried to replicate the issue and the problem doesn't seem to arise when passing batch_size=None and steps=1 to model.evaluate() because the entire dataset is being processed in a single batch. In order to maintain a standard batch_size, you can use verbose=2 in model.evaluate() to ensure consistent results and more detailed metric aggregation during evaluation. Attaching gist for your reference. Thanks!

Carath · 2025-01-22T17:37:52Z

Thank you for your answer.

I fail to see how this really fixes the issue, many users might just call the evaluate() method and be confused by the mismatching values. Moreover it seems quite weird that changing the verbosity level of the method yields completely different values to be printed.

I also don't think this is a batch size issue, as the returned values obtained when passing return_dict=True are actually correct (up to 1e-4) despite the printed values being wrong:

print(model.evaluate(x_test, y_test_one_hot, batch_size=32, return_dict=True))

Output:

313/313 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.9519 - loss: 0.0075 
{'accuracy': 0.9581000208854675, 'loss': 0.006478920113295317}

I propose to set verbose=2 as default argument to evaluate() as to not cause user confusion, until the underlying issue is fixed.

sonali-kumari1 · 2025-01-23T09:27:51Z

Hi @Carath -

Changing the verbosity level of the method yields slightly different values because verbose=1 will display progress bar after each batch while verbose=2 will show summary metrics of training and validation at the end of each epoch. Avoid using batch_size in model.evaluate() as batch generation are handled internally. You can refer to this documentation for more details.

Carath · 2025-01-23T10:52:57Z

Created a pull request to prevent the incorrect behavior to affect most users.

On a side note, replies to this thread seem to be (a least partially) coming from a chat bot, I don't believe this to be really appropriate if that is the case.

github-actions bot assigned mehtamansi29 Jan 20, 2025

sonali-kumari1 added the type:Bug label Jan 21, 2025

sonali-kumari1 added the stat:awaiting response from contributor label Jan 22, 2025

google-ml-butler bot removed the stat:awaiting response from contributor label Jan 22, 2025

sonali-kumari1 added the stat:awaiting response from contributor label Jan 23, 2025

Carath mentioned this issue Jan 23, 2025

Fix evaluate() printed values being incorrect by default #20799

Open

google-ml-butler bot removed the stat:awaiting response from contributor label Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect values returned by the evaluate method in Model API #20788

Incorrect values returned by the evaluate method in Model API #20788

Carath commented Jan 20, 2025

Carath commented Jan 20, 2025 •

edited

Loading

sonali-kumari1 commented Jan 22, 2025

Carath commented Jan 22, 2025

sonali-kumari1 commented Jan 23, 2025

Carath commented Jan 23, 2025

Incorrect values returned by the evaluate method in Model API #20788

Incorrect values returned by the evaluate method in Model API #20788

Comments

Carath commented Jan 20, 2025

Carath commented Jan 20, 2025 • edited Loading

sonali-kumari1 commented Jan 22, 2025

Carath commented Jan 22, 2025

sonali-kumari1 commented Jan 23, 2025

Carath commented Jan 23, 2025

Carath commented Jan 20, 2025 •

edited

Loading