Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect values returned by the evaluate method in Model API #20788

Open
Carath opened this issue Jan 20, 2025 · 5 comments
Open

Incorrect values returned by the evaluate method in Model API #20788

Carath opened this issue Jan 20, 2025 · 5 comments
Assignees
Labels

Comments

@Carath
Copy link

Carath commented Jan 20, 2025

The evaluate() method from the Model API returns accuracy and loss values incoherent with the ones computed during training.

Issue happening with:

  • tensorflow==2.18.0, keras==3.8.0, numpy==2.0.2
  • tensorflow==2.17.1, keras==3.5.0, numpy==1.26.4

Issue not happening with:

  • tensorflow==2.12.0, keras==2.12.0, numpy==1.23.5

This has been tested both with and without a GPU, on two personal computers and on a Google Colab VM.

Example of this behavior on the validation values:

Training
1500/1500 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step - accuracy: 0.9496 - loss: 0.0078 - val_accuracy: 0.9581 - val_loss: 0.0065

evaluate()
313/313 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9519 - loss: 0.0075

Code to produce the issue:

import numpy as np
import tensorflow as tf
import keras

seed = 123
np.random.seed(seed)
tf.random.set_seed(seed)
keras.utils.set_random_seed(seed)

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

x_train = np.expand_dims(x_train / 255.0, -1)
x_test = np.expand_dims(x_test / 255.0, -1)
y_train_one_hot = keras.utils.to_categorical(y_train, 10)
y_test_one_hot = keras.utils.to_categorical(y_test, 10)

model = keras.models.Sequential([
	keras.layers.Input((28, 28, 1)),
	keras.layers.Flatten(),
	keras.layers.Dense(64, activation="relu"),
	keras.layers.Dense(10, activation="softmax")
])

model.compile(optimizer="Adam", loss="mean_squared_error", metrics=["accuracy"])
model.summary()

model.fit(
	x=x_train,
	y=y_train_one_hot,
	batch_size=40,
	epochs=2,
	validation_data=(x_test, y_test_one_hot)
)

mse = keras.losses.MeanSquaredError()

# Incorrect values. Changing the batch_size to 40 or 1 does not remove the issue:
print("\nResults from evaluate():")
model.evaluate(x_test, y_test_one_hot, batch_size=None)

print("\nBypassing evaluate():") # those values match the ones computed during training!
print("accuracy =", np.mean(np.argmax(model.predict(x_test), axis=1) == y_test))
print("mse = %.4f" % float(mse(model.predict(x_test), y_test_one_hot)))
@Carath
Copy link
Author

Carath commented Jan 20, 2025

Additional information: the problem doesn't seem to arise when passing batch_size=None and steps=1 to evaluate().

Maybe there is an issue with one of the internal callbacks ? If anything the correct values for the loss and accuracy should be returned no matter which batch_size is used (as long as it divides the number of samples).

@sonali-kumari1
Copy link
Contributor

Hi @Carath -

I have tried to replicate the issue and the problem doesn't seem to arise when passing batch_size=None and steps=1 to model.evaluate() because the entire dataset is being processed in a single batch. In order to maintain a standard batch_size, you can use verbose=2 in model.evaluate() to ensure consistent results and more detailed metric aggregation during evaluation. Attaching gist for your reference. Thanks!

@Carath
Copy link
Author

Carath commented Jan 22, 2025

Thank you for your answer.

I fail to see how this really fixes the issue, many users might just call the evaluate() method and be confused by the mismatching values. Moreover it seems quite weird that changing the verbosity level of the method yields completely different values to be printed.

I also don't think this is a batch size issue, as the returned values obtained when passing return_dict=True are actually correct (up to 1e-4) despite the printed values being wrong:

print(model.evaluate(x_test, y_test_one_hot, batch_size=32, return_dict=True))

Output:

313/313 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.9519 - loss: 0.0075 
{'accuracy': 0.9581000208854675, 'loss': 0.006478920113295317}

I propose to set verbose=2 as default argument to evaluate() as to not cause user confusion, until the underlying issue is fixed.

@sonali-kumari1
Copy link
Contributor

Hi @Carath -

Changing the verbosity level of the method yields slightly different values because verbose=1 will display progress bar after each batch while verbose=2 will show summary metrics of training and validation at the end of each epoch. Avoid using batch_size in model.evaluate() as batch generation are handled internally. You can refer to this documentation for more details.

@Carath
Copy link
Author

Carath commented Jan 23, 2025

Created a pull request to prevent the incorrect behavior to affect most users.

On a side note, replies to this thread seem to be (a least partially) coming from a chat bot, I don't believe this to be really appropriate if that is the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants