-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No step marker observed and hence the step time is unknown #578
Comments
Hi, I retried my experiment and I actually did a slightly different thing -- I installed |
I'm running into the same issue. The workaround suggested by @foxik didn't work for me either. |
@JustASquid do you have the flexibility to run on a slightly older versions of tf*? I was able to get this to work by doing that. I can share my config with you later today in case that's a viable option. |
@pritamdodeja I did a run in version 2.10, I no longer get this warning, but the training step markers are wrong: This run was with Could this be related to the same issue? |
@JustASquid the original symptom I faced was profiler wasn't available with the message related to the step markers in the screenshot above. Do you see that the profiler is available to you in tensorboard? Try running the reproducible example I have put as a snippet above and see what results you get. |
@pritamdodeja to clarify, the warning doesn't show up anymore when downgrading from Tensorflow 2.11 to Tensorflow 2.10 for the training run. But the issue now is that the step numbers are all wrong; As you can see from the x-axis which shows incorrect step numbers and the very strange "spiking". Could be related to #266 perhaps? |
@JustASquid It looks like the same issue to me. I don't know enough protocol buffers yet to be able to effectively debug it though. If/when that changes, I will post back here with an update. |
@JustASquid I just tested this issue on the following configuration and it's still broken. Things are actually worse now as you cannot go back to an older tf version because of cudnn dependency :( - Profiler no longer shows up. If I get the time, I'm going to do a deep dive on tensorboard profiler and protocol buffers. I'm using the latest protobuf but setting
|
I was able to understand why this is happening. The profiler is writing the profiler data in a different place in the hierarchy. Once that issue is solved, and the profile duration is long enough, for me, the step marker issue is going away. I will provide details in the next day or so. |
@JustASquid @foxik Here is my understanding of the possible cause of this: Let's say you usually run tensorboard expects plugins/profile to exist in Starting with tensorflow 2.12 (possibly earlier) plugins/profile is instead appearing at This is causing tensorboard to not see the profile data, and not activating the profiler in the UI, etc. Once you manually rectify this by copying the data using
in and refresh tensorboard, it should start seeing the profiler. If I had to guess what introduced the change/error, I would say it's somewhere in the vicinity of
More specifically, in the tensorflow repo, I suspect the following might be helpful to figure out what exactly broke this
My use-case is in the context of a tfx pipeline, but I believe this applies to other use cases where profiling is happening, so likely your |
Hello, Thanks for raising and discussing the issue. I am facing the same issue. Could you tell me if this is resolved? Thanks. |
In my case I am able to obtain stats for example code similar to what @pritamdodeja has provided above, but not when I change to my own loss function that I am trying to debug (and this runs okay). I get the impression the core profiler is not outputting those markers, as they don't appear to be in the protobuf file, so have opened here. |
Consider Stack Overflow for getting support using TensorBoard—they have
a larger community with better searchability:
https://stackoverflow.com/questions/tagged/tensorboard
Do not use this template for for setup, installation, or configuration
issues. Instead, use the “installation problem” issue template:
https://github.com/tensorflow/tensorboard/issues/new?template=installation_problem.md
To report a problem with TensorBoard itself, please fill out the
remainder of this template.
Environment information (required)
Please run
diagnose_tensorboard.py
(link below) in the sameenvironment from which you normally run TensorFlow/TensorBoard, and
paste the output here:
https://raw.githubusercontent.com/tensorflow/tensorboard/master/tensorboard/tools/diagnose_tensorboard.py
Diagnostics
Diagnostics output
Next steps
No action items identified. Please copy ALL of the above output,
including the lines containing only backticks, into your GitHub issue
or comment. Be sure to redact any sensitive information.
~
For browser-related issues, please additionally specify:
Issue description
Running very standard example of tensorboard callback, code below, and getting No step marker observed issue
Please describe the bug as clearly as possible. How can we reproduce the
problem without additional resources (including external data files and
proprietary Python modules)?
Step markers are either not getting logged by Keras or are not being read by tensorboard. I would expect that this information is logged so that I can use the module for optimizing tf.data usage. The environment that this is run in is a standard tensorflow docker container with the only additional package installed being tensorboard_plugin_profile
@foxik has suggested this is a protobuf version issue and that upgrading to 3.20.3 fixed a similar issue for him. It didn't fix it for me, am attaching the logs from both versions pre and post upgrade. I originally opened the issue at tensorflow/tensorboard#6210 - @bmd3k asked me to recreate it here with all the information consolidated.
logs.oldprotobuf 2.zip
logs.protobuf.3.20.3.zip
The text was updated successfully, but these errors were encountered: