No step marker observed and hence the step time is unknown #578

pritamdodeja · 2023-03-18T09:51:55Z

Consider Stack Overflow for getting support using TensorBoard—they have
a larger community with better searchability:

https://stackoverflow.com/questions/tagged/tensorboard

Do not use this template for for setup, installation, or configuration
issues. Instead, use the “installation problem” issue template:

https://github.com/tensorflow/tensorboard/issues/new?template=installation_problem.md

To report a problem with TensorBoard itself, please fill out the
remainder of this template.

Environment information (required)

Please run diagnose_tensorboard.py (link below) in the same
environment from which you normally run TensorFlow/TensorBoard, and
paste the output here:

https://raw.githubusercontent.com/tensorflow/tensorboard/master/tensorboard/tools/diagnose_tensorboard.py

Diagnostics

Diagnostics output

--- check: autoidentify                                                         
INFO: diagnose_tensorboard.py version 516a2f9433ba4f9c3a4fdb0f89735870eda054a1  
                                                                                
--- check: general                                                              
INFO: sys.version_info: sys.version_info(major=3, minor=8, micro=10, releaselevel='final', serial=0)
INFO: os.name: posix                                                            
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='71d6fe811d18', release='6.0.5-200.fc36.x86_64', version='#1 SMP PREEMPT_DYNAMIC Wed Oct 26 15:55:21 UTC 2022', machine='x86_64')
INFO: sys.getwindowsversion(): N/A                                              
                                                                                
--- check: package_management                                                   
INFO: has conda-meta: False                                                     
INFO: $VIRTUAL_ENV: None                                                        
                                                                                
--- check: installed_packages                                                   
INFO: installed: tensorboard==2.11.0                                            
WARNING: no installation among: ['tensorflow', 'tensorflow-gpu', 'tf-nightly', 'tf-nightly-2.0-preview', 'tf-nightly-gpu', 'tf-nightly-gpu-2.0-preview']
INFO: installed: tensorflow-estimator==2.11.0                                   
INFO: installed: tensorboard-data-server==0.6.1                                 
                                                                                
--- check: tensorboard_python_version                                           
INFO: tensorboard.version.VERSION: '2.11.0'                                     
                                                                                
--- check: tensorflow_python_version                                            
INFO: tensorflow.__version__: '2.11.0'                                          
INFO: tensorflow.__git_version__: 'v2.11.0-rc2-17-gd5b57ca93e5'                 
                                                                                
--- check: tensorboard_data_server_version                                      
INFO: data server binary: '/usr/local/lib/python3.8/dist-packages/tensorboard_data_server/bin/server'
INFO: data server binary version: b'rustboard 0.6.1'                            
                                                                                
--- check: tensorboard_binary_path                                              
INFO: which tensorboard: b'/usr/local/bin/tensorboard\n'                        
                                                                                
--- check: addrinfos                                                            
socket.has_ipv6 = True                                                          
socket.AF_UNSPEC = <AddressFamily.AF_UNSPEC: 0>                                 
socket.SOCK_STREAM = <SocketKind.SOCK_STREAM: 1>                                
socket.AI_ADDRCONFIG = <AddressInfo.AI_ADDRCONFIG: 32>                          
socket.AI_PASSIVE = <AddressInfo.AI_PASSIVE: 1>                                 
Loopback flags: <AddressInfo.AI_ADDRCONFIG: 32>                                 
Loopback infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 0))]
Wildcard flags: <AddressInfo.AI_PASSIVE: 1>                                     
Wildcard infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('0.0.0.0', 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::', 0, 0, 0))]
                                                                                
--- check: readable_fqdn                                                        
INFO: socket.getfqdn(): '71d6fe811d18'                                          
                                                                                
--- check: stat_tensorboardinfo                                                 
INFO: directory: /tmp/.tensorboard-info                                         
INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=805882112, st_dev=51, st_nlink=2, st_uid=0, st_gid=0, st_size=6, st_atime=1677293427, st_mtime=1677293598, st_ctime=1677293598)
INFO: mode: 0o40777                                                             
                                                                                
--- check: source_trees_without_genfiles                                        
INFO: tensorboard_roots (1): ['/usr/local/lib/python3.8/dist-packages']; bad_roots (0): []
                                                                                
--- check: full_pip_freeze                                                      
INFO: pip freeze --all:                                                         
absl-py==1.3.0                                                                  
anyio==3.6.2                                                                    
argon2-cffi==21.3.0                                                             
argon2-cffi-bindings==21.2.0                                                    
asttokens==2.1.0                                                                
astunparse==1.6.3                                                               
attrs==22.1.0                                                                   
backcall==0.2.0                                                                 
beautifulsoup4==4.11.1                                                          
bleach==5.0.1                                                                   
cachetools==5.2.0                                                               
certifi==2022.9.24                                                              
cffi==1.15.1                                                                    
charset-normalizer==2.1.1                                                       
contourpy==1.0.6                                                                
cycler==0.11.0                                                                  
debugpy==1.6.3                                                                  
decorator==5.1.1                                                                
defusedxml==0.7.1                                                               
entrypoints==0.4                                                                
executing==1.2.0                                                                
fastjsonschema==2.16.2                                                          
flatbuffers==22.10.26                                                           
fonttools==4.38.0                                                               
gast==0.4.0                                                                     
google-auth==2.14.1                                                             
google-auth-oauthlib==0.4.6                                                     
google-pasta==0.2.0                                                             
grpcio==1.50.0                                                                  
gviz-api==1.10.0                                                                
h5py==3.7.0                                                                     
idna==3.4                                                                       
importlib-metadata==5.0.0                                                       
importlib-resources==5.10.0                                                     
ipykernel==5.1.1                                                                
ipython==8.6.0                                                                  
ipython-genutils==0.2.0                                                         
ipywidgets==8.0.2                                                               
jedi==0.17.2                                                                    
Jinja2==3.1.2                                                                   
jsonschema==4.17.0                                                              
jupyter==1.0.0                                                                  
jupyter-client==7.4.7                                                           
jupyter-console==6.4.4                                                          
jupyter-core==5.0.0                                                             
jupyter-http-over-ws==0.0.8                                                     
jupyter-server==1.23.2                                                          
jupyterlab-pygments==0.2.2                                                      
jupyterlab-widgets==3.0.3                                                       
keras==2.11.0                                                                   
kiwisolver==1.4.4                                                               
libclang==14.0.6                                                                
Markdown==3.4.1                                                                 
MarkupSafe==2.1.1                                                               
matplotlib==3.6.2                                                               
matplotlib-inline==0.1.6                                                        
mistune==2.0.4                                                                  
nbclassic==0.4.8                                                                
nbclient==0.7.0                                                                 
nbconvert==7.2.5                                                                
nbformat==4.4.0                                                                 
nest-asyncio==1.5.6                                                             
notebook==6.5.2                                                                 
notebook-shim==0.2.2                                                            
numpy==1.23.4                                                                   
oauthlib==3.2.2                                                                 
opt-einsum==3.3.0                                                               
packaging==21.3                                                                 
pandocfilters==1.5.0                                                            
parso==0.7.1                                                                    
pexpect==4.8.0                                                                  
pickleshare==0.7.5                                                              
Pillow==9.3.0                                                                   
pip==20.2.4                                                                     
pkgutil-resolve-name==1.3.10                                                    
platformdirs==2.5.4                                                             
prometheus-client==0.15.0                                                       
prompt-toolkit==3.0.32                                                          
protobuf==3.19.6                                                                
psutil==5.9.4                                                                   
ptyprocess==0.7.0                                                               
pure-eval==0.2.2                                                                
pyasn1==0.4.8                                                                   
pyasn1-modules==0.2.8                                                           
pycparser==2.21                                                                 
Pygments==2.13.0                                                                
pyparsing==3.0.9                                                                
pyrsistent==0.19.2                                                              
python-dateutil==2.8.2                                                          
pyzmq==24.0.1                                                                   
qtconsole==5.4.0                                                                
QtPy==2.3.0                                                                     
requests==2.28.1                                                                
requests-oauthlib==1.3.1                                                        
rsa==4.9                                                                        
Send2Trash==1.8.0                                                               
setuptools==65.5.1                                                              
six==1.16.0                                                                     
sniffio==1.3.0                                                                  
soupsieve==2.3.2.post1                                                          
stack-data==0.6.1                                                               
tensorboard==2.11.0                                                             
tensorboard-data-server==0.6.1                                                  
tensorboard-plugin-profile==2.11.1                                              
tensorboard-plugin-wit==1.8.1                                                   
tensorflow-cpu==2.11.0                                                          
tensorflow-estimator==2.11.0                                                    
tensorflow-io-gcs-filesystem==0.27.0                                            
termcolor==2.1.0                                                                
terminado==0.17.0                                                               
tinycss2==1.2.1                                                                 
tornado==6.2                                                                    
traitlets==5.5.0                                                                
typing-extensions==4.4.0                                                        
urllib3==1.26.12                                                                
wcwidth==0.2.5                                                                  
webencodings==0.5.1                                                             
websocket-client==1.4.2                                                         
Werkzeug==2.2.2                                                                 
wheel==0.34.2                                                                   
widgetsnbextension==4.0.3                                                       
wrapt==1.14.1                                                                   
zipp==3.10.0

Next steps

No action items identified. Please copy ALL of the above output,
including the lines containing only backticks, into your GitHub issue
or comment. Be sure to redact any sensitive information.
~
For browser-related issues, please additionally specify:

Browser type and version (e.g., Chrome 64.0.3282.140):
Screenshot, if it’s a visual issue:

Issue description

Running very standard example of tensorboard callback, code below, and getting No step marker observed issue

import tensorflow as tf
import datetime
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

def create_model():
  return tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28), name='layers_flatten'),
    tf.keras.layers.Dense(512, activation='relu', name='layers_dense'),
    tf.keras.layers.Dropout(0.2, name='layers_dropout'),
    tf.keras.layers.Dense(10, activation='softmax', name='layers_dense_2')
  ])

model = create_model()
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1, profile_batch=(1,50))

model.fit(x=x_train, 
          y=y_train, 
          epochs=5, 
          validation_data=(x_test, y_test), 
          callbacks=[tensorboard_callback])

Please describe the bug as clearly as possible. How can we reproduce the
problem without additional resources (including external data files and
proprietary Python modules)?

Step markers are either not getting logged by Keras or are not being read by tensorboard. I would expect that this information is logged so that I can use the module for optimizing tf.data usage. The environment that this is run in is a standard tensorflow docker container with the only additional package installed being tensorboard_plugin_profile

@foxik has suggested this is a protobuf version issue and that upgrading to 3.20.3 fixed a similar issue for him. It didn't fix it for me, am attaching the logs from both versions pre and post upgrade. I originally opened the issue at tensorflow/tensorboard#6210 - @bmd3k asked me to recreate it here with all the information consolidated.

logs.oldprotobuf 2.zip

logs.protobuf.3.20.3.zip

The text was updated successfully, but these errors were encountered:

foxik · 2023-03-18T18:36:17Z

Hi,

I retried my experiment and I actually did a slightly different thing -- I installed tensorflow==1.12.0rc0 (which brought tensorboard==1.12.0) and then tensorboard-plugin-profile==2.11.1, and finally downgraded to protobuf==3.20.3. This allows me to open profile runs created by both TF 1.11 and TF 1.12.0rc0.

JustASquid · 2023-03-31T02:45:02Z

I'm running into the same issue. The workaround suggested by @foxik didn't work for me either.
Are there any suggestions for other workarounds? Profiling is currently not possible for our model; Trying to figure out which custom layer is causing the issue is not feasible.

pritamdodeja · 2023-03-31T13:03:54Z

@JustASquid do you have the flexibility to run on a slightly older versions of tf*? I was able to get this to work by doing that. I can share my config with you later today in case that's a viable option.

JustASquid · 2023-04-01T02:42:59Z

@pritamdodeja I did a run in version 2.10, I no longer get this warning, but the training step markers are wrong:

This run was with profile_batch="10,20" with a 30-batch epoch.

Could this be related to the same issue?

pritamdodeja · 2023-04-01T15:55:51Z

@pritamdodeja I did a run in version 2.10, I no longer get this warning, but the training step markers are wrong:

This run was with profile_batch="10,20" with a 30-batch epoch.

Could this be related to the same issue?

@JustASquid the original symptom I faced was profiler wasn't available with the message related to the step markers in the screenshot above. Do you see that the profiler is available to you in tensorboard? Try running the reproducible example I have put as a snippet above and see what results you get.

JustASquid · 2023-04-01T22:20:27Z

@pritamdodeja to clarify, the warning doesn't show up anymore when downgrading from Tensorflow 2.11 to Tensorflow 2.10 for the training run.

But the issue now is that the step numbers are all wrong; As you can see from the x-axis which shows incorrect step numbers and the very strange "spiking". Could be related to #266 perhaps?

pritamdodeja · 2023-04-02T12:07:26Z

@JustASquid It looks like the same issue to me. I don't know enough protocol buffers yet to be able to effectively debug it though. If/when that changes, I will post back here with an update.

pritamdodeja · 2023-08-26T21:11:19Z

@JustASquid I just tested this issue on the following configuration and it's still broken. Things are actually worse now as you cannot go back to an older tf version because of cudnn dependency :( - Profiler no longer shows up. If I get the time, I'm going to do a deep dive on tensorboard profiler and protocol buffers. I'm using the latest protobuf but setting

export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python

$ pip freeze | grep tensor
tensorboard==2.14.0
tensorboard-data-server==0.7.1
tensorboard-plugin-profile==2.13.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.13.0
tensorflow-data-validation==1.13.0
tensorflow-estimator==2.13.0
tensorflow-hub==0.12.0
tensorflow-io-gcs-filesystem==0.33.0
tensorflow-metadata==1.13.1
tensorflow-model-analysis==0.44.0
tensorflow-serving-api==2.12.2
tensorflow-transform==1.13.0

pritamdodeja · 2023-08-27T00:37:12Z

I was able to understand why this is happening. The profiler is writing the profiler data in a different place in the hierarchy. Once that issue is solved, and the profile duration is long enough, for me, the step marker issue is going away. I will provide details in the next day or so.

pritamdodeja · 2023-08-27T10:55:12Z

@JustASquid @foxik Here is my understanding of the possible cause of this:

Let's say you usually run tensorboard --logdir model_run to start tensorboard

tensorboard expects plugins/profile to exist in model_run/<run number>/<train|validation>

Starting with tensorflow 2.12 (possibly earlier) plugins/profile is instead appearing at model_run/<run number>

This is causing tensorboard to not see the profile data, and not activating the profiler in the UI, etc. Once you manually rectify this by copying the data using

cp -Rpv ../plugins .

in model_run/<run number>/<train|validation>

and refresh tensorboard, it should start seeing the profiler.

If I had to guess what introduced the change/error, I would say it's somewhere in the vicinity of

tensorflow/tensorflow/core/profiler/convert/xplane_to_tools_data.cc

More specifically, in the tensorflow repo, I suspect the following might be helpful to figure out what exactly broke this

git diff 7a500e 4d4873 tensorflow/tensorflow/core/profiler/convert/xplane_to_tools_data.h

My use-case is in the context of a tfx pipeline, but I believe this applies to other use cases where profiling is happening, so likely your log_dir and hierarchy might be different, but relatively, the problem should be the same.

Gaura · 2023-11-03T14:23:41Z

Hello,

Thanks for raising and discussing the issue. I am facing the same issue. Could you tell me if this is resolved?

Thanks.

stellarpower · 2024-04-25T01:02:50Z

In my case I am able to obtain stats for example code similar to what @pritamdodeja has provided above, but not when I change to my own loss function that I am trying to debug (and this runs okay). I get the impression the core profiler is not outputting those markers, as they don't appear to be in the protobuf file, so have opened here.

pritamdodeja mentioned this issue Mar 18, 2023

No step marker observed and hence the step time is unknown tensorflow/tensorboard#6210

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No step marker observed and hence the step time is unknown #578

No step marker observed and hence the step time is unknown #578

pritamdodeja commented Mar 18, 2023

foxik commented Mar 18, 2023

JustASquid commented Mar 31, 2023

pritamdodeja commented Mar 31, 2023

JustASquid commented Apr 1, 2023

pritamdodeja commented Apr 1, 2023 •

edited

Loading

JustASquid commented Apr 1, 2023

pritamdodeja commented Apr 2, 2023

pritamdodeja commented Aug 26, 2023

pritamdodeja commented Aug 27, 2023

pritamdodeja commented Aug 27, 2023

Gaura commented Nov 3, 2023

stellarpower commented Apr 25, 2024

No step marker observed and hence the step time is unknown #578

No step marker observed and hence the step time is unknown #578

Comments

pritamdodeja commented Mar 18, 2023

Environment information (required)

Diagnostics

Next steps

Issue description

foxik commented Mar 18, 2023

JustASquid commented Mar 31, 2023

pritamdodeja commented Mar 31, 2023

JustASquid commented Apr 1, 2023

pritamdodeja commented Apr 1, 2023 • edited Loading

JustASquid commented Apr 1, 2023

pritamdodeja commented Apr 2, 2023

pritamdodeja commented Aug 26, 2023

pritamdodeja commented Aug 27, 2023

pritamdodeja commented Aug 27, 2023

Gaura commented Nov 3, 2023

stellarpower commented Apr 25, 2024

pritamdodeja commented Apr 1, 2023 •

edited

Loading