Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure streaming binary data error #716

Open
Joe-Heffer-Shef opened this issue Aug 23, 2022 · 7 comments
Open

Azure streaming binary data error #716

Joe-Heffer-Shef opened this issue Aug 23, 2022 · 7 comments

Comments

@Joe-Heffer-Shef
Copy link

Joe-Heffer-Shef commented Aug 23, 2022

Problem description

I am trying to stream a binary file from Azure Blob Storage.

I expect to be able to iterate over chunks of the data set, but I see an error do with the Azure readinto function.

I'm using the npTDMS library to read a LabVIEW data file in TDMS format (binary quantitative data files.)

Steps/code to reproduce the problem

The code is something like this:

import azure.storage.blob
import smart_open
import nptdms

CONN_STR = '******************'
BLOB_URI = 'azure://test/my_data_file.tdms'

transport_params = dict(
    client=azure.storage.blob.BlobServiceClient.from_connection_string(conn_str=CONN_STR),
)

with smart_open.open(BLOB_URI, mode='rb', transport_params=transport_params) as file:

    with nptdms.TdmsFile.open(file) as tdms_file:
        for group in tdms_file.groups():
            for channel in group.channels():
                for chunk in channel.data_chunks():
                    pass

and the error I get is:

Traceback (most recent call last):
  File "C:\Users\my_username\my_project\scripts\blob-tdms\smart.py", line 35, in <module>
    main()
  File "C:\Users\my_username\my_project\scripts\blob-tdms\smart.py", line 28, in main
    for chunk in channel.data_chunks():
  File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\nptdms\tdms.py", line 564, in data_chunks
    for raw_data_chunk in self._read_channel_data_chunks():
  File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\nptdms\tdms.py", line 758, in _read_channel_data_chunks
    for chunk in self._reader.read_raw_data_for_channel(self.path):
  File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\nptdms\reader.py", line 191, in read_raw_data_for_channel
    for i, chunk in enumerate(
  File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\nptdms\tdms_segment.py", line 269, in read_raw_data_for_channel
    for chunk in self._read_channel_data_chunks(f, data_objects, channel_path, chunk_offset, stop_chunk):
  File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\nptdms\tdms_segment.py", line 367, in _read_channel_data_chunks
    for chunk in reader.read_channel_data_chunks(file, data_objects, channel_path, chunk_offset, stop_chunk):
  File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\nptdms\base_segment.py", line 64, in read_channel_data_chunks
    yield self._read_channel_data_chunk(file, data_objects, chunk_index, channel_path)
  File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\nptdms\base_segment.py", line 72, in _read_channel_data_chunk
    data_chunk = self._read_data_chunk(file, data_objects, chunk_index)
  File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\nptdms\daqmx.py", line 39, in _read_data_chunk
    combined_data = read_interleaved_segment_bytes(file, raw_data_width, chunk_size)
  File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\nptdms\base_segment.py", line 159, in read_interleaved_segment_bytes
    combined_data = fromfile(f, dtype=np.uint8, count=number_bytes)
  File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\nptdms\base_segment.py", line 147, in fromfile
    bytes_read = file.readinto(buffer[offset:])
  File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\smart_open\azure.py", line 322, in readinto
    b[:len(data)] = data
ValueError: invalid literal for int() with base 10: b'\x93\xad\x03\x00k\xf0\xff\xff\xfe\xee\xff\xffm\xfd\xff\xffd\xc1E\x00<\xad\x03\x00O\xf0\xff\xffI\xee\xff\xff\xd1\xfd\xff\xff\xbe\xc2E\x00\xe8\xac\x03\x00\xa6\xef\xff\xff\xe5\xed\xff\xff\x92\xfd\xff\x

It seems like it's expecting a text file? Or it's not calculating the data index correctly to page through the data set?

Versions

>>> import platform, sys, smart_open
>>> print(platform.platform())
Windows-10-10.0.19042-SP0
>>> print("Python", sys.version)
Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:15:42) [MSC v.1916 64 bit (AMD64)]
>>> print("smart_open", smart_open.__version__)
smart_open 6.1.0

From pip list:

azure-core          1.23.0
azure-storage-blob  12.10.0
npTDMS              1.4.0
smart-open          6.1.0
@mpenkov
Copy link
Collaborator

mpenkov commented Aug 23, 2022

Can you poke around with a debugger?

ValueError: invalid literal for int() with base 10

I don't see int() in the stack trace anywhere... I wonder what's actually raising that exception.

@Joe-Heffer-Shef
Copy link
Author

It looks like it's something to do with how len works.

For example:

>>> import os
>>> data = os.urandom(32)
>>> data
b"\xaf\xc6\x89\xc4xt2s'_\xc5\xd3\xb1\xe9\x86\xa5&\x80\xf2!\x96q\xff\xbc\x81?\xc4\x8e\x14q\xe9E"
>>> len(data)
32

@piskvorky
Copy link
Owner

I don't see any int() in your example either – how do you mean?

@Joe-Heffer-Shef
Copy link
Author

Joe-Heffer-Shef commented Aug 24, 2022

I think the Python built-in function len is created using CPython so the source code isn't available.
https://docs.python.org/3/library/functions.html#len

def len(*args, **kwargs): # real signature unknown
    """ Return the number of items in a container. """
    pass

This means we won't see int in the stack trace.

It guess when calling len(s) it tries to cast the size of the argument s to an integer. For some reason this part of the code gives a binary data value for the size of the data variable?

File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\smart_open\azure.py", line 322, in readinto
    b[:len(data)] = data

@mpenkov
Copy link
Collaborator

mpenkov commented Aug 24, 2022

For what possible values of b and data will b[:len(data)] = data (or parts of it) raise that exception?

If you're able to dig in with a debugger, it would be good to know what those values are.

@nharada1
Copy link

I believe this is an issue under the hood with the readinto implementation. I run into this same error when using S3 and Linux. The problem seems to be assigning a binary string into a numpy array. Perhaps the exception that the next line catches should be ValueError instead of AttributeError?

@Joe-Heffer-Shef
Copy link
Author

Joe-Heffer-Shef commented Sep 15, 2022

For what possible values of b and data will b[:len(data)] = data (or parts of it) raise that exception?

If you're able to dig in with a debugger, it would be good to know what those values are.

I ran the script using the PyCharm debugger.

Here are the values of the variables when the exception occurs:

# type: numpy.ndarray
b = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
# type: bytes
data = b'\x00\x00\x00\x00\x00\xf05\xbf\x00\x00\x00\x00.... (lots of binary data)

This is the traceback:

Traceback (most recent call last):
  File "C:/Users/my_username/my_project/scripts/blob-tdms/smart.py", line 45, in main
    for chunk in channel.data_chunks():
  File "C:\Anaconda\envs\my_project\lib\site-packages\nptdms\tdms.py", line 586, in data_chunks
    for raw_data_chunk in self._read_channel_data_chunks():
  File "C:\Anaconda\envs\my_project\lib\site-packages\nptdms\tdms.py", line 780, in _read_channel_data_chunks
    for chunk in self._reader.read_raw_data_for_channel(self.path):
  File "C:\Anaconda\envs\my_project\lib\site-packages\nptdms\reader.py", line 218, in read_raw_data_for_channel
    for i, chunk in enumerate(
  File "C:\Anaconda\envs\my_project\lib\site-packages\nptdms\tdms_segment.py", line 269, in read_raw_data_for_channel
    for chunk in self._read_channel_data_chunks(f, data_objects, channel_path, chunk_offset, stop_chunk):
  File "C:\Anaconda\envs\my_project\lib\site-packages\nptdms\tdms_segment.py", line 367, in _read_channel_data_chunks
    for chunk in reader.read_channel_data_chunks(file, data_objects, channel_path, chunk_offset, stop_chunk):
  File "C:\Anaconda\envs\my_project\lib\site-packages\nptdms\base_segment.py", line 64, in read_channel_data_chunks
    yield self._read_channel_data_chunk(file, data_objects, chunk_index, channel_path)
  File "C:\Anaconda\envs\my_project\lib\site-packages\nptdms\tdms_segment.py", line 492, in _read_channel_data_chunk
    channel_data = RawChannelDataChunk.channel_data(obj.read_values(file, number_values, self.endianness))
  File "C:\Anaconda\envs\my_project\lib\site-packages\nptdms\tdms_segment.py", line 557, in read_values
    return fromfile(file, dtype=dtype, count=number_values)
  File "C:\Anaconda\envs\my_project\lib\site-packages\nptdms\base_segment.py", line 147, in fromfile
    bytes_read = file.readinto(buffer[offset:])
  File "C:\Anaconda\envs\my_project\lib\site-packages\smart_open\azure.py", line 322, in readinto
    b[:len(data)] = data
ValueError: invalid literal for int() with base 10: b"\x00\x00\x00\x00\x00\xf05\xbf\x00\x00\x00\x00\x00\xa0=\xbf\x00\x00\x00\x00\x00P<\xbf\x00\x00\x00\x00\x00\xd0G\xbf\x00\x00\x00\x00\x00\xd0M\xbf\x00\x00\x00\x00\x00PL\xbf\x00\x00\x00\x00\x00\x98F\xbf\

This is the code in azure.py where the crash happens:

    def readinto(self, b):
        """Read up to len(b) bytes into b, and return the number of bytes read."""
        data = self.read(len(b))
        if not data:
            return 0
        b[:len(data)] = data
        return len(data)

Please note I've updated the package versions like so: (Conda environment.yaml file)

name: my_env
channels:
  - conda-forge
  - defaults
dependencies:
  - ca-certificates=2022.9.14=h5b45459_0
  - certifi=2022.9.14=pyhd8ed1ab_0
  - nptdms=1.6.0=pyhd8ed1ab_0
  - smart-open=6.2.0=pyh1a96a4e_0
  - smart_open=6.2.0=pyha770c72_0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants