Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed and storage reproduce #29

Open
lixinghe1999 opened this issue Jun 15, 2023 · 2 comments
Open

Speed and storage reproduce #29

lixinghe1999 opened this issue Jun 15, 2023 · 2 comments

Comments

@lixinghe1999
Copy link

Dear author and other people who are interested,

I am interested in ML on compressed video recently, however, after trying few codes on this repo, I have the following concerns.

  1. why the motion vector is stored by int64 by default? from my experiment, the size of it can easily be larger than the original video.
  2. even if I change the type of motion vector to uint8, the size of it is still 4* larger than the original video. I guess there is further compression behind H.264 (I am new to video codec), can anyone confirm it?
  3. read the pre-saved .npy motion vector seems only have a very limited advantage compared to directly reading the RGB. (7s for motion vector, 8s for OpenCV RGB reading, 30minutes video). I understand reading the .npy is not the same as reading the byte from video by C++ (although np.load is C++ backend), but since the decoder is not so heavy, I still feel that only reading motion vector can give limited benefit.
  4. Besides, can I use mv-extractor to directly get the motion vector?

I am appreciate to any comments or help! thanks in advance.

@LukasBommes
Copy link
Owner

LukasBommes commented Oct 27, 2024

Hey @lixinghe1999,
the purpose of this library is by no means to beat the codec at storing the video in the most space-saving way. For that you are better of storing the video fully encoded. Instead, the purpose of this library is to make the motion vectors, which are an internal detail of the codec, accessible for research and projects that require access to the motion vectors.

To answer your questions:

  1. The motion vectors are stored as int32 as defined here. It's true that one could store them more space-saving as some fields require only uint8 or int16 (see the definition of AVMotionVector here). It's too long ago to say for sure why I cast everything to int32. But I assume, that was needed because the entire array is passed to numpy by reference to form a numpy array. And standard numpy arrays can only contain homogenous data types.

  2. Well, uint8 won't fit the data contained in the AVMotionVector, which contains some fields that are int16 or int32. Regarding the second part of your question, I can't recall how this works and will have to look this up on my own. You coudl take a look at this book. It explains such details very well.

  3. As mentioned above, the aim of extracting the motion vectors is not to be faster than the codec. The codec also makes use of hardware-acceleration, which you won't get with numpy arrays.

  4. Sure, you can use the extract_mvs command. Please refer to the README on how to do this.

@LukasBommes
Copy link
Owner

LukasBommes commented Oct 31, 2024

To follow up on your second question: Why would the motion vectors be larger than the original frame? Let's assume the original frame is 1280 px * 720 px (720p) and contains three color channels (RGB). If it's an 8 bit image, it would take up 12807203 Bytes = 2700 KiB. If that same frame was stored as a P frame, the associated 3600 motions vectors (each 32 Bytes) of the MPEG4 encoded frame would take up 3600*32 Bytes = 112.5 KiB. So, the motion vectors are 24 times more compact than the frame. (Obviously, this calculation does not account for the keyframe which is needed as reference for the motion vectors to make sense).

Let me also cite a section from the book that I recommended:

4.3.1.3 Bitstream encoding

The video coding process produces a number of values that must be encoded to form the compressed bitstream. These values include:

  • Quantized transform coefficients
  • Information to enable the decoder to re-create the prediction
  • Information about the structure of the compressed data and the compression tools used during encoding
  • Information about the complete video sequence.

These values and parameters, syntax elements, are converted into binary codes using variable length coding and/or arithmetic coding. Each of these encoding methods produces an efficient, compact binary representation of the information. The encoded bitstream can then be stored and/or transmitted.

So, the motion vectors will be even further compressed by the codec. And I am quite sure the codec won't store all the zero-valued motion vectors, which there can be many depending on scene composition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants