Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parameter to control the file size #115

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

vlad-ignatov
Copy link
Contributor

This PR is not final! I hope to clarify the text after some discussion

Notes:

  1. The servers are expected to have their own default limit. Thew should validate this parameter and only apply it if it is within reasonable boundaries. For example:
    • a server might not be able to generate files bigger than 10G and in that case
      _pageSize=1T should be considered invalid.
    • Trying to set the limit to very low value might result in huge file list generated
      by the status endpoint which is also not desirable.
  2. The server should decide weather to support _pageSize=0 based on the amount of data that it has. If that is not supported, the server should reject such requests, instead of silently ignoring the parameter.
  3. When using a file size based limit, the clients should be aware that the result might be approximate. Because the servers will stream the data in chunks, they will not know if they have reached the the size limit until they actually exceed it. That is why _pageSize=100M will probably produce a file with size equal to 100M plus the size of some portion of the last resource.
  4. About the count-based limiting
    1. It will obviously produce variable size files but with consistent length. In some cases, the client might be part of a specific pipeline for which the resource count is more important then the file size.
    2. So far, there are two types of bulk-data server implementations
      1. Most will probably generate files that clients will then download
      2. Some will not create files but generate and stream them on the fly from the download endpoint. Such servers compile a list of file links based on the count limit (either _pageSize=number or internal limit) and then return that list from the status endpoint. These servers will not be able to generate the download links based on a file-size limit.
  5. The _pageSize is optional. Servers who don't support it should silently ignore it. However, those who do must return an error if the passed _pageSize value is not acceptable.

Questions

  1. We need a name, generic enough to fit both the count and size based limit. I came up with _pageSize but I am not sure if that is the best one.
  2. Do you think the _pageSize=number syntax is confusing (looking like size in bytes)?

This PR is not final! I hope to clarify the text after some discussion

Notes:

1. The servers are expected to have their own default limit. Thew should validate this parameter and only apply it if it is within reasonable boundaries. For example:
   - a server might not be able to generate files bigger than 10G and in that case 
`_pageSize=1T` should be considered invalid.
   - Trying to set the limit to very low value might result in huge file list generated
by the status endpoint which is also not desirable.
2. The server should decide weather to support `_pageSize=0` based on the amount of data that it has. If that is not supported, the server should reject such requests, instead of silently ignoring the parameter.
3. When using a file size based limit, the clients should be aware that the result might be approximate. Because the servers will stream the data in chunks, they will not know if they have reached the the size limit until they actually exceed it. That is why `_pageSize=100M` will probably produce a file with size equal to 100M plus the size of some portion of the last resource.
4. About the count-based limiting
   1. It will obviously produce variable size files but with consistent length. In some cases, the client might be part of a specific pipeline for which the resource count is more important then the file size.
   2. So far, there are two types of bulk-data server implementations
      1. Most will probably generate files that clients will then download
      2. Some will not create files but generate and stream them on the fly from the download endpoint. Such servers compile a list of file links based on the count limit (either _pageSize=number or internal limit) and then return that list from the status endpoint. These servers will not be able to generate the download links based on a file-size limit.
5. The `_pageSize` is optional. Servers who don't support it should silently ignore it. However, those who do must return an error if the passed _pageSize value is not acceptable.

Questions
1. We need a name, generic enough to fit both the count and size based limit. I came up with `_pageSize` but I am not sure if that is the best one.
2. Do you think the `_pageSize=number` syntax is confusing (looking like size in bytes)?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant