Add parameter to control the file size #115

vlad-ignatov · 2019-01-16T19:16:28Z

This PR is not final! I hope to clarify the text after some discussion

Notes:

The servers are expected to have their own default limit. Thew should validate this parameter and only apply it if it is within reasonable boundaries. For example:
- a server might not be able to generate files bigger than 10G and in that case
  _pageSize=1T should be considered invalid.
- Trying to set the limit to very low value might result in huge file list generated
  by the status endpoint which is also not desirable.
The server should decide weather to support _pageSize=0 based on the amount of data that it has. If that is not supported, the server should reject such requests, instead of silently ignoring the parameter.
When using a file size based limit, the clients should be aware that the result might be approximate. Because the servers will stream the data in chunks, they will not know if they have reached the the size limit until they actually exceed it. That is why _pageSize=100M will probably produce a file with size equal to 100M plus the size of some portion of the last resource.
About the count-based limiting
1. It will obviously produce variable size files but with consistent length. In some cases, the client might be part of a specific pipeline for which the resource count is more important then the file size.
2. So far, there are two types of bulk-data server implementations
  1. Most will probably generate files that clients will then download
  2. Some will not create files but generate and stream them on the fly from the download endpoint. Such servers compile a list of file links based on the count limit (either _pageSize=number or internal limit) and then return that list from the status endpoint. These servers will not be able to generate the download links based on a file-size limit.
The _pageSize is optional. Servers who don't support it should silently ignore it. However, those who do must return an error if the passed _pageSize value is not acceptable.

Questions

We need a name, generic enough to fit both the count and size based limit. I came up with _pageSize but I am not sure if that is the best one.
Do you think the _pageSize=number syntax is confusing (looking like size in bytes)?

This PR is not final! I hope to clarify the text after some discussion Notes: 1. The servers are expected to have their own default limit. Thew should validate this parameter and only apply it if it is within reasonable boundaries. For example: - a server might not be able to generate files bigger than 10G and in that case `_pageSize=1T` should be considered invalid. - Trying to set the limit to very low value might result in huge file list generated by the status endpoint which is also not desirable. 2. The server should decide weather to support `_pageSize=0` based on the amount of data that it has. If that is not supported, the server should reject such requests, instead of silently ignoring the parameter. 3. When using a file size based limit, the clients should be aware that the result might be approximate. Because the servers will stream the data in chunks, they will not know if they have reached the the size limit until they actually exceed it. That is why `_pageSize=100M` will probably produce a file with size equal to 100M plus the size of some portion of the last resource. 4. About the count-based limiting 1. It will obviously produce variable size files but with consistent length. In some cases, the client might be part of a specific pipeline for which the resource count is more important then the file size. 2. So far, there are two types of bulk-data server implementations 1. Most will probably generate files that clients will then download 2. Some will not create files but generate and stream them on the fly from the download endpoint. Such servers compile a list of file links based on the count limit (either _pageSize=number or internal limit) and then return that list from the status endpoint. These servers will not be able to generate the download links based on a file-size limit. 5. The `_pageSize` is optional. Servers who don't support it should silently ignore it. However, those who do must return an error if the passed _pageSize value is not acceptable. Questions 1. We need a name, generic enough to fit both the count and size based limit. I came up with `_pageSize` but I am not sure if that is the best one. 2. Do you think the `_pageSize=number` syntax is confusing (looking like size in bytes)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parameter to control the file size #115

Add parameter to control the file size #115

vlad-ignatov commented Jan 16, 2019

Add parameter to control the file size #115

Are you sure you want to change the base?

Add parameter to control the file size #115

Conversation

vlad-ignatov commented Jan 16, 2019