avoid caching on metalnx server? #364

cmeesters · 2024-09-29T10:23:16Z

Hi,

On our cluster, we use iRODS to store big data (obviously). With the introduction of Metalnx there is a minor issue: all data get cached on the VM running Metalnx, so that any curl or wget accumulates too much on that machine for all request exceeding a few gigabytes.

Now, I am not the admin of said VM, but can this be configured away, such that the VM only mediates the request to the iRODS server and does not cache the data in between?

Cheers
Christian

The text was updated successfully, but these errors were encountered:

korydraughn · 2024-09-30T17:02:16Z

... all data get cached on the VM running Metalnx ...

It's not clear what you mean?
Are files accumulating somewhere?
Can you provide a screenshot showing what you're referring to?

cmeesters · 2024-10-01T16:26:00Z

Meanwhile, I got this to serve as an example:

> ls -ltr tmp-ticket-files/
total 5089404
-rw-r----- 1 root root 5211545600 Oct  1 04:51 backup-20160419.tar

The base path is /usr/local/tomcat/tmp-ticket-files/ within a docker container. Apparently, every file requested via a URL containing the ticket string is accumulated there.

I hope this helps to clarify?

korydraughn · 2024-10-01T18:06:41Z

Are you downloading multiple data objects at the same time?

cmeesters · 2024-10-01T18:57:11Z

This is a two-fold yes:

I could do that, as any user can, with a single wget call.
Moreover, this is a system where users can publish URLs. Whenever a new scientific paper is released, people may feel tempted to download data. Over time, the "risk" that concurrent download requests are to be served increases.

As data objects are potentially huge, a potential solution might not to cache any data and to connect the request to the irods server directly. As for the second item, I do not feel able to assess the feasibility.

edit: PS: Thank you for looking into this!

korydraughn · 2024-10-01T20:45:21Z

I'm surprised you mentioned wget. Are you actually using wget against Metalnx?
Are you using tickets to access the data?

Please provide a set of steps that will allow us to reproduce what you're seeing locally?

joergsteinkamp · 2024-10-02T07:56:06Z

Hi Kory,

yes we use tickets in metalnx to grant public access to the data. Before we used the irods-rest, which is not developed any further. In irods-rest there was no disk caching implemented, irods-rest used streams. We now experinece trouble, if large files/folders get downloaded from outside our institution with wget or curl, it downloads the file only partly and stops with Error 500 (Server internal error). This seems to be a timeout problem in metalnx, that the file gets deleted before it is fully downloaded (see attached docker log). If that's the case increasing the internal timeout could help, but I didn't find it.
Nevertheless, if several people download different fies in parallel the disk will get filled up with caching and is this method thread save, if two persons start downloading the same file? In my opinion streaming as it was implemented in irods-rest, would be a better solution.

irods-metalnx-docker.log

Kind regards,
Jörg

The file from the above log is ~6GB large:

> curl -v -D climatologies.txt --max-time 600 https://irods-web.zdv.uni-mainz.de/metalnx/ticketclient/ELrGhkqUO7IoHqB?path=/zdv/project/m2_jgu-w2w/w2w-c4/amayer02/DATA/mayer_and_wirth_lagrangian_characterization_of_heat_waves/2010-2022_tracerdiagnostics/lambda7/postprocess/dailymeans/temperature_diagnostics_in_terms_of_theta_with_indirect_diabatic_source_and_seasonality/climatologies.tar --output climatologies.tar
[...]
> sha256sum climatologies.tar
cc2161a26e512da28da3156747308c3715ee4c319b8dc4932f36e05f159e46d8  climatologies.tar

korydraughn added the question label Sep 30, 2024

korydraughn added this to the 3.0.0 milestone Sep 30, 2024

joergsteinkamp mentioned this issue Oct 2, 2024

Ticket downloads not thread-safe #365

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid caching on metalnx server? #364

avoid caching on metalnx server? #364

cmeesters commented Sep 29, 2024

korydraughn commented Sep 30, 2024

cmeesters commented Oct 1, 2024

korydraughn commented Oct 1, 2024

cmeesters commented Oct 1, 2024 •

edited

Loading

korydraughn commented Oct 1, 2024

joergsteinkamp commented Oct 2, 2024

avoid caching on metalnx server? #364

avoid caching on metalnx server? #364

Comments

cmeesters commented Sep 29, 2024

korydraughn commented Sep 30, 2024

cmeesters commented Oct 1, 2024

korydraughn commented Oct 1, 2024

cmeesters commented Oct 1, 2024 • edited Loading

korydraughn commented Oct 1, 2024

joergsteinkamp commented Oct 2, 2024

cmeesters commented Oct 1, 2024 •

edited

Loading