Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid caching on metalnx server? #364

Open
cmeesters opened this issue Sep 29, 2024 · 6 comments
Open

avoid caching on metalnx server? #364

cmeesters opened this issue Sep 29, 2024 · 6 comments
Labels
Milestone

Comments

@cmeesters
Copy link

Hi,

On our cluster, we use iRODS to store big data (obviously). With the introduction of Metalnx there is a minor issue: all data get cached on the VM running Metalnx, so that any curl or wget accumulates too much on that machine for all request exceeding a few gigabytes.

Now, I am not the admin of said VM, but can this be configured away, such that the VM only mediates the request to the iRODS server and does not cache the data in between?

Cheers
Christian

@korydraughn
Copy link
Contributor

... all data get cached on the VM running Metalnx ...

It's not clear what you mean?
Are files accumulating somewhere?
Can you provide a screenshot showing what you're referring to?

@korydraughn korydraughn added this to the 3.0.0 milestone Sep 30, 2024
@cmeesters
Copy link
Author

Meanwhile, I got this to serve as an example:

> ls -ltr tmp-ticket-files/
total 5089404
-rw-r----- 1 root root 5211545600 Oct  1 04:51 backup-20160419.tar

The base path is /usr/local/tomcat/tmp-ticket-files/ within a docker container. Apparently, every file requested via a URL containing the ticket string is accumulated there.

I hope this helps to clarify?

@korydraughn
Copy link
Contributor

Are you downloading multiple data objects at the same time?

@cmeesters
Copy link
Author

cmeesters commented Oct 1, 2024

This is a two-fold yes:

  • I could do that, as any user can, with a single wget call.
  • Moreover, this is a system where users can publish URLs. Whenever a new scientific paper is released, people may feel tempted to download data. Over time, the "risk" that concurrent download requests are to be served increases.

As data objects are potentially huge, a potential solution might not to cache any data and to connect the request to the irods server directly. As for the second item, I do not feel able to assess the feasibility.

edit: PS: Thank you for looking into this!

@korydraughn
Copy link
Contributor

I'm surprised you mentioned wget. Are you actually using wget against Metalnx?
Are you using tickets to access the data?

Please provide a set of steps that will allow us to reproduce what you're seeing locally?

@joergsteinkamp
Copy link
Contributor

Hi Kory,

yes we use tickets in metalnx to grant public access to the data. Before we used the irods-rest, which is not developed any further. In irods-rest there was no disk caching implemented, irods-rest used streams. We now experinece trouble, if large files/folders get downloaded from outside our institution with wget or curl, it downloads the file only partly and stops with Error 500 (Server internal error). This seems to be a timeout problem in metalnx, that the file gets deleted before it is fully downloaded (see attached docker log). If that's the case increasing the internal timeout could help, but I didn't find it.
Nevertheless, if several people download different fies in parallel the disk will get filled up with caching and is this method thread save, if two persons start downloading the same file? In my opinion streaming as it was implemented in irods-rest, would be a better solution.

irods-metalnx-docker.log

Kind regards,
Jörg

The file from the above log is ~6GB large:

> curl -v -D climatologies.txt --max-time 600 https://irods-web.zdv.uni-mainz.de/metalnx/ticketclient/ELrGhkqUO7IoHqB?path=/zdv/project/m2_jgu-w2w/w2w-c4/amayer02/DATA/mayer_and_wirth_lagrangian_characterization_of_heat_waves/2010-2022_tracerdiagnostics/lambda7/postprocess/dailymeans/temperature_diagnostics_in_terms_of_theta_with_indirect_diabatic_source_and_seasonality/climatologies.tar --output climatologies.tar
[...]
> sha256sum climatologies.tar
cc2161a26e512da28da3156747308c3715ee4c319b8dc4932f36e05f159e46d8  climatologies.tar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants