Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage Docs / Notifications #116

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Storage Docs / Notifications #116

wants to merge 10 commits into from

Conversation

cs1jmc
Copy link
Contributor

@cs1jmc cs1jmc commented Feb 15, 2024

Adds docs on Ronin notifications that might be recieved & some basic guidance on understanding drive storage.

@cs1jmc cs1jmc self-assigned this Feb 15, 2024
@cs1jmc cs1jmc added the documentation Improvements or additions to documentation label Feb 15, 2024
Copy link
Collaborator

@willfurnass willfurnass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested some minor tweaks/clarifications but mostly LGTM!

ronin/drive-storage.rst Outdated Show resolved Hide resolved
ronin/drive-storage.rst Outdated Show resolved Hide resolved
pilot/notifications.rst Outdated Show resolved Hide resolved
ronin/drive-storage.rst Outdated Show resolved Hide resolved
@willfurnass
Copy link
Collaborator

@Joe-Heffer-Shef Anything else you'd suggest including?

This is designed to ensure you've not accidentally left a machine running idle. Unlike on-premise VM's that run 24/7 we recommend you shut down your instances when not in use, much like you would your own PC.

We do also understand this alert could be a false positive where your workloads are not CPU demanding but still require the machine be on for extended periods.
If this is the case please get in touch and we can make the instance exempt from these alerts.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more likely that the user has provisioned a far-too-big machine or doesn't understand how to use parallel processing, so it might be worth suggesting that they spin up a smaller machine if appropriate for that workload.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to avoid the discussion of 'right sizing' in this specific doc. I think it needs it's own page / paragraph to be referenced.

That said I'm not sure how to approach the topic as sizing is going to be very specific to the users needs and AWS' instance types are semi-regularly changing. It's hard not to just point to the AWS docs, which isn't helpful for a majority of people.

I fear the doc we make will end up being too simple. This is a topic we're eager to push over to Ronin to see if they can make the UX explain this better so that people are less likely to pick a silly instance for their workload.

Copy link
Contributor Author

@cs1jmc cs1jmc Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://blog.ronin.cloud/selecting-machine/

Turns out they do have a doc for this. It's probably a good place to signpost people to.
Q: What's the likelyhood a user will acutally read this.

When creating a new non-root drive, Ronin gives you multiple options for drive types.
As a general rule of thumb we recommend you select **SSD**, this is due to how AWS provisions drive speed.

The SSD drives will be allocated 125MiB/s and 3000iops as per `gp3 defaults <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/general-purpose.html#gp3-performance>`_.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine but a bit cryptic, agree with Will's proposal


If you have received an email titled "Unused Drive Storage Detected" this means that Ronin has noticed detached drives have been in your project for extended periods.

This could be from a terminated instance that had the "Keep On Termination" flag set:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describe what happens when this option is set. People won't necessarily understand what "keep on termination" means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to avoid adding bits like that as I feel it becomes a fine line of teaching people to suck eggs. This is where I'm disappointed that Ronin don't have their own end user documentation we can reference...

Do we really have to write our own docs on how another companies product works?
@willfurnass Maybe we add that to my list of questions for Ronin...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People won't understand these concepts already so it's better for it to be explained simply and clearly. (It will feel too simple for us but it will be useful for them to have it spelled out.)

But I agree, the RONIN docs need to be better, it's not our job to explain their product.

=======================================

Drive storage or block storage as it's often referred as is the storage attached directly to your :term:`instances<instance>` within Ronin.
These are most commonly the "Root Drive" however Ronin gives you the option to create your own additional storage to attach and move between instances.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to the relevant RONIN documentation so people can learn how to do this

Copy link
Contributor Author

@cs1jmc cs1jmc Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://blog.ronin.cloud/storage-help/
This?

We already link to a few docs from here https://docs.rcc.shef.ac.uk/ronin/index.html but It looks like it's worth adding.

bcab9e1 Adds the link in.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general rule of thumb we recommend you select **SSD**, this is due to how AWS provisions drive speed.

The SSD drives will be allocated 125MiB/s and 3000iops as per `gp3 defaults <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/general-purpose.html#gp3-performance>`_.
Should your workload require more performance please do get in touch.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like a guide on using the storage-optimised machines properly please 😸
Most of the data processing tasks for CURED are bottlenecked heavily by disk I/O so users being able to easily optimise this would save a lot of time

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could signpost to tooling that helps people quantify IO perf (on Linux and Windows instances - iostat and iotop for Linux).

A (mermaid.js) flow chart for diagnosing IO perf issues could be useful.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please
I've collected a few tips but I realise it's probably a tricky topic for newcomers to grasp. There's a very wide potential variety of experience in the users - some extremely basic concepts to explain or at least signpost to relevant training

@Joe-Heffer-Shef
Copy link

My overall feedback is: this is good, but in general there needs to be some basic computing training---not necessarily here in this documentation, but somewhere for researchers to get the necessary skills to use RONIN properly. Maybe some tutorials/walk-throughs that show in detail the steps to achieve common tasks?

This is my RONIN Ubuntu cheat sheet if that helps at all.

@cs1jmc
Copy link
Contributor Author

cs1jmc commented Mar 6, 2024

My overall feedback is: this is good, but in general there needs to be some basic computing training---not necessarily here in this documentation, but somewhere for researchers to get the necessary skills to use RONIN properly. Maybe some tutorials/walk-throughs that show in detail the steps to achieve common tasks?

This is my RONIN Ubuntu cheat sheet if that helps at all.

I admit I didn't see this until I'd already replied to a bunch of the other suggestions.
We've vaguely discussed in the past the potential for a research cloud compute drivers licence, a bit like the HPC one, do you think that'd be able to help with the basic compute training? It might be difficult to make sure it's not a barrier to entry...
Minds better than mine will know how best to form something like that I imagine.

@Joe-Heffer-Shef
Copy link

My overall feedback is: this is good, but in general there needs to be some basic computing training---not necessarily here in this documentation, but somewhere for researchers to get the necessary skills to use RONIN properly. Maybe some tutorials/walk-throughs that show in detail the steps to achieve common tasks?
This is my RONIN Ubuntu cheat sheet if that helps at all.

I admit I didn't see this until I'd already replied to a bunch of the other suggestions. We've vaguely discussed in the past the potential for a research cloud compute drivers licence, a bit like the HPC one, do you think that'd be able to help with the basic compute training? It might be difficult to make sure it's not a barrier to entry... Minds better than mine will know how best to form something like that I imagine.

Yes I think an introductory RCC course is a very good idea. Also, Norbi is starting to talk about a new "research computing 101" talk to cover fundamentals that should be relevant to all different areas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants