Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Review: Cleanup service for old version records. #335

Conversation

mfendeksilverstripe
Copy link
Contributor

@mfendeksilverstripe mfendeksilverstripe commented Jun 29, 2021

Feature Review: Cleanup service for old version records.

This PR is to showcase a bespoke feature which might be useful in a core module. The goal is the answer the following questions:

  • is this feature useful in other projects?
  • should this be a standalone module or a configurable add-on?
  • where does the Fluent specific code live?

This is not a code review. The code is meant to provide some context and documentation. If we find this feature useful, we can move to standard PR flow.

Summary

This is enterprise level solution for purging old version records and ChangeSetItem records. Version records are accumulated over time and larger projects may find too difficult and unnecessary to keep them around. This feature allows you to delete old version records in a safe and incremental way.

  • register base classes that need version cleanup
  • specify how many draft versions you want to keep (published versions are never deleted)
  • configure the version record lifetime to avoid deleting fresh versions
  • tweak the deletion increment so the deletion doesn't impair your application

Once this is configured to suit your project, you can just sit back and relax as the old version get deleted in the background.

You can also hook up the deletion events to your monitoring charts like this:

Screen Shot 2021-06-30 at 11 41 14 AM

Dependencies

  • Fluent support is included and the related code is well isolated, we could add some extension points and create a Fluent addon
  • Fluent version indexes need to be present

@mfendeksilverstripe mfendeksilverstripe force-pushed the feature-review/versions-cleanup branch from 2aae5fc to 67de53c Compare June 29, 2021 02:43
@brynwhyman
Copy link

Awesome! Without having context of the code, my immediate feeling is this would be a great addition as a separate module. I think we should continue to focus on keeping core module code as the foundation that provide the functionality that meets the common use case very well - while being as straight forward as possible to maintain.

While I'm sure there are many many projects out there that would benefit from this, it seems like the most relevant use case will be enterprise-sized sites with a significant amount of versioned pages (especially when you bring Fluent into the mix) - where performance is already something that needs a lot of attention.

In a somewhat similar vein, I've recently been pointing people to this module: https://github.com/dnadesign/userforms-bulk-delete. Sure, it's a feature that the majority of sites with userforms would find valuable, but separating it into a separate module keeps the code opt-in and doesn't require the userforms module to continue to grow with every new feature.

I'm guessing putting this into a separate module but come with concerns about discovering it - but we've recently been talking about some more dedicated developer docs to helping to improve performance of large sites - something which this could be referenced in too.

@brettt89
Copy link

As a support engineer (and someone who spends a lot of time working with performance). Versioning is a massive pain for us due to lack of cleanup ability (e.g. purging / removing old data). So having a method (or built-in garbage collection) or alternative cold-storage for old data (e.g. File storage) or anything to get this data out of the database would help improve performance and support greatly.

This results in Database often becoming overpopulated and slow as records start getting into the millions.
E.g. Imagine a site using Page Versions over 10+ years.

@andrewandante
Copy link
Contributor

Anything that makes a DB snapshot smaller is grand in my book! 😄

But more seriously, this will affect performance in a really great way, I'd been hoping to do something similar but never found the time. IMO I'd have it on by default with fairly conservative restraints (keep 10 versions or something) just because it's so easy for those tables to explode out, and once they are there it's much harder to deal with them. Is ChangeSet a possible add to the list, and ChangeSetItem - or do they get incredibly complicated?

Excellent work!

@brettt89
Copy link

Anything that makes a DB snapshot smaller is grand in my book! 😄

But more seriously, this will affect performance in a really great way, I'd been hoping to do something similar but never found the time. IMO I'd have it on by default with fairly conservative restraints (keep 10 versions or something) just because it's so easy for those tables to explode out, and once they are there it's much harder to deal with them. Is ChangeSet a possible add to the list, and ChangeSetItem - or do they get incredibly complicated?

Excellent work!

I would also love to have this as the default (limited version records - E.g. 10 versions) for all Versioned tables by default (maybe with ability to customize per-dataobject).

But also worth considering other tables which naturally get quite large over time (e.g. FormSubmissions / Files / etc). Perhaps there may be some cross-over in the approach being taken that could be used outside of Versioning as well.

@mfendeksilverstripe
Copy link
Contributor Author

@andrewandante I've added ChangeSet cleanup task as well since we already have it ;-)

@brynwhyman What's the process of setting up a new module? Should I just create one in the Terraformers space? I'll use the module skeleton as a template for the structure.

@brynwhyman
Copy link

What's the process of setting up a new module? Should I just create one in the Terraformers space? I'll use the module skeleton as a template for the structure.

You're free to create the module under any GitHub organisation. At Silverstripe, there's the option to bring it over to the open-source Silverstripe GH org at a later stage so to include it in the paid TravisCI plan for CI builds, but that's entirely optional :)

Although, if others see value in this being included in the versioned module that could still be an option... @maxime-rainville @emteknetnz @dhensby @chillu what do you think?

@emteknetnz
Copy link
Member

emteknetnz commented Jun 29, 2021

I'd prefer it as a separate module as it's not a common use case. I think deleting old Versioned records would do more harm than good for the majority of sites.

People are more likely to pause think and think about consequences if they need to composer require <module> rather than just change some yml config. Organizations may not like it if their audit history is purged because a dev just changed some config on a whim one day

Making this an official feature also means CMS designers may need to start thinking about scenarios where Versioned records don't exist when there's a reasonable expectation that they do

Even though it wouldn't be an "official" module, probably still worth a mention of the new module on docs.silverstripe.org to help more people discover it if they're struggling with too many records

@brettt89
Copy link

I'd prefer it as a separate module as it's not a common use case. I think deleting old Versioned records would do more harm than good for the majority of sites.

People are more likely to pause think and think about consequences if they need to composer require <module> rather than just change some yml config. Organizations may not like it if their audit history is purged because a dev just changed some config on a whim one day

Making this an official feature also means CMS designers may need to start thinking about scenarios where Versioned records don't exist when there's a reasonable expectation that they do

Even though it wouldn't be an "official" module, probably still worth a mention of the new module on docs.silverstripe.org to help more people discover it if they're struggling with too many records

As an engineer, I would argue that if they want to maintain an accurate historic record, they need to use backups instead. Versioning appears to work on the basis that nothing is every deleted from the website which is what causes most the pain (E.g. Assets are never actually deleted, DB records are never actually removed).

(E.g. Storing historic page content from 5+ years ago is not something I would expect a standard website to store. This is what backups are for).

People are more likely to pause think and think about consequences if they need to composer require rather than just change some yml config.

If developers are not thinking of consequences from code changes that is a bigger problem. There are many 'breakable' changes that can be done by developers without composer that could also end up purging data. This is why Backups are important. Deploying changes without taking backups is not a risk that should be managed by CMS - In my opinion.

E.g.
Versioned = Short term roll-back capability
Backups = Long term roll-back capability

NOTE: This is all just my opinion as a Support Engineer.

@mateusz
Copy link

mateusz commented Jun 29, 2021

Chiming in with operational folks, this would be awesome to have in core and enabled by default, but this smells like an impractical MAJOR change to me :)

Second best is having it in core, but disabled. Ops would then be able to point to a self-service solution: "ah yes, just run this task, and BTW ask your devs to enable it permanently." A single config file flip is easier than telling devs to pull in additional module.

Additional visibility would be good too: surface the total count somewhere to increase awareness among CMS editors. Maybe even plaster it with "Warning: large amount of version history detected. Your CMS performance may suffer" at a certain threshold.

DELETION_LIFETIME sounds good to me - data retention is more often specified as time periods, rather than record quantities.

@emteknetnz
Copy link
Member

emteknetnz commented Jun 29, 2021

As an engineer, I would argue that if they want to maintain an accurate historic record, they need to use backups instead.

I'm certainly not arguing against the value of backups.

I'm more concerned that this feature is only relevant to the top % of sites, and we'd more likely to unnecessarily remove data from sites for no real upside. Being able to rollback to a page that was last updated 3 years ago seems nice to have, I wouldn't want to remove this ability from users unless there was a compelling reason to - which in the case of very large sites there is, but for small / regular sites probably not

Additional visibility would be good too

Agree, though I'm not sure the CMS is the correct place to surface this information.

This information is not relevant to a CMS author whose primary concern is updating website content i.e. completely non-technical

Even to a website adminstrator, I'm not sure "database has lots of records" is relevant. This is assumes a level of technical proficiency which cannot be assumed.

Also how would the CMS be aware of what "too much" means? Seems like this requires awareness of the capability of the database i.e. there's a big difference between $5 a month hosting vs $5,000 per month. The CMS has no awareness of the hardware that it's running on.

Seems like the hosting platform may be a more appropriate location to surface a warning of "too much many records"

@mateusz
Copy link

mateusz commented Jun 29, 2021

Seems like the hosting platform may be a more appropriate location to surface a warning of "too much many records"

Yeah, I understand it's a gray zone. I prefer though to frame it as a shared responsibility between CMS users, devs and platforms.

I'm not excluding a platform-side visibility - we should have both. I purposefully framed the CMS notice in CMS terms - "version history", not "database records". And I reckon (caution, opinion ahead) there would be a volume of past records (e.g. 1,000,000 in any of the *_version tables) where that kind of warning would be applicable to everyone, regardless of hosting size (because SQL queries are anyway limited to a single-CPU performance, which doesn't vary that much. And also those are some pretty large rows, so at 1 million, couple of kB each, it won't fit in memory, so we are just one non-indexed query away from a disk thrash).

But let's go back to the root cause. What is being argued here is that at certain volume of versioned records, CMS performance suffers. I think we are being hamstrung by lack of a non-functional requirement: it's currently implied CMS can store infinite amount of version history - and people believe that. But CMS performance is not tested for against infinite amount of version history records, so you get these operational issues cropping up.

In my experience any assumption of infinite performance/resources sooner or later translates to operational issues. Platforms are probably at fault, often trying to uphold an illusion of infinitely scalable performance (Kube, I'm looking at you), but that's A BIG HORRIBLE LIE 😄

@mfendeksilverstripe
Copy link
Contributor Author

mfendeksilverstripe commented Jun 29, 2021

Regarding the monitoring of the versions deletion. We use a custom chart in the Graylog for this:

Screen Shot 2021-06-30 at 11 41 14 AM

This allows us to keep track of the deletion progress. This is not available in the CMS.

The cleanup service reports number of deleted records @mateusz . It still needs to be hooked up to something that can display the information.

@elliot-sawyer
Copy link
Contributor

I would love the ability to disable versioning entirely if possible. It's absence would mean records only have a "Draft" and "Published" state with no history at all.

@ssmarco
Copy link

ssmarco commented Jun 30, 2021

Hi @mfendeksilverstripe

Nice work by the way.

If this does not go to the core directly then similar path with other modules when they started at first separately until there is such a time in the future.

What I believe in this feature is that there is now a way for all of us to benefit from your hard work without putting much effort in writing custom code to do the same thing.

I've also written a custom versions cleaner for another client and what a massive savings we got in the end from GBs down to MBs of data and it also helped decrease the deployment period.

So definitely once your module is out there, more people will use and contribute to it.

@chillu
Copy link
Member

chillu commented Jul 1, 2021

Awesome work @mfendeksilverstripe, kicked off a long overdue discussion! :)

From an operational perspective, database size issues can pop up through other means as well (e.g. a naive developer writing "Google Analytics as a database table"). Versioned is by far the most common cause of course, and the CMS team has made this worse for any ops team through the default introduction of campaigns (Changeset table), and then again through the opt-in introduction of versioned-snapshots. Reducing database size could be incentivised through service catalogues, granular pricing, or by directly engaging with large customers and making (strong) recommendations. It might help the discussion to describe specifically how backups can replace database content, and ensure them that every change is retained somewhere.

We simply can't maintain more code in Supported Modules until we have more sustained capacity, so my view is that anything that could go into a separate module shouldn't be in core.

@chillu
Copy link
Member

chillu commented Jul 1, 2021

There's a related discussion about versioned-snapshots at #334. Version truncation has been discussed before in #198. There's already a (probably less capable?) module at https://github.com/axllent/silverstripe-version-truncator.

Two issues which you should work through or at least call out as "sharp edges":

  1. Author UX: The CMS has an archiveDate URL parameter, that's used to view archived pages. Technically it enables you to view the whole website at any past date (for versioned content). If you garbage collect based on previous version records (rather than a hard date cutoff), these views no longer present an accurate picture: It depends on the distribution of versions over time. The versioning logic is likely resilient enough to just pick the "next available version" and not break. Ideally we'd show a "here be dragons" message to users though? Note that there's no date picker UI, it's more of a hidden core feature that's surfaced when viewing archived records.

  2. Campaigns and Versioned-Snapshots: Does this break views for these data structured point to garbage collected versions?

@mfendeksilverstripe
Copy link
Contributor Author

@chillu This feature deletes only draft versions. Version history viewer will fall back on later versions so as far as I can see, nothing breaks. Note that this code is running on Production of a large project, this is not a POC.

We keep all the published versions so there is always something the roll back to. We've set the expectation with the client that draft versions have really low value after 6 months. We also keep to two most recent draft versions even past 6 months.

Overall, I think this should be a separate module not necessarily limited to versions cleanup as I also included the ChangeSetItem cleanup.

I had a look at other modules but none of them had the capability to allow very granular level of configuration and ongoing incremental deletion in the background.

My feature runs in the background as jobs which automatically scale back when the queue is too busy as the cleanup jobs are low priority.

@elliot-sawyer
Copy link
Contributor

@axllent's module is a very interesting approach. Infinite historical records aren't necessary, even for Public Records Act reasons, and a sensible pruning strategy is an excellent alternative if going completely versionless isn't possible. I wonder if keep_versions: 0 has the same effect...

@jcop007
Copy link

jcop007 commented Jul 2, 2021

Excessive version records has caused us issues as well - making databases enormous and therefore time consuming to transfer. Admittedly, some of the issues have been caused by version records being created when they probably shouldn't, like when cron tasks run and make updates to dataobjects. We've started using this module to start pruning version records - https://github.com/isobar-nz/silverstripe-versionprune

@sinan-evanshunt
Copy link

Thanks to @mfendeksilverstripe for the cleanup task.
See this issue on the SilverStripe Slack channel https://silverstripe-users.slack.com/archives/C36U3BDHN/p1625006165243300

I have set private const DELETION_LIFETIME = 5; and private const DELETION_LIMIT = 50000;
I ran the task a couple of times (since there are +2 million records!) and I was able to access the page on the admin side.

The only change to the task's code was the removal of use App\Helpers\QueueLockHelper; and

        if (QueueLockHelper::isMaintenanceLockActive()) {
            return;
        }

Couldn't figure out how to get this part working. But regardless, this task looks like it saved the day! so, thank you 👍

@mfendeksilverstripe
Copy link
Contributor Author

The only change to the task's code was the removal of use App\Helpers\QueueLockHelper; and
Couldn't figure out how to get this part working. But regardless, this task looks like it saved the day! so, thank you

QueueLockHelper::isMaintenanceLockActive() is a global lock to disabled all dev tasks from executing. This is a piece of bespoke code not directly related to this feature so you can safely remove it ;-).

I'm happy to hear that the cleanup task is working for you @sinan-evanshunt .

@brettt89
Copy link

brettt89 commented Jul 18, 2021

FYI I have put together a Module for working with Garbage Collection in general (including collectors for Versioned records). This is still in draft as I have not yet been able to test this as heavily as I would like, but I am looking for contributors if anyone is interested.

Silverstripe Garbage Collector Module

Github: https://github.com/brettt89/silverstripe-garbage-collector

Garbage Collector Service for Silverstripe Applications. Contains Service for operations and collection of Collectors and Processors that can be consumed or extended.

PR for Fluent Support: brettt89/silverstripe-garbage-collector#1

@mfendeksilverstripe
Copy link
Contributor Author

Really nice work @brettt89 . I was wondering where should the Fluent related code live? Should that be a new module or should it be just an extension in the Fluent module?

@andrewandante
Copy link
Contributor

where should the Fluent related code live

With these kinds of things, I think it depends on how much code it is. If it's relatively little, including it in the core with only config in yaml is a nice way to work around it. Otherwise, making it a separate module is better, so non-Fluent people don't need to download 10 extra files they won't need. I've done the former for subsites, for example.

@brettt89
Copy link

brettt89 commented Jul 18, 2021

where should the Fluent related code live

With these kinds of things, I think it depends on how much code it is. If it's relatively little, including it in the core with only config in yaml is a nice way to work around it. Otherwise, making it a separate module is better, so non-Fluent people don't need to download 10 extra files they won't need. I've done the former for subsites, for example.

GarbageCollector module has a PR raised for adding Fluent Extension support.

brettt89/silverstripe-garbage-collector#1 [Merged]

Garbage Collector should be ready for testing on actual data sets.
https://github.com/brettt89/silverstripe-garbage-collector/

@chillu
Copy link
Member

chillu commented Oct 14, 2021

@mfendeksilverstripe Would integrating this PR into Brett's module be a viable path? I'd much prefer that to adding more code to core. Although we should strongly encourage devs operating Silverstripe CMS sites to consider this gotcha, and link to the relevant modules in the following places:

https://docs.silverstripe.org/en/4/getting_started/server_requirements/
https://docs.silverstripe.org/en/4/developer_guides/model/versioning/
https://docs.silverstripe.org/en/4/developer_guides/performance/resource_usage/

@mfendeksilverstripe
Copy link
Contributor Author

hey @chillu Brett's module is based on this PR so the job's done :D. I'll close this as this PR has served its purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.