-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Review: Cleanup service for old version records. #335
Feature Review: Cleanup service for old version records. #335
Conversation
2aae5fc
to
67de53c
Compare
Awesome! Without having context of the code, my immediate feeling is this would be a great addition as a separate module. I think we should continue to focus on keeping core module code as the foundation that provide the functionality that meets the common use case very well - while being as straight forward as possible to maintain. While I'm sure there are many many projects out there that would benefit from this, it seems like the most relevant use case will be enterprise-sized sites with a significant amount of versioned pages (especially when you bring Fluent into the mix) - where performance is already something that needs a lot of attention. In a somewhat similar vein, I've recently been pointing people to this module: https://github.com/dnadesign/userforms-bulk-delete. Sure, it's a feature that the majority of sites with userforms would find valuable, but separating it into a separate module keeps the code opt-in and doesn't require the userforms module to continue to grow with every new feature. I'm guessing putting this into a separate module but come with concerns about discovering it - but we've recently been talking about some more dedicated developer docs to helping to improve performance of large sites - something which this could be referenced in too. |
As a support engineer (and someone who spends a lot of time working with performance). Versioning is a massive pain for us due to lack of cleanup ability (e.g. purging / removing old data). So having a method (or built-in garbage collection) or alternative cold-storage for old data (e.g. File storage) or anything to get this data out of the database would help improve performance and support greatly. This results in Database often becoming overpopulated and slow as records start getting into the millions. |
Anything that makes a DB snapshot smaller is grand in my book! 😄 But more seriously, this will affect performance in a really great way, I'd been hoping to do something similar but never found the time. IMO I'd have it on by default with fairly conservative restraints (keep 10 versions or something) just because it's so easy for those tables to explode out, and once they are there it's much harder to deal with them. Is Excellent work! |
I would also love to have this as the default (limited version records - E.g. 10 versions) for all Versioned tables by default (maybe with ability to customize per-dataobject). But also worth considering other tables which naturally get quite large over time (e.g. FormSubmissions / Files / etc). Perhaps there may be some cross-over in the approach being taken that could be used outside of Versioning as well. |
@andrewandante I've added @brynwhyman What's the process of setting up a new module? Should I just create one in the Terraformers space? I'll use the module skeleton as a template for the structure. |
You're free to create the module under any GitHub organisation. At Silverstripe, there's the option to bring it over to the open-source Silverstripe GH org at a later stage so to include it in the paid TravisCI plan for CI builds, but that's entirely optional :) Although, if others see value in this being included in the versioned module that could still be an option... @maxime-rainville @emteknetnz @dhensby @chillu what do you think? |
I'd prefer it as a separate module as it's not a common use case. I think deleting old Versioned records would do more harm than good for the majority of sites. People are more likely to pause think and think about consequences if they need to Making this an official feature also means CMS designers may need to start thinking about scenarios where Versioned records don't exist when there's a reasonable expectation that they do Even though it wouldn't be an "official" module, probably still worth a mention of the new module on docs.silverstripe.org to help more people discover it if they're struggling with too many records |
As an engineer, I would argue that if they want to maintain an accurate historic record, they need to use backups instead. Versioning appears to work on the basis that nothing is every deleted from the website which is what causes most the pain (E.g. Assets are never actually deleted, DB records are never actually removed). (E.g. Storing historic page content from 5+ years ago is not something I would expect a standard website to store. This is what backups are for).
If developers are not thinking of consequences from code changes that is a bigger problem. There are many 'breakable' changes that can be done by developers without E.g. NOTE: This is all just my opinion as a Support Engineer. |
Chiming in with operational folks, this would be awesome to have in core and enabled by default, but this smells like an impractical MAJOR change to me :) Second best is having it in core, but disabled. Ops would then be able to point to a self-service solution: "ah yes, just run this task, and BTW ask your devs to enable it permanently." A single config file flip is easier than telling devs to pull in additional module. Additional visibility would be good too: surface the total count somewhere to increase awareness among CMS editors. Maybe even plaster it with "Warning: large amount of version history detected. Your CMS performance may suffer" at a certain threshold. DELETION_LIFETIME sounds good to me - data retention is more often specified as time periods, rather than record quantities. |
I'm certainly not arguing against the value of backups. I'm more concerned that this feature is only relevant to the top % of sites, and we'd more likely to unnecessarily remove data from sites for no real upside. Being able to rollback to a page that was last updated 3 years ago seems nice to have, I wouldn't want to remove this ability from users unless there was a compelling reason to - which in the case of very large sites there is, but for small / regular sites probably not
Agree, though I'm not sure the CMS is the correct place to surface this information. This information is not relevant to a CMS author whose primary concern is updating website content i.e. completely non-technical Even to a website adminstrator, I'm not sure "database has lots of records" is relevant. This is assumes a level of technical proficiency which cannot be assumed. Also how would the CMS be aware of what "too much" means? Seems like this requires awareness of the capability of the database i.e. there's a big difference between $5 a month hosting vs $5,000 per month. The CMS has no awareness of the hardware that it's running on. Seems like the hosting platform may be a more appropriate location to surface a warning of "too much many records" |
Yeah, I understand it's a gray zone. I prefer though to frame it as a shared responsibility between CMS users, devs and platforms. I'm not excluding a platform-side visibility - we should have both. I purposefully framed the CMS notice in CMS terms - "version history", not "database records". And I reckon (caution, opinion ahead) there would be a volume of past records (e.g. 1,000,000 in any of the *_version tables) where that kind of warning would be applicable to everyone, regardless of hosting size (because SQL queries are anyway limited to a single-CPU performance, which doesn't vary that much. And also those are some pretty large rows, so at 1 million, couple of kB each, it won't fit in memory, so we are just one non-indexed query away from a disk thrash). But let's go back to the root cause. What is being argued here is that at certain volume of versioned records, CMS performance suffers. I think we are being hamstrung by lack of a non-functional requirement: it's currently implied CMS can store infinite amount of version history - and people believe that. But CMS performance is not tested for against infinite amount of version history records, so you get these operational issues cropping up. In my experience any assumption of infinite performance/resources sooner or later translates to operational issues. Platforms are probably at fault, often trying to uphold an illusion of infinitely scalable performance (Kube, I'm looking at you), but that's A BIG HORRIBLE LIE 😄 |
Regarding the monitoring of the versions deletion. We use a custom chart in the Graylog for this: This allows us to keep track of the deletion progress. This is not available in the CMS. The cleanup service reports number of deleted records @mateusz . It still needs to be hooked up to something that can display the information. |
I would love the ability to disable versioning entirely if possible. It's absence would mean records only have a "Draft" and "Published" state with no history at all. |
Nice work by the way. If this does not go to the core directly then similar path with other modules when they started at first separately until there is such a time in the future. What I believe in this feature is that there is now a way for all of us to benefit from your hard work without putting much effort in writing custom code to do the same thing. I've also written a custom versions cleaner for another client and what a massive savings we got in the end from GBs down to MBs of data and it also helped decrease the deployment period. So definitely once your module is out there, more people will use and contribute to it. |
Awesome work @mfendeksilverstripe, kicked off a long overdue discussion! :) From an operational perspective, database size issues can pop up through other means as well (e.g. a naive developer writing "Google Analytics as a database table"). Versioned is by far the most common cause of course, and the CMS team has made this worse for any ops team through the default introduction of campaigns ( We simply can't maintain more code in Supported Modules until we have more sustained capacity, so my view is that anything that could go into a separate module shouldn't be in core. |
There's a related discussion about versioned-snapshots at #334. Version truncation has been discussed before in #198. There's already a (probably less capable?) module at https://github.com/axllent/silverstripe-version-truncator. Two issues which you should work through or at least call out as "sharp edges":
|
@chillu This feature deletes only draft versions. Version history viewer will fall back on later versions so as far as I can see, nothing breaks. Note that this code is running on Production of a large project, this is not a POC. We keep all the published versions so there is always something the roll back to. We've set the expectation with the client that draft versions have really low value after 6 months. We also keep to two most recent draft versions even past 6 months. Overall, I think this should be a separate module not necessarily limited to versions cleanup as I also included the I had a look at other modules but none of them had the capability to allow very granular level of configuration and ongoing incremental deletion in the background. My feature runs in the background as jobs which automatically scale back when the queue is too busy as the cleanup jobs are low priority. |
@axllent's module is a very interesting approach. Infinite historical records aren't necessary, even for Public Records Act reasons, and a sensible pruning strategy is an excellent alternative if going completely versionless isn't possible. I wonder if |
Excessive version records has caused us issues as well - making databases enormous and therefore time consuming to transfer. Admittedly, some of the issues have been caused by version records being created when they probably shouldn't, like when cron tasks run and make updates to dataobjects. We've started using this module to start pruning version records - https://github.com/isobar-nz/silverstripe-versionprune |
Thanks to @mfendeksilverstripe for the cleanup task. I have set The only change to the task's code was the removal of
Couldn't figure out how to get this part working. But regardless, this task looks like it saved the day! so, thank you 👍 |
I'm happy to hear that the cleanup task is working for you @sinan-evanshunt . |
FYI I have put together a Module for working with Garbage Collection in general (including collectors for Versioned records). This is still in draft as I have not yet been able to test this as heavily as I would like, but I am looking for contributors if anyone is interested. Silverstripe Garbage Collector ModuleGithub: https://github.com/brettt89/silverstripe-garbage-collector Garbage Collector Service for Silverstripe Applications. Contains Service for operations and collection of Collectors and Processors that can be consumed or extended. PR for Fluent Support: brettt89/silverstripe-garbage-collector#1 |
Really nice work @brettt89 . I was wondering where should the Fluent related code live? Should that be a new module or should it be just an extension in the Fluent module? |
With these kinds of things, I think it depends on how much code it is. If it's relatively little, including it in the core with |
GarbageCollector module has a PR raised for adding Fluent Extension support. brettt89/silverstripe-garbage-collector#1 [Merged] Garbage Collector should be ready for testing on actual data sets. |
@mfendeksilverstripe Would integrating this PR into Brett's module be a viable path? I'd much prefer that to adding more code to core. Although we should strongly encourage devs operating Silverstripe CMS sites to consider this gotcha, and link to the relevant modules in the following places: https://docs.silverstripe.org/en/4/getting_started/server_requirements/ |
hey @chillu Brett's module is based on this PR so the job's done :D. I'll close this as this PR has served its purpose. |
Feature Review: Cleanup service for old version records.
This PR is to showcase a bespoke feature which might be useful in a core module. The goal is the answer the following questions:
This is not a code review. The code is meant to provide some context and documentation. If we find this feature useful, we can move to standard PR flow.
Summary
This is enterprise level solution for purging old version records and
ChangeSetItem
records. Version records are accumulated over time and larger projects may find too difficult and unnecessary to keep them around. This feature allows you to delete old version records in a safe and incremental way.Once this is configured to suit your project, you can just sit back and relax as the old version get deleted in the background.
You can also hook up the deletion events to your monitoring charts like this:
Dependencies