From 6e2d2832a54737ba7ae25c5e07045027ad13090f Mon Sep 17 00:00:00 2001 From: Mark Elvers Date: Mon, 4 Dec 2023 16:03:48 +0000 Subject: [PATCH 1/6] moved services post --- _posts/2023-12-04-services-moved.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) create mode 100644 _posts/2023-12-04-services-moved.md diff --git a/_posts/2023-12-04-services-moved.md b/_posts/2023-12-04-services-moved.md new file mode 100644 index 0000000..a0591ee --- /dev/null +++ b/_posts/2023-12-04-services-moved.md @@ -0,0 +1,14 @@ +--- +title: Relocating opam.ci.ocaml.org and ocaml.ci.dev +--- + +About six months ago, [opam-repo-ci (opam.ci.ocaml.org)](https://opam.ci.ocaml.org) was suffering a lack of system memory [issue 220](https://github.com/ocurrent/opam-repo-ci/issues/220) which caused it to be moved to the same machine which was being used to host [ocaml-ci (ocaml.ci.dev)](https://ocaml.ci.dev). + +Subsequently, that machine suffered from BTRFS volume corruption [issue 51](https://github.com/ocaml/infrastructure/issues/51). Therefore, both services were moved to a big new server. The data was efficiently migrated using BTRFS tools `btrfs send | btrfs receive`. + +Since the move, we have seen issues with BTRFS metadata plus we have suffered from a build-up of sub volumes as reported by other users: [Docker gradually exhausts disk space on BTRFS](https://github.com/moby/moby/issues/27653). + +Unfortunately, both services went down on Friday evening [issue 85](https://github.com/ocaml/infrastructure/issues/85). Analysis showed over 500 BTRFS sub volumes, a shortage of metadata space and insufficient space to perform a BTRFS _rebalance_. + +Returning to the original configuration of splitting the ci.dev and ocaml.org services, they have been moved on to new and separate hardware. The underlying filesystem is now a RAID1-backed ext4 formatted with `-i 8192` to ensure sufficient inodes are available. Docker is using Overlayfs. RSYNC was used to copy the databases and logs from the old server. This change should add resilience and has doubled the capacity for storing history logs. + From 1cae4ca99d0c30681d6e38cd9c8c9671260c031e Mon Sep 17 00:00:00 2001 From: Mark Elvers Date: Tue, 5 Dec 2023 12:51:34 +0000 Subject: [PATCH 2/6] Update _posts/2023-12-04-services-moved.md Co-authored-by: Christine Rose --- _posts/2023-12-04-services-moved.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2023-12-04-services-moved.md b/_posts/2023-12-04-services-moved.md index a0591ee..ca0d5d8 100644 --- a/_posts/2023-12-04-services-moved.md +++ b/_posts/2023-12-04-services-moved.md @@ -2,7 +2,7 @@ title: Relocating opam.ci.ocaml.org and ocaml.ci.dev --- -About six months ago, [opam-repo-ci (opam.ci.ocaml.org)](https://opam.ci.ocaml.org) was suffering a lack of system memory [issue 220](https://github.com/ocurrent/opam-repo-ci/issues/220) which caused it to be moved to the same machine which was being used to host [ocaml-ci (ocaml.ci.dev)](https://ocaml.ci.dev). +About six months ago, [`opam-repo-ci` (opam.ci.ocaml.org)](https://opam.ci.ocaml.org) suffered from a lack of system memory ([issue 220](https://github.com/ocurrent/opam-repo-ci/issues/220)) which caused it to be moved to the machine hosting [`ocaml-ci` (ocaml.ci.dev)](https://ocaml.ci.dev). Subsequently, that machine suffered from BTRFS volume corruption [issue 51](https://github.com/ocaml/infrastructure/issues/51). Therefore, both services were moved to a big new server. The data was efficiently migrated using BTRFS tools `btrfs send | btrfs receive`. From b1ab96cd6e9cd1670743dd81d58d545b91be1ef9 Mon Sep 17 00:00:00 2001 From: Mark Elvers Date: Tue, 5 Dec 2023 12:51:40 +0000 Subject: [PATCH 3/6] Update _posts/2023-12-04-services-moved.md Co-authored-by: Christine Rose --- _posts/2023-12-04-services-moved.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2023-12-04-services-moved.md b/_posts/2023-12-04-services-moved.md index ca0d5d8..0a9bbaa 100644 --- a/_posts/2023-12-04-services-moved.md +++ b/_posts/2023-12-04-services-moved.md @@ -4,7 +4,7 @@ title: Relocating opam.ci.ocaml.org and ocaml.ci.dev About six months ago, [`opam-repo-ci` (opam.ci.ocaml.org)](https://opam.ci.ocaml.org) suffered from a lack of system memory ([issue 220](https://github.com/ocurrent/opam-repo-ci/issues/220)) which caused it to be moved to the machine hosting [`ocaml-ci` (ocaml.ci.dev)](https://ocaml.ci.dev). -Subsequently, that machine suffered from BTRFS volume corruption [issue 51](https://github.com/ocaml/infrastructure/issues/51). Therefore, both services were moved to a big new server. The data was efficiently migrated using BTRFS tools `btrfs send | btrfs receive`. +Subsequently, that machine suffered from BTRFS volume corruption ([issue 51](https://github.com/ocaml/infrastructure/issues/51)). Therefore, we moved both services to a larger new server. The data was efficiently migrated using BTRFS tools: `btrfs send | btrfs receive`. Since the move, we have seen issues with BTRFS metadata plus we have suffered from a build-up of sub volumes as reported by other users: [Docker gradually exhausts disk space on BTRFS](https://github.com/moby/moby/issues/27653). From 1a9d35f97e03bd91809718dd0ab418e9f2c4d8a9 Mon Sep 17 00:00:00 2001 From: Mark Elvers Date: Tue, 5 Dec 2023 12:51:45 +0000 Subject: [PATCH 4/6] Update _posts/2023-12-04-services-moved.md Co-authored-by: Christine Rose --- _posts/2023-12-04-services-moved.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2023-12-04-services-moved.md b/_posts/2023-12-04-services-moved.md index 0a9bbaa..e8cfde6 100644 --- a/_posts/2023-12-04-services-moved.md +++ b/_posts/2023-12-04-services-moved.md @@ -6,7 +6,7 @@ About six months ago, [`opam-repo-ci` (opam.ci.ocaml.org)](https://opam.ci.ocaml Subsequently, that machine suffered from BTRFS volume corruption ([issue 51](https://github.com/ocaml/infrastructure/issues/51)). Therefore, we moved both services to a larger new server. The data was efficiently migrated using BTRFS tools: `btrfs send | btrfs receive`. -Since the move, we have seen issues with BTRFS metadata plus we have suffered from a build-up of sub volumes as reported by other users: [Docker gradually exhausts disk space on BTRFS](https://github.com/moby/moby/issues/27653). +Since the move, we have seen issues with BTRFS metadata. Plus, we have suffered from a build-up of subvolumes, as reported by other users: [Docker gradually exhausts disk space on BTRFS](https://github.com/moby/moby/issues/27653). Unfortunately, both services went down on Friday evening [issue 85](https://github.com/ocaml/infrastructure/issues/85). Analysis showed over 500 BTRFS sub volumes, a shortage of metadata space and insufficient space to perform a BTRFS _rebalance_. From b55e90e8430893c57cb61856c7f2764c447118b1 Mon Sep 17 00:00:00 2001 From: Mark Elvers Date: Tue, 5 Dec 2023 12:51:51 +0000 Subject: [PATCH 5/6] Update _posts/2023-12-04-services-moved.md Co-authored-by: Christine Rose --- _posts/2023-12-04-services-moved.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2023-12-04-services-moved.md b/_posts/2023-12-04-services-moved.md index e8cfde6..9449f92 100644 --- a/_posts/2023-12-04-services-moved.md +++ b/_posts/2023-12-04-services-moved.md @@ -8,7 +8,7 @@ Subsequently, that machine suffered from BTRFS volume corruption ([issue 51](htt Since the move, we have seen issues with BTRFS metadata. Plus, we have suffered from a build-up of subvolumes, as reported by other users: [Docker gradually exhausts disk space on BTRFS](https://github.com/moby/moby/issues/27653). -Unfortunately, both services went down on Friday evening [issue 85](https://github.com/ocaml/infrastructure/issues/85). Analysis showed over 500 BTRFS sub volumes, a shortage of metadata space and insufficient space to perform a BTRFS _rebalance_. +Unfortunately, both services went down on Friday evening ([issue 85](https://github.com/ocaml/infrastructure/issues/85)). Analysis showed over 500 BTRFS subvolumes, a shortage of metadata space, and insufficient space to perform a BTRFS _rebalance_. Returning to the original configuration of splitting the ci.dev and ocaml.org services, they have been moved on to new and separate hardware. The underlying filesystem is now a RAID1-backed ext4 formatted with `-i 8192` to ensure sufficient inodes are available. Docker is using Overlayfs. RSYNC was used to copy the databases and logs from the old server. This change should add resilience and has doubled the capacity for storing history logs. From 691be739f22a98663c708be7107afbf884751903 Mon Sep 17 00:00:00 2001 From: Mark Elvers Date: Tue, 5 Dec 2023 12:51:56 +0000 Subject: [PATCH 6/6] Update _posts/2023-12-04-services-moved.md Co-authored-by: Christine Rose --- _posts/2023-12-04-services-moved.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2023-12-04-services-moved.md b/_posts/2023-12-04-services-moved.md index 9449f92..502d7d8 100644 --- a/_posts/2023-12-04-services-moved.md +++ b/_posts/2023-12-04-services-moved.md @@ -10,5 +10,5 @@ Since the move, we have seen issues with BTRFS metadata. Plus, we have suffered Unfortunately, both services went down on Friday evening ([issue 85](https://github.com/ocaml/infrastructure/issues/85)). Analysis showed over 500 BTRFS subvolumes, a shortage of metadata space, and insufficient space to perform a BTRFS _rebalance_. -Returning to the original configuration of splitting the ci.dev and ocaml.org services, they have been moved on to new and separate hardware. The underlying filesystem is now a RAID1-backed ext4 formatted with `-i 8192` to ensure sufficient inodes are available. Docker is using Overlayfs. RSYNC was used to copy the databases and logs from the old server. This change should add resilience and has doubled the capacity for storing history logs. +Returning to the original configuration of splitting the `ci.dev` and OCaml.org services, they have been moved onto new and separate hardware. The underlying filesystem is now a RAID1-backed ext4, formatted with `-i 8192` in order to ensure the availability of sufficient inodes. Docker uses Overlayfs. RSYNC was used to copy the databases and logs from the old server. This change should add resilience and has doubled the capacity for storing history logs.