Add ADR on Security Zones (#943)

Co-authored-by: Viktor <[email protected]> Co-authored-by: Olle Larsson <[email protected]> Co-authored-by: Viktor <[email protected]>
elastisys · Aug 20, 2024 · 5d90176 · 5d90176
1 parent 91ea7ae
commit 5d90176
Show file tree

Hide file tree

Showing 8 changed files with 152 additions and 8 deletions.
diff --git a/CODEOWNERS b/CODEOWNERS
@@ -14,3 +14,7 @@
 /linkchecker.conf                   @elastisys/goto-public-docs @elastisys/qa
 /package.json                       @elastisys/goto-public-docs @elastisys/qa
 /requirements-linkchecker.txt       @elastisys/goto-public-docs @elastisys/qa
+
+# Architectural decisions MUST be reviewed by the product team
+/docs/architecture.md               @elastisys/product-team
+/docs/adr/                          @elastisys/product-team
diff --git a/ci/vale/styles/config/vocabularies/Elastisys/accept.txt b/ci/vale/styles/config/vocabularies/Elastisys/accept.txt
@@ -12,6 +12,7 @@ Grafana
 IMY
 Kubespray
 Kured
+Tekton
 Telavox
 Trivy
 Velero
@@ -24,6 +25,7 @@ allowlisted
 (?i)autoscaling
 colocation
 CPU
+cybersecurity
 Dockerfile
 (?i)downsampled
 downscaling

diff --git a/docs/adr/0015-we-believe-in-community-driven-open-source.md b/docs/adr/0015-we-believe-in-community-driven-open-source.md
@@ -2,7 +2,6 @@
 tags:
   - ISO 27001 A.17.1.1 Planning Information Security Continuity
 ---
-
 # We believe in community-driven open source
 
 - Status: accepted

diff --git a/docs/adr/0022-use-dedicated-nodes-for-additional-services.md b/docs/adr/0022-use-dedicated-nodes-for-additional-services.md
@@ -3,7 +3,6 @@ tags:
   - BSI IT-Grundschutz APP.4.4.A14
   - NIST SP 800-171 3.13.3
 ---
-
 # Use Dedicated Nodes for Additional Services
 
 - Status: accepted

diff --git a/docs/adr/0050-use-cluster-isolation.md b/docs/adr/0050-use-cluster-isolation.md
@@ -0,0 +1,132 @@
+# Use Cluster Isolation to separate the application and its traces from its logs and metrics
+
+- Status: accepted
+- Deciders: Product Team
+- Date: 2024-08-16
+
+## Context and Problem Statement
+
+Let us introduce the context of this ADR.
+First, we look at the regulatory and information security landscape.
+Second, we look at the technological landscape.
+Finally, we merge these two discussions to devise the problem statement.
+
+### Regulatory and Information Security Landscape
+
+The [NIS2 Directive](https://digital-strategy.ec.europa.eu/en/policies/nis2-directive) came into force 2023 and will affect how platforms hosting software critical to society are secured.
+Article 21 lists 10 cybersecurity risk-management measures which all essential and important entities will need to implement.
+
+This ADR focuses on the following two measures, reproduced below verbatim from the NIS2 Directive:
+
+> - (e) security in network and information systems acquisition, development and maintenance, including vulnerability handling and disclosure;
+> - (i) human resources security, access control policies and asset management.
+
+The most common way to implement these measures is using a concept called **Security Zone**.
+A Security Zone is a logical grouping of IT systems with similar needs of protection.
+For example, one Security Zone hosts [mission-critical applications](https://en.wikipedia.org/wiki/Mission_critical), whereas another Security Zone hosts non-mission-critical applications.
+
+Each Security Zone has a separate network, either isolated physically (OSI layer 1) or logically (OSI layer 2, e.g., [VLAN](https://en.wikipedia.org/wiki/VLAN)).
+The Internet itself can be conceptualized as a Security Zone with the lowest protection class.
+Network traffic can only travel between Security Zones through firewalls.
+Usually, connections are only allowed from Security Zones with higher protection class to those with lower.
+As an additional measures, applications and protocols are designed to ensure that:
+
+- A compromise of information availability and integrity of a Security Zone with lower protection class does not affect a Security Zone with a higher protection class.
+For example, updates from the Internet need to pass through a manual review and validation process on a non-active production environment before being applied to the active production environment.
+- Confidential information is not leaked from a Security Zone with higher protection class into one with lower protection class.
+For example, application logs required for diagnostics may traverse from a higher to lower class. This, of course, entails that adequate requirements are set on the application development process, such as, logs should contain not confidential information.
+However, application traces, given that they may contain function call parameters and hence pose a higher risk of leaking confidential information, should remain in the Security Zone with the higher protection class.
+
+Besides network isolation, Security Zones with higher protection class may be accessed only by staff with a given security clearance.
+For example, the Security Zone in the highest protection class may be accessed only by staff who cleared the [Swedish SÄPO Security Clearance](https://sakerhetspolisen.se/ovriga-sidor/other-languages/english-engelska/what-we-do/protective-security.html) or the [German SÜG](https://de.wikipedia.org/wiki/Sicherheits%C3%BCberpr%C3%BCfungsgesetz).
+In contrast, Security Zones in a lower protection class may be accessed by the staff of suppliers, application developers, platform support staff, etc.
+
+### Technological Landscape
+
+Somewhat simplified, a containerized platform, such as Compliant Kubernetes, hosts the application stack and observability stack.
+The [three pillars of observability](https://www.oreilly.com/library/view/distributed-systems-observability/9781492033431/ch04.html) are metrics, logs and traces.
+Metrics and logs are rather "explicit" in nature, i.e., the application developer usually add code to their application to explicitly decide what metrics and what log lines the application produces.
+Traces are rather "implicit" in nature, i.e., the application developer [includes a library in their application and some code snippet](https://opentelemetry.io/docs/zero-code/python/example/), which automatically produces traces of all function calls and returns, including invoked parameters.
+
+The application and observability stack may be separated through various levels of isolation: labels, namespace isolation, Node isolation and Cluster isolation.
+See [Levels of Isolation](../user-guide/how-many-environments.md#levels-of-isolation) for a description of each isolation level and what kind of isolation they achieve.
+
+## Problem Statement
+
+Given the regulatory and information security landscape above, what level of isolation should Compliant Kubernetes employ between the application stack and observability stack?
+
+## Decision Drivers
+
+- We want to make it easy to protect code which runs in Security Zones with higher protection class, even if that means a reduction in productivity.
+- We want to protect code which runs in Security Zones with lower protection class, but a reduction in productivity should be minimized.
+- We want to avoid "Cluster sprawl" which increases maintenance burden.
+
+## Considered Options
+
+1. Use Namespace isolation between application and observability stack.
+    - Good, because it avoids Cluster sprawl.
+    - Bad, because Security Zones are only enforced using NetworkPolicies, which is usually insufficient for high-security organizations.
+1. Use Node isolation between application and observability stack.
+    - Good, because it avoids Cluster sprawl.
+    - Good, because each Node may run in a different Security Zone.
+    - Bad, because Kubernetes and its CNI tends to assume non-restricted communication between Nodes. The resulting firewall rules might look very generous and hard to audit.
+    - Bad, because the Kubernetes control plane is shared between Security Zones, which violates the spirit of zoning.
+1. Use Cluster isolation between application and observability stack.
+    - Good, because each Cluster can run in a different Security Zone.
+    - Good, because the application and observability stack can be in different Security Zones.
+    - Bad, because traces end up in the same Security Zone as logs and metrics.
+1. Use Cluster isolation between application and observability stack. Host the application and its traces in one Cluster, while the other Cluster hosts logs and metrics.
+    - Good, because it avoid Cluster sprawl, i.e., only two Cluster, instead of the minimum of one.
+    - Good, because it follows the Security Zone model.
+1. Use Cluster isolation with one Cluster for each of application, traces, logs and metrics.
+    - Neutral, because, compared to the option above, it does allow more Security Zones. However, many Security Zones are likely to be of the same protection class, hence, the benefits in terms of security does not outweigh the cost.
+    - Bad, because it leads to Cluster sprawl.
+
+## Decision Outcome
+
+Chosen option: "Option 4: Host the application and its traces in one Cluster, while the other Cluster hosts logs and metrics", because it provides the right tradeoff between number of Security Zones and Cluster sprawl.
+
+![Illustration of Option 4](img/0050-use-cluster-isolation.drawio.svg)
+
+## Other Considerations
+
+### Glossary used in Compliant Kubernetes
+
+Compliant Kubernetes uses the following glossary:
+
+- The Cluster hosting the application and traces is called the [Workload Cluster](../glossary.md#workload-cluster).
+- The Cluster hosting logs and metrics is called the [Management Cluster](../glossary.md#management-cluster).
+
+The pair of Cluster form a Compliant Kubernetes [Environment](../glossary.md#environment).
+
+### Automated Platform Updates via Tekton
+
+If the Workload Cluster is in a Security Zone with a high protection class, then we recommend against using auto-updates with Tekton.
+Prefer running the migration scripts manually for added human supervision.
+See [ADR-0035 Run Tekton on Management Cluster](0035-run-tekton-on-service-cluster.md).
+
+### Cluster API
+
+If the Workload Cluster is in a Security Zone with a higher protection class than the Management Cluster, then the Workload Cluster should run its own Cluster API controller.
+This goes against [ADR-0033 Run Cluster API controllers on Management Cluster](0033-run-cluster-api-controllers-on-service-cluster.md).
+
+### Tamper-Proof Logging
+
+Some regulations and information security standards require tamper-proof logging.
+In other words, a compromise of the application deployment should not enable an attacker to remove their trails and hinder forensics.
+Compliant Kubernetes already observes this principle for its observability stack, e.g., the access that the Workload Cluster has to the Service Cluster cannot be used to remove old log entries.
+(Of course, garbage new log entries may be created, but that only makes an attack more obvious during [log review](../ciso-guide/log-review.md).
+
+Cluster isolation between the application and its logs adds another layer of protection:
+The attacker would need to:
+
+- compromise the application;
+- compromise the underlying Workload Cluster to reach log collection components (Fluentd);
+- compromise the OpenSearch endpoint on the Service Cluster.
+
+## Links
+
+- [NIS2 Directive](https://digital-strategy.ec.europa.eu/en/policies/nis2-directive)
+- [Proposal for an implementation of the NIS2 Directive in Sweden](https://www.regeringen.se/contentassets/1e56bf5cad214fc78eb80d91c11cccb6/nya-regler-om-cybersakerhet-sou-202418.pdf)
+- [This is the NIS2 Directive from the Swedish Civil Contingencies Agency](https://www.msb.se/sv/amnesomraden/informationssakerhet-cybersakerhet-och-sakra-kommunikationer/krav-och-regler-inom-informationssakerhet-och-cybersakerhet/nis-direktivet/det-har-ar-nis2-direktivet/)
+- [NIS2 in Germany](https://www.openkritis.de/eu/eu-nis-2-germany.html)
diff --git a/docs/adr/img/0050-use-cluster-isolation.drawio.svg b/docs/adr/img/0050-use-cluster-isolation.drawio.svg
diff --git a/docs/adr/index.md b/docs/adr/index.md
@@ -7,7 +7,6 @@ tags:
   - NIST SP 800-171 3.13.2
   - NIST SP 800-171 3.13.3
 ---
-
 # Architectural Decision Log
 
 ## Mapping to ISO 27001 Controls
@@ -102,7 +101,7 @@ This log lists the architectural decisions for Compliant Kubernetes.
 - [ADR-0022](0022-use-dedicated-nodes-for-additional-services.md) - Use Dedicated Nodes for Additional Services
 - [ADR-0023](0023-allow-snippets-annotations.md) - Only allow Ingress Configuration Snippet Annotations after Proper Risk Acceptance
 - [ADR-0024](0024-allow-Harbor-robot-account.md) - Allow a Harbor robot account that can create other robot accounts with full privileges
-- [ADR-0025](0025-local-storage.md) - Use local-volume-provisioner for Managed Services that requires high-speed disks.
+- [ADR-0025](0025-local-storage.md) - Use local-volume-provisioner for Managed Services that requires high-speed disks
 - [ADR-0026](0026-hnc.md) - Use `environment-name` as the default root of Hierarchical Namespace Controller (HNC)
 - [ADR-0027](0027-postgresql-external-replication.md) - PostgreSQL - Enable external replication
 - [ADR-0028](0028-harder-pod-eviction-when-node-goes-OOM.md) - Harder Pod eviction when nodes are going OOM
@@ -121,12 +120,13 @@ This log lists the architectural decisions for Compliant Kubernetes.
 - [ADR-0041](0041-encryption-at-rest.md) - Rely on Infrastructure Provider for encryption-at-rest
 - [ADR-0042](0042-argocd-dynamic-hnc-namespaces.md) - ArgoCD with dynamic HNC namespaces
 - [ADR-0043](0043-rclone-and-encryption-adhere-cryptography-policy.md) - Rclone and Encryption adheres Cryptography Policy
-- [ADR-0044](0044-argocd-managing-its-own-namespace.md) - ArgoCD is not allowed to manage its own namespace.
+- [ADR-0044](0044-argocd-managing-its-own-namespace.md) - ArgoCD is not allowed to manage its own namespace
 - [ADR-0045](0045-use-specialised-prebuilt-images.md) - Use specialised prebuilt images
 - [ADR-0046](0046-handle-crds.md) - Handle all CRDs with the standard Helm CRD management
 - [ADR-0047](0047-kubernetes-versions.md) - When to upgrade to new Kubernetes versions
-- [ADR-0048](0048-access-management-for-AMS-with-network-policies.md) - Access Management for Additional Managed Services (AMS-es).
-- [ADR-0049](0049-run-ingress-nginx-in-chroot.md) - Run Ingress-NGINX in chroot.
+- [ADR-0048](0048-access-management-for-AMS-with-network-policies.md) - Access Management for Additional Managed Services (AMS-es)
+- [ADR-0049](0049-run-ingress-nginx-in-chroot.md) - Running NGINX with Chroot Option
+- [ADR-0050](0050-use-cluster-isolation.md) - Use Cluster Isolation to separate the application and its traces from its logs and metrics
 
 <!-- adrlogstop -->
 
@@ -142,4 +142,4 @@ Pre-requisites:
 - Install [adr-log](https://github.com/adr/adr-log#install)
 - Install [make](https://packages.ubuntu.com/search?keywords=make)
 
-Run `make -C docs/adr`
+Run `make -C docs/adr`, then run `pre-commit run --all-files`.
diff --git a/docs/user-guide/how-many-environments.md b/docs/user-guide/how-many-environments.md
@@ -54,6 +54,10 @@ In particular, production data should not be compromised, no matter what happens
 Similarly, some regulations -- such as Medical Devices Regulation (MDR) -- require you to take a risk-based approach to changing the tech stack.
 Depending on your risk assessment, this implies **verifying** changes in a non-production Application Deployment before going into production.
 
+Some regulations, such as [NIS2](../ciso-guide/nis2.md), require that the organization takes measures related to "security in network and information systems acquisition, development and maintenance, including vulnerability handling and disclosure".
+This is commonly implemented using a concept called Security Zones.
+If two applications are in different Security Zones, then Cluster isolation might be required.
+
 ## Recommendations
 
 Taking into account the relevant regulations, Compliant Kubernetes recommends setting up **at least two Environments**: