Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROX-18431: restore deleted tenant through API #1207

Merged
merged 22 commits into from
Sep 12, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .secrets.baseline
Original file line number Diff line number Diff line change
Expand Up @@ -564,5 +564,5 @@
}
]
},
"generated_at": "2023-08-30T10:02:58Z"
"generated_at": "2023-09-06T14:19:26Z"
}
10 changes: 5 additions & 5 deletions dev/env/scripts/exec_fleetshard_sync.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,12 @@ export AWS_AUTH_HELPER="${AWS_AUTH_HELPER:-aws-saml}"
source "${GITROOT}/scripts/lib/external_config.sh"
init_chamber

CLUSTER_NAME="cluster-acs-dev-dp-01"
CLUSTER_NAME="${CLUSTER_NAME:-cluster-acs-dev-dp-01}"

ARGS="CLUSTER_ID=${CLUSTER_ID:-$(chamber read ${CLUSTER_NAME} ID -q -b ssm)} \
MANAGED_DB_SECURITY_GROUP=${MANAGED_DB_SECURITY_GROUP:-$(chamber read ${CLUSTER_NAME} MANAGED_DB_SECURITY_GROUP -q -b ssm)} \
MANAGED_DB_SUBNET_GROUP=${MANAGED_DB_SUBNET_GROUP:-$(chamber read ${CLUSTER_NAME} MANAGED_DB_SUBNET_GROUP -q -b ssm)} \
SECRET_ENCRYPTION_KEY_ID=${SECRET_ENCRYPTION_KEY_ID:-$(chamber read ${CLUSTER_NAME} SECRET_ENCRYPTION_KEY_ID -q -b ssm)} \
ARGS="CLUSTER_ID=${CLUSTER_ID:-$(chamber read "${CLUSTER_NAME}" ID -q -b ssm)} \
MANAGED_DB_SECURITY_GROUP=${MANAGED_DB_SECURITY_GROUP:-$(chamber read "${CLUSTER_NAME}" MANAGED_DB_SECURITY_GROUP -q -b ssm)} \
MANAGED_DB_SUBNET_GROUP=${MANAGED_DB_SUBNET_GROUP:-$(chamber read "${CLUSTER_NAME}" MANAGED_DB_SUBNET_GROUP -q -b ssm)} \
SECRET_ENCRYPTION_KEY_ID=${SECRET_ENCRYPTION_KEY_ID:-$(chamber read "${CLUSTER_NAME}" SECRET_ENCRYPTION_KEY_ID -q -b ssm)} \
AWS_ROLE_ARN=${FLEETSHARD_SYNC_AWS_ROLE_ARN:-$(chamber read fleetshard-sync AWS_ROLE_ARN -q -b ssm)} \
$ARGS"

Expand Down
4 changes: 2 additions & 2 deletions docs/development/howto-e2e-test-rds.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@ At the point in time this documentation was written AWS RDS DB creation and dele
# Prepare environment
export AWS_AUTH_HELPER=aws-saml
export MANAGED_DB_ENABLED=true
# flip the PublicAcessible flag to true in rds.go line 354
export CLUSTER_NAME=local_cluster
vladbologa marked this conversation as resolved.
Show resolved Hide resolved
# flip the PubliclyAccessible flag to true in rds.go line 514
make binary
./dev/env/scripts/exec_fleetshard_sync.sh
Expand Down
88 changes: 86 additions & 2 deletions fleetshard/pkg/central/cloudprovider/awsclient/rds.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ import (
"context"
"errors"
"fmt"
"strings"
"time"

"github.com/aws/aws-sdk-go/aws"
Expand All @@ -15,11 +16,13 @@ import (
"github.com/stackrox/acs-fleet-manager/fleetshard/config"
"github.com/stackrox/acs-fleet-manager/fleetshard/pkg/central/cloudprovider"
"github.com/stackrox/acs-fleet-manager/fleetshard/pkg/central/postgres"
"k8s.io/apimachinery/pkg/util/rand"
)

const (
dbAvailableStatus = "available"
dbDeletingStatus = "deleting"
dbBackingUpStatus = "backing-up"
dbUser = "rhacs_master"
dbPrefix = "rhacs-"
dbInstanceSuffix = "-db-instance"
Expand Down Expand Up @@ -105,7 +108,15 @@ func (r *RDS) EnsureDBDeprovisioned(databaseID string, skipFinalSnapshot bool) e
// to construct a PostgreSQL connection string. It expects that the database was already provisioned.
func (r *RDS) GetDBConnection(databaseID string) (postgres.DBConnection, error) {
dbCluster, err := r.describeDBCluster(getClusterID(databaseID))

if err != nil {
var awsErr awserr.Error
if errors.As(err, &awsErr) {
if awsErr.Code() == rds.ErrCodeDBClusterNotFoundFault {
err = errors.Join(cloudprovider.ErrDBNotFound, err)
}
}

return postgres.DBConnection{}, err
}

Expand Down Expand Up @@ -154,7 +165,13 @@ func (r *RDS) ensureDBClusterCreated(clusterID, acsInstanceID, masterPassword st
return nil
}

finalSnapshotID, err := r.getFinalSnapshotIDIfExists(clusterID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note, this will not work if the default database ID is overridden, at least the way I started to implement this here: #1198

In that PR the override is currently implemented as a ConfigMap in the data plane. Do you think the override name should actually be stored by FM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about that. For now just assumed clusterID would be correct after #1198 in any case because I didn't know how you would handle that.

IIUC all overrides at some point should be done declaratively through GitOps. But we don't want to make the Backup/Restore epic depend on that.

One idea instead of configmap could be to use the tags you introduced to identified the clusterID. This would work as long as there is a DB running with appropriate tags. With this approach we wouldn't need a dedicated configmap to store the override, just some tags in AWS. But we would still have the problem that we loose overwrite information once a tenant was deleted. We wouldn't find a fitting final snapshot in that case.

Not sure if we're overthinking here. This is a edge case that is very unlikely to ever happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea with using the tags directly instead of the ConfigMap is worth considering either way, thanks!

I suppose we can discuss elsewhere whether this edge case should be handled automatically, or have some manual steps in an SOP. But it can definitely happen: once a DB override is in place, the restore deleted tenant operation will not work anymore.

if err != nil {
return err
}

glog.Infof("Initiating provisioning of RDS database cluster %s.", clusterID)

input := &createCentralDBClusterInput{
clusterID: clusterID,
acsInstanceID: acsInstanceID,
Expand All @@ -164,14 +181,60 @@ func (r *RDS) ensureDBClusterCreated(clusterID, acsInstanceID, masterPassword st
dataplaneClusterName: r.dataplaneClusterName,
isTestInstance: isTestInstance,
}
_, err = r.rdsClient.CreateDBCluster(newCreateCentralDBClusterInput(input))

rdsCreateDBClusterInput := newCreateCentralDBClusterInput(input)

if finalSnapshotID != "" {
glog.Infof("Restoring DB cluster: %s from snasphot: %s", clusterID, finalSnapshotID)
return r.restoreDBClusterFromSnapshot(finalSnapshotID, rdsCreateDBClusterInput)
}

return r.createDBCluster(rdsCreateDBClusterInput)
}

func (r *RDS) restoreDBClusterFromSnapshot(snapshotID string, clusterInput *rds.CreateDBClusterInput) error {
_, err := r.rdsClient.RestoreDBClusterFromSnapshot(newRestoreCentralDBClusterInput(snapshotID, clusterInput))
if err != nil {
return fmt.Errorf("restoring DB cluster: %w", err)
}

return nil
}

func (r *RDS) createDBCluster(clusterInput *rds.CreateDBClusterInput) error {
_, err := r.rdsClient.CreateDBCluster(clusterInput)
if err != nil {
return fmt.Errorf("creating DB cluster: %w", err)
}

return nil
}

func (r *RDS) getFinalSnapshotIDIfExists(clusterID string) (string, error) {
snapshotsOut, err := r.rdsClient.DescribeDBClusterSnapshots(&rds.DescribeDBClusterSnapshotsInput{
DBClusterIdentifier: &clusterID,
})

if err != nil {
return "", fmt.Errorf("checking if final snapshot for clusterID: %s exists: %w", clusterID, err)
}

var mostRecentSnapshotID string
var mostRecentSnapshotTime *time.Time
for _, snapshot := range snapshotsOut.DBClusterSnapshots {
if !strings.Contains(*snapshot.DBClusterSnapshotIdentifier, "final") {
continue
}

if mostRecentSnapshotTime == nil || mostRecentSnapshotTime.Before(*snapshot.SnapshotCreateTime) {
mostRecentSnapshotID = *snapshot.DBClusterSnapshotIdentifier
mostRecentSnapshotTime = snapshot.SnapshotCreateTime
}
}

return mostRecentSnapshotID, nil
}

func (r *RDS) ensureDBInstanceCreated(instanceID, clusterID, acsInstanceID string, isTestInstance bool) error {
instanceExists, _, err := r.instanceStatus(instanceID)
if err != nil {
Expand Down Expand Up @@ -227,6 +290,10 @@ func (r *RDS) ensureClusterDeleted(clusterID string, skipFinalSnapshot bool) err
return nil
}

if clusterStatus == dbBackingUpStatus {
return cloudprovider.ErrDBBackupInProgress
}

if clusterStatus != dbDeletingStatus {
glog.Infof("Initiating deprovisioning of RDS database cluster %s.", clusterID)
_, err := r.rdsClient.DeleteDBCluster(newDeleteCentralDBClusterInput(clusterID, skipFinalSnapshot))
Expand Down Expand Up @@ -415,6 +482,23 @@ func newCreateCentralDBClusterInput(input *createCentralDBClusterInput) *rds.Cre
return awsInput
}

func newRestoreCentralDBClusterInput(snapshotID string, input *rds.CreateDBClusterInput) *rds.RestoreDBClusterFromSnapshotInput {
restoreInput := &rds.RestoreDBClusterFromSnapshotInput{
DBClusterIdentifier: input.DBClusterIdentifier,
Engine: input.Engine,
EngineVersion: input.EngineVersion,
VpcSecurityGroupIds: input.VpcSecurityGroupIds,
PubliclyAccessible: input.PubliclyAccessible,
DBSubnetGroupName: input.DBSubnetGroupName,
ServerlessV2ScalingConfiguration: input.ServerlessV2ScalingConfiguration,
Tags: input.Tags,
SnapshotIdentifier: &snapshotID,
EnableCloudwatchLogsExports: input.EnableCloudwatchLogsExports,
}

return restoreInput
}

type createCentralDBInstanceInput struct {
clusterID string
instanceID string
Expand Down Expand Up @@ -482,7 +566,7 @@ func newRdsClient() (*rds.RDS, error) {
}

func getFinalSnapshotID(clusterID string) *string {
return aws.String(fmt.Sprintf("%s-%s", clusterID, "final"))
return aws.String(fmt.Sprintf("%s-%s-%s", clusterID, rand.String(10), "final"))
}

func getInstanceType(isTestInstance bool) string {
Expand Down
32 changes: 18 additions & 14 deletions fleetshard/pkg/central/cloudprovider/awsclient/rds_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -67,31 +67,32 @@ func waitForClusterToBeDeleted(ctx context.Context, rdsClient *RDS, clusterID st
}
}

func waitForFinalSnapshotToExist(ctx context.Context, rdsClient *RDS, clusterID string) (bool, error) {
func waitForFinalSnapshotToExist(ctx context.Context, rdsClient *RDS, clusterID string) (bool, string, error) {

ticker := time.NewTicker(awsRetrySeconds * time.Second)
for {
select {
case <-ticker.C:
snapshotOut, err := rdsClient.rdsClient.DescribeDBClusterSnapshots(&rds.DescribeDBClusterSnapshotsInput{
DBClusterSnapshotIdentifier: getFinalSnapshotID(clusterID),
DBClusterIdentifier: &clusterID,
})

if err != nil {
if awsErr, ok := err.(awserr.Error); ok {
if awsErr.Code() != rds.ErrCodeDBClusterSnapshotNotFoundFault {
return false, err
return false, "", err
}

continue
}
}

if snapshotOut != nil {
return len(snapshotOut.DBClusterSnapshots) == 1, nil
if snapshotOut != nil && len(snapshotOut.DBClusterSnapshots) == 1 {
return true, *snapshotOut.DBClusterSnapshots[0].DBClusterSnapshotIdentifier, nil
}

case <-ctx.Done():
return false, fmt.Errorf("waiting for final DB snapshot: %w", ctx.Err())
return false, "", fmt.Errorf("waiting for final DB snapshot: %w", ctx.Err())
}

}
Expand Down Expand Up @@ -164,16 +165,18 @@ func TestRDSProvisioning(t *testing.T) {
require.NoError(t, err)
assert.True(t, clusterDeleted)

// Always attempt to delete the final snapshot if it exists
defer func() {
_, err := rdsClient.rdsClient.DeleteDBClusterSnapshot(
&rds.DeleteDBClusterSnapshotInput{DBClusterSnapshotIdentifier: getFinalSnapshotID(clusterID)},
)
snapshotExists, snapshotID, err := waitForFinalSnapshotToExist(deleteCtx, rdsClient, clusterID)

assert.NoError(t, err)
}()
if snapshotExists {
defer func() {
_, err := rdsClient.rdsClient.DeleteDBClusterSnapshot(
&rds.DeleteDBClusterSnapshotInput{DBClusterSnapshotIdentifier: &snapshotID},
)

assert.NoError(t, err)
}()
}

snapshotExists, err := waitForFinalSnapshotToExist(deleteCtx, rdsClient, clusterID)
require.NoError(t, err)
require.True(t, snapshotExists)
}
Expand All @@ -190,6 +193,7 @@ func TestGetDBConnection(t *testing.T) {
var awsErr awserr.Error
require.ErrorAs(t, err, &awsErr)
assert.Equal(t, awsErr.Code(), rds.ErrCodeDBClusterNotFoundFault)
require.ErrorIs(t, err, cloudprovider.ErrDBNotFound)
}

func TestGetAccountQuotas(t *testing.T) {
Expand Down
7 changes: 7 additions & 0 deletions fleetshard/pkg/central/cloudprovider/dbclient.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ package cloudprovider

import (
"context"
"errors"

"github.com/stackrox/acs-fleet-manager/fleetshard/pkg/central/postgres"
)
Expand Down Expand Up @@ -38,6 +39,12 @@ const (
DBSnapshots
)

// ErrDBBackupInProgress is returned if an action failed because a DB backup is in progress
var ErrDBBackupInProgress = errors.New("DB Backup in Progress")

// ErrDBNotFound is returned if an action failed because a expected DB is not found
var ErrDBNotFound = errors.New("DB not found")

// AccountQuotaValue holds quota data for services, as a pair of currently Used out of Max
type AccountQuotaValue struct {
Used int64
Expand Down
17 changes: 16 additions & 1 deletion fleetshard/pkg/central/reconciler/reconciler.go
Original file line number Diff line number Diff line change
Expand Up @@ -987,8 +987,14 @@ func (r *CentralReconciler) ensureCentralDeleted(ctx context.Context, remoteCent
if r.managedDBEnabled {
// skip Snapshot for remoteCentral created by probe
skipSnapshot := remoteCentral.Metadata.Internal

err = r.managedDBProvisioningClient.EnsureDBDeprovisioned(remoteCentral.Id, skipSnapshot)
if err != nil {
if errors.Is(err, cloudprovider.ErrDBBackupInProgress) {
glog.Infof("Can not delete Central DB for: %s, backup in progress", remoteCentral.Metadata.Namespace)
return false, nil
}

return false, fmt.Errorf("deprovisioning DB: %v", err)
}

Expand Down Expand Up @@ -1103,7 +1109,16 @@ func (r *CentralReconciler) getCentralDBConnectionString(ctx context.Context, re

dbConnection, err := r.managedDBProvisioningClient.GetDBConnection(remoteCentral.Id)
if err != nil {
return "", fmt.Errorf("getting RDS DB connection data: %w", err)
if !errors.Is(err, cloudprovider.ErrDBNotFound) {
return "", fmt.Errorf("getting RDS DB connection data: %w", err)
}

glog.Infof("expected DB for %s not found, trying to restore...", remoteCentral.Id)
// Using no password because we try to restore from backup
err := r.managedDBProvisioningClient.EnsureDBProvisioned(ctx, remoteCentral.Id, remoteCentral.Id, "", remoteCentral.Metadata.Internal)
if err != nil {
return "", fmt.Errorf("trying to restore DB: %w", err)
}
}
return dbConnection.GetConnectionForUserAndDB(dbCentralUserName, postgres.CentralDBName).WithSSLRootCert(postgres.DatabaseCACertificatePathCentral).AsConnectionString(), nil
}
Expand Down
2 changes: 1 addition & 1 deletion fleetshard/pkg/central/reconciler/reconciler_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -578,7 +578,7 @@ func TestReconcileDeleteWithManagedDB(t *testing.T) {
assert.Len(t, managedDBProvisioningClient.EnsureDBProvisionedCalls(), 1)

deletedCentral := simpleManagedCentral
deletedCentral.Metadata.DeletionTimestamp = "2006-01-02T15:04:05Z07:00"
deletedCentral.Metadata.DeletionTimestamp = "2006-01-02T15:04:05+00:00"

// trigger deletion
managedDBProvisioningClient.EnsureDBProvisionedFunc = func(_ context.Context, _ string, _ string, _ string, _ bool) error {
Expand Down
43 changes: 43 additions & 0 deletions internal/dinosaur/pkg/api/admin/private/api/openapi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -376,6 +376,49 @@ paths:
security:
- Bearer: []
summary: Update a Central instance by ID
/api/rhacs/v1/admin/centrals/{id}/restore:
post:
parameters:
- description: The ID of record
in: path
name: id
required: true
schema:
type: string
responses:
"201":
description: Requests to restore accepted
"400":
content:
application/json:
schema:
$ref: '#/components/schemas/Error'
description: Validation error occured
"401":
content:
application/json:
schema:
$ref: '#/components/schemas/Error'
description: Auth token is invalid
"403":
content:
application/json:
schema:
$ref: '#/components/schemas/Error'
description: User is not authorised to access the service
"404":
content:
application/json:
schema:
$ref: '#/components/schemas/Error'
description: No Central found with the specified ID
"500":
content:
application/json:
schema:
$ref: '#/components/schemas/Error'
description: Unexpected error occurred
summary: Restore a central tenant that was already deleted
/api/rhacs/v1/admin/centrals/db/{id}:
delete:
operationId: deleteDbCentralById
Expand Down
Loading
Loading