Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a unique constraint can fail under high load #485

Open
andrew-farries opened this issue Nov 25, 2024 · 3 comments
Open

Adding a unique constraint can fail under high load #485

andrew-farries opened this issue Nov 25, 2024 · 3 comments
Labels
bug Something isn't working
Milestone

Comments

@andrew-farries
Copy link
Collaborator

andrew-farries commented Nov 25, 2024

There is a race condition when adding a unique constraint to a column.

The race can cause the operation to fail, particularly when the target database is under high load at the time of the migration.

Reproduction 1

Create a pair of migrations:

01_create_table.json
{
  "name": "01_create_table",
  "operations": [
    {
      "create_table": {
        "name": "movies",
        "columns": [
          {
            "name": "id",
            "type": "serial",
            "pk": true
          },
          {
            "name": "name",
            "type": "varchar(255)"
          }
        ]
      }
    }
  ]
}
02_set_unique.json
{
  "name": "02_set_unique",
  "operations": [
    {
      "alter_column": {
        "table": "movies",
        "column": "name",
        "unique": {
          "name": "movies_name_unique"
        },
        "up": "name",
        "down": "name"
      }
    }
  ]
}

Apply the first migration:

pgroll start 01_create_table.json --complete

In the target database, generate some load on the system:

CREATE TABLE test(a int);
INSERT INTO test(a) SELECT generate_series(1, 10000000)

While the above INSERT operation is running, apply the second migration:

pgroll start 02_set_unique.json --complete

The second migration is likely to fail with the following error:

Failed to complete migration: unable to execute complete operation: pq: index "movies_name_unique" is not valid

Reproduction 2

The same problem also exists when a column with a UNIQUE constraint is duplicated, for example to change the nullability of the column. To reproduce:

Create a pair of migrations:

01_create_table.json
{
  "name": "01_create_table",
  "operations": [
    {
      "create_table": {
        "name": "movies",
        "columns": [
          {
            "name": "id",
            "type": "serial",
            "pk": true
          },
          {
            "name": "name",
            "type": "varchar(255)",
            "unique": true,
            "nullable": true
          }
        ]
      }
    }
  ]
}
02_set_notnull.json
{
  "name": "02_set_notnull",
  "operations": [
    {
      "alter_column": {
        "table": "movies",
        "column": "name",
        "nullable": false,
        "up": "name",
        "down": "name"
      }
    }
  ]
}

Now follow the same steps as for the first reproduction: apply the first migration, generate load, apply the second migration. The second migration will likely fail with the same error as in the first reproduction.

Notes

In the first reproduction, the operation fails because there is a race between the movies_name_unique index being marked valid after it is created (concurrently) on operation Start:

}
func (o *OpSetUnique) addUniqueIndex(ctx context.Context, conn db.DB) error {
// create unique index concurrently
_, err := conn.ExecContext(ctx, fmt.Sprintf("CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS %s ON %s (%s)",
pq.QuoteIdentifier(o.Name),
pq.QuoteIdentifier(o.Table),
pq.QuoteIdentifier(TemporaryName(o.Column))))
return err
}

and when the unique constraint is added using the index on operation Complete:

func (o *OpSetUnique) Complete(ctx context.Context, conn db.DB, tr SQLTransformer, s *schema.Schema) error {
// Create a unique constraint using the unique index
_, err := conn.ExecContext(ctx, fmt.Sprintf("ALTER TABLE IF EXISTS %s ADD CONSTRAINT %s UNIQUE USING INDEX %s",
pq.QuoteIdentifier(o.Table),
pq.QuoteIdentifier(o.Name),
pq.QuoteIdentifier(o.Name)))
if err != nil {
return err
}
return err
}

In the second reproduction, the race is between the concurrent index creation performed by the column duplicator:

func (d *duplicatorStmtBuilder) duplicateUniqueConstraints(withoutConstraint []string, colNames ...string) []string {
stmts := make([]string, 0, len(d.table.UniqueConstraints))
for _, uc := range d.table.UniqueConstraints {
if slices.Contains(withoutConstraint, uc.Name) {
continue
}
if duplicatedMember, constraintColumns := d.allConstraintColumns(uc.Columns, colNames...); duplicatedMember {
stmts = append(stmts, fmt.Sprintf(cCreateUniqueIndexSQL,
pq.QuoteIdentifier(DuplicationName(uc.Name)),
pq.QuoteIdentifier(d.table.Name),
strings.Join(quoteColumnNames(constraintColumns), ", "),
))
}
}
return stmts
}

And when the unique constraint is created using the index when the column is renamed on operation complete:

// Rename any `UNIQUE` indexes on the duplicated column and use them to
// create `UNIQUE` constraints.
for _, ui := range table.Indexes {
if !IsDuplicatedName(ui.Name) {
continue
}
if !ui.Unique {
continue
}
if slices.Contains(ui.Columns, TemporaryName(column.Name)) {
// Rename the unique index to its original name
renameIndexSQL := fmt.Sprintf(cRenameIndexSQL,
pq.QuoteIdentifier(ui.Name),
pq.QuoteIdentifier(StripDuplicationPrefix(ui.Name)),
)
_, err = conn.ExecContext(ctx, renameIndexSQL)
if err != nil {
return fmt.Errorf("failed to rename unique index %q: %w", ui.Name, err)
}
// Create a unique constraint using the unique index
createUniqueConstraintSQL := fmt.Sprintf(cCreateUniqueConstraintSQL,
pq.QuoteIdentifier(table.Name),
pq.QuoteIdentifier(StripDuplicationPrefix(ui.Name)),
pq.QuoteIdentifier(StripDuplicationPrefix(ui.Name)),
)
_, err = conn.ExecContext(ctx, createUniqueConstraintSQL)
if err != nil {
return fmt.Errorf("failed to create unique constraint from index %q: %w", ui.Name, err)
}
// Index no longer exists, remove it from the table
delete(table.Indexes, ui.Name)
}
}

In both cases the correct behaviour is to wait for the concurrently created index to be marked as valid before returning.

@andrew-farries andrew-farries added the bug Something isn't working label Nov 25, 2024
@andrew-farries andrew-farries added this to the v1 milestone Nov 25, 2024
@agedemenli
Copy link
Contributor

We can maybe have a check that periodically looks into catalog table pg_index to see if the index is validated or not. So that we can wait until it is validated, before returning.
Alternatively we can use the postgres view pg_stat_progress_analyze. I haven't use that before but it could maybe be helpful for these cases.
One other thing that comes to my mind is to introduce a locking mechanism to prevent the race, but feels like it could be an overkill.

@andrew-farries
Copy link
Collaborator Author

I think waiting for the index to become valid before returning is a good solution here 👍
I'd avoid anything that involves locking the affected table.

@gulcin
Copy link

gulcin commented Jan 21, 2025

Regarding the Slack discussion of this issue:
The best is to wait until the index is valid and return and error if index is not created successfully, this implies other problems (violation of unique constraints, deadlock etc).

I agree with Andrew:

Maybe the pg_stat_progress_create_index table can be queried to find out if Postgres is still working on the index or has given up. (@andrew-farries )

In the case we get error, the right action imo, is not reindexing but dropping the invalid index and re-running create index concurrently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants