Adding a `unique` constraint can fail under high load #485

andrew-farries · 2024-11-25T08:19:42Z

There is a race condition when adding a unique constraint to a column.

The race can cause the operation to fail, particularly when the target database is under high load at the time of the migration.

Reproduction 1

Create a pair of migrations:

01_create_table.json

{
  "name": "01_create_table",
  "operations": [
    {
      "create_table": {
        "name": "movies",
        "columns": [
          {
            "name": "id",
            "type": "serial",
            "pk": true
          },
          {
            "name": "name",
            "type": "varchar(255)"
          }
        ]
      }
    }
  ]
}

02_set_unique.json

{
  "name": "02_set_unique",
  "operations": [
    {
      "alter_column": {
        "table": "movies",
        "column": "name",
        "unique": {
          "name": "movies_name_unique"
        },
        "up": "name",
        "down": "name"
      }
    }
  ]
}

Apply the first migration:

pgroll start 01_create_table.json --complete

In the target database, generate some load on the system:

CREATE TABLE test(a int);
INSERT INTO test(a) SELECT generate_series(1, 10000000)

While the above INSERT operation is running, apply the second migration:

pgroll start 02_set_unique.json --complete

The second migration is likely to fail with the following error:

Failed to complete migration: unable to execute complete operation: pq: index "movies_name_unique" is not valid

Reproduction 2

The same problem also exists when a column with a UNIQUE constraint is duplicated, for example to change the nullability of the column. To reproduce:

Create a pair of migrations:

01_create_table.json

{
  "name": "01_create_table",
  "operations": [
    {
      "create_table": {
        "name": "movies",
        "columns": [
          {
            "name": "id",
            "type": "serial",
            "pk": true
          },
          {
            "name": "name",
            "type": "varchar(255)",
            "unique": true,
            "nullable": true
          }
        ]
      }
    }
  ]
}

02_set_notnull.json

{
  "name": "02_set_notnull",
  "operations": [
    {
      "alter_column": {
        "table": "movies",
        "column": "name",
        "nullable": false,
        "up": "name",
        "down": "name"
      }
    }
  ]
}

Now follow the same steps as for the first reproduction: apply the first migration, generate load, apply the second migration. The second migration will likely fail with the same error as in the first reproduction.

Notes

In the first reproduction, the operation fails because there is a race between the movies_name_unique index being marked valid after it is created (concurrently) on operation Start:

pgroll/pkg/migrations/op_set_unique.go

Lines 74 to 84 in 6d48386

    
           } 
        
           func (o *OpSetUnique) addUniqueIndex(ctx context.Context, conn db.DB) error { 
        
           	// create unique index concurrently 
        
           	_, err := conn.ExecContext(ctx, fmt.Sprintf("CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS %s ON %s (%s)", 
        
           		pq.QuoteIdentifier(o.Name), 
        
           		pq.QuoteIdentifier(o.Table), 
        
           		pq.QuoteIdentifier(TemporaryName(o.Column)))) 
        
           	return err 
        
           }

and when the unique constraint is added using the index on operation Complete:

pgroll/pkg/migrations/op_set_unique.go

Lines 35 to 47 in 6d48386

    
           func (o *OpSetUnique) Complete(ctx context.Context, conn db.DB, tr SQLTransformer, s *schema.Schema) error { 
        
           	// Create a unique constraint using the unique index 
        
           	_, err := conn.ExecContext(ctx, fmt.Sprintf("ALTER TABLE IF EXISTS %s ADD CONSTRAINT %s UNIQUE USING INDEX %s", 
        
           		pq.QuoteIdentifier(o.Table), 
        
           		pq.QuoteIdentifier(o.Name), 
        
           		pq.QuoteIdentifier(o.Name))) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	return err 
        
           }

In the second reproduction, the race is between the concurrent index creation performed by the column duplicator:

pgroll/pkg/migrations/duplicate.go

Lines 170 to 186 in 6d48386

    
           func (d *duplicatorStmtBuilder) duplicateUniqueConstraints(withoutConstraint []string, colNames ...string) []string { 
        
           	stmts := make([]string, 0, len(d.table.UniqueConstraints)) 
        
           	for _, uc := range d.table.UniqueConstraints { 
        
           		if slices.Contains(withoutConstraint, uc.Name) { 
        
           			continue 
        
           		} 
        
           		if duplicatedMember, constraintColumns := d.allConstraintColumns(uc.Columns, colNames...); duplicatedMember { 
        
           			stmts = append(stmts, fmt.Sprintf(cCreateUniqueIndexSQL, 
        
           				pq.QuoteIdentifier(DuplicationName(uc.Name)), 
        
           				pq.QuoteIdentifier(d.table.Name), 
        
           				strings.Join(quoteColumnNames(constraintColumns), ", "), 
        
           			)) 
        
           		} 
        
           	} 
        
           	return stmts 
        
           }

And when the unique constraint is created using the index when the column is renamed on operation complete:

pgroll/pkg/migrations/rename.go

Lines 123 to 159 in 6d48386

    
           // Rename any `UNIQUE` indexes on the duplicated column and use them to 
        
           // create `UNIQUE` constraints. 
        
           for _, ui := range table.Indexes { 
        
           	if !IsDuplicatedName(ui.Name) { 
        
           		continue 
        
           	} 
        
           	if !ui.Unique { 
        
           		continue 
        
           	} 
        
           	if slices.Contains(ui.Columns, TemporaryName(column.Name)) { 
        
           		// Rename the unique index to its original name 
        
           		renameIndexSQL := fmt.Sprintf(cRenameIndexSQL, 
        
           			pq.QuoteIdentifier(ui.Name), 
        
           			pq.QuoteIdentifier(StripDuplicationPrefix(ui.Name)), 
        
           		) 
        
           		_, err = conn.ExecContext(ctx, renameIndexSQL) 
        
           		if err != nil { 
        
           			return fmt.Errorf("failed to rename unique index %q: %w", ui.Name, err) 
        
           		} 
        
           		// Create a unique constraint using the unique index 
        
           		createUniqueConstraintSQL := fmt.Sprintf(cCreateUniqueConstraintSQL, 
        
           			pq.QuoteIdentifier(table.Name), 
        
           			pq.QuoteIdentifier(StripDuplicationPrefix(ui.Name)), 
        
           			pq.QuoteIdentifier(StripDuplicationPrefix(ui.Name)), 
        
           		) 
        
           		_, err = conn.ExecContext(ctx, createUniqueConstraintSQL) 
        
           		if err != nil { 
        
           			return fmt.Errorf("failed to create unique constraint from index %q: %w", ui.Name, err) 
        
           		} 
        
           		// Index no longer exists, remove it from the table 
        
           		delete(table.Indexes, ui.Name) 
        
           	} 
        
           }

In both cases the correct behaviour is to wait for the concurrently created index to be marked as valid before returning.

The text was updated successfully, but these errors were encountered:

agedemenli · 2025-01-20T15:34:50Z

We can maybe have a check that periodically looks into catalog table pg_index to see if the index is validated or not. So that we can wait until it is validated, before returning.
Alternatively we can use the postgres view pg_stat_progress_analyze. I haven't use that before but it could maybe be helpful for these cases.
One other thing that comes to my mind is to introduce a locking mechanism to prevent the race, but feels like it could be an overkill.

andrew-farries · 2025-01-20T15:43:19Z

I think waiting for the index to become valid before returning is a good solution here 👍
I'd avoid anything that involves locking the affected table.

gulcin · 2025-01-21T14:55:36Z

Regarding the Slack discussion of this issue:
The best is to wait until the index is valid and return and error if index is not created successfully, this implies other problems (violation of unique constraints, deadlock etc).

I agree with Andrew:

Maybe the pg_stat_progress_create_index table can be queried to find out if Postgres is still working on the index or has given up. (@andrew-farries )

In the case we get error, the right action imo, is not reindexing but dropping the invalid index and re-running create index concurrently.

andrew-farries added the bug Something isn't working label Nov 25, 2024

andrew-farries added this to the v1 milestone Nov 25, 2024

andrew-farries mentioned this issue Jan 22, 2025

Retry creating unique index #613

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a `unique` constraint can fail under high load #485

Adding a `unique` constraint can fail under high load #485

andrew-farries commented Nov 25, 2024 •

edited

Loading

agedemenli commented Jan 20, 2025

andrew-farries commented Jan 20, 2025

gulcin commented Jan 21, 2025

Adding a unique constraint can fail under high load #485

Adding a unique constraint can fail under high load #485

Comments

andrew-farries commented Nov 25, 2024 • edited Loading

Reproduction 1

Reproduction 2

Notes

agedemenli commented Jan 20, 2025

andrew-farries commented Jan 20, 2025

gulcin commented Jan 21, 2025

Adding a `unique` constraint can fail under high load #485

Adding a `unique` constraint can fail under high load #485

andrew-farries commented Nov 25, 2024 •

edited

Loading