Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New plan for trace storage work #10

Open
41 of 60 tasks
StachuDotNet opened this issue Mar 1, 2023 · 2 comments
Open
41 of 60 tasks

New plan for trace storage work #10

StachuDotNet opened this issue Mar 1, 2023 · 2 comments

Comments

@StachuDotNet
Copy link
Member

StachuDotNet commented Mar 1, 2023

DB Clone

  • prototype the DB clone (go through the steps, record "down times")
    • events table (not v2)
      • verify that nothing is querying events (not v2) table
      • look at the events table and see if it's using any foreign keys
        • if any, remove the foreign keys
      • delete the events table from dark-west
    • do the DB clone
    • update the DB clone
      • set zoning to single-zone (applies to both servers and storage)
      • update postgres version to 14
        conclusion: probably brings unnecessary risk
      • turn down the CPUs by ~1/3

Drop Events table

  • drop events table in the codebase
    • review usages of "events" in codebase - see if we're missing anything
    • investigate connection to worker_stats_v1
    • write migration script (drop if still exists) to drop events
    • update tests if they were somehow referencing events
    • update clear-canvas script to not reference events table
      (note: apparently we weren't clearing events_v2!)
    • do we need to merge any changes before we drop the events table in prod?
      • Yes.
  • drop events table in production
    • set lock_timeout = '1s'
    • set statement_timeout = '1s'
    • alter table events drop constraint events_canvas_id_fkey
    • alter table events drop constraint events_account_id_fkey
    • drop index concurrently if exists idx_events_for_dequeue
    • drop index concurrently if exists idx_events_for_dequeue2
    • truncate events table
    • drop events table
  • merge the migration
  • copy the above all from stable-dark to dark

Get Google to shrink a clone

Goal: determine the amount of downtime

  • make a clone of our DB
  • set lock_timeout = '1s'
  • set statement_timeout = '1s'
  • drop FK on account_id
    alter table events drop constraint events_canvas_id_fkey
  • drop FK on canvas_id
    alter table events drop constraint events_account_id_fkey
  • drop index idx_events_for_dequeue
    drop index concurrently if exists idx_events_for_dequeue
  • drop index idx_events_for_dequeue2
    drop index concurrently if exists idx_events_for_dequeue2
  • drop index index_events_for_stats
  • truncate events table
  • drop events table
  • ask google to shrink it
    (they'll do this in real-time synchronously during a workday/call)
  • record the downtime for reference: [downtime]
  • lower availability to single-zone
  • lower CPU from 16 vCPUs to 12 vCPUs

Make a plan for doing this against the prod DB

  • plan how to alert customers
    • of expected downtime, etc
  • ...

another day: (pull into another issue)

Cloud storage

  • delete trace-related tests
  • check that 404s continue to work
  • ensure we overwrite cloud storage traces for execute_handler button
  • check if execute_function traces are appropriately merged with a cloud-storage -based trace
  • garbage collection - set object lifecycle for bucket or for traces
  • ensure pusher is supported
  • do walkthrough and check it all works

monitoring

  • schedule weekly call/meeting where we review usage, for 4 weeks. at the end of such, consider what to do
    • check table sizes
    • check costs

migrate existing canvases

  • upload to both simultaneously
  • fetch and upload existing trace data for existing canvases/handlers
  • possibly automatically switch LD flag once this is done
  • switch all users to only use uploaded storage data

Maybe later?

  • turn on private IPs (requires DB downtime)
@StachuDotNet
Copy link
Member Author

  • toplevel_oplists has PK
  • stored_events_v2 no PK
  • function_results_v3 no PK
  • function_arguments no PK
  • traces_v0 has PK
  • user_data has PK
  • system_migrations has PK
  • static_asset_deploys has PK
  • secrets has PK
  • scheduling_rules has PK
  • packages_v0 has PK
  • op_ctrs no PK (tho does have unique constraint)
  • events_v0 has PK
  • events has PK
  • custom_domains has PK
  • cron_records has PK
  • canvases has PK
  • accounts has PK
  • access no PK

from an earlier call - notes on which tables have/don't have PKs

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Mar 3, 2023

Edit: the contents of this comment have been moved to #11, and have been executed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant