Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dsc failed to start a downstairs #1498

Open
leftwo opened this issue Oct 7, 2024 · 3 comments · Fixed by #1504
Open

dsc failed to start a downstairs #1498

leftwo opened this issue Oct 7, 2024 · 3 comments · Fixed by #1504

Comments

@leftwo
Copy link
Contributor

leftwo commented Oct 7, 2024

In CI (link may be bad) job: https://buildomat.eng.oxide.computer/wg/0/artefact/01J9KZDBT7Q76BMAZ3NFT2EM6G/JHuMk44VV03fuovquSYDTdL9uCEruARhbYRD7WhRUTif7Lam/01J9KZDV6QXGVBQG0ZHA6JNXTH/01J9M3TRMJHBD7KTGG9QWWK022/dsc-out.txt

We had the test timeout.
Looking at test log output, we can see a new check we added is waiting for all the downstairs to respond:


120 | 2024-10-07T17:27:21.088Z | + ptime -m bash /input/build/work/scripts/test_up.sh -r 2 encrypted
-- | -- | --
121 | 2024-10-07T17:27:21.099Z | /input/build/work
122 | 2024-10-07T17:27:21.111Z | Using 2 region sets
123 | 2024-10-07T17:27:21.126Z | Upstairs using key: 1gql5EL5hhzi5hY5rXSvSfN31jxffOX/9IhgN0kkwMI=
124 | 2024-10-07T17:27:21.129Z | dsc output goes to /tmp/test_up-build/dsc-out.txt
125 | 2024-10-07T17:27:21.129Z | Creating 6 downstairs regions
126 | 2024-10-07T17:27:21.586Z | Starting 6 downstairs
127 | 2024-10-07T17:27:21.589Z | dsc started at PID: 1191
128 | 2024-10-07T17:27:26.657Z | Downstairs client 0 not running, waiting for it
129 | 2024-10-07T17:27:31.718Z | Downstairs client 0 not running, waiting for it
130 | 2024-10-07T17:27:36.778Z | Downstairs client 0 not running, waiting for it

The waiting for it waits till the test hits a timeout and we abort it.
In the logs for dsc we can see it got to Starting on client 0, but never to Running:

/var/tmp/bins/dsc create --encrypted --cleanup --region-count 6 --extent-size 10 --extent-count 5 --output-dir /tmp/test_up-build/dsc --ds-bin /var/tmp/bins/crucible-downstairs --region-dir /var/tmp/test_up-build
Removing existing dsc directory "/tmp/test_up-build/dsc"
Creating region directory at: /var/tmp/test_up-build
Creating dsc directory at: /tmp/test_up-build/dsc
Downstairs region 0 created at /var/tmp/test_up-build/8810 in 0.26448342
Downstairs region 1 created at /var/tmp/test_up-build/8820 in 0.036061257
Downstairs region 2 created at /var/tmp/test_up-build/8830 in 0.03765977
Downstairs region 3 created at /var/tmp/test_up-build/8840 in 0.047207728
Downstairs region 4 created at /var/tmp/test_up-build/8850 in 0.024012322
Downstairs region 5 created at /var/tmp/test_up-build/8860 in 0.022937205
Created 6 regions
/var/tmp/bins/dsc start --output-dir /tmp/test_up-build/dsc --ds-bin /var/tmp/bins/crucible-downstairs --region-dir /var/tmp/test_up-build --region-count 6
Using existing output directory "/tmp/test_up-build/dsc"
Update our region info with: Some(RegionExtentInfo { block_size: 4096, blocks_per_extent: 10, extent_count: 5 })
start ds: 8810
start ds: 8820
start ds: 8830
start ds: 8840
start ds: 8850
start access at:127.0.0.1:9998
start ds: 8860
Starting downstairs at port 8810
Starting downstairs at port 8830
Make output file at "/tmp/test_up-build/dsc/downstairs-8810.txt"
Make output file at "/tmp/test_up-build/dsc/downstairs-8830.txt"
Starting downstairs at port 8820
Make output file at "/tmp/test_up-build/dsc/downstairs-8820.txt"
Starting downstairs at port 8840
Make output file at "/tmp/test_up-build/dsc/downstairs-8840.txt"
Starting downstairs at port 8850
Make output file at "/tmp/test_up-build/dsc/downstairs-8850.txt"
Starting downstairs at port 8860
Make output file at "/tmp/test_up-build/dsc/downstairs-8860.txt"
[8810][0] initial start wait reports Starting
[8830][2] initial start wait reports Starting
[8820][1] initial start wait reports Starting
[8840][3] initial start wait reports Starting
[8850][4] initial start wait reports Starting
[8860][5] initial start wait reports Starting
Downstairs /var/tmp/test_up-build/8850 port 8850 PID:Some(1193)
[8850][4] initial start wait reports Running
Downstairs /var/tmp/test_up-build/8820 port 8820 PID:Some(1196)
[8820][1] initial start wait reports Running
Downstairs /var/tmp/test_up-build/8840 port 8840 PID:Some(1197)
[8840][3] initial start wait reports Running
Downstairs /var/tmp/test_up-build/8860 port 8860 PID:Some(1195)
[8860][5] initial start wait reports Running
Downstairs /var/tmp/test_up-build/8830 port 8830 PID:Some(1198)
[8830][2] initial start wait reports Running
Oct 07 17:27:21.192 INFO listening, local_addr: 127.0.0.1:9998
Control access at:127.0.0.1:9998
Oct 07 17:27:26.229 INFO accepted connection, remote_addr: 127.0.0.1:35620, local_addr: 127.0.0.1:9998
Oct 07 17:27:26.231 INFO request completed, latency_us: 407, response_code: 200, uri: /state/cid/0, method: GET, req_id: 0ffa13a9-ee3c-4b4c-b7a6-c51ce6f2348d, remote_addr: 127.0.0.1:35620, local_addr: 127.0.0.1:9998
Oct 07 17:27:31.292 INFO accepted connection, remote_addr: 127.0.0.1:42984, local_addr: 127.0.0.1:9998
Oct 07 17:27:31.292 INFO request completed, latency_us: 158, response_code: 200, uri: /state/cid/0, method: GET, req_id: a61885b0-a975-4400-867a-6c121fb549a0, remote_addr: 127.0.0.1:42984, local_addr: 127.0.0.1:9998
Oct 07 17:27:36.352 INFO accepted connection, remote_addr: 127.0.0.1:41958, local_addr: 127.0.0.1:9998
Oct 07 17:27:36.353 INFO request completed, latency_us: 165, response_code: 200, uri: /state/cid/0, method: GET, req_id: 3fda854a-0c0b-4ccf-ae32-810e7bf7a3b5, remote_addr: 127.0.0.1:41958, local_addr: 127.0.0.1:9998
Oct 07 17:27:41.413 INFO accepted connection, remote_addr: 127.0.0.1:55992, local_addr: 127.0.0.1:9998
Oct 07 17:27:41.413 INFO request completed, latency_us: 168, response_code: 200, uri: /state/cid/0, method: GET, req_id: 839a976b-489e-4375-9fef-16311ae7953e, remote_addr: 127.0.0.1:55992, local_addr: 127.0.0.1:9998
Oct 07 17:27:46.474 INFO accepted connection, remote_addr: 127.0.0.1:45792, local_addr: 127.0.0.1:9998
Oct 07 17:27:46.475 INFO request completed, latency_us: 159, response_code: 200, uri: /state/cid/0, method: GET, req_id: 15182d54-4131-4011-9787-fb84d2f63414, remote_addr: 127.0.0.1:45792, local_addr: 127.0.0.1:9998
Oct 07 17:27:51.537 INFO accepted connection, remote_addr: 127.0.0.1:34146, local_addr: 127.0.0.1:9998

A third bit of information, if I look at the output for client 0, port 8810, I do see that it started:

{"msg":"Opened existing region file \"/var/tmp/test_up-build/8810/region.json\"","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.199648921Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194}
{"msg":"Database read version 1","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.201163781Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194}
{"msg":"Database write version 1","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.20118688Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194}
{"msg":"UUID: 12345678-0000-0000-0000-000000008810","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.20120484Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194}
{"msg":"Blocks per extent:10 Total Extents: 5","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.201221359Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194}
{"msg":"Crucible Version: Crucible Version: 0.0.1\nCommit SHA: fb9332b3671c250552cf7fa7cbd0a2572ab72b79\nCommit timestamp: 2024-10-07T17:07:42.000000000Z  branch: alan/test-more-regions\nrustc: 1.80.0 stable x86_64-unknown-illumos\nCargo: x86_64-unknown-illumos  Debug: true Opt level: 0","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.290495955Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194,"task":"main"}
{"msg":"Upstairs <-> Downstairs Message Version: 11","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.290804692Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194,"task":"main"}
{"msg":"Repair listens on 0.0.0.0:12810 for path:\"/var/tmp/test_up-build/8810\"","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.296602114Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194,"task":"repair"}
{"msg":"listening","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.298231449Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194,"local_addr":"0.0.0.0:12810","task":"repair"}
{"msg":"Using repair address: 0.0.0.0:12810","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.298915952Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194,"task":"main"}
{"msg":"No SSL acceptor configured","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.298953321Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194,"task":"main"}
{"msg":"downstairs listening on 0.0.0.0:8810","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.299080086Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194,"task":"main"}

This suggests the problem is somewhere in the messages between different parts of dsc.

@leftwo
Copy link
Contributor Author

leftwo commented Oct 7, 2024

In dsc/src/main.rs, we have the following code that starts a downstairs:

    fn start(&self) -> Result<Child> {                                                   
        println!("Make output file at {:?}", self.output_file);                  
        let outputs = File::create(&self.output_file)                            
            .context("Failed to create output file")?;                           
        let errors = outputs.try_clone()?;                                       
                                                                                 
        let port_value = format!("{}", self.port);                               
                                                                                 
        let mode = if self.read_only {                                           
            "ro".to_string()                                                     
        } else {                                                                 
            "rw".to_string()                                                     
        };                                                                       
                                                                                 
        let region_dir = self.region_dir.clone();                                
        let cmd = Command::new(self.ds_bin.clone())                              
            .args([                                                              
                "run",                                                           
                "-p",                                                            
                &port_value,                                                     
                "-d",                                                            
                &region_dir,                                                     
                "--mode",                                                        
                &mode,                                                           
            ])                                                                   
            .stdout(Stdio::from(outputs))                                        
            .stderr(Stdio::from(errors))                                         
            .spawn()                                                             
            .context("Failed trying to run downstairs")?;                        
                                                                                 
        println!(                                                                
            "Downstairs {} port {} PID:{:?}",                                    
            region_dir,                                                          
            self.port,                                                           
            cmd.id()                                                             
        ); 

In our logs above, we see the first message Make outfile at... for this client, but we never see the 2nd, Downstairs {} port {}.... So, that command (which does spawn, we see the output file) never comes back from that spawn, as near as I can determine.

@leftwo
Copy link
Contributor Author

leftwo commented Oct 8, 2024

I was able to reproduce it outside of CI using a debug build.

The stacks look mostly similar, but this one stuck out:

 fffff9ffeeb7bf8a read     (20, fffff9ffec3fbc00, 8)
 0000000003add616 std::sys::pal::unix::process::process_inner::<impl std::sys::pal::unix::process::process_common::Command>::spawn::h356b47fec30ee6a3 () + 356
 0000000003ace1ec std::process::Command::spawn::hb5ff4e4d3aae5893 () + 1c
 00000000036eb319 tokio::process::imp::spawn_child::h2e091d8168139f18 () + 29
 00000000036c3ea0 tokio::process::Command::spawn::h82818d972f5bd22b () + 30
 0000000001ebd1ff dsc::DownstairsInfo::start::hdb9e8fbe2e7641c0 () + 4af
 0000000001fc382a dsc::start_ds::{{closure}}::h72fb5d778549a741 () + 39a
 0000000001fc3d27 dsc::ds_start_monitor::{{closure}}::h79d20fed6577f95e () + 1a7
 0000000001fc340b dsc::start_dsc::{{closure}}::{{closure}}::h4710d443c5725ecc () + db
 0000000001ee5cde tokio::runtime::task::core::Core<T,S>::poll::{{closure}}::h25306ec96d735a5a () + 6e
 0000000001ee4d6c tokio::runtime::task::core::Core<T,S>::poll::h323db97a5faf7906 () + 2c
 0000000001f23dcf tokio::runtime::task::harness::poll_future::{{closure}}::h4932b9dce84b5721 () + 3f
 0000000001e8f172 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::hcffcc65ef2a830a9 () + 22
 0000000001f99be6 std::panicking::try::do_call::h791a875498f9620a () + 36
 0000000001f94064 std::panicking::try::h1ebe6082f4923d75 () + 44
 0000000001e901e4 std::panic::catch_unwind::h280a3ab6527a6078 () + 14
 0000000001f22563 tokio::runtime::task::harness::poll_future::hcb4b3af1b5128951 () + 73
 0000000001f251d6 tokio::runtime::task::harness::Harness<T,S>::poll_inner::h1765a596355314e0 () + 86
 0000000001f2b3e5 tokio::runtime::task::harness::Harness<T,S>::poll::h9cb104db5087b249 () + 15
 0000000001ecbc2d tokio::runtime::task::raw::poll::hafbc00b5629743bc () + 1d
 0000000003735146 tokio::runtime::task::raw::RawTask::poll::hbcd9b90a9f68e73f () + 26
 0000000003742072 tokio::runtime::task::LocalNotified<S>::run::h82c13fc69d40bfa2 () + 22
 00000000036c898c tokio::runtime::scheduler::multi_thread::worker::Context::run_task::{{closure}}::h9ecf202fb4aedd36 () + 1c
 00000000036c8950 tokio::runtime::scheduler::multi_thread::worker::Context::run_task::h9f2957183dfc0a17 () + 120
 00000000036c8634 tokio::runtime::scheduler::multi_thread::worker::Context::run::hb79d7578a4d720cc () + 1b4
 00000000036c8424 tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}::{{closure}}::h947c9cce25841432 () + 34
 000000000370ff9f tokio::runtime::context::scoped::Scoped<T>::set::h589875c319f11e61 () + 5f
 00000000036c547b tokio::runtime::context::set_scheduler::{{closure}}::h946422b065b2537b () + 2b
 00000000036cd4a7 std::thread::local::LocalKey<T>::try_with::hbff304b678572a27 () + b7
 00000000036cbaa1 std::thread::local::LocalKey<T>::with::h7e497ece18cd4db2 () + 11
 00000000036c5440 tokio::runtime::context::set_scheduler::h92d580ae830637b9 () + 30
 00000000036c83d8 tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}::h23bf5525e61184da () + c8
 000000000375a83e tokio::runtime::context::runtime::enter_runtime::h1093cfccbd6ee326 () + ae
 00000000036c829b tokio::runtime::scheduler::multi_thread::worker::run::h6416b5163edbc5b8 () + 11b
 00000000036c8171 tokio::runtime::scheduler::multi_thread::worker::Launch::launch::{{closure}}::hc6817b4d669b3e32 () + 11
 0000000003707056 <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll::h234967f3a9f321ed () + 56
 00000000036e37a5 tokio::runtime::task::core::Core<T,S>::poll::{{closure}}::h596f8a7421257a1b () + 75
 00000000036e2afc tokio::runtime::task::core::Core<T,S>::poll::h7990afe31ca5173b () + 2c
 00000000036b649f tokio::runtime::task::harness::poll_future::{{closure}}::h2b16ec35846ce065 () + 3f
 0000000003746182 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::hc3c3bbccb1345b38 () + 22
 0000000003767a06 std::panicking::try::do_call::h80b93ac65f9ea34a () + 36
 0000000003763c04 std::panicking::try::h62a93063541a4e4f () + 44
 0000000003751c24 std::panic::catch_unwind::hf4eeead54b5fbc66 () + 14
 00000000036b33b3 tokio::runtime::task::harness::poll_future::h33ea2f16a04c8f82 () + 73
 00000000036a9906 tokio::runtime::task::harness::Harness<T,S>::poll_inner::h2151a92fefbb26a3 () + 86
 00000000036a8dd5 tokio::runtime::task::harness::Harness<T,S>::poll::h6fa656e3a74f94bf () + 15
 000000000373563d tokio::runtime::task::raw::poll::h6c0e9ed682716b6b () + 1d
 0000000003735146 tokio::runtime::task::raw::RawTask::poll::hbcd9b90a9f68e73f () + 26
 0000000003742106 tokio::runtime::task::UnownedTask<S>::run::hd214edb526cc3a17 () + 26
 0000000003702109 tokio::runtime::blocking::pool::Task::run::ha15a3499ed14041a () + 19
 000000000370643c tokio::runtime::blocking::pool::Inner::run::hc0d1077484ccb8f8 () + 11c
 00000000037062c9 tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}::h9b828a85de1189c0 () + 39
 00000000037376ed std::sys_common::backtrace::__rust_begin_short_backtrace::hcde0af8ac6e94548 () + d
 0000000003714d50 std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}::h682a7ee890499fa7 () + 30
 0000000003745470 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::h1db3259947d8925b () + 30
 0000000003767bc0 std::panicking::try::do_call::h95bd2eb6539cb660 () + 40
 0000000003762b0b std::panicking::try::h1ff37b4fa23118c9 () + 5b
 0000000003714be8 std::thread::Builder::spawn_unchecked_::{{closure}}::h13ddbdd7ae4a1097 () + 1f8
 00000000036b7c71 core::ops::function::FnOnce::call_once{{vtable.shim}}::hb6cc2aa0206b9ae1 () + 11
 0000000003adce5b std::sys::pal::unix::thread::Thread::new::thread_start::hc6a6dcaabc211891 () + 1b
 fffff9ffeeb74f37 _thrp_setup (fffff9ffeec93a40) + 77
 fffff9ffeeb75280 _lwp_start ()

@mkeeter
Copy link
Contributor

mkeeter commented Oct 14, 2024

I believe this was falsely closed because you included the phrase fix: #1498 in #1504 !

@mkeeter mkeeter reopened this Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants