Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression causing deadlock between File.Close and File.WriteTo #603

Open
ncw opened this issue Nov 11, 2024 · 7 comments
Open

Regression causing deadlock between File.Close and File.WriteTo #603

ncw opened this issue Nov 11, 2024 · 7 comments

Comments

@ncw
Copy link
Contributor

ncw commented Nov 11, 2024

I noticed when I upgraded the dependencies for rclone that the sftp server tests have started deadlocking.

I used git bisect to narrow this down to this commit d1903fb

This made an enormous backtrace which I removed all traces without pkg/sftp in and deduplicated the traces which left us with three.

I think backtrace 1 is irrelevant. What appears to have happened is that a goroutine started (backtrace 3) which is holding the File lock

The code from rclone which causes this is this (backtrace 3)

func (f *Fs) newObjectReader(sftpFile *sftp.File) *objectReader {
	pipeReader, pipeWriter := io.Pipe()
	file := &objectReader{
		f:          f,
		sftpFile:   sftpFile,
		pipeReader: pipeReader,
		done:       make(chan struct{}),
	}
	// Show connection in use
	f.addSession()

	go func() {
		// Use sftpFile.WriteTo to pump data so that it gets a
		// chance to build the window up.
		_, err := sftpFile.WriteTo(pipeWriter)
		// Close the pipeWriter so the pipeReader fails with
		// the same error or EOF if err == nil
		_ = pipeWriter.CloseWithError(err)
		// signal that we've finished
		close(file.done)
	}()

	return file
}

And here (backtrace 2)

// Close a reader of a remote sftp file
func (file *objectReader) Close() (err error) {
	// Close the sftpFile - this will likely cause the WriteTo to error
	err = file.sftpFile.Close()
	// Close the pipeReader so writes to the pipeWriter fail
	_ = file.pipeReader.Close()
	// Wait for the background process to finish
	<-file.done
	// Show connection no longer in use
	file.f.removeSession()
	return err
}

So the problem is that rclone expects the line err = file.sftpFile.Close() to close the sftp File and return an error, while now what happens is that because of the locking added in d1903fb the Close waits indefinitely for the File.WriteTo() to finish which it won't since rclone has stopped putting data into it.

So I think this is a regression - this has worked perfectly for many years!

I haven't made a small reproduction - I probably can now I understand what is going on - but maybe you've got an immediate idea on how to solve this problem?

Backtrace 1

  • goroutine 56 gp=0xc0004048c0 m=nil [sync.Cond.Wait]:
  • runtime.gopark(0xc0000a5ce0?, 0x1?, 0x20?, 0x24?, 0x23?)
    • go1.23/src/runtime/proc.go:424 +0xce fp=0xc000352430 sp=0xc000352410 pc=0x473b6e
  • runtime.goparkunlock(...)
    • go1.23/src/runtime/proc.go:430
  • sync.runtime_notifyListWait(0xc00011e0d0, 0x32c)
    • go1.23/src/runtime/sema.go:587 +0x159 fp=0xc000352480 sp=0xc000352430 pc=0x4753b9
  • sync.(*Cond).Wait(0x0?)
    • go1.23/src/sync/cond.go:71 +0x85 fp=0xc0003524c0 sp=0xc000352480 pc=0x484dc5
  • golang.org/x/crypto/ssh.(*buffer).Read(0xc00007c060, {0xc0001167f8, 0x4, 0x4})
    • /home/ncw/go/pkg/mod/golang.org/x/[email protected]/ssh/buffer.go:94 +0x1ed fp=0xc000352548 sp=0xc0003524c0 pc=0x66eb6d
  • golang.org/x/crypto/ssh.(*channel).ReadExtended(0xc0003e40c0, {0xc0001167f8?, 0x1c0004048c0?, 0xc0003525d8?}, 0x40811e?)
    • /home/ncw/go/pkg/mod/golang.org/x/[email protected]/ssh/channel.go:363 +0x93 fp=0xc0003525a8 sp=0xc000352548 pc=0x672553
  • golang.org/x/crypto/ssh.(*channel).Read(0x408245?, {0xc0001167f8?, 0x471f94?, 0xc000400e08?})
    • /home/ncw/go/pkg/mod/golang.org/x/[email protected]/ssh/channel.go:540 +0x25 fp=0xc0003525e0 sp=0xc0003525a8 pc=0x673265
  • github.com/pkg/sftp.(*conn).Read(0x0?, {0xc0001167f8?, 0xc000352648?, 0x475e49?})
    • <autogenerated>:1 +0x29 fp=0xc000352610 sp=0xc0003525e0 pc=0x6b0029
  • io.ReadAtLeast({0xb6e6c0, 0xc0000ea240}, {0xc0001167f8, 0x4, 0x4}, 0x4)
    • go1.23/src/io/io.go:335 +0x90 fp=0xc000352658 sp=0xc000352610 pc=0x4a93f0
  • io.ReadFull(...)
    • go1.23/src/io/io.go:354
  • github.com/pkg/sftp.recvPacket({0xb6e6c0, 0xc0000ea240}, 0x0, 0x2?)
  • github.com/pkg/sftp.(*conn).recvPacket(...)
  • github.com/pkg/sftp.(*clientConn).recv(0xc0000ea240)
  • github.com/pkg/sftp.NewClientPipe.func1()
  • runtime.goexit({})
    • go1.23/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0003527e8 sp=0xc0003527e0 pc=0x47bf21
  • created by github.com/pkg/sftp.NewClientPipe in goroutine 46

Backtrace 2

  • goroutine 1762 gp=0xc0003a96c0 m=nil [sync.Mutex.Lock]:
  • runtime.gopark(0x0?, 0xc000034700?, 0xc0?, 0x20?, 0xc00001fbb8?)
    • go1.23/src/runtime/proc.go:424 +0xce fp=0xc00001fad8 sp=0xc00001fab8 pc=0x473b6e
  • runtime.goparkunlock(...)
    • go1.23/src/runtime/proc.go:430
  • runtime.semacquire1(0xc00038e06c, 0x0, 0x3, 0x1, 0x15)
    • go1.23/src/runtime/sema.go:178 +0x225 fp=0xc00001fb40 sp=0xc00001fad8 pc=0x450725
  • sync.runtime_SemacquireMutex(0xc00001fbb0?, 0xe5?, 0xb?)
    • go1.23/src/runtime/sema.go:95 +0x25 fp=0xc00001fb78 sp=0xc00001fb40 pc=0x475105
  • sync.(*Mutex).lockSlow(0xc00038e068)
    • go1.23/src/sync/mutex.go:173 +0x15d fp=0xc00001fbc8 sp=0xc00001fb78 pc=0x4864fd
  • sync.(*Mutex).Lock(...)
    • go1.23/src/sync/mutex.go:92
  • sync.(*RWMutex).Lock(0x2?)
    • go1.23/src/sync/rwmutex.go:148 +0x31 fp=0xc00001fbf8 sp=0xc00001fbc8 pc=0x4877d1
  • github.com/pkg/sftp.(*File).Close(0xc00038e050)
  • github.com/rclone/rclone/backend/sftp.(*objectReader).Close(0xc0002ac680)
    • github.com/rclone/rclone/backend/sftp/sftp.go:2105 +0x1f fp=0xc00001fc90 sp=0xc00001fc58 pc=0x8cacbf
  • github.com/rclone/rclone/lib/readers.(*LimitedReadCloser).Close(0xc000338480)
    • github.com/rclone/rclone/lib/readers/limited.go:17 +0x24 fp=0xc00001fce0 sp=0xc00001fc90 pc=0x7c8d44
  • github.com/rclone/rclone/fstest/fstests.ReadObject({0xb73360, 0xf3e900}, 0xc000312ea0, {0xb77540, 0xc000486e80}, 0xffffffffffffffff, {0xc00061e350, 0x1, 0x1})
    • github.com/rclone/rclone/fstest/fstests/fstests.go:293 +0x2f5 fp=0xc00001fdb0 sp=0xc00001fce0 pc=0x925895
  • github.com/rclone/rclone/fstest/fstests.Run.func13.14.36(0xc000312ea0)
    • github.com/rclone/rclone/fstest/fstests/fstests.go:1802 +0x310 fp=0xc00001ff70 sp=0xc00001fdb0 pc=0x933db0
  • testing.tRunner(0xc000312ea0, 0xc000486e00)
    • go1.23/src/testing/testing.go:1690 +0xf4 fp=0xc00001ffc0 sp=0xc00001ff70 pc=0x526174
  • testing.(*T).Run.gowrap1()
    • go1.23/src/testing/testing.go:1743 +0x25 fp=0xc00001ffe0 sp=0xc00001ffc0 pc=0x527165
  • runtime.goexit({})
    • go1.23/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00001ffe8 sp=0xc00001ffe0 pc=0x47bf21
  • created by testing.(*T).Run in goroutine 1096
    • go1.23/src/testing/testing.go:1743 +0x390

Backtrace 3

  • goroutine 1676 gp=0xc000160e00 m=nil [select]:
  • runtime.gopark(0xc000416cf0?, 0x2?, 0x98?, 0x6b?, 0xc000416cb4?)
    • go1.23/src/runtime/proc.go:424 +0xce fp=0xc000416b50 sp=0xc000416b30 pc=0x473b6e
  • runtime.selectgo(0xc000416cf0, 0xc000416cb0, 0xc0001ef470?, 0x1, 0x40f905?, 0x1)
    • go1.23/src/runtime/select.go:335 +0x7a5 fp=0xc000416c78 sp=0xc000416b50 pc=0x44f685
  • io.(*pipe).write(0xc000112f00, {0xc0003f0000, 0x5f, 0x8000})
    • go1.23/src/io/pipe.go:86 +0x1e5 fp=0xc000416d50 sp=0xc000416c78 pc=0x4ab185
  • io.(*PipeWriter).Write(0xc00038e050?, {0xc0003f0000?, 0xc0003f0000?, 0x8000?})
    • go1.23/src/io/pipe.go:161 +0x1a fp=0xc000416d80 sp=0xc000416d50 pc=0x4ab59a
  • github.com/pkg/sftp.(*File).writeToSequential(0xc00038e050, {0xb6e140, 0xc000112f00})
  • github.com/pkg/sftp.(*File).WriteTo(0xc00038e050, {0xb6e140, 0xc000112f00})
  • github.com/rclone/rclone/backend/sftp.(*Fs).newObjectReader.func1()
    • github.com/rclone/rclone/backend/sftp/sftp.go:2085 +0x30 fp=0xc000416fe0 sp=0xc000416fa8 pc=0x8cac10
  • runtime.goexit({})
    • go1.23/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000416fe8 sp=0xc000416fe0 pc=0x47bf21
  • created by github.com/rclone/rclone/backend/sftp.(*Fs).newObjectReader in goroutine 1762
    • github.com/rclone/rclone/backend/sftp/sftp.go:2082 +0x199
@puellanivis
Copy link
Collaborator

puellanivis commented Nov 11, 2024

🤔 Right… so to address this, we would need to trigger a signal to close any outstanding WriteTo or ReadFrom (or heck, any outstanding Read or Write)…

Actually, let’s look at the behavior of os.File:

package main

import (
	"io"
	"fmt"
	"os"
	"sync"
)

func main() {
	f, err := os.Open("/dev/zero")
	if err != nil {
		panic(err)
	}

	rd, wr := io.Pipe()

	var wg sync.WaitGroup

	wg.Add(1)
	go func() {
		defer wg.Done()

		_, err := f.WriteTo(wr)
		_ = wr.CloseWithError(err)
	}()

	lr := &io.LimitedReader{ R: rd, N: 42 }

	b, err := io.ReadAll(lr)
	if err != nil {
		panic(err)
	}

	fmt.Println(string(b))

	err = f.Close()
	if err != nil {
		panic(err)
	}

	wg.Wait()
}

output:

close-on-open-fd$ go run main.go | allcat -o clip:
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0x0?)
        /usr/lib64/go/1.23/src/runtime/sema.go:71 +0x25
sync.(*WaitGroup).Wait(0xc0000b0180?)
        /usr/lib64/go/1.23/src/sync/waitgroup.go:118 +0x48
main.main()
        /home/snowgirl/Work/tmp/go/close-on-open-fd/main.go:42 +0x232

goroutine 6 [select]:
io.(*pipe).write(0xc0000b01e0, {0xc0000d0000, 0x8000, 0x8000})
        /usr/lib64/go/1.23/src/io/pipe.go:86 +0x1e5
io.(*PipeWriter).Write(0x55da90?, {0xc0000d0000?, 0x46567f?, 0xc00008e048?})
        /usr/lib64/go/1.23/src/io/pipe.go:161 +0x1a
io.copyBuffer({0x4deb38, 0xc0000b01e0}, {0x4dec38, 0xc00008e048}, {0x0, 0x0, 0x0})
        /usr/lib64/go/1.23/src/io/io.go:431 +0x1de
io.Copy(...)
        /usr/lib64/go/1.23/src/io/io.go:388
os.genericWriteTo(0xc00008e040?, {0x4deb38, 0xc0000b01e0})
        /usr/lib64/go/1.23/src/os/file.go:275 +0x4f
os.(*File).WriteTo(0xc00008e040, {0x4deb38, 0xc0000b01e0})
        /usr/lib64/go/1.23/src/os/file.go:253 +0x9c
main.main.func1()
        /home/snowgirl/Work/tmp/go/close-on-open-fd/main.go:24 +0x57
created by main.main in goroutine 1
        /home/snowgirl/Work/tmp/go/close-on-open-fd/main.go:21 +0x150
exit status 2

It seems here, while the Close() succeeds the WriteTo() doesn’t crash out or return an error.

🤔

P.S.: Adding a rd.Close() before the Wait allows it to complete and WriteTo returns io.ErrClosedPipe

@puellanivis
Copy link
Collaborator

Min-repro:

package main

import (
	"io"
	"fmt"
	"net"
	"os"
	"sync"

	"golang.org/x/crypto/ssh"
	"golang.org/x/crypto/ssh/agent"
	"github.com/pkg/sftp"
)

func main() {
	var auths []ssh.AuthMethod
	if aconn, err := net.Dial("unix", os.Getenv("SSH_AUTH_SOCK")); err == nil {
		auths = append(auths, ssh.PublicKeysCallback(agent.NewClient(aconn).Signers))

	}
	auths = append(auths, ssh.Password("notasecret"))

	config := ssh.ClientConfig{
		User:            "testuser",
		Auth:            auths,
		HostKeyCallback: ssh.InsecureIgnoreHostKey(),
	}

	addr := fmt.Sprintf("127.0.0.1:2022")

	conn, err := ssh.Dial("tcp", addr, &config)
	if err != nil {
		panic(err)
	}
	defer conn.Close()

	cl, err := sftp.NewClient(conn)
	if err != nil {
		panic(err)
	}
	defer cl.Close()

	f, err := cl.Open("/dev/zero")
	if err != nil {
		panic(err)
	}

	rd, wr := io.Pipe()

	var wg sync.WaitGroup

	wg.Add(1)
	go func() {
		defer wg.Done()

		_, err := f.WriteTo(wr)
		_ = wr.CloseWithError(err)
	}()

	lr := &io.LimitedReader{ R: rd, N: 42 }

	b, err := io.ReadAll(lr)
	if err != nil {
		panic(err)
	}

	fmt.Printf("%v\n", b)

	err = f.Close()
	if err != nil {
		panic(err)
	}

	wg.Wait()
}

This locks up:

$ go run main.go
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Adding an rd.Close() at the bottom causes the process to succeed, with the same io.ErrClosedPipe coming from the WriteTo.

It seems that things were only ever working because the io.Pipe is triggering the end. Moving the rd.Close() to before the f.Close() works even on v1.13.7.

ncw added a commit to rclone/rclone that referenced this issue Nov 11, 2024
… tests

Before this change, upgrading to v1.13.7 caused a deadlock in the tests.

This was caused by additional locking in the sftp package exposing a
bad choice by the rclone code.

See pkg/sftp#603 and thanks to @puellanivis
for the fix suggestion.
@ncw
Copy link
Contributor Author

ncw commented Nov 11, 2024

Thank you so much for making a repro @puellanivis

You are right, closing things in a different order makes the deadlock go away

--- a/backend/sftp/sftp.go
+++ b/backend/sftp/sftp.go
@@ -2101,10 +2101,10 @@ func (file *objectReader) Read(p []byte) (n int, err error) {
 
 // Close a reader of a remote sftp file
 func (file *objectReader) Close() (err error) {
-       // Close the sftpFile - this will likely cause the WriteTo to error
-       err = file.sftpFile.Close()
        // Close the pipeReader so writes to the pipeWriter fail
        _ = file.pipeReader.Close()
+       // Close the sftpFile - this will likely cause the WriteTo to error
+       err = file.sftpFile.Close()
        // Wait for the background process to finish
        <-file.done
        // Show connection no longer in use

I think the new way is probably the best way and the fact that it worked before was due to an implementation detail of the sftp library.

I'm happy with that fix and so we can close this if you are happy?

@puellanivis
Copy link
Collaborator

Yeah, I’m still thinking I’d rather repeat the os.File.Close behavior, since we’re generally trying to mirror that usage. Though, it would probably be quite a bit of work to ensure we don’t also introduce a race condition, where the WriteTo could theoretically end up making one or more reads from a different file due to a recycled file handle on the server side.

I’ll timebox this to a week. I would like to at least get dev-v2 working the same as os.File.Close (that is, it deadlocks without the rd.Close() but at least doesn’t deadlock on file.Close()). Then depending on how much work that takes, I’ll have a perspective on how to bring the fix to v1.

@ncw
Copy link
Contributor Author

ncw commented Nov 12, 2024

Sounds like a plan! For my use, I'll put the above tweak in and update rclone to v1.13.7 for its next release.

I haven't looked at the v2 branch yet - should I?

@puellanivis
Copy link
Collaborator

Yeah, I think the tweak of closing the read pipe first is definitely the better solution in general. 👍

Yes, you can look at the dev-v2 branch already. 😁

@puellanivis
Copy link
Collaborator

#604 works! 😁 Just like the os.File package in a couple of different configurations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants