[bug]: Payment Flow: Don't abort payment until HTLC attempts are still in flight #8975

ziggie1984 · 2024-08-05T10:38:35Z

Remarked in #8174

Analysed some of the behavior here:

What I think might be taken into consideration to change (from my point of understanding the codebase here)

When sending a payment LND will by default attempt MPP payment up to 16 shards per payment. We launch all of the asynchronously however when during the process one of the HTLC attempts fails because of specific reasons e.g. requesting a route fails or we cannot register a payment attempt:

lnd/routing/payment_lifecycle.go

Lines 258 to 280 in 7e1e054

    
           rt, err := p.requestRoute(ps) 
        
           if err != nil { 
        
           	return exitWithErr(err) 
        
           } 
        
           // We may not be able to find a route for current attempt. In 
        
           // that case, we continue the loop and move straight to the 
        
           // next iteration in case there are results for inflight HTLCs 
        
           // that still need to be collected. 
        
           if rt == nil { 
        
           	log.Errorf("No route found for payment %v", 
        
           		p.identifier) 
        
           	continue lifecycle 
        
           } 
        
           log.Tracef("Found route: %s", spew.Sdump(rt.Hops)) 
        
           // We found a route to try, create a new HTLC attempt to try. 
        
           attempt, err := p.registerAttempt(rt, ps.RemainingAmt) 
        
           if err != nil { 
        
           	return exitWithErr(err) 
        
           }

we abort the payment cycle and don't wait for INFLIGHT HTLC attempts to be resolved, we will only reconsider them during a restart.

One of the examples described in #8174 (comment)
is that as soon as the MPP times-out at the receiver side, but we are in the process of sending another HTLC, we will fail the payment lifecycle without resolving all the INFLIGHT htlcs.

This has downsides:

On the one hand the payment will always be inflight and service providers using the API might not be aware that this payment has no chance of resolving successfully because the MPPTimeout has already been reached for example.
Moreover when described to the Trackpayment API the payment stays INFLIGHT/PENDING forever.

Possible solutions, interested in feedback here:

I think it makes sense to store the HTLC Attempts in Memory during the excecution of the payment so that the payment lifecycle can always be aware how many HTLCs have been sent ? I store them in the DB but I think it might be worthwhile to also have the easy accessible in life cycle struct maybe ? (Currently we only have 1 golang channel which listens to potential results of all the result collectors)

The text was updated successfully, but these errors were encountered:

yyforyongyu · 2024-08-06T07:00:21Z

I think the question is, should we ever abort the payment lifecycle even when there are inflight HTLCs? The answer is yes, but only if we get a non-MPP-related error, such as a DB-related error, or other lower-level critical errors.

The previous assumption is, we should only rely on this func to decide if we want to abort the loop,

lnd/routing/payment_lifecycle.go

Lines 233 to 237 in ab96de3

    
           // Now decide the next step of the current lifecycle. 
        
           step, err := p.decideNextStep(payment) 
        
           if err != nil { 
        
           	return exitWithErr(err) 
        
           }

which works most of the time given the setup that,

the invoice has a timeout of 60s
finding a route via RequestRoute is fast (in million seconds)
creating and sending the HTLC attempt is fast.

but it can break if,

one or more HTLCs are in flight.
the sender node shuts down before the invoice times out and restarts after the invoice times out.
we call decideNextStep before the payment is marked as failed, and proceed to register another attempt.
we call RegisterAttempt, now it fails as we cannot register attempts on a failed payment.

The root cause is we have two different views of the payment in two goroutines,

G1: in resumePayment, we always fetch the latest payment state for each iteration, and make decisions based on this state.
G2: in collectResultAsync, we will alter the payment state based on the collected result from htlcswitch.

The simple fix is to always read and write to the DB in the same goroutine, which should happen in resumePayment. Since the only slow action is to collect results from htlcswitch via,

lnd/routing/payment_lifecycle.go

Lines 509 to 511 in ab96de3

    
           resultChan, err := p.router.cfg.Payer.GetAttemptResult( 
        
           	attempt.AttemptID, p.identifier, errorDecryptor, 
        
           )

A better fix is to refactor deeper to make the DB actions clearer, which I think can be accomplished in the upcoming native payment SQL PRs.

I think it makes sense to store the HTLC Attempts in Memory during the excecution of the payment so that the payment lifecycle can always be aware how many HTLCs have been sent ?

The more we rely on ephemeral states the less fault tolerant it becomes. Suppose a power outage happens all the results will be lost - unless we can reconstruct them, say, if htlcswitch has stored it on disk.

ziggie1984 · 2024-08-06T10:25:55Z

I think the question is, should we ever abort the payment lifecycle even when there are inflight HTLCs? The answer is yes, but only if we get a non-MPP-related error, such as a DB-related error, or other lower-level critical errors.

Hmm I tend to rather make sure we resolve all INFLIGHT HTLCs even if we have a lower-level critical error so that we make sure we mark this Payment and all its HTLCs as somehow unrecoverable, so that at least the trackpayment cmd signals the exit of the payment correctly.

but it can break if,

one or more HTLCs are in flight.
the sender node shuts down before the invoice times out and restarts after the invoice times out.
we call decideNextStep before the payment is marked as failed, and proceed to register another attempt.
we call RegisterAttempt, now it fails as we cannot register attempts on a failed payment.

Do you mean those steps all must happen in sequential order for the payment lifecycle to break or does every point represents a case where the payment lifecycle fails ? I guess it's the former but I don't think Point 2 is necessary for it to break ?

The simple fix is to always read and write to the DB in the same goroutine, which should happen in resumePayment. Since the only slow action is to collect results from htlcswitch via,

Depending on when the Payment Sql refactor is planned for I would prefer going with a simple fix first and then proceeding. Your recommendation of writing to the db in resumepayment is definitely a good way to solve it and its seems not super complicated.

While looking through the code I am wondering why we mutate the state of the payment in a fetchPayment method:

lnd/channeldb/payments.go

Line 274 in ab96de3

func fetchPayment(bucket kvdb.RBucket) (*MPPayment, error) {

lnd/channeldb/payments.go

Lines 314 to 319 in ab96de3

    
           // Set its state and status. 
        
           if err := payment.setState(); err != nil { 
        
           	return nil, err 
        
           } 
        
           return payment, nil

I think we should not change the state in a fetch method or we should call this fetch method update wdyt ?

michael1011 · 2024-08-27T11:01:58Z

This keeps happening for us on a regular basis on mainnet. There is no way to prevent it apart from not using MPP and even that is just an educated guess.

calvinrzachman · 2024-10-04T16:17:40Z

Following this issue at somewhat of a distance at the moment, but very curious to see the native payment SQL PRs and any continued development on @yyforyongyu's comment:

unless we can reconstruct them, say, if htlcswitch has stored it on disk.

Coming from the perspective of running the ChannelRouter remotely from the Switch and allowing communication via RPC, I have found myself getting the feeling that Switch store might need some upgrades. In such a scenario, you have the ChannelRouter driving state transitions about the status of a locally initiated payment (eg: non-forward) using state maintained by a remote Switch. The existing networkResultStore may not be quite sufficient for this.

yyforyongyu added this to the v0.19.0 milestone Aug 27, 2024

yyforyongyu linked a pull request Oct 2, 2024 that will close this issue

routing+htlcswitch: fix stuck inflight payments #9150

Open

2 tasks

ziggie1984 assigned yyforyongyu Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug]: Payment Flow: Don't abort payment until HTLC attempts are still in flight #8975

[bug]: Payment Flow: Don't abort payment until HTLC attempts are still in flight #8975

ziggie1984 commented Aug 5, 2024 •

edited

Loading

yyforyongyu commented Aug 6, 2024

ziggie1984 commented Aug 6, 2024 •

edited

Loading

michael1011 commented Aug 27, 2024

calvinrzachman commented Oct 4, 2024

[bug]: Payment Flow: Don't abort payment until HTLC attempts are still in flight #8975

[bug]: Payment Flow: Don't abort payment until HTLC attempts are still in flight #8975

Comments

ziggie1984 commented Aug 5, 2024 • edited Loading

yyforyongyu commented Aug 6, 2024

ziggie1984 commented Aug 6, 2024 • edited Loading

michael1011 commented Aug 27, 2024

calvinrzachman commented Oct 4, 2024

ziggie1984 commented Aug 5, 2024 •

edited

Loading

ziggie1984 commented Aug 6, 2024 •

edited

Loading