-
Notifications
You must be signed in to change notification settings - Fork 0
/
case-study.html
1457 lines (1433 loc) · 77 KB
/
case-study.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html data-wf-page="5f71dd169010d6326b65485d">
<head>
<meta charset="utf-8" />
<title>Triage • Case Study</title>
<meta content="width=device-width, initial-scale=1" name="viewport" />
<link href="assets/css/style.css" rel="stylesheet" type="text/css" />
<script
src="https://ajax.googleapis.com/ajax/libs/webfont/1.6.26/webfont.js"
type="text/javascript"
></script>
<script src="/assets/scripts/collapseScript.js"></script>
<link
rel="stylesheet"
href="https://fonts.googleapis.com/css?family=Inter:regular,500,600,700"
media="all"
/>
<link
rel="stylesheet"
href="https://fonts.googleapis.com/css?family=Work+Sans"
/>
<link
rel="stylesheet"
href="https://fonts.googleapis.com/css?family=Nunito+Sans"
/>
<meta
name="image"
property="og:image"
content="assets/images/synapse-case-study.png"
/>
<script type="text/javascript">
WebFont.load({ google: { families: ["Inter:regular,500,600,700"] } });
</script>
<script type="text/javascript">
!(function (o, c) {
var n = c.documentElement,
t = " w-mod-";
(n.className += t + "js"),
("ontouchstart" in o ||
(o.DocumentTouch && c instanceof DocumentTouch)) &&
(n.className += t + "touch");
})(window, document);
</script>
<link
href="assets/images/logo-color.png"
rel="shortcut icon"
type="image/x-icon"
/>
<link href="assets/images/logo-mono.png" rel="apple-touch-icon" />
<script
src="https://kit.fontawesome.com/d019875f94.js"
crossorigin="anonymous"
></script>
<meta
name="image"
property="og:image"
content="assets/images/thumbnail.png"
/>
</head>
<body>
<!--Navbar-->
<div class="navigation-wrap" id="navbar">
<div
data-collapse="medium"
data-animation="default"
data-duration="400"
role="banner"
class="navigation w-nav"
>
<div class="navigation-container">
<div class="navigation-left">
<a
href="/"
aria-current="page"
class="brand w-nav-brand w--current"
aria-label="home"
>
<img
src="assets/images/logo-color.png"
alt=""
class="template-logo"
onmouseover="hover(this);"
onmouseout="unhover(this);"
/>
</a>
<nav role="navigation" class="nav-menu w-nav-menu">
<nav role="navigation" class="nav-menu w-nav-menu">
<a href="/case-study" class="link-block w-inline-block">
<div>Case Study</div>
</a>
<a href="/team" class="link-block w-inline-block">
<div>Team</div>
</a>
</nav>
</nav>
</div>
<div class="navigation-right">
<div class="login-buttons">
<a
href="https://github.com/Team-Triage"
target="_blank"
>
<span style="color: #FFFFFF">
<i class="fab fa-github fa-lg"></i>
</span>
</a>
</div>
</div>
</div>
<div class="w-nav-overlay" data-wf-ignore="" id="w-nav-overlay-0"></div>
</div>
</div>
<!-- <div class="wrapper"></div> -->
<div id="sidebar" class="toc"></div>
<div class="section header">
<article class="container case-study-container">
<div class="hero-text-container">
<h1 class="h1 centered">Triage - A Kafka Proxy</h1>
</div>
<div id="case-study">
<br />
<br />
<!-- Introduction -->
<div class="case-study-section">
<h2 class="h2">Introduction</h2>
<div class="case-study-subsection">
<p>
Triage is an open-source consumer proxy for Apache Kafka that solves head-of-line blocking
(HoLB) caused by poison pill messages and non-uniform consumer latency. Once deployed, poison pill
messages will be identified and delivered to a dead letter store. By enabling additional
consumer instances to consume messages, Triage uses parallelism to ensure that an unusually
slow message will not block the queue.
</p>
<p>
Our goal was to create a service that could deal with HoLB in a message queue while making
it easy for consumer application developers to maintain their existing workflow.
</p>
<p>
This case study will begin by exploring the larger context of microservices and the role of
message queues in facilitating event-driven architectures. It will also describe some of the
basics regarding Kafka’s functionality and how HoLB can affect consumers, followed by an overview
of existing solutions. Finally, we will dive into the architecture of Triage, discuss important
technical decisions we made, and outline the key challenges we faced during the process.
</p>
</div>
</div>
<!-- Section 1: Problem Domain Setup -->
<div class="case-study-section">
<h2 class="h2">Problem Domain Setup</h2>
<div class="case-study-subsection">
<h3>The World of Microservices</h3>
<p>
Over the last decade, microservices have become a popular architectural
choice for building applications. By one estimate from 2020, 63% of enterprises
have adopted microservices, and many are satisfied with the tradeoffs <a href="https://dzone.com/articles/new-research-shows-63-percent-of-enterprises-are-a ">[1]</a>.
Decoupling services often leads to faster development time since work on
different services can be done in parallel. Additionally, many companies benefit
from the ability to independently scale individual components of their architecture,
and this same decoupling makes it easier to isolate failures in a system.
</p>
<p>
Microservice architectures are flexible enough to allow different technologies and
languages to communicate within the same system, creating a polyglot environment.
This flexibility enables a multitude of different approaches for achieving reliable
intra-system communication.
</p>
<p>
Two common options are the request-response pattern and event-driven architecture (EDA).
Although the latter is where our focus lies, it is useful to have some context on the shift
toward EDAs.
</p>
</div>
<div class="case-study-subsection">
<h3>From Request-Response to Event-Driven Architecture</h3>
<p>
A typical request-response pattern is commonly used on the web, and that is no different
from what we are referring to here. For example, imagine a number of interconnected
microservices. One of them sends a request to another and waits for a response. If any one
of the services in this chain experiences lag or failure, slowdowns cascade
throughout the entire system.
</p>
<figure>
<img
src="assets/images/case-study/req-res.gif"
class="case-study-image"
/>
</figure>
<p>
In an EDA, however, the approach is centered around “events”, which can be thought of
as any changes in state or notifications about a change. The key advantage is that each service
can operate without concern for the state of any other service - they perform their
tasks without interacting with other services in the architecture. EDAs are often implemented using
message queues. Producers write events to the message queue, and consumers read events off of it. For
example, imagine an online store - a producer application might detect that an order has been submitted
and write an “order” event to the queue. A consumer application could then see that order, dequeue it,
and process it accordingly.
</p>
<figure>
<img
src="assets/images/case-study/eda.gif"
class="case-study-image"
/>
</figure>
</div>
</div>
<!-- Section 2: Apache Kafka -->
<div class="case-study-section">
<h2 class="h2">Apache Kafka</h2>
<div class="case-study-subsection">
<h3>What is Kafka?</h3>
<p>
In a traditional message queue, events are read and then removed from the
queue. An alternative approach is to use log-based message queues, which persist
events to a log. Among log-based message queues, Kafka is the most popular
– over 80% of Fortune 100 companies use Kafka as part of their architecture <a href="https://kafka.apache.org/">[2]</a>.
Kafka is designed for parallelism and scalability and maintains the intended
decoupling of an EDA. In Kafka, events are called messages.
</p>
</div>
<div class="case-study-subsection">
<h3>How Does Kafka Work?</h3>
<p>
Typically, when talking about Kafka, we are referring to a Kafka cluster
- a cluster is comprised of several servers, referred to as brokers, working in
conjunction. A broker receives messages from producers, persists them, and makes
them available to consumers.
</p>
<p>
Topics are named identifiers used to group messages together. Topics, in turn,
are broken down into partitions. To provide scalability, each partition of a given
topic can be hosted on a different broker. This means that a single topic can be
scaled horizontally across multiple brokers to provide performance beyond the ability
of a single broker. Each instance of a consumer application can then read from a
partition, allowing for parallel processing of messages within a topic.
</p>
<figure>
<img
src="assets/images/case-study/kafka.png"
class="case-study-image"
/>
<figcaption>
A Kafka topic with two partitions
</figcaption>
</figure>
<p>
Consumers are organized into consumer groups under a common group ID to enable Kafka’s
internal load balancing. It is important to note that while a consumer instance can consume
from more than one partition, a partition can only be consumed by a single consumer instance.
If the number of instances is higher than the number of available partitions,
some instances will remain inactive.
</p>
<figure>
<img
src="assets/images/case-study/kafka.gif"
class="case-study-image"
/>
</figure>
<figcaption>
One consumer instance reads from one partition
</figcaption>
<p>
Internally, Kafka uses a mechanism called “commits” to track the successful processing of
messages. Consumer applications periodically send commits back to the Kafka cluster, indicating
the last message they’ve successfully processed. Should a consumer instance go down, Kafka
will have a checkpoint to resume message delivery from.
</p>
<figure>
<img
src="assets/images/case-study/commits.png"
class="case-study-image"
/>
<figcaption>
Kafka will resume message delivery from offset 51
</figcaption>
</figure>
</div>
</div>
<!-- Section 3: Problem Description -->
<div class="case-study-section">
<h2 class="h2">Problem Description</h2>
<div class="case-study-subsection">
<h3>Head-of-Line Blocking in Kafka</h3>
<p>
A significant problem that can be experienced when using message queues
is head-of-line blocking (HoLB). HoLB occurs when a message at the head
of the queue blocks the messages behind it. Since Kafka’s partitions are
essentially queues, messages may block the line for two common reasons –
poison pills and unusually slow messages.
</p>
</div>
<div class="case-study-subsection">
<h3>Poison Pills</h3>
<p>
Poison pills are messages that a consumer application receives but cannot
process. Messages can become poison pills for a host of reasons, such as
corrupted or malformed data.
</p>
<h4>HoLB Due to Poison Pills</h4>
<p>
To better understand how poison pills cause HoLB, imagine an online vendor
tracking orders on a website. Each order is produced to an orders topic.
A consumer application is subscribed to this topic and needs to process
each message so that a confirmation email for orders can be sent to customers.
</p>
<p>
The consumer application expects to receive a message that contains an
integer for the <span class="snippet">product_id</span> field, but instead, it receives a message with
no value for that field. With no mechanism to deal with this poison pill,
processing halts. This will stop all orders behind the message in question
even though they could be processed without problems.
</p>
<figure>
<img
src="assets/images/case-study/poison.gif"
class="case-study-image"
/>
<figcaption>
Consumer application crashes due to poison pill
</figcaption>
</figure>
</div>
<div class="case-study-subsection">
<h3>Non-Uniform Consumer Latency</h3>
<p>
Slow messages can cause non-uniform consumer latency,
where a consumer takes an unusually long time to process a message.
For instance, suppose a consumer application makes a call to one of
many external services based on the contents of a message. If one of
these external services is sluggish, a message's processing time
will be unusually slow. Messages in the queue that don’t rely on the
delayed external service will also experience an increase in processing latency.
</p>
<h4>HoLB Due to Non-Uniform Consumer Latency</h4>
<p>
To illustrate how non-uniform consumer latency causes HoLB, imagine a
consumer application that is subscribed to the
<span class="snippet">greenAndOrangeMessages</span> topic. It receives the messages and routes them to one of two
external services based on their color.
</p>
<ol class="numbered-list">
<li>If the message is green, it is sent to the green external service, called <span class="snippet">Green Service</span>.
</li>
<li> If the message is orange, it is sent to the orange external service, called <span class="snippet">Orange Service</span>.
</li>
</ol>
<p>
As the consumer is pulling messages, there’s a sudden spike in latency in
the response from <span class="snippet">Orange Service</span>.
When the consumer calls <span class="snippet">Orange Service</span> while processing
the orange message, the lack of response blocks the processing of all messages behind it.
</p>
<figure>
<img
src="assets/images/case-study/nucl.gif"
class="case-study-image"
/>
<img />
<figcaption>
HoLB due to non-uniform consumer latency
</figcaption>
</figure>
<p>
Although all the messages behind the orange message are green, they cannot be
processed by the consumer, even though <span class="snippet">Green Service</span> is functioning normally.
Here, non-uniform consumer latency slows down the entire partition and causes HoLB.
</p>
<p>
The consequences of HoLB in a message queue can range from disruptive, such as slow
performance, to fatal - potential crashes. An obvious solution to these issues might
be simply dropping messages; however, in many cases, data loss is unacceptable. For
our use case, an ideal solution would retain all messages.
</p>
</div>
<div class="case-study-subsection">
<h3>Solution Requirements</h3>
<p>
Based on the problem space described, we determined the following requirements for a solution:
</p>
<ol class="numbered-list">
<li>It should be publicly available to consumer application developers.</li>
<li>It should serve developers working in a polyglot microservices environment.</li>
<li>It should prevent data loss (messages should never be dropped).</li>
<li>It should integrate smoothly into existing architectures.</li>
<li>It should be easily deployed regardless of the user’s cloud environment (if any).</li>
</ol>
</div>
</div>
<!-- Section 4: Alternative Approaches -->
<div class="case-study-section">
<h2>Alternative Approaches</h2>
<p>
With the aforementioned requirements in mind, we extensively researched
existing solutions and approaches to solving HoLB. The solutions we found
ranged from built-in Kafka configurations to service models built to support
large Kafka deployments.
</p>
<div class="case-study-subsection">
<h3>Kafka Auto-Commit</h3>
<p>
By default, the Kafka consumer library sends commits back to Kafka every 5
seconds, regardless of whether a message has been successfully processed.
Where data loss is not an issue, auto-commit is a reasonable solution to HoLB.
If a problematic message is encountered, the application can simply drop the
message and move on. However, where data loss is unacceptable, this approach
will not work.
</p>
</div>
<div class="case-study-subsection">
<h3>Confluent Parallel Consumer</h3>
<p>
Confluent Parallel Consumer (CPC) is a Java Kafka Consumer library that seemingly
addresses HoLB by offering an increase in parallelism beyond partition count for
a given topic <a href="https://www.confluent.io/blog/introducing-confluent-parallel-message-processing-client/">[3]</a>. It operates by processing messages in parallel using multiple
threads on the consumer application’s host machine.
</p>
<p>
While CPC is an attractive solution, there were a few areas where it differed from
our design requirements. The most obvious shortcoming for us was the fact that
it's written in Java. In modern polyglot microservice environments, this presents
a notable con - any developer wanting to utilize the advantages of CPC would need
to rewrite their applications in Java.
</p>
<p>
Additionally, our requirements did not permit data loss; while setting up data loss
prevention with CPC is feasible, we sought a solution that came with this functionality
out of the box.
</p>
</div>
<div class="case-study-subsection">
<h3>Kafka Workers (DoorDash)</h3>
<p>
DoorDash chose to leverage Kafka to help them achieve their goals of rapid
throughput and low latency. Unfortunately, their use of Kafka introduced
HoLB caused by non-uniform consumer latency.
</p>
<p>
The worker-based solution that Doordash implemented to address this problem
consists of a single Kafka consumer instance per partition, called a "worker,"
which pipes messages into a local queue <a href="https://doordash.engineering/2020/09/03/eliminating-task-processing-outages-with-kafka/">[4]</a>. Processes called <span class="snippet">task-executors</span>, within the "worker"
instance, then retrieve the events from this queue and process them.
</p>
<p>
This solution allows events on a single partition to be processed by multiple
<span class="snippet">task-executors</span> in parallel. If a single message is
slow to process, it doesn’t impact the processing time of other messages. Other
available <span class="snippet">task-executors</span>
can read off the local queue and process messages even though a message at the
head might be slow.
</p>
<p>
While this solution solves HoLB caused by non-uniform consumer latency, it did not fit our design
requirements due to its lack of data loss prevention. According to DoorDash,
if a worker crashes, messages within its local queue may be lost. As previously
established, data loss prevention was a strict design requirement for us, making
this approach a poor fit for our use case.
</p>
</div>
<div class="case-study-subsection">
<h3>Consumer Proxy Model (Uber)</h3>
<p>
Uber sought to solve HoLB caused by non-uniform consumer latency and poison pills while ensuring at-least-once
delivery since they deemed data loss intolerable.
</p>
<p>
Their solution, Consumer Proxy, solves HoLB by acting as a proxy between the Kafka
cluster and multiple instances of the consumer application <a href="https://www.uber.com/en-CA/blog/kafka-async-queuing-with-consumer-proxy/">[5]</a>. With this approach,
messages are ingested and then processed in parallel by consumer instances.
Consumer Proxy also uses a system of internal acknowledgments sent by consumer instances,
indicating the successful processing of a message. Consumer Proxy only commits messages
back to Kafka which have been successfully processed. If a message cannot be processed, a
dead-letter queue is used to store it for later retrieval.
</p>
<p>
Uber’s Consumer Proxy is a feature-rich solution that seems to fulfill all of our requirements.
It eliminated HoLB due to the two causes our team was concerned with while avoiding data loss.
That being said, Consumer Proxy is an in-house solution that is not publicly
available for consumer application developers.
</p>
<figure>
<img
src="assets/images/case-study/existing_solutions.png"
class="case-study-image"
/>
</figure>
<figcaption>
Comparison of existing solutions
</figcaption>
</div>
</div>
<!-- Section 5: Introducing Triage -->
<div class="case-study-section">
<h2>Introducing Triage</h2>
<figure>
<img
src="assets/images/case-study/triage.png"
class="case-study-image"
/>
</figure>
<p>
Based on our research, none of the solutions fit all of our requirements – they either were not
supported in multiple languages, failed to solve HoLB for both causes identified, or were not
publicly available. We chose Uber's consumer proxy model as the basis for Triage because it
solved both causes of HoLB and was language agnostic. As seen in the figure above, a Triage
instance acts as a Kafka consumer proxy and passes messages to downstream consumer instances.
</p>
<div class="case-study-subsection">
<h3>How Does Triage Work?</h3>
<p>
Triage will subscribe to a topic on the Kafka cluster and begin consuming messages. When
a message is consumed, it is sent to an instance of a consumer application. This consumer
instance will process the message and send back a status code that reflects whether or not
a message has been successfully processed. Triage uses an internal system of
<span class="snippet">acks</span> and <span class="snippet">nacks</span>
(acknowledgments and negative acknowledgments) to identify healthy versus poison pill messages.
</p>
<p>
Internally, Triage uses a <span class="snippet">Commit Tracker</span> to determine which messages have been successfully
acknowledged and can be committed back to Kafka. Once it has done so, those records are deleted
from the tracker. For messages that have been negatively acknowledged, Triage utilizes
the dead-letter pattern to avoid data loss.
</p>
</div>
<div class="case-study-subsection">
<h3>Triage Solves HoLB Caused By Poison Pills</h3>
<p>
When a poison pill is encountered, the consumer instance will send back a <span class="snippet">nack</span> for that message.
A <span class="snippet">nack</span> directs Triage to deliver the message record in its entirety to a DynamoDB table. Here,
it can be accessed at any point in the future for further analysis or work. The partition will
not be blocked, and messages can continue to be consumed uninterrupted.
</p>
<figure>
<img
src="assets/images/case-study/poison-solved.gif"
class="case-study-image"
/>
<figcaption>
Triage solves HoLB due to poison pills
</figcaption>
</figure>
</div>
<div class="case-study-subsection">
<h3>Triage Solves HoLB Caused by Non-Uniform Consumer Latency</h3>
<p>
With Triage, if a consumer instance takes an unusually long time to process a message,
the partition remains unblocked. Messages can continue to be processed using other available
consumer instances. Once the consumer instance finishes processing the slow message, it can
continue processing messages.
</p>
<figure>
<img
src="assets/images/case-study/nucl-solved.gif"
class="case-study-image"
/>
<figcaption>
Triage solves HoLB due to non-uniform consumer latency
</figcaption>
</figure>
</div>
<div class="case-study-subsection">
<h3>How Can I Use Triage?</h3>
<h4>Deploying Triage</h4>
<p>
Triage can be deployed using our triage-cli command line tool, available as an
NPM package. It offers a 2 step process that deploys Triage to AWS. You can read
our step-by-step instructions here: <a target="_blank" href="https://github.com/team-triage/triage-cli#readme">Triage CLI</a>.
</p>
<h4>Connecting to Triage</h4>
<p>
Consumer applications can connect to Triage using our <a target="_blank" href="https://github.com/team-triage/triage-client-go#readme">thin client library</a>, currently
offered in Go. It handles authenticating with and connecting to Triage and provides
an easy-to-use interface for developers to indicate whether a message has been
processed successfully.
</p>
<figure>
<img
src="assets/images/case-study/triage-dummy-consumer.png"
class="case-study-image"
/>
<figcaption>An example of a simple consumer application</figcaption>
</figure>
</div>
</div>
<!-- Section 6: Triage Design Challenges -->
<div class="case-study-section">
<h2>Triage Design Challenges</h2>
<p>
Based on our requirements for Triage, we encountered a few challenges. Below,
we’ll present our reasoning behind the solutions we chose and how they allowed
us to fulfill all of our solution requirements.
</p>
<div class="case-study-subsection">
<h3>Polyglot Support</h3>
<h4>Challenge</h4>
<p>
We knew we wanted Triage to be language-agnostic – a consumer application
should be able to connect to Triage, regardless of the language it’s written
in. To do this, we had to consider whether Triage would exist as a service between
Kafka and the consumer or as a client library on the consumer itself. We also
needed to decide on a suitable network protocol.
</p>
<h4>Solution</h4>
<p>
By leveraging a service + thin client library implementation and gRPC code generation,
we can build out support for consumer applications written in any language with relative ease.
</p>
<h5>Service vs. Client Library</h5>
<p>
On one hand, a client library offers simplicity of implementation and testing, as well as the
advantage of not having to add any new pieces of infrastructure to a user’s system.
We could also expect to get buy-in from developers with less pushback, as testing a
client library with an existing system is more manageable than integrating a new service.
</p>
<p>
There were, however, some disadvantages with this approach. Our solution
to addressing non-uniform consumer latency relies on parallel processing of a single partition. While,
in theory, a client library could support multiple instances of a consumer application, a service
implementation is more straightforward. Even if a client library were to be designed to
dispatch messages to multiple consumer instances, it would begin to resemble a service implementation.
</p>
<p>
Another concern of ours was ease of maintainability. Within modern polyglot microservice
environments, maintenance of client libraries written in multiple languages consumes a non-trivial
amount of engineering hours. Changes in the Kafka version and the dependencies of the client
libraries themselves could cause breaking changes that require time to resolve. We assumed
that those hours could be better spent on core application logic.
</p>
<button type="button" class="collapsible">Kafka configs =></button>
<div class="content">
<p>
Kafka can be difficult to work with. While the core concepts of Kafka are relatively
straightforward to understand, in practice, interaction with a Kafka cluster involves
a steep learning curve. There are over 40 configuration settings that a Kafka client
can specify, making setting up an optimal or even functional consumer application
difficult. Uber, for example, noted that their internal Kafka team was spending about
half of their working hours troubleshooting for consumer application developers <a href="https://videos.confluent.io/watch/Jw7M7MpKVE5sS4hu2dWGvU">[6]</a>.
</p>
</div>
<p>
By centralizing the core functionality of Triage to a service running in the cloud and only
utilizing a thin client library for connecting to Triage, support and maintenance become
easier. Triage’s client library is simple – it makes an initial HTTP connection request
with an authentication key provided by the developer and runs a gRPC server that listens
for incoming messages. Implementing support in additional languages for this thin client
library is straightforward, and much of the challenge around configuring a Kafka consumer
is abstracted away from the developer.
</p>
<h5>Network Protocol</h5>
<p>
The next decision that we faced was choosing an appropriate network protocol for communication
with consumer applications. HTTP was an obvious consideration both for its ubiquity and ease
of implementation; however, after further research, we felt gRPC was the better option <a href="https://learn.microsoft.com/en-us/aspnet/core/grpc/comparison?view=aspnetcore-7.0">[10]</a>.
</p>
<p>
gRPC allows us to leverage the benefits of HTTP/2 over HTTP/1.1, specifically regarding the
size of traffic we send and receive. HTTP/2 uses protocol buffers which are serialized and
emitted as binaries to achieve higher compression than HTTP/1.1, which typically uses the de-facto
standard of JSON. Higher compression means less data to send over the network and ultimately, faster
throughput.
</p>
<button type="button" class="collapsible">gRPC vs JSON + gZip =></button>
<div class="content">
<p>
A counterpoint to the compression argument is the existence and growing popularity of JSON
with gzip. Compression gains from protocol buffers compared to JSON with gzip are less impressive;
however, we run into similar dependency pains mentioned in our discussion of service versus client library
implementations. Each version of the thin client library we would potentially write must
import its own language’s implementation of gzip.
</p>
</div>
<p>
gRPC also makes it easy to build out support for multiple languages via out of the box code
generation. Using the same gRPC files we’ve used for Triage's Go client library, we can
utilize a simple command-line tool to generate gRPC server and client implementations in
all major programming languages.
</p>
</div>
<div class="case-study-subsection">
<h3>Enabling Parallel Consumption</h3>
<h4>Challenge</h4>
<p>
Since Triage operates by dispatching messages to several consumer
instances, we needed a way to send messages to, and receive responses from, them simultaneously.
We knew that the language we chose would play a significant role in solving this
challenge.
</p>
<h4>Solution</h4>
<p>
By creating dedicated Goroutines for each consumer instance and synchronizing them
with the rest of Triage via channels, we enable parallel consumption of a single
Kafka partition.
</p>
<h5>Language</h5>
<p>
We chose Go for the relative simplicity of implementing concurrency via Goroutines
and the ease of synchronization and passing data across these Goroutines via channels.
</p>
<h6>Goroutines</h6>
<p>
Goroutines can be thought of as non-blocking function loops that can run concurrently
with other functions <a href="https://softwareengineeringdaily.com/2021/03/03/why-we-switched-from-python-to-go/">[11]</a>. The resource overhead of creating and running a Goroutine is
negligible, so it’s not uncommon to see programs with thousands of Goroutines. This
sort of multithreaded behavior is easy to use with Go, as a generic function can be
turned into a Goroutine by simply prepending its invocation with the keyword <span class="snippet">go</span>. Each
major component of Triage exists as a Goroutine, which often relies on other
underlying Goroutines. Channels are used extensively to pass data between these components
and achieve synchronization where needed.
</p>
<p>
In the figure below, <span class="snippet">myFunc()</span>'s execution will block execution
of the print statement <em>"I'm after myFunc!"</em>
</p>
<figure>
<img src="assets/images/case-study/nogoroutine.PNG">
<figcaption>Execution without Goroutines</figcaption>
</figure>
<p>
Conversely, the invocation of <span class="snippet">myFunc()</span> below is prepended with the
<span class="snippet">go</span> keyword. It will now execute in the background, as a Goroutine, allowing
the execution of the print statement.
</p>
<figure>
<img src="assets/images/case-study/withgoroutine.png">
<figcaption>Execution with Goroutine</figcaption>
</figure>
<h6>Channels</h6>
<p>
Channels in Go are queue-like data structures that facilitate communication across
processes within a Go application. Channels support passing both primitives and structs.
We can think of a function that writes a given value to a channel as a “sender” and a function
that reads said value off of the channel as a “receiver.” When a sender attempts to write
a message but there is no receiver attempting to pull a message off of the channel,
code execution is blocked until a receiver is ready. Similarly, if a receiver attempts
to read a message off of the channel when there is no sender, code execution is blocked
until a sender writes a message.
</p>
<figure>
<img src="assets/images/case-study/channels.PNG">
<figcaption>A simple example of a channel</figcaption>
</figure>
</div>
<div class="case-study-subsection">
<h3>Ease of Deployment</h3>
<h4>Challenge</h4>
<p>
We wanted to make sure that deploying Triage was as simple as possible for consumer
application developers. Our goal was for setup, deployment, and teardown to be painless.
</p>
<h4>Solution</h4>
<p>
By taking advantage of the AWS Cloud Development Kit (CDK) in conjunction with AWS Fargate
on Elastic Container Service (ECS), we were able to create an automated deployment script that
interacts with our command line tool, triage-cli. This allows users to deploy a failure-resistant
Fargate service to AWS in just a few easy steps.
</p>
<h5>AWS Fargate on ECS</h5>
<p>
We chose AWS for its wide geographic distribution, general industry familiarity,
and support for containerized deployments.
</p>
<p>
AWS offers services in virtually every region of the world, meaning that developers
looking to use Triage can deploy anywhere. We selected Fargate as the deployment
strategy for Triage containers, removing the overhead of provisioning and managing
individual virtual private server instances. Instead, we could concern ourselves
with relatively simple ECS task and service definitions.
</p>
<p>
In our case, we define a task as a single container running Triage. Our service definition is a collection
of these tasks, with the number of tasks being equal to the number of partitions for a given topic.
This service definition is vital to how we guard against failure - if a Triage container
were to crash, it would be scrapped and another would immediately be provisioned automatically.
</p>
<button type="button" class="collapsible">Health Checks and Logs =></button>
<div class="content">
<p>
Using Fargate doesn’t mean that we sacrifice any of the benefits
that come with ECS since Fargate exists on top of ECS. Health checks and
logs, as well as all of the infrastructure created during the deployment,
are available to and owned by a user, since Triage deploys using the AWS
account logged into the AWS CLI.
</p>
</div>
<p>
Automated deployment is enabled by CDK.
Manually deploying Triage would require an understanding of cloud-based networking
that would increase technical overhead. Amazon’s CDK abstracts that away – instead
of having to set up the dozens of individual entities that must be provisioned and
interconnected for a working cloud deployment, we were able to use ready-made templates
provided by CDK for a straightforward deployment script.
</p>
<h5>triage-cli</h5>
<p>
We created triage-cli to interact with the deployment script created with AWS CDK. This allows
us to interpolate user-specific configuration details into the script and deploy using just two
commands – <span class="snippet">triage init</span> and <span class="snippet">triage deploy</span>.
</p>
</div>
</div>
<!-- Section 7: Implementation-->
<div class="case-study-section">
<h2>Implementation</h2>
<figure>
<img
src="assets/images/case-study/application_logic.png"
class="case-study-image"
/>
</figure>
<p>
Having found solutions to our design challenges, our next step in developing Triage was
implementation. In this section, we will discuss the components that make up the application
logic of a Triage container, as well as provide a brief overview of how our thin client library
interacts with it. We will address implementation with the following subsections:
<ol class="numbered-list">
<li><strong>Message Flow</strong> - How Triage pulls messages from Kafka and sends them to consumers</li>
<li><strong>Consumer Instances</strong> - How a consumer instance receives messages from Triage and responds</li>
<li><strong>Handling Acks/Nacks</strong> - How Triage handles these responses</li>
<li><strong>Commits</strong> - How Triage handles commits</li>
</ol>
</p>
<div class="case-study-subsection">
<h3>Message Flow</h3>
<p>
We will start by describing the flow of messages from Kafka, through Triage,
to downstream consumer instances.
</p>
<h4>Fetcher</h4>
<p>
The <span class="snippet">Fetcher</span> component is an instance of a Kafka consumer – it periodically polls Kafka for
messages. It then writes these messages to the <span class="snippet">Commit Tracker</span> component and sends them to
the messages channel. We will discuss the <span class="snippet">Commit Tracker</span> in a later section - for now, it is
sufficient to know that a copy of a message ingested by Triage is stored in a hashmap,
and a reference to this message is sent over the <span class="snippet">messages</span> channel.
</p>
<h4>Consumer Manager</h4>
<p>
At this point in the flow, messages from Kafka are sitting in the <span class="snippet">messages</span> channel and are
ready to be processed. The <span class="snippet">Consumer Manager</span> component runs a simple HTTP server that listens
for incoming requests from consumer instances. After authenticating a request, the <span class="snippet">Consumer
Manager</span> parses the consumer instance’s network address and writes it to the <span class="snippet">newConsumers</span> channel.
</p>
<p>
To recap, we now have messages from Kafka in a <span class="snippet">messages</span> channel and
network addresses to send them to in a <span class="snippet">newConsumers</span> channel.
</p>
<h4>Dispatch</h4>
<p>
Triage’s <span class="snippet">Dispatch</span> component is responsible for getting messages from within Triage to the
consumer instances. We can think of <span class="snippet">Dispatch</span> as a looping function that waits for network
addresses on the <span class="snippet">newConsumers</span> channel. When it receives a network address, it uses it to
instantiate a gRPC client – think of this as a simple agent to make network calls. When
this client is created, a connection is established between the client and the consumer
at the network address.
</p>
<h4>senderRoutine</h4>
<p>
Dispatch then calls a function called <span class="snippet">senderRoutine</span>, passing it the gRPC client as a
parameter. <span class="snippet">senderRoutine</span> is invoked as a Goroutine, ensuring that when <span class="snippet">Dispatch</span> loops
and listens for the next network address, <span class="snippet">senderRoutine</span> continues to run in the background.
</p>
<p>
<span class="snippet">senderRoutine</span> is essentially a <span class="snippet">for</span> loop. First, a message is pulled off of the messages
channel. The gRPC client passed to <span class="snippet">senderRoutine</span> is then used to send this message over
the network to the consumer instance. The <span class="snippet">senderRoutine</span> now waits for a response.
</p>
</div>
<div class="case-study-subsection">
<h3>Consumer Instances</h3>
<p>
We will now discuss how consumer instances receive messages from, and send responses to, Triage.
</p>
<p>
Consumer applications interact with Triage using the Triage Client. This client library is responsible for the following:
</p>
<ol class="numbered-list">
<li>Providing a convenience method to send an HTTP request to Triage</li>
<li>Accepting a message handler</li>
<li>Running a gRPC Server</li>
</ol>
<p>
We have already covered the HTTP request – we will now examine message handlers and gRPC servers.
</p>
<h4>Message Handler</h4>
<p>
Developers first pass the client library a message handler function – the message handler
should have a Kafka message as a parameter, process the message, and return either a
positive or negative integer based on the outcome of the processing. This integer is how
consumer application developers can indicate whether a message has or has not been successfully
processed.
</p>
<h4>gRPC Server</h4>
<p>
When the consumer application is started, it runs a gRPC server that listens on a dedicated port.
When the server receives a message from Triage, it invokes the message handler, with the message
as an argument. If the message handler returns a positive integer, it indicates that the message
was successfully processed, and an <span class="snippet">ack</span> is sent back to Triage. If
the handler returns a negative integer, it indicates that the message was not processed
successfully, and a <span class="snippet">nack</span> is sent to Triage.
</p>
</div>
<div class="case-study-subsection">
<h3>Handling Acks/Nacks</h3>
<p> So far, we have covered how messages get from Kafka, through Triage, and to consumer instances.
We then explained how Triage’s client library manages receiving these messages and sending
responses. We will now cover how Triage handles these responses.
</p>
<h4>senderRoutine</h4>
<p>
As discussed in the message flow section, after sending a message to a consumer instance,
<span class="snippet">senderRoutine</span> waits for a response. When a response is received, <span class="snippet">senderRoutine</span> creates an
<span class="snippet">acknowledgment</span> struct. The struct has two fields - <span class="snippet">Status</span> and <span class="snippet">Message</span>.
The <span class="snippet">Status</span> field indicates either a positive or negative acknowledgment. If the
response was a <span class="snippet">nack</span>, a reference to the message is saved under the <span class="snippet">Message</span> field of the
struct. For <span class="snippet">acked</span> messages, the <span class="snippet">Message</span> field can be <span class="snippet">nil</span>. Finally, before looping to send
another message, <span class="snippet">senderRoutine</span> places the struct on the <span class="snippet">acknowledgments</span> channel.
</p>
<h4>Filter</h4>
<p>
A component called <span class="snippet">Filter</span> listens on the <span class="snippet">acknowledgments</span> channel. It pulls <span class="snippet">acknowledgment</span>
structs off the channel and performs one of two actions based on whether the struct represents
an <span class="snippet">ack</span> or a <span class="snippet">nack</span>. For <span class="snippet">acks</span>, <span class="snippet">Filter</span>
immediately updates the <span class="snippet">Commit Tracker</span>’s hashmap. We
will discuss the <span class="snippet">Commit Tracker</span>’s hashmap in the next section - for now, it is enough to know
that an <span class="snippet">ack</span> means we can mark the entry representing the message in the hashmap as acknowledged.
</p>
<p>
For <span class="snippet">nacks</span>, however, the hashmap cannot be updated immediately - we have a bad message and need
to ensure it is stored somewhere before moving on. <span class="snippet">Filter</span> places this negative acknowledgment in
the <span class="snippet">deadLetters</span> channel.
</p>
<h4>Reaper</h4>
<p>
A component called <span class="snippet">Reaper</span> listens on this <span class="snippet">deadLetters</span> channel. It makes an API call to DynamoDB,
attempting to write the faulty message to a table. Once confirmation is received from DynamoDB that
the write was successful, the entry representing the message in <span class="snippet">Commit Tracker</span>’s hashmap can be marked
as acknowledged.
</p>
<p>
At this point, we have covered how messages get from Kafka, through Triage, to consumer instances. We
have also covered how consumer instances process these messages, send responses back to Triage,
and how Triage handles these responses.
</p>
</div>
<div class="case-study-subsection">
<h3>Commits</h3>
<p>
We will now cover the <span class="snippet">Commit Tracker</span> component and how it allows us to manage commits back to Kafka effectively.
</p>
<h4>Commit Hashmap</h4>
<p>As discussed in the message flow section, as messages are ingested by Triage,
we store a reference to them in a hashmap. The hashmap’s keys are the offsets
of the messages, and the values are a custom struct called <span class="snippet">CommitStore</span>. The
<span class="snippet">CommitStore</span> struct has two fields - <span class="snippet">Message</span> and <span class="snippet">Value</span> .
The <span class="snippet">Message</span> field stores a reference to a specific Kafka message; the <span class="snippet">Value</span> field stores whether
or not Triage has received a response for this message.
</p>
<p>
Previously, we mentioned that the <span class="snippet">Filter</span> and <span class="snippet">Reaper</span> components marked messages
in the hashmap as acknowledged. More specifically, they were updating the <span class="snippet">Value</span>
field. Because messages that are <span class="snippet">nacked</span> are stored in DynamoDB for processing
at a later time, we can think of them as “processed,” at least with respect to
calculating which offset to commit back to Kafka.
</p>
<figure>
<img
src="assets/images/case-study/ack-nack.gif"
class="case-study-image"
/>
<figcaption>
<span class="snippet">Commit Tracker</span> in action
</figcaption>
</figure>
<h4>Commit Calculator</h4>
<p>
To calculate which offset to commit, a component called <span class="snippet">CommitCalculator</span> periodically runs
in the background. To be efficient with our commits, we want to commit the highest
offset possible since Kafka will implicitly commit all messages below our committed
offset. For example, if there are 100 messages in a partition and we commit
offset 50, Kafka will consider offsets 0-49 as “committed.”
</p>
<p>