-
Notifications
You must be signed in to change notification settings - Fork 381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepare Varnish-Cache for the age of AI #4209
base: master
Are you sure you want to change the base?
Conversation
homework from bugwash: look at ministat and get more numbers |
I have re-run the mini benchmark. Because As apparently I had not made this clear in the initial note, the goal at this point is to show that a reimplementation of So here is an unfiltered comparison of two runs with vai vs. master. These are response times, so lower is better. crc32 check enabled
no crc32 check
Footnotes
|
b10085a
to
19fb692
Compare
1bb9644
to
092f933
Compare
With the force push as of just now I fixed a misnomer of |
Make sure that we do not confuse buffers with object data. The flags field comes at no extra memory cost on 64bit. For 32bit, this change breaks binary compatibility with any out-of-tree storages using storage_simple.
This commit prepares a more generic busy object extension notification facility in preparation of the asynchronous iteration facility introduced with the next commit. It makes more sense when looked at in context of that, but the changes constitute a fairly independent part and thus have been separated. Background To support streaming of busy objects (delivery to a client while the body is being fetched), the Object API provides ObjWaitExtend(), which is called by storage iterators to learn the available amount of body data and to wait for more if all available data has been processed (= sent to the client, usually). The other end of the facility is ObjExtend(), which is called by the fetch side of storage to update the available amount of body data and wake up any clients blocking in ObjWaitExtend(). This facility recently got extended by a blocking operation in the other direction, where the writing side blocks if the amount of unsent data exceeds the amount configured via the transit_buffer. Why this change? The existing facility is based on the model of blocking threads. In order to support asynchronous iterators, where a single thread may serve multiple requests, we need a different, non-blocking model with notifications. Implementation The basic implementation idea is to introduce a variant of ObjWaitExtend() which, rather than blocking on a condition variable, registers a callback function to be called when the condition variable got signalled. This is ObjVAIGetExtend(): It returns the updated extension, if available, _or_ registers the callback. To implement the actual callback, we add to struct boc a queue (struct vai_q_head), whose elements are basically the notification callback with two pointers: the caller gets a private pointer as well as vai_hdl is an opaque handle owned by storage. ObjExtend() now also works the list of registered callbacks. ObjVAICancel() removes a callback when the caller is no longer interested or needs to reclaim the queue entry.
This commit adds a new object iteration API to support asynchronous IO. Background To process object bodies, the Object API so far provides ObjIterate(), which calls a storage specific iterator function. It in turn calls a caller-provided objiterate_f function on individual, contigious segments of data (extents). In turn, objiterate_f gets called with either no flags, or one of OBJ_ITER_FLUSH and OBJ_ITER_END. The storage iterator uses these flags to signal lifetime of the provided entents: They remain valid until a flag is present, so the caller may delay use until an extent is provided with a flag sent. Or, seen from the other end, objiterate_f needs to ensure it does not use any previously received extent when a flag is seen. objiterate_f can not make any assumption as to if or when it is going to be called, if the storage iterator function needs time to retrieve data or a streaming fetch is in progress, then so be it, objiterate_f may eventually get called or not. Or, again seen from the other end, the storage iterator function assumes being called from a thread and may block at any time. Why this change? The model described above is fundamentally incompatible with asynchronous, event driven IO models, where a single thread might serve multiple requests in parallel to benefit from efficiency gains and thus no called function must ever block. This additional API is intended to provide an interface suitable for such asynchronous models. As before, also the asynchronous iterator is owned by a storage specific implementation, but now, instead of using a thread for its state, that state exists in a data structure opaque to the caller. Batching with scatter arrays (VSCARAB) As recapitulated above, the existing objiterate_f works on one buffer at a time, yet even before asynchronous I/O, issuing one system call for each buffer would be inefficient. So, for the case of HTTP/1, the V1L layer collects buffers into an array of io vectors (struct iovec), which are handed over to the kernel using writev(). These arrays of io vectors seem to have no established name even after decades of existence, elsewhere they are called siov or array, so in this API, we are going to call them scatter arrays. With the new API, we use scatter arrays for all the processing steps: The goal is that storage fills a scatter array, which then gets processed and maybe replaced by filters, until finally some transport hands many I/Os at once to the kernel. Established interfaces follow the signature of writev(), they have a pointer to an array of struct iovec and a count (struct iovec *iov, int iovcnt). Yet for our plans, we want to have something which can be passed around in a single unit, to ensure that the array is always used with the right count, something which can vary in size and live on the heap or the stack. This is the VSCARAB, the Varnish SCatter ARAy of Buffers, basically a container struct with a flexible array member (fam). The VSCARAB has a capacity, a used count, and is annotated with v_counted_by_() such that, when support for bounds checking is improved by compilers, we get additional sanity checks (and possibly optimizations). We add macros to work on VSCARABs for (bounds) allocation (on the stack and heap), initialization, checking (magic and limits), iterating, and adding elements. VSCARET and VFLA Managing scatter arrays is one side of the coin, when we are done using buffers, we need to return them to storage, such that storage can do LRU things or reuse memory. As before, we want to batch these operations for efficiency. As an easy to use, flexible data structure, we add VSCARABs sibing VSCARET. And, because both are basically the same, we generalize macros as VFLA, Varnish Flexible Arrays. API Usage The basic model for the API is that the storage engine "leases" to the caller a number of extents, which the caller is then free to use until it returns the leases to the storage engine. The storage engine can also signal to the caller that it can not return more extents unless some are returned or that it simply can not return any at this time for other reasons (for example, because it is waiting for data on a streaming fetch). In both cases, the storage engine promises to call the caller's notification function when it is ready to provide more extents or iteration has ended. The API consists of four functions: - ObjVAIinit() requests an asynchronous iteration on an object. The caller provides an optional workspace for the storage engine to use for its state, and the notification callback / private pointer introduced with the previous commit. Its use is explained below. ObjVAIinit() returns either an opaque handle owned jointly by the Object layer in Varnish-Cache and the storage engine, or NULL if the storage engine does not provide asynchronous iteration. All other API functions work on the handle returned by ObjVAIinit(): - ObjVAIlease() returns the next extents from the object body in a caller-prodived VSCARAB. Each extent is a struct viov, which contains a struct iovec (see iovec(3type) / readv(2)) with the actual extent, and an integer identifying the lease. For the VSCARAB containing the last extent, VSCARAB_F_END is set. The "lease" integer (uint64_t) of each viov is opaque to the caller and needs to be returned as-is later, but is guaranteed by storage to be a multiple of 8. This can be used by the caller to temporarily stash a tiny amount of additional state into the lower bits of the lease. ObjVAIlease either returns a positive integer with a number of available leases, zero if the end of the object has been reached, or a negative integer for "call again later" and error conditions: -EAGAIN signals that no more data is available at this point, and the storage engine will call the notification function when the condition changes. -ENOBUFS behaves identically, but also requires the caller to return more leases. -EPIPE mirrors BOS_FAILED on the busy object. Any other -(errno) can be used by the storage engine to signal other error conditions. - ObjVAIreturn() returns a VSCARET of leases to the storage when the caller is done with them For efficiency, leases of extents which are no longer in use should be collected in a VSCARET and returned using ObjVAIreturn() before any blocking condition. They must be returned when ObjVAIlease() requests so by returning -ENOBUFS and, naturally, when iteration over the object body ends. - ObjVAIfini() finalizes iteration. The handle must not be used thereafter. Implementation One particular aspect of the implementation is that the storage engine returns the "lease", "return" and "fini" functions to be used with the handle. This allows the storage engine to provide functions tailored to the attributes of the storage object, for example streaming fetches require more elaborate handling than settled storage objects. Consequently, the vai_hdl which is, by design, opaque to the caller, is not entirely opaque to the object layer: The implementation requires it to start with a struct vai_hdl_preamble containing the function pointers to be called by ObjVAIlease(), ObjVAIreturn() and ObjVAIfini(). More details about the implementation will become clear with the next commit, which implements SML's synchronous iterator using the new API.
…terator This commit implements the asynchronous iteration API defined and described in previous commits for the simple storage layer and reimplements the synchronous iterator with it. This commit message does not provide background information, please refer to the two previous commits. Implementation sml_ai_init() initializes the handle and choses either a simple or more elaborate "boc" lease function depending on whether or not a streaming fetch is ongoing (boc present). sml_ai_lease_simple() is just that, dead simple. It iterates the storage segment list and fills the VSCARAB provided by the caller. It is a good starting point into the implementation. sml_ai_lease_boc() handles the busy case and is more elaborate due to the nature of streaming fetches. It first calls ObjVAIGetExtend() to get the current extent. If no data is available, it returns the appropriate value. Other than that, is basically does the same things as sml_ai_lease_simple() with these exceptions: It also needs to return partial extents ("fragments") and it needs to handle the case where the last available segment is reached, in which case there is no successor to store for the next invocation. sml_ai_return() is only used for the "boc" case. It removes returned full segments from the list and then frees them outside the boc mtx. It also adds special handling for the last segment still needed by sml_ai_lease_boc() to resume. This segment is retained on the VSCARET. sml_ai_fini() is straight forward and should not need explanation. Implementation of sml_iterator() using the new API To reimplement the existing synchronous iterator based on the new API, we first need a little facility to block waiting for a notification. This is struct sml_notify with the four sml_notify* functions. sml_notify() is the callback, sml_notify_wait() blocks waiting for a notification to arrive. Until it runs out of work, the iterator performs these steps: ObjVAIlease() is called repeatedly until either the VSCARAB is full or a negative value is returned. This allows the rest of the code to react to the next condition appropriately by sending an OBJ_ITER_FLUSH with the last lease only. Calling func() on each extent is trivial, the complications only come from handling OBJ_ITER_FLUSH, "just in time" returns and error handling.
To bring VAI to filters, we are going to need buffer allocations all over the place, because any new data created by filters needs to survive after the filter function returns. So we add ObjVAIbuffer() to fill a VSCARAB with buffers and teach ObjVAIreturn() to return any kind of lease. We add an implementation for SML.
With the force-push as of just now (226d9eb), I have changed the implementation to use a "flexible array of things" for scatter arrays and return of leases which I worked on in November and December - mostly in preparation of bringing VAI also to VDPs. Batching with scatter arrays (VSCARAB)The existing With the new API, we (will) use scatter arrays for all the processing steps: The goal is that storage fills a scatter array, which then gets processed and maybe replaced by filters, until finally some transport hands many I/Os at once to the kernel. Established interfaces follow the signature of Yet for our plans, we want to have something which can be passed around in a single unit, to ensure that the array is always used with the right count, something which can vary in size and live on the heap or the stack. This is the VSCARAB, the Varnish SCatter ARAy of Buffers, basically a container struct with a flexible array member (fam). The VSCARAB has a capacity, a used count, and is annotated with We add macros to work on VSCARABs for allocation (on the stack and heap), initialization, checking (magic and limits), iterating, and adding elements. VSCARET and VFLAManaging scatter arrays is one side of the coin, when we are done using buffers, we need to return them to storage, such that storage can do LRU things or reuse memory. As before, we want to batch these operations for efficiency. As an easy to use, flexible data structure, we add VSCARABs sibing VSCARET. And, because both are basically the same, we generalize macros as VFLA, Varnish Flexible Arrays. |
... where of course AI stands for Asynchronous Iterators ;)
On this PR
This PR might be more extensive than others, and I would like to ask reviewers to lean back and take some time for it. Thank you in advance to anyone who invests their time into looking at this.
The PR description has been written to give an overview; individual commit messages have more details.
While the code proposed here works, and I am already building on top of it, I am deliberately marking this PR as a draft to make clear that I am very interested in feedback and comments before committing to a particular interface. Also I expect sensible changes to arrive still, but the fundamentals are ripe, IMHO.
Intro
This PR suggests a new API to enable use of asynchronous I/O on the delivery side of Varnish-Cache. The design has been highly influenced by io_uring, because my prior experience has been that creating compatibility with it is fairly easy: The io_uring interface is extremely simple in that completions are identified by a single
uint64_t
value, so once calling code is prepared to work with this limited interface, adding other implementations is relatively easy.Advantages of asynchronous I/O
In my mind, efficient asynchronous I/O interfaces primarily provides two main benefits:
a) using less threads
b) significantly lower context switching overhead when many I/O requests can be combined into a single system call (or, in the case of io_uring, optionally none at all)
a) has long been achieved with the traditional
select()
(poll()
/epoll()
/kqueue()
) event loop model, which Varnish-Cache uses for the waiter facility. This technique saves on threads, but each I/O still requires >1 system call per I/O. As long as we used HTTP1 with TCP, this overhead was not too bad, because we could simply make our I/O system calls very big (usingwritev()
). But already with HTTP2, handing large batches of data to the kernel becomes problematic, because the (questionable) benefits of the protocol require reacting to client events with lowest latency. And with HTTP3, we definitely need system calls to handle lots of small UDP datagrams, unless we want to map hardware into user address space and talk directly to it.So b) becomes much more important now: More than ever, we will need to be able to batch I/O requests into as small a number of system calls as possible with a tight feedback loop to the client. If we want to stay efficient and not push too much data for each client at once, consequently we also need to handle many clients from a central instance (a thread or a few threads).
Or, summarizing even more, we need to use an event loop-ish architeture for client I/O.
Existing interfaces vs. asynchronous I/O
Our existing ObjIterate() interface calls into the storage engine with a delivery function to execute on every extent (pointer + length) of body data. Iteration can block at any point: The storage engine may issue blocking I/O, as certainly will the delivery function while sending data to the client.
For asynchronous I/O, we need to turn this inside out: Firstly, our iteration state can not live on the stack as it currently does, we need a data structure containing all we need to know about the iteration such that we can come back later and ask the storage engine for more data. Secondly, we need a guarantee from the storage engine to never block. And, following from that, we need some notification mechanism allowing the storage engine to tell client I/O code that it now has more data available. A special case of the last point is streaming a busy object, where it is not just the storage engine that blocks, but rather creation of the cache object itself.
Overview
This series of commits starts with the last aspect:
ObjVAIGetExtend()
in the second commit checks for more streaming data and, if none is available, registers a callback, allowing for asynchronous notification.The third commit contains the proposed new API, which follows the idea that the storage engine hands out "leases" on segments of data to callers, which can then use them until they return the leases - where the storage engine might require leases to be returned before handing out new ones.
With an array of leases at hand, the caller can then do its asynchronous thing, also returning leases in an asynchronous fashion.
The last commit implements the new API for simple storage and reimplements the existing simple storage iterator using the new API as a PoC.
Performance
Performance of malloc storage with current varnish-cache head 508306f has been measured against the reimplementation using the new API. The test has been set up such that the more relevant case of streaming data from small-ish extents is tested with varnish-cache as its own backend: A cache object is created from 64MB of random data, which is then streamed in pass mode by using a request from Varnish-Cache against itself. The test has also been run with
debug.chkcrc32
from #4208 and also for much longer periods.Details are given below.
Typical results for
wrk -t100 -c400 -d100 http://localhost:8080
on my laptop are:baseline 508306f
with VAI reimplementation 125dc28
baseline 508306f with
debug.chkcrc32
with VAI reimplementation 125dc28 with
debug.chkcrc32
Performance Test details
VCL with varnishd start command line in a comment (code in vcl_deliver commented out for no-crc test)