Building Metrics Pipeline for High-Performance Data Collection

This is part three in a series discussing the metrics pipeline powering Kong Cloud.

In previous posts in this series, we’ve discussed how Kong Cloud collects, ships, and stores high volumes of metrics and time-series data. We’ve described the difference between push and pull models of collecting metrics data, and looked at the benefits and drawbacks of each from a manageability and performance perspective.

This post will detail how our engineering team studied the integration of our metrics collection code with the services that actually process customer traffic, ensuring that our data collection was effective, correct, and most importantly, did not interfere with user traffic in terms of performance or latency.

Kong Cloud’s platform is powered at the edge by a tier of OpenResty servers. To briefly recap, OpenResty is a bundle of the venerable Nginx web server, and LuaJIT, providing a stable, high-performance web application and proxy platform. Leveraging OpenResty at the edge allows us to write complex business logic (e.g., data transformation, communicating with third party services, etc) in a very efficient manner. Plumbing code related to mundane but critical tasks, like memory management socket lifecycle handling, is delegated to the core OpenResty codebase, letting us focus on the correctness of our business logic. This paradigm lets us ship changes and new features to our edge tier in a matter of minutes, not the weeks or months that writing native Nginx module code would require.

That said, OpenResty’s safety and stability does not mean that we can be blasé about the code we ship. Any new logic introduced at the edge of our infrastructure introduces complexity that needs to be validated. This validation extends beyond logic correctness – we also need to ensure that the code we ship doesn’t introduce unnecessary latency. Keeping latency at a minimum at our edge tier allows us to maintain a minimally-sized fleet of edge machines in each datacenter, reducing cost and operational complexity.

We kept these principles closely in mind while developing the logging code used by edge tier machines as part of our metrics pipeline. Our previous blog post describes the basic architecture we use in OpenResty to store and periodically ship metrics data. The implementation itself relied on code patterns that were designed to leverage OpenResty APIs that are known to be friendly to the JIT compiler (e.g., including the resty.core library). We made a deliberate effort to avoid relying on data structures that could only be iterated by Lua code that can not be JIT compiled, such as the infamous pairs built-in function and used FFI data structures to reduce memory footprint.

Following the first pass of development, we took the opportunity to profile our logging library in a lab environment to gauge the expected performance hit it would introduce in our environment. We first set up a lab environment with a single Nginx worker process serving a local file from disk, with no Lua phase handlers or custom business logic:

$ wrk -c 20 -d 5s -t4 http://localhost:8080/index.html

Running 5s test @ http://localhost:8080/index.html

4 threads and 20 connections

Thread Stats Avg Stdev Max +/- Stdev

Latency 218.65us 78.51us 7.68ms 96.75%

Req/Sec 23.08k 1.43k 24.49k 91.18%

468517 requests in 5.10s, 360.11MB read

Requests/sec: 91864.85

Transfer/sec: 70.61MB

The idea here was simply to get a rough baseline of where we’d expect to see from Nginx’s throughput and client latency. The raw numbers themselves here aren’t as important as the delta in values we’ll see in future tests. The change ratio gave us an idea of the overall impact of the running code, given the parameters of the requested benchmark.

Next, we setup the same benchmark, with the addition of an empty Lua log phase handler:

# nginx config

log_by_lua_block{

-- nothing to see here!

}

$ wrk -c 20 -d 5s -t4 http://localhost:8080/index.html

Running 5s test @ http://localhost:8080/index.html

4 threads and 20 connections

Thread Stats Avg Stdev Max +/- Stdev

Latency 263.71us 286.15us 12.47ms 99.59%

Req/Sec 19.73k 4.13k 76.84k 99.50%

394735 requests in 5.10s, 303.40MB read

Requests/sec: 77402.86

Transfer/sec: 59.49MB

This gave us a rough idea of how much impact simply running the Lua VM would require. We saw about a 15% drop in max throughput in this test. Again, the idea here was not to perform a highly rigorous experiment, but rather to get a few simple benchmarks to make sure the inclusion of our logging library wouldn’t cause unexpected drops in performance.

We set up the benchmark for a third run, this time with our logging library included:

# nginx config

log_by_lua_block {

local logger = require “edge.logger”

logger.exec()

}

$ wrk -c 20 -d 5s -t4 http://localhost:8080/index.html

Running 5s test @ http://localhost:8080/index.html

4 threads and 20 connections

Thread Stats Avg Stdev Max +/- Stdev

Latency 392.11us 219.26us 14.60ms 98.21%

Req/Sec 13.03k 697.97 13.61k 95.10%

264432 requests in 5.10s, 203.25MB read

Requests/sec: 51852.36

Transfer/sec: 39.85MB

Well, that was unexpected. We knew that we were going to see some drop in throughput, but running at little more than half the raw request rate of Nginx was indicative of a real bottleneck in the logging library. After a fruitless few hours of eyeballing the code looking for bad patterns, we turned to LuaJIT’s profiling tools. We ran the benchmark again with LuaJIT’s jit.dump module enabled, giving us a verbose data set detailing the JIT compiler’s behavior. JIT dumps are a useful tool, dumping out Lua bytecode, intermediate representation (IR), and generated machine code for all traces the JIT compiler generates. When the compiler cannot generate a trace, the profiler will spit out the failure reason at the point of trace abort. For example, when LuaJIT tried to compile a trace that relied on using pairs to iterate over a table, it generated the following output (most of it removed for brevity):

0000 . FUNCC ; ipairs_aux

0066 JITERL 16 21 (logger.lua:497)

0067 GGET 13 15 ; “pairs” (logger.lua:510)

0068 TGETV 14 5 9 (logger.lua:510)

0069 CALL 13 4 2 (logger.lua:510)

0000 . FUNCC ; pairs

0070 ISNEXT 16 => 0093 (logger.lua:510)

—- TRACE 22 abort logger.lua:510 — NYI: bytecode 72

This particular trace wasn’t concerning, as it ran on a code path that executed infrequently on a background timer, so it wasn’t responsible for slowing down client requests. As we looked through the rest of the dump output, one repeated abort, in particular, caught our eye:

0009 . GGET 4 0 ; "ipairs" (logger.lua:358)

0010 . MOV 5 2 (logger.lua:358)

0011 . CALL 4 4 2 (logger.lua:358)

0000 . . FUNCC ; ipairs

0012 . JMP 7 => 0021 (logger.lua:358)

0021 . ITERC 7 3 3 (logger.lua:358)

0000 . . FUNCC ; ipairs_aux

0022 . JITERL 7 21 (logger.lua:358)

—- TRACE 25 abort logger.lua:359 — inner loop in root trace

This trace was occurring in the hot path (code that was called as part of the logging phase on every request), so we suspected this was at least a substantial part of the slowdown. The cause of failure here was not an operator that couldn’t be compiled, but rather using the ipairs iterator inside of another trace. This was a bit confusing at first- ipairs itself can be compiled by the JITer, so it wasn’t immediately clear why this was failing. Taking a look at the chunk of code where this trace aborted:

350 local function incr_histogram(tbl, tbl_length, histogram, value)

351 if value == nil then

352 return

353 end

354 — increment sum which is at last index

355 tbl[tbl_length – 1] = tbl[tbl_length – 1] + value

356

357 for i, uplimit in ipairs(histogram) do

358 if value <= uplimit then

359 tbl[i – 1] = tbl[i – 1] + 1

360 return

361 end

362 end

363 end

In this case, a root JIT trace had been created at the start of the incr_histogram function. The for loop over the ipairs iterator on line 357 generated a new side trace, returning out of the builtin ipairs once it found the appropriate histogram bucket to increment. It seemed that return out of ipairs was causing the trace to abort. On a hunch, we replaced the call to ipairs with an integer-based loop:

diff --git a/files/lualib/edge/logger.lua b/files/lualib/edge/logger.lua

index 5860c5b..3b83106 100644

— a/files/lualib/edge/logger.lua

+++ b/files/lualib/edge/logger.lua

@@ -354,8 +354,8 @@ local function incr_histogram(tbl, tbl_length, histogram, value)

— increment sum which is at last index

tbl[tbl_length – 1] = tbl[tbl_length – 1] + value

- for i, uplimit in ipairs(histogram) do

– if value <= uplimit then

+ for i = 1, tbl_length – 1 do

+ if value <= histogram[i] then

tbl[i – 1] = tbl[i – 1] + 1

return

end

This allows the LuaJIT engine to generate a side trace inside our loop (which, incidentally, may only execute over one iteration), and exit without being aborted by trying to return from within the builtin ipairs. Re-running the benchmark test with this change alone resulted in marked improvement:

$ wrk -c 20 -d 5s -t4 http://localhost:8080/index.html

Running 5s test @ http://localhost:8080/index.html

4 threads and 20 connections

Thread Stats Avg Stdev Max +/- Stdev

Latency 293.55us 94.27us 6.61ms 98.36%

Req/Sec 17.19k 466.69 17.65k 86.27%

348846 requests in 5.10s, 268.13MB read

Requests/sec: 68406.82

Transfer/sec: 52.58MB

That’s a substantial improvement both in max throughput and request latency seen by the benchmark client. We were able to corroborate this improvement in performance with an increase in the number of compiled traces reported by SystemTap and the OpenResty stapxx set of tools.

This change highlights the need to thoroughly study and understand the behavior of code.

Our metrics pipeline is a crucial element of Kong Cloud. It gives us keen insight into the performance of the various Kong clusters we manage. Its position within our infrastructure requires that it be highly performant and stable. As we fleshed out the development of the OpenResty library that delivered metrics from our edge tier, we kept a close eye on performance and compiling behaviors, and were able to leverage LuaJIT’s excellent debug tooling to knockdown performance issues that weren’t caught during code review or integration testing, before we shipped them into production.

The post Building Metrics Pipeline for High-Performance Data Collection appeared first on KongHQ.

Building Metrics Pipeline for High-Performance Data Collection

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112