Skip to content

riverlog total size cap, perf and allocation optimizations#1157

Merged
bgentry merged 5 commits intomasterfrom
bg/riverlog-log-cap-perf
Feb 25, 2026
Merged

riverlog total size cap, perf and allocation optimizations#1157
bgentry merged 5 commits intomasterfrom
bg/riverlog-log-cap-perf

Conversation

@bgentry
Copy link
Contributor

@bgentry bgentry commented Feb 25, 2026

A user reported high transient memory usage and slow append cycles when using riverlog when the total size of its jsonb array grew large (especially in snooze-heavy jobs). It turns out we were doing a lot of highly inefficient stuff in here with lots of low-hanging fruit for optimization. To start, I added a benchmark to validate the perf troubles and potential solutions.

First, the biggest user-facing change is the new total history cap: MiddlewareConfig.MaxTotalBytes (default 8 MB). When appending a new river:log entry would exceed that limit, the middleware now prunes oldest entries first while always keeping the newest entry. That directly addresses unbounded log-history growth for long-lived/snoozed jobs.

Internally, the implementation was first refactored from its original struct-unmarshal/append logic to json.RawMessage; however this turned out to still be quite slow and allocation-heavy, so I did another pass leveraging gjson/sjson which was far better (results below).

A final round of optimization improved this significantly more using a one-pass capped-array builder which marshals only the new entry, computes exactly how much existing suffix can be kept, and emits the final JSON array once instead of append-then-repeated-delete loops. It also truncates log bytes before converting to string, so oversized buffers are not copied in full first.

Coverage was expanded with focused unit tests around append/prune semantics (missing key, non-array metadata, minimal pruning, newest-entry retention when over cap, and no-cap behavior).

In the end, we dramatically reduced the duration of these operations as well as the number of allocations, at the cost of a small increase in the total memory overhead per operation.

Stage Change 256KB ns/op 256KB B/op 256KB allocs/op 2MB ns/op 2MB B/op 2MB allocs/op
Baseline (23921c2) Original append path + benchmark harness 1,331,725 703,634 294 8,348,696 6,441,894 2,100
Interim (ecb718b) Internal []json.RawMessage path 1,562,610 (+17.3%) 674,228 (-4.2%) 292 (-0.7%) 12,557,009 (+50.4%) 6,433,251 (-0.1%) 2,097 (-0.1%)
Interim (c87f47e) gjson/sjson append+prune 627,022 (-52.9%) 1,166,917 (+65.8%) 24 (-91.8%) 4,397,872 (-47.3%) 6,480,502 (+0.6%) 30 (-98.6%)
Final (5896050) One-pass suffix selection + single-build output 163,320 (-87.7%) 884,066 (+25.6%) 31 (-89.5%) 1,207,252 (-85.5%) 7,006,968 (+8.8%) 46 (-97.8%)

Large `river:log` metadata can drive significant allocation spikes while
an attempt appends one more log entry. Before changing behavior, we need
an inexpensive way to measure the hot path and compare alternatives.

Add focused benchmarks in riverlog that exercise middleware append
behavior with prebuilt 256 KB and 2 MB metadata payloads. The benchmark
stays in-process (no DB setup), reports allocations, and skips in
testing.Short() so it remains safe for quick CI-oriented runs.
@bgentry bgentry force-pushed the bg/riverlog-log-cap-perf branch from 5896050 to 6da39e2 Compare February 25, 2026 02:00
Snoozed jobs can run repeatedly without consuming attempts, so river:log
entries may grow without bound over long-running jobs. Large metadata
payloads then amplify memory use on each future append.

Add a new `MiddlewareConfig` field `MaxTotalBytes` to bound the total
serialized `river:log` payload. When the new attempt is appended and the
result exceeds the cap, oldest entries are dropped first while always
retaining the newest entry.

The middleware now defaults `MaxTotalBytes` to 8 MB. New tests verify
that oldest entries are pruned when the cap is reached, a tiny cap still
retains the newest entry, and the default is applied.
Appending a new river:log entry previously decoded the whole historical
array into []logAttempt and then re-encoded it. For large metadata this
creates substantial transient allocations for strings and slices.

Switch middleware internals to decode river:log as []json.RawMessage.
Each run now marshals only the newest entry and re-serializes the raw
entry array while preserving the existing total-cap pruning behavior.
This removes per-entry struct decoding from the append path.
The raw-message iteration still rebuilt log arrays through json
marshaling, which left append performance significantly worse than the
baseline implementation used on master.

Switch riverlog history to gjson/sjson so operations stay on raw JSON
arrays. The middleware reads only `river:log` with `gjson`, appends with
`sjson.SetRawBytes` using `-1`, and enforces `MaxTotalBytes` by deleting
oldest entries with `sjson.DeleteBytes("0")`.

Comparison for `BenchmarkMiddlewareWorkAppend` (`-benchtime=100ms`):

    case: 256 KB
    impl           ns/op      B/op  allocs/op
    baseline      976052    645856        291
    rawmessage   1605106    698233        293
    sjson         614433   1167265         25

    case: 2 MB
    impl           ns/op      B/op  allocs/op
    baseline     7762435   5920954       2096
    rawmessage  12516198   6651551       2098
    sjson        4485274   6482082         32

The pruning test changed only to avoid a brittle exact-byte boundary.
`MaxTotalBytes` is now set slightly above the two-entry payload so the
test consistently verifies the intended behavior: appending a third
entry drops the oldest and keeps the newest two.
@bgentry bgentry force-pushed the bg/riverlog-log-cap-perf branch from 6da39e2 to bb18bd1 Compare February 25, 2026 02:14
Appending to large river:log histories still did extra work in the sjson
version: it built an appended array first, then repeatedly deleted index 0
while over cap. In high-volume histories this caused multiple full-array
rewrites before reaching the final payload.

Rework append/prune to compute the kept suffix first and then build the
final array once. The new flow marshals only the new entry, derives bounds
for existing array elements, decides how many oldest entries to drop to fit
MaxTotalBytes, and emits a single final JSON array. It also truncates log
bytes before converting to string so oversized log buffers are not copied
in full.

Benchmark comparison for `BenchmarkMiddlewareWorkAppend`
(`-benchtime=100ms`, same machine):

case: 256 KB
impl            ns/op      B/op  allocs/op
before         821178   1167325         25
after          177355    884454         31

case: 2 MB
impl            ns/op      B/op  allocs/op
before        4854185   6479606         29
after         1247496   7006818         46

Added focused unit coverage for appendLogDataWithCap, including missing-key
behavior, non-array metadata errors, minimal oldest-first pruning, keeping
newest entry when over cap, and no-cap behavior.
@bgentry bgentry force-pushed the bg/riverlog-log-cap-perf branch from bb18bd1 to 020e1db Compare February 25, 2026 02:20
@bgentry bgentry requested a review from brandur February 25, 2026 02:21
Copy link
Contributor

@brandur brandur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great man! Really nice to see the implementation get swapped out like this for something more optimal while the API stays 99% the same and is non-breaking.


### Changed

- `riverlog.Middleware` now supports `MiddlewareConfig.MaxTotalBytes` (default 8 MB) to cap total persisted `river:log` history per job. When the cap is exceeded, oldest log entries are dropped first while retaining the newest entry. Values over 64 MB are clamped to 64 MB. [PR #1157](https://github.com/riverqueue/river/pull/1157).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brandur I should have called this out earlier, does this 8MB default seem reasonable to you? Do you feel it should be called out as "breaking" or is "changed" sufficient?

We can tweak it in a follow up before release if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I saw it in there so I can't complain. I could see it potentially on the low side, but it really depends on the use case. How did you pick 8 MB anyway?

Just re-reading this, we might want to update this comment on MaxSizeBytes and call out that you'll likely hit MaxTotalBytes way before you hit something like 50 MB:

	// Be careful with this number because the maximum total log size is equal
	// to maximum number of attempts multiplied by this number (each attempt's
	// logs are kept separately). For example, 25 * 2 MB = 50 MB maximum
	// theoretical log size. Log data goes into metadata which is a JSONB field,
	// and JSONB fields have a maximum size of 255 MB, so any number larger than
	// 255 divided by maximum number of attempts may cause serious operational
	// problems.

@bgentry bgentry merged commit 9be9aff into master Feb 25, 2026
24 of 25 checks passed
@bgentry bgentry deleted the bg/riverlog-log-cap-perf branch February 25, 2026 04:16
@Sroose
Copy link

Sroose commented Feb 25, 2026

Hi,

Thank you for the very quick response!

We tested the changes from this PR against our real-world case (a long-lived snoozed job with ~6000 log entries, ~17MB of river:log in metadata, snoozing every 2 minutes).
Unfortunately we see no meaningful improvement.

Numbers

We ran go tool pprof heap profiles after 7 snooze iterations (~12 minutes) on both v0.28.0 and a build with this PR applied.

inuse_space:

v0.28.0 PR #1157
iobufpool (pgx sync.Pool) 112 MB (50%) 128 MB (58%)
pgtype.encodePlan...Encode 58.5 MB (26%) 66.5 MB (30%)
bytes.growSlice 16 MB 0.5 MB
json.Marshal 16 MB -
riverlog.buildAppendedArray - 8 MB
Total 224 MB 219 MB

alloc_space: 3,300 MB (old) vs 3,412 MB (new) - similar with the fix.

The json.Unmarshal / json.Marshal allocations inside riverlog are gone, which is good. But they've been replaced by gjson.ParseBytes (129 MB) + gjson.getBytes (129 MB) + buildAppendedArray (112 MB), so the net effect within alloc_space is roughly a wash.

More importantly, the two dominant allocators - iobufpool and pgtype.Encode - are completely unchanged. Together they account for ~88% of inuse in both old and new.

Why

The PR optimises step 3 of what's essentially a 7-step amplification chain per snooze cycle:

  1. JobGetAvailable fetches the job, including the full metadata column → pgx reads ~17 MB from the wire into iobufpool
  2. pgtype.Scan copies the metadata bytes out of the wire buffer → another ~17 MB
  3. riverlog parses the river:log array → this is what the PR fixes ✅
  4. riverlog builds the new capped array → buildAppendedArray, still ~8 MB (MaxTotalBytes default)
  5. job_executor.reportResult calls json.Marshal(metadataUpdates) on the full update map (containing the log array), then sjson.SetBytes to add the snooze counter → copies the whole blob again
  6. pgx encodes the UPDATE Bind message with metadata_updatespgproto3.(*Bind).Encode at ~112 MB in alloc_space
  7. The UPDATE query has RETURNING ... metadata → pgx reads the full merged metadata back from the wire → another iobufpool allocation

Steps 1, 2, 5, 6 and 7 are untouched. The iobufpool sync.Pool retains peak-allocated buffers - once it allocates 17 MB for the first read it never shrinks, since the job keeps reusing the pool every 2 minutes.

Even after MaxTotalBytes trims the log down to 8 MB, that's still a large blob being round-tripped through pgx on every cycle. And the pool buffers sized for the original 17 MB read stick around.

Full pprof output

v0.28.0 (old)
Type: inuse_space - 224.22MB total
     112MB 49.95%  iobufpool.init.0.func1
   58.53MB 26.10%  pgtype.encodePlanJSONCodecEitherFormatByteSlice.Encode
      16MB  7.14%  bytes.growSlice
   15.98MB  7.13%  encoding/json.Marshal

Type: alloc_space - 3300.24MB total
  337.80MB 10.24%  pgtype.scanPlanAnyTextToBytes.Scan
     304MB  9.21%  iobufpool.init.0.func1
  225.25MB  6.83%  bytes.growSlice
  225.13MB  6.82%  encoding/json.Marshal
  192.44MB  5.83%  pgtype.encodePlanJSONCodecEitherFormatByteSlice.Encode
  132.23MB  4.01%  json.(*decodeState).literalStore
  114.70MB  3.48%  json.unquoteBytes
  111.81MB  3.39%  sjson.appendRawPaths
  111.80MB  3.39%  pgproto3.(*Bind).Encode
PR #1157 (new)
Type: inuse_space - 219.09MB total
     128MB 58.42%  iobufpool.init.0.func1
   66.48MB 30.35%  pgtype.encodePlanJSONCodecEitherFormatByteSlice.Encode
       8MB  3.65%  riverlog.buildAppendedArray
    0.51MB  0.23%  bytes.growSlice

Type: alloc_space - 3412MB total
  371.58MB 10.89%  pgtype.scanPlanAnyTextToBytes.Scan
     336MB  9.85%  iobufpool.init.0.func1
  180.88MB  5.30%  pgtype.encodePlanJSONCodecEitherFormatByteSlice.Encode
  128.79MB  3.77%  gjson.ParseBytes
  128.79MB  3.77%  gjson.getBytes
  124.17MB  3.64%  bytes.growSlice
  113.28MB  3.32%  encoding/json.Marshal
     112MB  3.28%  riverlog.buildAppendedArray
  111.78MB  3.28%  sjson.appendRawPaths
  111.76MB  3.28%  pgproto3.(*Bind).Encode

We have currently applied a workaround for our case (which is less logging and running less frequently), so this is not a blocker. But it would be nice to see the memory amplification being resolved eventually.

@bertpersyn-nxh
Copy link

Thanks @Sroose for the detailled output :)

@bgentry and @brandur, thanks for the quick response and fix.

For users of the riverqueue go sdk that do not want to use the logs at runtime (opt-out?), are there any postgres and jsonb strategies that append logs (concat) + trim when updating the river_job? This could essentially remove the whole memory issue on the client side.

An (quick and non production ready) example:

CREATE TEMP TABLE job_monitor (
                                  id serial PRIMARY KEY,
                                  metadata jsonb DEFAULT '{"logs": []}'::jsonb
);

INSERT INTO job_monitor (metadata)
VALUES ('{"status": "running", "logs": [{"m": "Start proces", "t": 1700000000}]}');

-- 3. "Atomic Append & Rotate" Update
UPDATE job_monitor
SET metadata = jsonb_set(
        metadata,
        '{logs}',
        (
            SELECT jsonb_agg(elem ORDER BY idx)  -- preserve chronological order (oldest->newest) in stored array
        FROM (
            SELECT
                elem,
                idx,
                sum(length(elem::text)) OVER (ORDER BY idx DESC) AS total_bytes
            FROM unnest(
                array_append(
                    ARRAY(SELECT jsonb_array_elements(metadata->'logs')),
                    jsonb_build_object('m', 'Some text and some other text and some more text' || random(), 't', CURRENT_TIMESTAMP)::jsonb
                )
            ) WITH ORDINALITY AS t(elem, idx)
        ) sub
        WHERE total_bytes < 500
    )
)
WHERE id = 1;

SELECT
    pg_column_size(metadata) as storage_size_bytes,
    jsonb_pretty(metadata) as content
FROM job_monitor;

Execute step 3 a couple of times and the logs should rotate, so it stays at +~ 500 bytes of data.

@bgentry
Copy link
Contributor Author

bgentry commented Feb 25, 2026

@Sroose @bertpersyn-nxh I made another optimization in #1159 that lowers memory overhead for any snooze metadata updates when the metadata payload is large.

Beyond this, I'm not sure how much we can improve things without a significant re-architecture where we either don't store all these things in metadata or at least don't have to read/write them fully on every job update where some fields need adjustment. Most of the remaining overhead is in pgx itself from the sheer cost of writing those byte slices. Let me know if you have other ideas that seem workable!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants