Forum teuk.org

🧿🪶 Mediabot v3: The Last Four Misbehaving Runes — MB373 to MB376

in Mediabot · started by TeuK · 19h ago

TeuK · 19h ago

This final debugging pass did not uncover a collapsing staircase or a dragon beneath the database.

It found four much smaller things — precisely the kind that survive for years in mature software:

  • a nickname accidentally treated as a regular expression;
  • an AI message measured in letters while IRC counts bytes;
  • a Prometheus help line trusting every backslash;
  • and a truncation suffix added after the byte budget had already been declared safe.

Claude completed MB373 through MB375. The final review added MB376 by following the complete AI output path one step beyond the new wrapper.

The standing rule remained untouched:

Improve the bot without changing its database schema, and make every number describe what actually reaches the wire.


🧿🪪 MB373 — The bot’s nickname is a name, not a spell pattern

The Hailo mention path used the current IRC nickname directly inside a regular expression:

$what =~ /$sCurrentNick/i

That is harmless for a simple alphanumeric nick.

IRC nicknames, however, may legally contain characters with special regex meanings:

[ ] \ ` ^ { } |

A nickname such as bot|x could match an unrelated x.

A nickname containing an unmatched [ could produce an invalid expression and break the message callback.

MB373 quotes the nickname literally in both places where it is used:

\Q$sCurrentNick\E

The removal step is now case-insensitive as well, matching the detection behaviour.

A wizard’s name may contain unusual symbols. It should not become an accidental incantation merely because Perl sees a vertical bar.


🪶📏 MB374 — AI replies are measured by the owl’s luggage, not by the number of letters

IRC limits are byte limits.

The old _chatgpt_wrap() counted characters.

For plain ASCII, those numbers usually agree. For French accents and emojis, they do not.

Three hundred é characters are six hundred bytes in UTF-8. A chunk described as “400 characters” could therefore be far too large for the intended IRC payload.

The lower-level botPrivmsg() already had a proven byte-safe splitter, so MB374 removed the duplicate character-based implementation and delegated to:

Mediabot::Helpers::_split_text_for_irc()

This improved more than transport safety.

MAX_PRIVMSG is supposed to limit how many lines the AI sends. When oversized chunks were split again downstream, the configured count no longer matched the number of messages actually placed on IRC.

After MB374, the initial AI chunks are prepared according to the same byte rules used by the final sender.

The owl is weighed with its parcel, not by counting the feathers on the label.


🛰️📜 MB375 — One strange HELP line can no longer blind the Prometheus watchtower

Prometheus exposition includes metadata lines such as:

# HELP metric_name description

Metric label values were already escaped correctly.

The HELP text was not.

A future description containing a backslash or an embedded newline could produce malformed exposition. Because Prometheus parses the complete document, one broken HELP line could make the whole scrape fail.

MB375 added:

_escape_help_text()

It escapes backslashes first and newlines second, following the exposition format.

The # TYPE line also receives the defensive fallback:

untyped

when a type is absent.

Current constant help strings remain visually unchanged. The protection is for the next contributor who innocently adds a Windows path, a regular expression or a multiline explanation.

The watchtower no longer goes dark because one annotation arrived with an adventurous backslash.


🪡🧮 MB376 — The truncation suffix is now counted before the parchment is cut

MB374 fixed the main wrapper, but one old piece of character arithmetic remained.

When an answer exceeded MAX_PRIVMSG, ChatGPT and Claude added a suffix to the final permitted chunk.

The historical code calculated the available space like this:

my $allow = $wrap_bytes - length($suffix);

It then used character-based substr() before appending the suffix.

The ChatGPT suffix itself contains UTF-8 characters:

 [¯\_(ツ)_/¯ guess you can’t have everything…]

That meant the line could be byte-safe before the suffix, exceed the budget afterwards, and be split again by botPrivmsg().

The old mismatch returned through the back door:

MAX_PRIVMSG configured : 4
chunks prepared        : 4
lines actually sent    : potentially more than 4

One byte-safe seal for both AI providers

MB376 introduced:

_irc_wire_bytes()
_irc_prefix_for_budget()
_fit_truncation_suffix()

The final helper calculates the real UTF-8 cost of the suffix first.

Only the remaining byte allowance is offered to the text prefix.

The final contract is simple:

bytes(prefix + suffix) <= WRAP_BYTES

Both OpenAI and Anthropic now use the same helper.

The visible suffixes remain unchanged. Only their accounting has become honest.

The parchment is no longer measured, cut, and then given an unmeasured decorative border.


⚗️🧪 Validation ledger

Claude’s three rounds arrived with complete green suites:

MB373 new test        : 13/13
MB373 full suite      : 7935/7935

MB374 new test        : 13/13
MB374 ChatGPT tests   : 26/26
MB374 full suite      : 7948/7948

MB375 new test        : 11/11
MB375 full suite      : 7959/7959

The final review added:

MB376 new test                 : 17/17
AI/Prometheus regression       : 67/67
MB373–MB376 regression         : 54/54
MB361–MB376 regression         : 405/405

The MB376 installer was tested on two clean copies of the MB375 snapshot and in an already-applied state.

These test groups overlap and should not be added into an artificial grand total.


🧱 Database impact

None.

0 new tables
0 altered columns
0 migrations
0 schema changes

🗺️ What the final four rounds changed

Mediabot now:

  • treats its own IRC nickname literally in Hailo regex operations;
  • prepares AI replies according to UTF-8 wire bytes;
  • keeps MAX_PRIVMSG aligned with the number of intended IRC messages;
  • emits Prometheus HELP text that follows the exposition format;
  • reserves suffix space before truncating ChatGPT or Claude output;
  • uses one shared truncation contract for both AI providers.

🧹✨ Mischief managed — at least for this chapter

The interesting part of this pass is not that the bugs were dramatic.

They were not.

They were boundary disagreements:

name versus regex
character versus byte
description versus exposition syntax
chunk budget versus final decorated line

Each component looked reasonable alone.

The defect appeared where one component handed its result to the next under a slightly different definition.

That is where the final hunt ended: not with a grand rewrite, but with four definitions made consistent.

The nickname stays a nickname.

The byte count reaches the wire unchanged.

The Prometheus scroll remains readable.

And the final suffix fits on the parchment it decorates.

Teuk

You must be logged in to reply.