🧙‍♂️⏱️ Mediabot v3: The Time-Turner Cleanup Pass

in Mediabot · started by TeuK · 2w ago

TeuK · 2w ago

This development pass was all about one of the most annoying classes of bugs in long-running bots:

runtime timers and in-memory state that survive longer than they should

No database schema change.

No migration.

No table altered.

The focus was runtime correctness: DCC timers, authentication timers, DB-backed scheduled timers, reconnect behavior, and a few nasty edge cases that would not necessarily crash the bot but could make debugging painful weeks or months later.

Why this pass mattered

Mediabot is a long-running IRC bot.

That means small state leaks are not harmless.

A timer that stays alive for sixty extra seconds is not dramatic once.

A timer that gets duplicated on every reconnect is a different story.

A pending DCC offer that survives after its listener has closed can leave a user stuck with confusing messages.

A DB-backed timer that disappears after a transient database hiccup can make scheduled commands silently stop working.

This pass cleaned those traps.

mb145 — DCC offer timeout cleanup

The first issue was in the DCC CHAT offer path.

When the bot offers a DCC CHAT session, it creates a listener and a 60-second timeout. If the client never connects, the timeout closes the listener. That part was fine.

The bug was the successful connection path.

When the client connected before the timeout expired, the code removed the listener and started the DCC session, but the timeout timer itself stayed scheduled until its original expiry.

That meant the countdown timer and its closure remained alive unnecessarily.

The fix was to keep a reference to the timeout timer and cancel it immediately when the DCC client connects.

Affected paths:

offer_dcc_chat
accept_dcc_chat_passive

The same cleanup was also added to listen-error paths.

Result:

DCC client connects
offer removed
listener removed
timeout stopped
timeout removed from IO::Async loop

No dangling countdown waiting around for no reason.

mb146 — passive DCC timeout removes the pending offer

The next DCC issue was more subtle.

In the active DCC path, when an offer timed out, the bot removed the pending offer entry.

But in the passive DCC path, timeout only closed the listener. It did not remove the pending offer.

That could leave the user stuck in a false state:

A DCC CHAT offer is already pending. Please connect to it or wait for timeout.

Except the timeout already happened.

The fix added pending-offer cleanup to the passive timeout handler:

$self->_dcc_offer_remove('passive_chat', $nick);
eval { $loop->remove($listener) };

This removed another runtime ghost.

mb147 — DCC authentication timeout cleanup

Once DCC is connected, the bot starts a 60-second authentication timeout.

That timer is useful while the user is unauthenticated.

But after successful login, it becomes obsolete.

Before this pass, the timeout callback checked whether the user was already authenticated and returned safely, but the timer still stayed scheduled until expiry.

The same applied when the user disconnected before authenticating: the timer stayed alive until expiry, only to discover later that the session was gone.

The fix introduced:

_cancel_auth_timeout($id)

and stored the timeout handle in the session:

$self->{users}{$id}{auth_timeout_timer} = $timeout_timer;

The timeout is now stopped and removed:

on successful DCC authentication
on session close/disconnect

This prevents retained timer closures after login or early disconnect.

mb148 — `remtimer` really stops runtime timers

Then the audit moved from DCC timers to database-backed bot timers.

The remtimer command checked that a timer existed in runtime memory, then deleted the row from the TIMERS table.

But if the database row was already gone, the old code returned with a notice:

Timer <name> was running but not found in database

and left the runtime timer alive.

That is exactly the kind of drift that makes operators crazy:

DB says timer is gone
runtime timer still fires
remtimer says it was not found
restart fixes it

The fix now treats runtime as the stale side if the DB row is missing.

It stops and removes the runtime timer anyway:

my $timer = delete $self->{hTimers}{$name};
if ($timer) {
    eval { $timer->stop if $timer->can('stop') };
    eval { $self->{loop}->remove($timer) if $self->{loop} };
}

The normal removal path was also improved to explicitly call stop() before removing the timer from the loop.

mb149 — main timer getter typo

This one looked tiny, but it was not tiny.

The setter stored the main periodic timer here:

$self->{main_timer_tick} = $timer;

But the getter returned:

$self->{maint_timer_tick}

One extra t.

Because of that, reconnect cleanup could never retrieve the old main timer.

The reconnect code tried to do the right thing:

my $old_timer = $mediabot->getMainTimerTick();
if ($old_timer) {
    $old_timer->stop;
    $loop->remove($old_timer);
}

but $old_timer was always undefined.

So every reconnect could create another main 5-second timer while the previous one kept running.

This is exactly the kind of bug that looks impossible later:

Why is on_timer_tick firing twice?
Why only after reconnects?
Why does restart fix it?

The fix made the getter match the setter:

return $self->{main_timer_tick};

mb150 — avoid DB timer duplication after reconnect

After fixing the main timer getter, the next issue was the DB-backed timer loader.

on_login() calls:

$mediabot->onStartTimers();

That happens at initial connection, but also after IRC reconnects.

onStartTimers() creates IO::Async::Timer::Periodic objects from the TIMERS table.

Before mb150, it overwrote the hash of known timers without stopping the old timer objects first.

That meant after reconnects, the same scheduled DB timer could be duplicated:

startup       -> 1 timer
reconnect #1  -> 2 timers
reconnect #2  -> 3 timers

The fix introduced a helper to stop and remove all currently registered DB timers before replacing them:

_stop_all_runtime_db_timers()

There was one integration issue: DBCommands.pm exports methods into package Mediabot, so the helper also had to be listed in @EXPORT.

That was corrected and covered by a stronger regression test.

mb151 — make `onStartTimers()` reload atomic-ish

mb150 fixed duplication, but the first version had an important operational flaw.

It stopped old timers before proving that the database reload could succeed.

That meant a transient DB failure during reconnect could do this:

stop current runtime timers
DB prepare or execute fails
return 0
all DB-backed timers are now gone

No crash.

No obvious error beyond a log line.

Just scheduled commands silently no longer firing.

mb151 made the reload safer:

build fresh timers first
if DB reload fails, keep old runtime timers alive
if DB reload succeeds, stop old timers and publish the fresh set

So the commit point is now:

$self->_stop_all_runtime_db_timers();
%{$self->{hTimers}} = %hTimers;
return $i;

This keeps the mb150 protection against duplicate timers without turning a temporary DB hiccup into a silent timer outage.

Tests added and strengthened

This pass added or updated tests around the runtime cleanup paths:

396_mb145_dcc_offer_timer_cleanup.t
397_mb146_dcc_passive_timeout_offer_cleanup.t
398_mb147_dcc_auth_timeout_cleanup.t
399_mb148_runtime_timer_removal_cleanup.t
400_mb149_main_timer_tick_getter_typo.t
401_mb150_onstarttimers_reconnect_cleanup.t
402_mb151_onstarttimers_atomic_reload.t

The tests cover:

DCC offer timeout cancellation on connect
passive DCC pending-offer cleanup on timeout
DCC authentication timeout cancellation on login/close
runtime timer removal even when DB row is missing
main timer getter/setter consistency
DB timer cleanup across reconnects
atomic-ish DB timer reload behavior
Exporter integration for DBCommands helpers

The important lesson from this pass: the test suite must check not only that code exists, but also that the method is visible through Mediabot’s actual runtime architecture.

Recommended validation

cd /home/mediabot/mediabot_v3 || exit 1

perl -I. -c Mediabot/Partyline.pm
perl -I. -c Mediabot/DBCommands.pm
perl -I. -c Mediabot/Mediabot.pm
perl -c mediabot.pl

perl -c t/cases/396_mb145_dcc_offer_timer_cleanup.t
perl -c t/cases/397_mb146_dcc_passive_timeout_offer_cleanup.t
perl -c t/cases/398_mb147_dcc_auth_timeout_cleanup.t
perl -c t/cases/399_mb148_runtime_timer_removal_cleanup.t
perl -c t/cases/400_mb149_main_timer_tick_getter_typo.t
perl -c t/cases/401_mb150_onstarttimers_reconnect_cleanup.t
perl -c t/cases/402_mb151_onstarttimers_atomic_reload.t

perl t/cases/396_mb145_dcc_offer_timer_cleanup.t
perl t/cases/397_mb146_dcc_passive_timeout_offer_cleanup.t
perl t/cases/398_mb147_dcc_auth_timeout_cleanup.t
perl t/cases/399_mb148_runtime_timer_removal_cleanup.t
perl t/cases/400_mb149_main_timer_tick_getter_typo.t
perl t/cases/401_mb150_onstarttimers_reconnect_cleanup.t
perl t/cases/402_mb151_onstarttimers_atomic_reload.t

perl -I. -e 'require "Mediabot/Mediabot.pm"; die "missing method\n" unless Mediabot->can("_stop_all_runtime_db_timers"); print "OK: Mediabot can _stop_all_runtime_db_timers\n"'

git diff --check

Then a real foreground boot remains the best smoke test:

./mediabot.pl --conf=mediabot.conf

Operational impact

This pass removes several long-running-process traps:

DCC offer timers no longer linger after successful connection
passive DCC stale offers no longer block later attempts
DCC auth timers no longer linger after login or disconnect
remtimer now cleans runtime even when DB has already drifted
main timer cleanup works after reconnect
DB-backed timers no longer duplicate after reconnect
DB-backed timers no longer disappear after transient reload failure

These are not flashy user-facing features.

They are better than that.

They make the bot calmer, more predictable, and much easier to debug after it has been running for days.

Final note

This was a Time-Turner pass.

The bot already worked, but some old timers kept echoes of past states alive longer than they should.

Now the runtime state is cleaner:

less duplicated work
fewer stale offers
fewer retained closures
safer reconnect behavior
better timer lifecycle discipline

No ghosts left ticking in the walls.

🧙‍♂️⏱️

You must be logged in to reply.