📻🧯 Mediabot v3: Radio Firebreak — non-blocking cancellation and timer-safe downloads

in Mediabot · started by TeuK · 3d ago

TeuK · 3d ago

Some bugs are loud.

Some bugs crash immediately, throw a stack trace, and wave a red flag.

This one was quieter.

It lived in the space between a running yt-dlp process, an IRC admin command, and an IO::Async timer that could still wake up later.

That is exactly the kind of runtime edge case worth fixing before it becomes a ghost in production.

This pass is about radio download safety.

No database schema change.

No migration.

No dramatic rewrite.

Just a targeted firebreak around the radio download lifecycle.

📻 The problem: cancelling a download was not quite enough

Mediabot can launch radio downloads through the radio request pipeline.

That workflow has moving parts:

a child process
temporary stdout/stderr files
a polling timer
a runtime state hash
admin commands such as radiodlstatus and radiodlcancel

The old behavior worked most of the time, but it had two weak points.

First, radiodlcancel used a blocking delay:

sleep 1;

That is not what you want in the IRC command path. Even one second is too much when the bot is supposed to keep processing network events and timers.

Second, the polling timer created by the radio request code was local to the download logic. It was not stored in the active job state.

That meant an admin could cancel the job, clear the state, remove temporary files, and still have an old timer closure waiting to fire later.

That is not a clean lifecycle.

That is a half-closed trapdoor.

🧯 What could go wrong?

The risk was not theoretical elegance.

The risk was runtime weirdness:

double cleanup
double waitpid
stale finish callback
download-finished message after cancel
temporary files already gone
internal state already reset
one-second IRC loop stall

These bugs are painful because they are timing-dependent.

You do not always see them.

Then one day a user cancels a slow download at exactly the wrong moment and the bot does something haunted.

Better to fix the haunted staircase before somebody walks on it.

🛠 What changed

Two files were touched:

Mediabot/AdminCommands.pm
Mediabot/Radio/Request.pm

Still no database change.

🧵 Radio job state now owns its timer

The active radio download state now stores more than just the child process metadata.

It also stores the event loop and the polling timer:

$bot->{_radio_request_download} = {
    active           => 1,
    pid              => $pid,
    query            => $query,
    started          => time(),
    stdout           => $stdout,
    stderr           => $stderr,
    loop             => $loop,
    timer            => undef,
    cancel_requested => 0,
    term_sent_at     => undef,
};

After the timer is created, it is written back into the job:

$bot->{_radio_request_download}->{timer} = $timer;
$bot->{_radio_request_download}->{loop}  = $loop;

That is the key improvement.

The timer is no longer an invisible local thing.

It is part of the job lifecycle.

If the job is cancelled, the timer can be removed.

If the job is stale, the callback can detect it and exit.

🛑 `radiodlcancel` now cancels the whole lifecycle

The admin command now does the right thing in the right order.

It:

marks the job as cancel_requested
removes the stored polling timer from the loop
sends TERM
schedules a delayed KILL escalation without blocking
cleans temporary stdout/stderr files
clears _radio_request_download
returns quickly to the IRC loop

The important part is that it no longer blocks with sleep 1.

Instead, it uses an IO::Async::Timer::Countdown to escalate after one second if the child process is still alive.

That means the bot can keep breathing while the cancel sequence finishes.

A bot should not hold its breath just because yt-dlp is being stubborn.

⏱ Timeout handling is non-blocking too

The download timeout path also no longer does this:

kill 'TERM', $pid;
sleep 1;
kill 'KILL', $pid;

Instead, timeout escalation is split across timer ticks:

first timeout tick -> TERM
later tick         -> KILL if still alive
normal polling     -> reaps the child

That keeps the IO loop responsive.

More importantly, the timer callback now checks whether it is still responsible for the active job:

job is active
cancel_requested is false
pid still matches

If not, it exits.

That prevents stale callbacks from running a normal finish path after an admin cancel.

🧪 Suggested validation

Syntax checks:

perl -I. -c Mediabot/AdminCommands.pm
perl -I. -c Mediabot/Radio/Request.pm
perl -I. -c mediabot.pl

General hygiene:

git diff --check

Runtime smoke test on a dev bot:

m radiodlstatus
m play <something not cached>
m radiodlstatus
m radiodlcancel
m radiodlstatus

Expected behavior:

download job starts
status shows active pid
cancel returns quickly
status returns idle after cleanup
no stale “download finished” message appears after cancel
no one-second IRC loop freeze

🧹 Commit hygiene

Because the patch scripts create safety backups, remember to remove them before commit:

find /home/mediabot/mediabot_v3 \
  \( -name '*.bak.mb125_radio_cancel_timer_safe.*' \
     -o -name '*.bak.mb125_radio_cancel_timer_safe_v2.*' \
     -o -name '*.bak.mb125_radio_cancel_timer_safe_v3.*' \) \
  -delete

Then verify:

git status --short

No backup files should be staged or untracked.

🔥 Why “Radio Firebreak”?

A firebreak is not the fire.

It is the line you cut before the fire reaches the next forest.

That is what this patch does.

It does not redesign the whole radio subsystem.

It does not change the schema.

It does not invent a grand new radio architecture.

It simply makes sure that a cancelled download is really cancelled, that old timers do not wander back later, and that the IRC loop is not frozen by a blocking sleep.

Small patch.

Good engineering.

The kind that prevents tomorrow’s weird bug from becoming tomorrow night’s emergency.

📻🧯🕯️

You must be logged in to reply.