Some bugs are loud.
Some bugs crash immediately, throw a stack trace, and wave a red flag.
This one was quieter.
It lived in the space between a running yt-dlp process, an IRC admin command, and an IO::Async timer that could still wake up later.
That is exactly the kind of runtime edge case worth fixing before it becomes a ghost in production.
This pass is about radio download safety.
No database schema change.
No migration.
No dramatic rewrite.
Just a targeted firebreak around the radio download lifecycle.
Mediabot can launch radio downloads through the radio request pipeline.
That workflow has moving parts:
a child process
temporary stdout/stderr files
a polling timer
a runtime state hash
admin commands such as radiodlstatus and radiodlcancel
The old behavior worked most of the time, but it had two weak points.
First, radiodlcancel used a blocking delay:
sleep 1;
That is not what you want in the IRC command path. Even one second is too much when the bot is supposed to keep processing network events and timers.
Second, the polling timer created by the radio request code was local to the download logic. It was not stored in the active job state.
That meant an admin could cancel the job, clear the state, remove temporary files, and still have an old timer closure waiting to fire later.
That is not a clean lifecycle.
That is a half-closed trapdoor.
The risk was not theoretical elegance.
The risk was runtime weirdness:
double cleanup
double waitpid
stale finish callback
download-finished message after cancel
temporary files already gone
internal state already reset
one-second IRC loop stall
These bugs are painful because they are timing-dependent.
You do not always see them.
Then one day a user cancels a slow download at exactly the wrong moment and the bot does something haunted.
Better to fix the haunted staircase before somebody walks on it.
Two files were touched:
Mediabot/AdminCommands.pm
Mediabot/Radio/Request.pm
Still no database change.
The active radio download state now stores more than just the child process metadata.
It also stores the event loop and the polling timer:
$bot->{_radio_request_download} = {
active => 1,
pid => $pid,
query => $query,
started => time(),
stdout => $stdout,
stderr => $stderr,
loop => $loop,
timer => undef,
cancel_requested => 0,
term_sent_at => undef,
};
After the timer is created, it is written back into the job:
$bot->{_radio_request_download}->{timer} = $timer;
$bot->{_radio_request_download}->{loop} = $loop;
That is the key improvement.
The timer is no longer an invisible local thing.
It is part of the job lifecycle.
If the job is cancelled, the timer can be removed.
If the job is stale, the callback can detect it and exit.
radiodlcancel now cancels the whole lifecycleThe admin command now does the right thing in the right order.
It:
marks the job as cancel_requested
removes the stored polling timer from the loop
sends TERM
schedules a delayed KILL escalation without blocking
cleans temporary stdout/stderr files
clears _radio_request_download
returns quickly to the IRC loop
The important part is that it no longer blocks with sleep 1.
Instead, it uses an IO::Async::Timer::Countdown to escalate after one second if the child process is still alive.
That means the bot can keep breathing while the cancel sequence finishes.
A bot should not hold its breath just because yt-dlp is being stubborn.
The download timeout path also no longer does this:
kill 'TERM', $pid;
sleep 1;
kill 'KILL', $pid;
Instead, timeout escalation is split across timer ticks:
first timeout tick -> TERM
later tick -> KILL if still alive
normal polling -> reaps the child
That keeps the IO loop responsive.
More importantly, the timer callback now checks whether it is still responsible for the active job:
job is active
cancel_requested is false
pid still matches
If not, it exits.
That prevents stale callbacks from running a normal finish path after an admin cancel.
Syntax checks:
perl -I. -c Mediabot/AdminCommands.pm
perl -I. -c Mediabot/Radio/Request.pm
perl -I. -c mediabot.pl
General hygiene:
git diff --check
Runtime smoke test on a dev bot:
m radiodlstatus
m play <something not cached>
m radiodlstatus
m radiodlcancel
m radiodlstatus
Expected behavior:
download job starts
status shows active pid
cancel returns quickly
status returns idle after cleanup
no stale “download finished” message appears after cancel
no one-second IRC loop freeze
Because the patch scripts create safety backups, remember to remove them before commit:
find /home/mediabot/mediabot_v3 \
\( -name '*.bak.mb125_radio_cancel_timer_safe.*' \
-o -name '*.bak.mb125_radio_cancel_timer_safe_v2.*' \
-o -name '*.bak.mb125_radio_cancel_timer_safe_v3.*' \) \
-delete
Then verify:
git status --short
No backup files should be staged or untracked.
A firebreak is not the fire.
It is the line you cut before the fire reaches the next forest.
That is what this patch does.
It does not redesign the whole radio subsystem.
It does not change the schema.
It does not invent a grand new radio architecture.
It simply makes sure that a cancelled download is really cancelled, that old timers do not wander back later, and that the IRC loop is not frozen by a blocking sleep.
Small patch.
Good engineering.
The kind that prevents tomorrow’s weird bug from becoming tomorrow night’s emergency.
📻🧯🕯️
You must be logged in to reply.