like systemd-timers 'persistent'. so that the user can configure it to not be
run after powering up when it was previously missed
this reverses the default behaviour to not run missed jobs after pvescheduler
was started, since most of the time that's not the desired behaviour
since we don't use it for updated schedules anymore, rename
'updated_job_schedule' to 'update_last_runtime'
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
Reviewed-by: Fabian Ebner <f.ebner@proxmox.com>
Avoid hard-coding the current implication of the replication stack to
not get started again until the old worker is done..
We still apply the same check, but changing that to let the jobs have
control is rather easy now.
Also rework the stop logic, send terminate to _all_ workers and make
the timeout a actual shared one (not first gets all, remaining get
kill) and send a kill to the stuck, leftover ones in one go at the
end, including some logging so that the admin can actually know about
this non-ideal situation.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
utilize PVE::Daemons 'hup' functionality to reload gracefully.
Leaves the children running (if any) and give them to the new instance
via ENV variables. After loading, check if they are still around
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
previously, systemd timers were responsible for running replication jobs.
those timers would not restart if the previous one is still running.
though trying again while it is running does no harm really, it spams
the log with errors about not being able to acquire the correct lock
to fix this, we rework the handling of child processes such that we only
start one per loop if there is currently none running. for that,
introduce the types of forks we do and allow one child process per type
(for now, we have 'jobs' and 'replication' as types)
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
if '$sub' dies, the error handler of PVE::Daemon triggers, which
initiates a shutdown of the child, resulting in confusing error logs
(e.g. 'got shutdown request, signal running jobs to stop')
instead, run it under 'eval' and print the error to the sylog instead
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
The whole thing is already prepared for this, the systemd timer was
just a fixed periodic timer with a frequency of one minute. And we
just introduced it as the assumption was made that less memory usage
would be generated with this approach, AFAIK.
But logging 4+ lines just about that the timer was started, even if
it does nothing, and that 24/7 is not to cheap and a bit annoying.
So in a first step add a simple daemon, which forks of a child for
running jobs once a minute.
This could be made still a bit more intelligent, i.e., look if we
have jobs tor run before forking - as forking is not the cheapest
syscall. Further, we could adapt the sleep interval to the next time
we actually need to run a job (and sending a SIGUSR to the daemon if
a job interval changes such, that this interval got narrower)
We try to sync running on minute-change boundaries at start, this
emulates systemd.timer behaviour, we had until now. Also user can
configure jobs on minute precision, so they probably expect that
those also start really close to a minute change event.
Could be adapted to resync during running, to factor in time drift.
But, as long as enough cpu cycles are available we run in correct
monotonic intervalls, so this isn't a must, IMO.
Another improvement could be locking a bit more fine grained, i.e.
not on a per-all-local-job-runs basis, but per-job (per-guest?)
basis, which would improve temporary starvement of small
high-periodic jobs through big, less peridoci jobs.
We argued that it's the user fault if such situations arise, but they
can evolve over time without noticing, especially in compolexer
setups.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>