This is required if we're going to have out of package resources. In
particular for third party packages, and also for if we decide to split
out each resource into a separate sub package.
This causes a graph to actually stop processing part way through, even
if there are poke's that want to continue on. This is so that the user
experience of pressing ^C actually causes a shutdown without finishing
the graph execution. It might be preferred to have this be a user
defined setting at some point in the future, such as if the user presses
^C twice. As well, we might want to implement an interrupt API so that
individual resource execution can be asked to bail out early if
requested. This could happen on a third ^C press.
I can't think of a reason we should grab a semaphore before backpoking.
The semaphore is intended to block around the actual work in CheckApply,
not the dependency resolution of the correct vertex.
This prevents some nasty races where a BackPoke could arrive on a paused
vertex either during a resume or pause operation. Previously we might
also have poked an excessive number of resources on resume.
The solution was to discard BackPokes during pause or resume. On pause,
they can be discarded because we've asked the graph to quiesce, and any
further work can be done on resume, and on resume we ignore them because
this should only happen during the unrolling (reverse topological resume
of the graph) and at the end of this the indegree == 0 vertices will
initiate a series of pokes which should deal with any BackPoke that was
possibly discarded.
One other aspect of this which is important: if an indegree == 0 vertex
is poked (Process runs) but it's already in the correct state, it should
still transmit the Poke through itself so that subsequent vertices know
to run. Currently this is done correctly in Process().
I'm a bit ashamed that this wasn't done properly in the engine earlier,
but I suppose that's what comes out of running fancier graphs and really
thinking in detail about what's truly correct. Hopefully I got it right
this time!
This prevents a nasty race that can happen in a graph with more than one
resource. If a resource has someone that it can BackPoke, and then
suppose an event comes in. It runs the obj.Event() method (from inside
its Watch loop) and then *before* the resulting Process method can run
it receives a pause event and pauses. Then the parent resource pauses as
well. Finally (it's a race) the Process gets around to running, and
decides it needs to BackPoke. At this point since the parent resource is
paused, it receives the BackPoke at a time when it can't handle
receiving one, and it panics!
As a result, we now track the number of running Process possibilities
via a WaitGroup which gets incremented from the obj.Event() and we don't
finish our pause or exit operations until it has quiesced and our
WaitGroup lets us know via Wait(). Lastly in order to prevent repeated
replays, we detect when we're quiescing and suspend replaying until post
pause. We don't need to save the replay (playback variable) explicitly
because its state remains during pause, and on exit it would get
re-checked anyways.
This adds a P/V style semaphore mechanism to the resource graph. This
enables the user to specify a number of "id:count" tags associated with
each resource which will reduce the parallelism of the CheckApply
operation to that maximum count.
This is particularly interesting because (assuming I'm not mistaken) the
implementation is dead-lock free assuming that no individual resource
permanently ever blocks during execution! I don't have a formal proof of
this, but I was able to convince myself on paper that it was the case.
An actual proof that N P/V counting semaphores in a DAG won't ever
dead-lock would be particularly welcome! Hint: the trick is to acquire
them in alphabetical order while respecting the DAG flow. Disclaimer,
this assumes that the lock count is always > 0 of course.
This cleans up some of the resource events and also reorganizes the
struct for simplicity. This should hopefully kill off at least one race
which would cause unnecessary blocking!
Yes this patch is a bit yucky, but so was the bug I was fighting with!
I'm still working on reducing the size of the monster patches that I
land, but I'm exercising the priviledge as the initial author. In any
case, this refactors worker into two, and cleans up the passing around
of the processChan. This puts common code into Init and Close.
This more appropriately blocks converging in the engine, since we are
now 1-1 decoupled from the Watch resource. This simplifies resource
writing, and should be more accurate around small converged timeouts.
We don't block in the Worker routine when we are polling, because we
expect to get constant poll events, and we can instead be more careful
about these by looking at CheckApply results.
If we can do this for all resources in the future, it would be
excellent!
I can't guarantee this has a significant effect, but it's likely to add
some efficiency when sending multiple BackPoke's at the same time, so
that erroneous ones can be cancelled out easier.
The mgmt graph depends on state tracking to eliminate redundant pokes.
With the Watch loop now able to produce events quickly, it should no
longer play a part in determining the vertex state. This simplifies the
resource API as well!
This adds rate limiting with the limit and burst meta parameters. The
limits apply to how often the Process check is called. As a result, it
might get called more often than there are Watch events due to possible
Poke/BackPoke events.
This system might need to get rethought in the future depending on its
usefulness.
This patch makes a number of changes in the engine surrounding the
resource API. In particular:
* Cleanup of send/read event.
* Cleanup of DoSend (now Event) in the Watch method.
* Events are now more consistently pointers.
* Exiting within Watch is now done in a single place.
* Multiple incoming events will be combined into a single action.
* Events in flight during an action are played back after CheckApply.
* Addition of Close method to API
This gets things ready for rate limiting and semaphore metaparams!
This allows a resource to use polling instead of the event based
mechanism. This isn't recommended, but it could be useful, and it was
certainly fun to code!
This signals which resources have to run their initial pokes, and
removes the racy retry timer. We actually get a proper signal when
things are running too!
This removes some boilerplate from the Watch methods which can be baked
into the engine instead.
This code should be checked for races and locks to make sure we only
start resources when it makes sense to.
This takes the Converged initialization and Startup patterns that are
common in all resources, and bakes it into the core engine. This way
resource writing is much more concise and there is less boilerplate!
This polishes the password resource so that it can actually avoid
writing the password to disk, and so that the work actually happens in
CheckApply where it can properly interact with the graph. This resource
now re-generates the password when it receives a notification.
The send/recv plumbing has been extended so that receivers can detect
when they're receiving new values. This is particularly important if
they might otherwise not expect those values to change and cache them
for efficiency purposes.
Clearly the use of errgroup is flawed.
1) You can't pass in variables, so this is likely to race.
2) You can't get a set of errors, so this is a bad API.
For the second problem, it would be much more sane to return a multierr
or a list of errors. If there's no fix for the first, I think it should
be removed from the lib.
Resources can send "refresh" notifications along edges. These messages
are sent whenever the upstream (initiating vertex) changes state. When
the changed state propagates downstream, it will be paired with a
refresh flag which can be queried in the CheckApply method of that
resource.
Future work will include a stateful refresh tracking mechanism so that
if a refresh event is generated and not consumed, it will be saved
across an interrupt (shutdown) or a crash so that it can be re-applied
on the subsequent run. This is important because the unapplied refresh
is a form of hysteresis which needs to be tracked and remembered or we
won't be able to determine that the state is wrong!
Still to do:
* Update the autogrouping code to handle the edge notify properties!
* Actually finish the stateful bool code