Commit Graph

34 Commits

Author SHA1 Message Date
James Shubin
b46432b5b6 misc: Golint fixes 2016-09-02 01:46:45 -04:00
James Shubin
5e3f03df06 etcd: Catch possible raft grpc error 2016-09-02 00:35:05 -04:00
James Shubin
ff01e4a5e7 remote: Add distributed converged timeout
This patch extends the --converged-timeout argument so that when used
with --remote it waits for the entire set of remote mgmt agents to
converge simultaneously before exiting.

purpleidea says: This particular part of the patch probably took as much
work as all of the work required for the initial remote patches alone!
2016-08-31 21:55:19 -04:00
James Shubin
6794aff77c miscellaneous cleanups and fixes 2016-08-31 21:55:19 -04:00
James Shubin
636f2a36b1 etcd: Use new converged timers and allow skipping them
This implements the new extensions to the converged UUID API so that we
can keep a consistent timer running and reset it when needed. This is
useful because we now allow certain Watcher callbacks and other
operations to explicitly _not_ cause a convergerUUID to un-converge.
This is a necessary dependency for the distributed remote
converged-timeout work.
2016-08-31 21:55:19 -04:00
James Shubin
f5fb135793 converger: Update the API for errors and naming
We updated the API and behaviour to make two changes:

1) We remove the log.* stuff, and replace the Fatal calls with straight
calls to panic(). These are meant to be like an assert, and shouldn't
happen unless there is a user error or a bug.

2) We made the !uuid.IsValid() checks return an error instead of causing
a panic. These returns errors instead, and makes the process safer for
users who are okay calling a function that won't have an effect.
2016-08-31 21:55:19 -04:00
James Shubin
6bf32c978a etcd: Rename loop to be more consistent in messages
Small nitpick fixups
2016-08-31 21:55:19 -04:00
James Shubin
8d3011fb9c etcd: Add a timeout for etcd server to start correctly
This also updates etcd to a newer version with a fix that allows this
detection and timeout operation to be possible.
2016-08-31 21:55:19 -04:00
James Shubin
9260066fa3 tests: Workaround regression in two host etcd clusters
If you don't give your two host cluster enough time to "feel healthy",
it will generate an error if you do operations within five seconds. This
is a regression and the five seconds is also quite arbitrary. This is
detailed at: https://github.com/coreos/etcd/issues/6305

This seems to be a bit of a race condition, even with a 10s timer, so
this also disables the StrictReconfigCheck. Re-enable this as soon as
possible.
2016-08-31 21:55:19 -04:00
James Shubin
5e45c5805b Improve internal etcd error handling 2016-08-31 21:55:19 -04:00
James Shubin
03c1df98f4 Improve prefix creation and feedback
This makes the default /var/lib/mgmt/ directory, but also allows you to
use a prefix in /tmp/ automatically if you can't write anywhere else.
2016-08-31 21:55:19 -04:00
James Shubin
1d0e187838 Etcd: switch to using a directory prefix 2016-08-31 21:55:19 -04:00
James Shubin
7032eea045 Remote "agent-less" mode
This is a new mode to be used for bootstrapping mgmt clusters or in
situations with tight operational restrictions.

This includes the basics, additional functionality will follow!
2016-08-31 21:55:19 -04:00
James Shubin
24ba6abc6b Don't block on member exits
The MemberList() function which apparently looks at all of the endpoints
was blocking right after a member exited, because we still had a stale
list of endpoints, and while a member that exited safely would update
the endpoints lists, the other member would (unavoidably) race to get
that message, and so it might call the MemberList() with the now stale
endpoint list. This way we invalidate an endpoint we know to be gone
immediately.

This also adds a simple test case to catch this scenario.
2016-07-26 04:23:46 -04:00
James Shubin
f6c1bba3b6 Avoid a rare panic if DestroyServer is called early
I never actually hit this bug, but I noticed it was possible when
examining the WaitGroup code that gets .Done() by DestroyServer().
2016-07-26 01:58:13 -04:00
James Shubin
a606961a22 Be safe when closing in destroy in case client is nil 2016-07-25 21:46:08 -04:00
James Shubin
e28c1266cf Do some gofmt simplifications 2016-07-25 20:56:33 -04:00
James Shubin
7aeb55de70 Port embedded etcd to embed API
This also updates the vendored version of etcd to current git master,
which is the only place this is supported at the moment.
2016-07-25 19:10:09 -04:00
James Shubin
8ca65f9fda Revert "Copy in out of tree patches"
This reverts commit d26b503dca.

Use new etcd "embed" API.
2016-07-24 00:08:58 -04:00
James Shubin
d26b503dca Copy in out of tree patches
These patches are proposed upstream changes and code for and from etcd.
Ideally we would revert this patch when/if things are merged upstream!
The majority of the work is in: https://github.com/coreos/etcd/pull/5584
2016-06-18 04:43:19 -04:00
James Shubin
5363839ac8 Embedded etcd
This monster patch embeds the etcd server. It took a good deal of
iterative work to tweak small details, and survived a rewrite from the
initial etcd v2 API implementation to the beta version of v3.

It has a notable race, and is missing some features, but it is ready for
git master and external developer consumption.
2016-06-18 04:43:19 -04:00
James Shubin
6f3ac4bf2a Rework the converged detection and provide a clean interface
The old converged detection was hacked in code, instead of something
with a nice interface. This cleans it up, splits it into a separate
file, and removes a race condition that happened with the old code.

We also take the time to get rid of the ugly Set* methods and replace
them all with a single AssociateData method. This might be unnecessary
if we can pass in the Converger method at Resource construction.

Lastly, and most interesting, we suspend the individual timeout callers
when they've already converged, thus reducing unnecessary traffic, and
avoiding fast (eg: < 5 second) timers triggering more than once if they
stay converged!

A quick note on theory for any future readers... What happens if we have
--converged-timeout=0 ? Well, for this and any other positive value,
it's important to realize that deciding if something is converged is
actually a race between if the converged timer will fire and if some
random new event will get triggered. This is because there is nothing
that can actually predict if or when a new event will happen (eg the
user modifying a file). As a result, a race is always inherent, and
actually not a negative or "incorrect" algorithm.

A future improvement could be to add a global lock to each resource, and
to lock all resources when computing if we are converged or not. In
practice, this hasn't been necessary. The worst case scenario would be
(in theory, because this hasn't been tested) if an event happens
*during* the converged calculation, and starts running, the exit command
then runs, and the event finishes, but it doesn't get a chance to notify
some service to restart. A lock could probably fix this theoretical
case.
2016-03-29 20:27:38 -04:00
James Shubin
1b01f908e3 Add resource auto grouping
Sorry for the size of this patch, I was busy hacking and plumbing away
and it got out of hand! I'm allowing this because there doesn't seem to
be anyone hacking away on parts of the code that this would break, since
the resource code is fairly stable in this change. In particular, it
revisits and refreshes some areas of the code that didn't see anything
new or innovative since the project first started. I've gotten rid of a
lot of cruft, and in particular cleaned up some things that I didn't
know how to do better before! Here's hoping I'll continue to learn and
have more to improve upon in the future! (Well let's not hope _too_ hard
though!)

The logical goal of this patch was to make logical grouping of resources
possible. For example, it might be more efficient to group three package
installations into a single transaction, instead of having to run three
separate transactions. This is because a package installation typically
has an initial one-time per run cost which shouldn't need to be
repeated.

Another future goal would be to group file resources sharing a common
base path under a common recursive fanotify watcher. Since this depends
on fanotify capabilities first, this hasn't been implemented yet, but
could be a useful method of reducing the number of separate watches
needed, since there is a finite limit.

It's worth mentioning that grouping resources typically _reduces_ the
parallel execution capability of a particular graph, but depending on
the cost/benefit tradeoff, this might be preferential. I'd submit it's
almost universally beneficial for pkg resources.

This monster patch includes:
* the autogroup feature
* the grouping interface
* a placeholder algorithm
* an extensive test case infrastructure to test grouping algorithms
* a move of some base resource methods into pgraph refactoring
* some config/compile clean ups to remove code duplication
* b64 encoding/decoding improvements
* a rename of the yaml "res" entries to "kind" (more logical)
* some docs
* small fixes
* and more!
2016-03-28 20:54:41 -04:00
James Shubin
d3f7432861 Update context since etcd upstream moved to vendor/ dir
Follow suit with what etcd upstream is doing.
2016-03-28 05:27:28 -04:00
James Shubin
3a85384377 Rename type to resource (res) and service to svc
Naming the resources "type" was a stupid mistake, and is a huge source
of confusion when also talking about real types. Fix this before it gets
out of hand.
2016-02-21 15:51:52 -05:00
James Shubin
491e9fd9bc Golint fixes
I used: `golint | grep -v comment | grep -v stringer` to avoid crap.
2016-01-19 23:35:33 -05:00
James Shubin
4c6647d807 Fixup state related items
* Fixup graph state readability
* Rename original SetState() to SetConvergedState() and friends...
* Add type state management for proper BackPoke() commands...
* Add better DEBUG logging

This is an important optimization that prevents running a BackPoke on a
parent which is in the process of running and will most certainly poke
the caller back in a moment. This avoids unnecessary roundtrips.
Unfortunately, there are still other algorithms required so that races
can't cause the graph to run for longer than necessary.
2016-01-12 04:57:05 -05:00
James Shubin
48eddc3721 Catch a different form of etcd disconnect 2016-01-10 04:09:28 -05:00
James Shubin
ea7fd76f93 Add exec type and fix up a few other things
* Add exec type
* Switch erroneous use of fmt to log instead
* Check for edge existence for safety before using
* Avoid recalling etcd channel maker
* Clean up logging output
2016-01-09 21:50:21 -05:00
James Shubin
d769309cc0 Hello 2016! 2016-01-08 02:43:38 -05:00
James Shubin
72525d30b1 Refactor etcd into object and add exit timers
This refactors my etcd use into a struct (object) wrapper, which makes
it easier to add an exit on converged timer.
2016-01-06 19:40:09 -05:00
James Shubin
904ace8027 Fix up go vet errors and integrate with ci 2016-01-04 21:02:22 -05:00
James Shubin
d8cbeb56f9 Support N distributed agents
This is the third main feature of this system. The code needs a bunch of
polish, but it actually all works :)

I've tested this briefly with N <= 3.

Currently you have to build your own etcd cluster. It's quite easy, just
run `etcd` and it will be ready. I usually run it in a throw away /tmp/
dir so that I can blow away the stored data easily.
2016-01-04 21:00:13 -05:00
James Shubin
6b4fa21074 Mega patch
This is still a dirty prototype, so please excuse the mess. Please
excuse the fact that this is a mega patch. Once things settle down this
won't happen any more.

Some of the changes squashed into here include:
* Merge vertex loop with type loop
(The file watcher seems to cache events anyways)
* Improve pgraph library
* Add indegree, outdegree, and topological sort with tests
* Add reverse function for vertex list
* Tons of additional cleanup!

Amazingly, on my first successful compile, this seemed to run!

A special thanks to Ira Cooper who helped me talk through some of the
algorithmic decisions and for his help in finding better ones!
2015-12-21 03:27:25 -05:00