etcd: Rewrite embed etcd implementation

This is a giant cleanup of the etcd code. The earlier version was written when I was less experienced with golang. This is still not perfect, and does contain some races, but at least it's a decent base to start from. The automatic elastic clustering should be considered an experimental feature. If you need a more battle-tested cluster, then you should manage etcd manually and point mgmt at your existing cluster.
2018-05-05 17:35:08 -04:00
parent fb275d9537
commit a5842a41b2
56 changed files with 5459 additions and 2654 deletions
--- a/docs/faq.md
+++ b/docs/faq.md
@@ -215,23 +215,25 @@ requires a number of seconds as an argument.
 ./mgmt run lang --lang examples/lang/hello0.mcl --converged-timeout=5
 ```

-### What does the error message about an inconsistent dataDir mean?
+### On startup `mgmt` hangs after: `etcd: server: starting...`.

 If you get an error message similar to:

 ```
-Etcd: Connect: CtxError...
-Etcd: CtxError: Reason: CtxDelayErr(5s): No endpoints available yet!
-Etcd: Connect: Endpoints: []
-Etcd: The dataDir (/var/lib/mgmt/etcd) might be inconsistent or corrupt.
+etcd: server: starting...
+etcd: server: start timeout of 1m0s reached
+etcd: server: close timeout of 15s reached
 ```

-This happens when there are a series of fatal connect errors in a row. This can
-happen when you start `mgmt` using a dataDir that doesn't correspond to the
-current cluster view. As a result, the embedded etcd server never finishes
-starting up, and as a result, a default endpoint never gets added. The solution
-is to either reconcile the mistake, and if there is no important data saved, you
-can remove the etcd dataDir. This is typically `/var/lib/mgmt/etcd/member/`.
+But nothing happens afterwards, this can be due to a corrupt etcd storage
+directory. Each etcd server embedded in mgmt must have a special directory where
+it stores local state. It must not be shared by more than one individual member.
+This dir is typically `/var/lib/mgmt/etcd/member/`. If you accidentally use it
+(for example during testing) with a different cluster view, then you can corrupt
+it. This can happen if you use it with more than one different hostname.
+
+The solution is to avoid making this mistake, and if there is no important data
+saved, you can remove the etcd member dir and start over.

 ### On running `make` to build a new version, it errors with: `Text file busy`.