146 lines
7.4 KiB
Markdown
146 lines
7.4 KiB
Markdown
# Service API design guide
|
|
|
|
This document is intended as a short instructional design guide in building a
|
|
service management API. It is certainly intended for someone who wishes to use
|
|
`mgmt` resources and functions to interact with their facilities, however it may
|
|
be of more general use as well. Hopefully this will help you make smarter design
|
|
considerations early on, and prevent some amount of unnecessary technical debt.
|
|
|
|
## Main aspects
|
|
|
|
What follows are some of the most common considerations which you may wish to
|
|
take into account when building your service. This list is non-exhaustive. Of
|
|
particular note, as of the writing of this document, many of these designs are
|
|
not taken into account or not well-handled or implemented by the major API
|
|
("cloud") providers.
|
|
|
|
### Authentication
|
|
|
|
#### The status-quo
|
|
|
|
Many services naturally require you to authenticate yourself. Usually the
|
|
initial user who sets up the account and provides credit card details will need
|
|
to download secret credentials in order to access the service. The onus is on
|
|
the user to keep those credentials private, and to prevent leaking them. It is
|
|
convenient (and insecure) to store them in `git` repositories containing scripts
|
|
and configuration management code. Since it's likely you will use multiple
|
|
different services, it also means you will have a ton of different credentials
|
|
to guard.
|
|
|
|
#### An alternative
|
|
|
|
Instead, build your service to accept a public key that you store in the users
|
|
account. Only consumers that can correctly sign messages matching this public
|
|
key should be authorized. This mechanism is well-understood by anyone who has
|
|
ever uploaded their public SSH key to a server. You can use SSH keys, GPG keys,
|
|
or even get into Kerberos if that's appropriate. Best of all, if you and other
|
|
services use a standardized mechanism like GPG, a user might only need to keep
|
|
track of their single key-pair, even when they're using multiple services!
|
|
|
|
### Events
|
|
|
|
#### The problem
|
|
|
|
People have been building "[CRUD](https://en.wikipedia.org/wiki/Create,_read,_update_and_delete)"
|
|
and "[REST](https://en.wikipedia.org/wiki/REST)"ful API's for years. The biggest
|
|
missing part that most of them don't provide is events. If users want to know
|
|
when a resource changes, they have to repeatedly poll the server, which is both
|
|
network intensive, and introduces latency. When services were simpler, this
|
|
wasn't as much of a consideration, but these days it matters. An embarrassingly
|
|
small number of major software vendors implement these correctly, if at all.
|
|
|
|
#### Why events?
|
|
|
|
The `mgmt` tool is different from most other static tools in that it allows
|
|
reading streams of incoming data, and stream of change events from resources we
|
|
are managing. If an event API is not available, we can still poll, but this is
|
|
not as desirable. An event-capable API doesn't prevent polling if that's
|
|
preferred, you can always repeat a read request periodically.
|
|
|
|
#### Variants
|
|
|
|
The two common mechanisms for receiving events are "callbacks" and
|
|
"long-polling". In the former, the service contacts the consumer when something
|
|
happens. In the latter, the consumer opens a connection, and the service either
|
|
closes the connection or sends the reply, when it's ready. Long-polling is often
|
|
preferred since it doesn't require an open firewall on the consumers side.
|
|
Callbacks are preferred because it's often cheaper for the service to implement
|
|
that. It's also less reliable since it's hard to know if the callback message
|
|
wasn't received because it was dropped, or if there just wasn't an event. And it
|
|
requires static timeouts when retrying a callback message, and so on. It's best
|
|
to implement long-polling or something equivalent at a minimum.
|
|
|
|
#### "Since" requests
|
|
|
|
When making an event request, some API's will let you tack on a "since" style
|
|
parameter that tells the endpoint that we're interested in all of the events
|
|
_since_ a particular timestamp, or _since_ a particular sequence ID. This can be
|
|
very useful if missing an intermediate event is a concern. Implement this if you
|
|
can, but it's better for all concerned if purely declarative facilities are all
|
|
that is required. It also forces the endpoint to maintain some state, which may
|
|
be undesirable for them.
|
|
|
|
#### Out of band
|
|
|
|
Some providers have the event system tacked on to a separate facility. If it's
|
|
not part of the core API, then it's not useful. You shouldn't have to configure
|
|
a separate system in order to start getting events.
|
|
|
|
### Batching
|
|
|
|
With so many resources, you might expect to have 1000's of long-polling
|
|
connections all sitting open and idle. That can't be efficient! It's not, which
|
|
is why good API's need a batching facility. This lets the consumer group
|
|
together many watches (all waiting on a long-poll) inside of a single call. That
|
|
way, a single connection might only be needed for a large amount of information.
|
|
|
|
### Don't auto-generate junk
|
|
|
|
Please build an elegant API. Many services auto-generate a "phone book" SDK of
|
|
junk. It might seem inevitable, so if you absolutely need to do this, then put
|
|
some extra effort into making it idiomatic. If I'm using an SDK generated for
|
|
`golang` and I see an internal `foo.String` wrapper, then chances are you have
|
|
designed your API and code to be easier to maintain for you, instead of
|
|
prioritizing your customers. Surely the total volume of all customer code is
|
|
more than your own, so why optimize for that instead of the putting the customer
|
|
first?
|
|
|
|
### Resources and functions
|
|
|
|
`Mgmt` has a concept of "resources" and "functions". Resources are used in an
|
|
idempotent model to express desired state and perform that work, and "functions"
|
|
are used to receive and pull data into the system. That separation has shown to
|
|
be an elegant one. Consider it when designing your API's. For example, if some
|
|
vital information can only be obtained after performing a modifying operation,
|
|
then it might signal that you're missing some sort of a lookup or event-log
|
|
system. Design your API's to be idempotent, this solves many distributed-system
|
|
problems involving receiving duplicate messages, and so on.
|
|
|
|
## Using mgmt as a library
|
|
|
|
Instead of building a new service from scratch, and re-inventing the typical
|
|
management and CLI layer, consider using `mgmt` as a library, and directly
|
|
benefiting from that work. This has not been done for a large production
|
|
service, but the author believes it would be quite efficient, particularly if
|
|
your application is written in golang. It's equivalently easy to do it for other
|
|
languages as well, you just end up with two binaries instead of one. (Or you can
|
|
embed the other binary into the new golang management tool.)
|
|
|
|
## Cloud API considerations
|
|
|
|
Many "cloud" companies have a lot of technical debt and a lot of customers. As a
|
|
result, it might be very hard for them to improve their API's, particularly
|
|
without breaking compatibility promises for their existing customers. As a
|
|
result, they should either add a versioned API, which lets newer consumers get
|
|
the benefit, or add new parallel services which offer the modern features. If
|
|
they don't, the only solution is for new competitors to build in these better
|
|
efficiencies, eventually offering better value to cost ratios, which will then
|
|
make legacy products less lucrative and therefore unmaintainable as compared to
|
|
their competitors.
|
|
|
|
## Suggestions
|
|
|
|
If you have any ideas for suggestions or other improvements to this guide,
|
|
please let us know! I hope this was helpful. Please reach out if you are
|
|
building an API that you might like to have `mgmt` consume!
|