blog/content/proactive-redundancy.md

55 lines
2.1 KiB
Markdown
Raw Normal View History

2021-03-24 02:11:03 +00:00
---
title: 'proactive redundancy'
2021-03-24 15:52:32 +00:00
date: 2018-11-15T18:39:26
2021-03-24 02:11:03 +00:00
tags:
- 'sysadmin'
- 'tilde'
---
after the [fiasco](november-13-post-mortem.html) earlier this week, i've
been taking steps to minimize the impact if tilde.team were to go down.
it's still a large spof (single-point-of-failure), but i'm reasonably
certain that at least the irc net will remain up and functional in the
event of another outage.
2021-03-24 20:30:42 +00:00
<!-- more -->
2021-03-24 02:11:03 +00:00
the first thing that i set up was a handful of additional ircd nodes:
see [the tilde.chat wiki](https://tilde.chat/wiki/?page=servers) for a
full list. slash.tilde.chat is on my personal vps, and bsd.tilde.chat is
hosted on the bsd vps that i set up for tilde.team.
i added the ipv4 addresses for these machines, along with the ip for
yourtilde.com as A records for tilde.chat, creating a dns round-robin.
`host tilde.chat` will return all four. requesting the dns record will
return any one of them, rotating them in a semi-random fashion. this
means that when connecting to tilde.chat on 6697 for irc, you might end
up on any of `{your,team,bsd,slash}.tilde.chat`.
this creates the additional problem that visiting the [tilde.chat
site](https://tilde.chat) will end up at any of those 4 machines in much
the same way. for the moment, the site is deployed on all of the boxes,
making site setup issues hard to
[debug](https://tildegit.org/tildeverse/tilde.chat/issues/8). the
solution to this problem is to use a subdomain as the roundrobin host,
as other networks like freenode do (see `host chat.freenode.net` for the
list of servers).
i'm not sure how to make any of the other services more resilient. it's
something that i have been and will continue to research moving forward.
the other main step that i have taken to prevent the same issue from
happening again was to configure the firewall to drop outgoing requests
to the subnets as defined in [rfc
1918](https://tools.ietf.org/html/rfc1918).
i'd like to consider at least this risk to be mitigated.
thanks for reading,
~ben
**update**: the round robin host is now *irc*.tilde.chat, which resolves
the site issues that we were having, due to the duplicated deployments.