I’ve been running my own node on the fediverse for a little while now: pleroma.gidikroon.eu. That is mainly a hobby, but partly to learn from it as well. I had not done sysadmin before or set up my own server. This WordPress blog is one of those one-click affairs. I also believe that part of the promise of the fediverse is to be able to self-host your social media presence while staying in contact with others on other servers; I wanted to experience how that would work out.
If a server is only for personal use, you don’t need much resources. Some people install the software on a PC at home that they leave turned on. Others use a Raspberry Pi, a very cheap single board computer that you can get for under €50,-. Cheap cloud providers like Digital Ocean and Hetzner are popular too. There are also services that can run your fediverse server for you, like masto.host and spacebear.
I wanted to try using Amazon Web Services (AWS). The by far easiest setup as well as the cheapest (within AWS) is using Amazon Lightsail. The smallest server option is only $3,50 per month with everything included. If you calculate doing the same thing on Amazon EC2 (having a t3a.nano instance running full time with block storage, network traffic and an ip address) it is more expensive and you have to setup everything yourself. Lightsail (and Digital Ocean and Hetzner and others) are really easy and cheap for people who don’t want to spend a lot of effort in self-hosting something.
On this smallest server option I have run at the same time:
- My Pleroma server (an ActivityPub server)
- The PostgreSQL database
- Another development Pleroma server (to test stuff)
- A relay that receives and sends posts between fediverse servers
The Pleroma server was also subscribed to one of the busiest relays, so posts came in at a rate of about three per second, around the clock. I had however stopped the second Pleroma server a while ago.
So what are some of the things I learned so far? The main thing relates to cloud servers being ‘burstable’. This means that they can only run at full speed for short periods of time (up to around half an hour), after which the speed gets reduced to 5% of full capacity. Which is perfect for personal servers that only get visited occasionally. Not so much for a server that gets constant traffic 24/7.
Amazon is clear that you don’t get full capacity. In fact these are recent metrics for my server where they clearly indicate a very low sustainable zone:
At the end of the graph you can also see that after my server has been above the sustainable zone for over a couple of hours, it has now been put into what I call cpu jail: the server will not do more than 5% of its full capacity. The problem is that other servers, relays, etc keep talking to it and when their connections fail, they will just retry later. The load been placed on the server will not let up and so this cpu jail situation will not resolve itself.
(BTW you can see that my approach to sysadmin is very low-effort, since I’m continuing to type this blog post while the above graph is a real current picture of the server that’s locked up. It needs to be a hobby after all and I’ll look into it when I feel like it.)
In the graph you can also see that early in the morning the server had just crashed after a similar period in cpu jail.
What to do about it?
For now I have stopped the relay server and unsubscribed from the other relay. This should mean that (after a while) the only social media posts coming in will be of people I actually follow. This takes a while to take effect however, as most servers are still retrying their earlier failed deliveries.
I have also looked into the configuration of the software. By default it is configured for medium sized servers that have dedicated resources. These have multiple cpu’s that work in parallel. The software uses that by queuing background jobs and having several worker threads in parallel dealing with the queued jobs. The standard configuration has about five different queues with up to 25 parallel workers each. That is not suitable for a cheap burstable cloud server.
Remember that I get only 5% of one cpu. Not 25 full time cpu’s…
So I changed the configuration to have only one worker per queue. That is still too much, but I can’t configure it lower. I have also reduced the amount of parallel connections to the database (which was 10), for the same reasons.
The database. That’s the next thing.
First the easy thing: with the cheapest server you get 20GB of storage, which is more than enough to keep your database on. However, since I subscribed to a relay and kept the posts for three months I essentially had everything published on the whole fediverse over the last few months in my database, which was about 15GB. It turns out that PostgreSQL always needs to have space for double the size of the database for its management operations. So I attached an extra disk within Lightsail for all data, so it would also be easier to move between instances. Costs some extra, but makes stuff easier.
A while ago I decided to upgrade the database from version 11 to 12. Seemed easy enough, since there is a command in the Debian operating system that does everything for you:
pg_upgradecluster 11 main
Doing this it will automatically create a new version 12 cluster and use the ‘dump’ method (also used for backups) to stream data out of the old cluster straight into the new cluster. Sounds fine.
But the ‘burstable’ concept also applies to reading and writing data on a disk. Again, Amazon is clear about this and says that you can’t really load more than 10GB of data into a database on the smallest server options.
So after this had been reading/writing with about 100MB/s for a while, the speed dropped to about 100kB/s. I left it running for over 24 hours, but in the end the process just crashed. The ‘pg_upgradecluster’ nicely rolled back all changes and everything was operational again, but still on version 11.
Reading up on things I found that I should have added an extra option:
pg_upgradecluster --method upgrade 11 main
With this extra option, it would not use the ‘dump’ method, but an ‘upgrade’ method to change the data from the version 11 format to the version 12 format. With this, the process was done, successfully, in five minutes, rather than unsuccessfully in 24 hours.
I realize however that if I ever need to restore the database from a backup, which necessarily uses the ‘dump’ method, I will run into the same problems.
It seems that theoretically you can run the Pleroma software on the cheapest cloud server option, but there are limits to what it can do. I am glad however that I used Pleroma and not Mastodon, since it requires much more resources to run and is not really suitable for self-hosting.
The learning continues.