High availabillity / failover with Gitea Helm chart

Hi there,

I’m running Gitea with the official Helm chart, and it’s working fine so far. I already have a redis and elasticsearch cluster configured. Also the PostgreSQL server will be HA soon. Data is stored on sha shared nfs drive

So now i tried to set replicaCount: 2 which is basically working too, but sometimes i get broken PR’s.
I think this mostly happens when there is a lot load from CI and my renovate bot.

If i only have one running gitea container, it shows the broken pr only a frew seconds after push. But if i try to view big diffs, gitea is blocking for more than a second, so my kubernetes healthchecks fail and the container gets restarted.

So i like to have at least two container running to have a small failover if one container fails.

Does anybody know how to best configure, so i don’t get the broken PR’s?

I’ve switched to the v3 helm chart with gitea v1.14.1 rootless. I’m using http only access.

I can share my values if it helps.

Soo uhh, Gitea should block in a way like that. There are multiple instances that server (tens of) thousands of users, that are not multi-homed where this isn’t happening to them. It sounds like there could be an underlying issue that should be solved rather than attempting to apply a bandaid with HA.

So I’t maybe a postgres or redis issue? or a too slow nfs? :thinking:

This sometimes happen when i do a big push, then the kubernetes probes to /user/login are failing with timeouts (was 1sec). The readiness has now 2sec timeout and the liveness probe 10sec.

Turns out the hanging is likely an issue with an upstream library (invalid use of mutex locks in that library which do lock things under load), still investigating how we as a project can solve this, and hopefully it also means that we can fix the upstream library too so others can benefit as well.

1 Like

Thanks for information, so with the extended timeouts it seems to be no longer restarted by kubernetes, but thats not fixing the underlying issue.

So it would be nice to have an extra /_health endpoint for general health checks (like database available, redis available …) which hopefully don’t need any lock.

btw: I’m using minio (4 node cluster) as s3 storage, which is also shared between other services.