Issues cloning a big repository


#1

Hi all,

I stumbled across gitea whilst looking for a git repository manager that could handle mirroring… as I’d like to keep a few local mirrors of remote git repositories on the local network.

I am running gitea via the Docker container on a AlpineLinux based virtual machine with 1 core, 512MB RAM and 80GB disk. The VM has a fairly heavily firewalled view of the world, but I have permitted HTTP/HTTPS, git and SSH protocols, as well as access to a local SMTP server.

The first thing I tried doing was to mirror this repository:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/

The first attempt failed because I hadn’t allowed for the git:// protocol access. That has left me with a repository that I can neither view nor delete… it gives me an error 500 every time I try to visit it:

https://git.longlandclan.id.au/stuartl/linux-stable

I fixed the problem and tried again, and I note that while git-clone seems to be going in the background (according to top), the front-end proxy eventually timed out. Visiting the repository in a separate tab again yields error 500. At some point, it seems to give up, and I’m left with no repository clone.

Is there some trick one can do to clone a big repository like the one I linked to? I have other clones of the repository that I could use to get things going but once imported, I’d like gitea to pull down the latest updates from the upstream repository.


#2

The Linux repo is huge, I think it’s improbable that you will be able to use it inside Gitea. Even if you can clone it, you’ll probably have performance issues while trying to see it in Gitea.

If you’re determined to try, maybe temporarily disable the timeout time of your proxy?


#3

It is big… but the amount of browsing I’ll be doing is minimal, and this will be largely a single-user system. With my broken repository, I ended up having to blow away the back-end PostgreSQL database and re-create it, as I was unable to delete the broken repository.

I tried bumping the proxy timeout to 1 hour (I don’t think I can “disable it” as such, not familiar with nginx, but it is lighter weight than Apache) but this didn’t help.

I find I’m able to push the entire Linux repository just fine to an empty repository, and browsing that works. https://git.longlandclan.id.au/stuartl/linux-temp is that repository.

I also note we’re able to change the mirror repository URL after creation, but not convert a normal repository to a mirror. So I’m exploiting this to mirror a small subset, then I’ll switch repositories to my full repository and let it update in the background.

Right now I’ve created an “empty” repository… initially I tried mirroring that, but it objected due to it not having any branches, so I cherry-picked the following commit into it:

commit 5fe785d8ac3b070d05ff08cbb0b0c0aac6c6be3c
Author: Linus Torvalds <torvalds@ppc970.osdl.org>
Date:   Sat Apr 16 15:20:36 2005 -0700

    Linux-2.6.12-rc2
    
    Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.
    
    Let it rip!

That repository is here: https://git.longlandclan.id.au/stuartl/empty/commits/master

I’ll let it chug away at that and see how it goes.


#4

I think this is an known issue we should resolve later.


#5

Indeed… at the moment it looks like it’s doable if you’re:

  1. patient (expect hours, not minutes)
  2. not stingy with RAM (for a repo this size, 2GB should be considered a minimum.)

For smaller repositories, it JustWorks™. gitea is just perfect for tracking smaller upstream projects, it’s just the Linux kernel that it chokes on, and understandably so, there’s over a decade’s worth of commit history there. It’d be worse if I pull in the Linux/MIPS tree: their repo goes back to kernel 1.x (mid 90s, when they used CVS).

I think my problem is exacerbated by the fact that I allocated 512MB RAM for the VM thinking “that ought to be enough”. I checked this morning, and saw that it had pulled in the initial push from linux-temp into my mirror, so I pushed some more and told it to sync again.

I think a few more iterations of this, and I’ll be ready to switch it to use a real outside mirror.

I can of course shut down the VM and allocate more RAM… and I can also buy more ram for the VM host (already thinking of doing this, although ECC SO-DIMMs are rare and expensive, that’s the type of RAM my nodes take).

Thinking about how to solve this long-term… and I come at this not being a Go programmer… but perhaps the clone process could be forked into the background, then after a timeout if it is taking longer, it sets a flag on the repository then displays a page to say the clone is taking place in the background. If someone hits the repository during this time, they get told the same thing.

When the clone completes, the flag is cleared and users can then see and use the repository as normal.
If a clone fails (and this is the critical bit), it blows away the repository then sends the owner a message notifying them of the failure. That way they’re not left with a repository that just coughs up error 500 that can’t be easily removed (I managed to completely break gitea, necessitating a re-install last time).

It’s nice though having this feature at all, and I appreciate the effort on gitea. It’s one of the few decent repository host systems I know of … the other I was thinking of using was Girocco (which runs repo.or.cz), having tried Gitlab (which doesn’t do mirroring unless you’re an enterprise customer).