What are git commits?

There seems to be a confusion about how git commits are stored internally. When we use git it might look like commits are stored as a difference between versions of files. Most commands like git diff and git show display us the information as diffs, and never as the entire files. And it sounds like it makes sense to store only the changes too: we have large code bases but only make small changes at a time.

However, this awesome Pro Git Book states that all commits are snapshots of your current project. What that means is when you do `git commit` it doesn't just store the difference between the last and new commits, it stores the entire files!

One of my projects takes 20MB and it has 1066 commits. But running du -ch .git I can see that the entire git history takes only 34MB! How is that possible? Every commit is supposed to store the entire project, why does the entire commit history only take 70% more storage than the project itself?

Turns out the answer is git gc. Git automatically compresses older commits by storing just diffs. It saves a lot of space in exchange of decreased performance (e.g. when checking out old commits). It is a good trade-off considering most of the time older commits just sit there for redundancy without ever being used. However, if you need to, you can turn the compression off by running

git config --global gc.auto 0

Why should I care?
Understanding how git stores objects internally can help you make more educated decisions about how to organize your workflow. For example, binary files (images, library files, executables, etc) can't take advantage of diffs, so they are always stored as the entire file. Knowing all that, you would know downsides of storing frequently-updated binary files in git:

  • • dramatic increase in repository size,
  • • decrease in performance.
2018   git