Wednesday, February 01, 2017

Why I will never Squash or Rebase in Git...

I'm a versioning purist. I'll admit that. I love to be able to access version history in the VCS systems I work with. May be it's just my Subversion (SVN) background, but I like being able to read through the history easily and find the logic for what changed and why in the log messages associated with commits. (And if the log messages don't contain that information then they're not *good* log messages.) So I am heavily against Squashing and Rebasing in Git.

GitHub recently introduced the ability to do to Sqashed Commits on merge, and some of my team members decided to give it a try. However, it was immediately apparent that Squashing is evil. Why? Because it really hurts being able to track stuff and keep a clean working copy locally.

My general local development works something like this (assuming the working copy has already been cloned and upstream setup):
$ git checkout master
$ git fetch upstream
$ git merge upstream/master
$ git checkout -b my_working_branch
do stuff
$ git commit
$ git push origin my_working_branch
get it merged remotely
$ git checkout master
$ git fetch upstream
$ git merge upstream/master
$ git branch --merged
look for my_working_branch
$ git branch -d my_working_branch
Squashing creates several issues:

1. Detection of merges break

For example, if you squash a branch on merge the last couple steps above won't work. Git can't tell that the branch was merged b/c it can't track the hashes of the branch. This means that you now have to do:

$ git branch -D my_working_branch
This means it now becomes extremely easy to remove the wrong branch.

This is also true of specific commits if you squash before pushing and are cherry picking between branches, etc - so it's not isolated to just branch merges.

2. Removes valuable history and insight

 History contains details. Logs contain details. This is very important information when trying to determine why someone did something the way they did - e.g when trying to find and fix a bug.

Git has an awesome feature called git bisect that allows you to find the exact commit a bug was introduced in. Squashing means you can only find the total group of commits that introduced the bug, not the commit itself. You now have a take it or leave it for the entire group. You also lose any contextual information regarding the specific commit and why it may have happened.

3. It foobars anyone tracking what you're doing
When using a VCS system code is meant for sharing, and once you share it others (f.e upstream maintainers, co-maintainers, etc) may  checkout your branches to monitor progress if they are interested in what you are doing. You do not necessarily have no clue who these people are either. However, squashing and rewriting history will screw up their ability to cleanly track your work.

You also make it problematic for yourself, especially if working on the same codebase on multiple systems (f.e laptop, desktop, server) since you will have to do a force push (git push origin my_working_branch --force) if you squash after pushing it, which means you'll have the same issues as others if you need to keep others places in sync, not to mention you may lose your own work in that case too if, for example, you push up one change set from your laptop and another without merging from your desktop. What got pushed from the first system (e.g the laptop) via a forced push will be lost when the second system (e.g the desktop) pushes up the changes.

Git Rebase runs into many of these same issues, even exasperating some of them (#3). Rebasing also runs into the following issues:

1. Your own branch history makes less sense.

That is to say, you lose the context of the changes in your branch by moving about the commits. The reason why you did something in a commit has very much to do with what the code looked like prior to that commit. Rearranging the history so newer commits appear after merges removes that context.

2. Sharing branches becomes that much harder.

This is a really, really big emphasis on #3 above regarding squashing breaking sharing branches with others and even yourself. Only it happens on the merge level instead of the push level.

Now is this not to say that there are not use for these features at times. There are, but they should be used with extreme caution and with extreme rarity. The smaller the project, the less likely they should be used.

For example, I can certainly understand why the Linux Kernel maintainers may use these features - with dozes of people sharing code and consolidating it down as it moves upstream. However, that is a project that has numerous layers where upper layers don't need to care as much about the details of the lowest layers, so squashing and rebasing can happen at controlled points between the layers and everyone - tracking at their layer - will be able to track what's going on more easily. Bugs are tracked and fixed at the various layers. Most projects are neither this size (millions of lines of code, contributed by tens of thousands of people), complexity, or have such a large hierarchy of contributors (subsystem and release has a person dedicated to its maintenance).

In the end, you really need to care more about history than most people do - especially in small projects, and even more so in projects that may have high turnover of its contributors, and even more so when turnover leaves little (if any) time for transfer of knowledge.

History is always important, and you may never know how important it is because it may be the person several times removed from you - long after you have moved on to better things - that is maintaining the code and trying to figure out why you did something that needs to know the history and details. Always write code and use history for that person, not yourself. They won't likely be as smart as you either.

The above is, for now, my current list of well known reasons why not to use Rebase and Squashing. I'll more as they come up.

UPDATE: If you ever have to force push, then you did something wrong.