Hacker News new | past | comments | ask | show | jobs | submit login
What's the difference between 'git pull' and 'git fetch'? (stackoverflow.com)
41 points by leonegresima on May 26, 2013 | hide | past | favorite | 26 comments



I think what many people miss is that if you have a remote repository called, say, "origin", and a local branch synced with it called, say, "master", then there's a behind-the-scenes local branch called "origin/master". This third branch is what your local git repository knows about the remote branch.

    [ master ] -------------- [origin/master | master]
     remote                           local
With this model in mind the difference is pretty clear. `git fetch` pulls down all the new stuff from origin and updates origin/master but leaves your local master untouched. You can then merge origin/master into your local master to bring it up to date. `git pull` just does both steps: pull down the remote data into origin/whatever, and then merge origin/whatever into whatever.

    master   |    origin/master    | master
      ----------------->
      git fetch

      ----------------->--------------->
      git pull


I feel like git has turned into the Perl of version control systems, in both the best and worst senses of the analogy. It's a system that was designed at a very low level but has grown organically from there, in directions dictated by how people use it (with Linus himself having an overwhelming influence here).

The real-world analogy is the probably apocryphal story about how some university/corporate campus/park whatever didn't put in paths, but instead waited a year to see where people walked and put in paths there. I first heard this story on John Siracusa's old podcast, but it's apparently been around for a long time. Some googling turned this up for yet another retelling: http://opensource.com/business/10/12/discovering-desire-line....

It's great because it really is insanely featureful compared to older VCS's, and it has absolutely enabled workflows that simply were not possible 10 years ago. On the other hand, the complexity and the inconsistencies and the TIMTOWTDIness of it all means that it is one more tool that you have to dedicate yourself to knowing inside out and thinking about all the time, as opposed to a tool that's more of a fire-and-forget thing that's easy to learn, easy to use, and never screws up (and doesn't give you enough rope to hang yourself).

Somewhere among git, Perl, Python, the Linux kernel, Mercurial, various Linux distros, various BSDs, and many many proprietary software projects, there is some really awesome classic book on software engineering waiting to be written, about the pros and cons of top-down vs. bottom-up design, having a BDFL vs. community-driven design and goals, user- vs. marketing- vs. developer-driven designs, etc.

Cathedral and the Bazaar doesn't count, it's way to shallow and opinionated about there being a right way to do everything. There really is no one true way to choose any of these approaches, and they all have their upsides and downsides. Maybe the best we can hope for is to be aware of these things and choose the best model on a per-project basis, and be able to adapt when that project's ideal model changes.


First let's start with the simple answer: "pull" starts by invoking "fetch". It then performs either a merge or rebase depending upon various "git config" settings and CLI options you can pass to git pull.

Now let's pop the hood.

Git commit history is represented by a graph. Typically each commit has a pointer to a single parent. When history diverges and needs to be merged, a merge commit is created which has two parents (it can have more than two but such a commit is extremely atypical). The first commit obviously has no parents and is also called a root commit.

Okay, so we've got a history of commits pointing to each other in a directed acyclic graph. Now we want to traverse that graph with "git log". Where do we start? This is what branches are, a mapping from a name to a particular commit. Git stores the branch names under .git/refs/heads. Go look. These are just files whose names are the branch names, and whose values are the SHA-1 of a particular commit. (For performance reasons, git will occasionally remove the files and instead use .git/packed-refs. But again this is a file you can go cat.)

Now, there are two types of branches: 1) local branches; 2) remote branches. Local branches are just the names (those things under refs/heads) which git updates whenever you create a new commit. Remote branches are the things which git updates when you perform a git fetch. That's it.

So a fetch operation examines a remote repo's local branches (refs/heads), examines the corresponding remote branches in your repo (refs/remotes/<remote_name>), pulls over the differences, then updates your remote branch to match the remote's local branches. It does so according to .git/config with a section that looks like this:

  [remote "origin"]
    url = https://github.com/gitster/git.git
    fetch = +refs/heads/master:refs/remotes/origin/master
That "fetch =" line tells git what to do when you invoke "git fetch" or "git fetch origin". It says to update your repo's refs/remotes/origin/master to match refs/heads/master in gitster's repo on github. The "+" at the start means to force it to happen even if the remote end has been rewritten (that is, the remote master's history does not "contain" refs/remotes/origin/master on your end). When you invoke git fetch you can see this happen in its output:

  From https://github.com/gitster/git
     52a3e01..edca415  master     -> origin/master
Well what happened here? Fetch examined refs/remotes/origin/master in my repo and refs/heads/master in gitster's repo, pulled over the commits I was missing, then updated my refs/remotes/origin/master (aka origin/master) from 52a3e01 to edca415. So "git log 52a3e01..edca415" will show me exactly the commits that fetch just brought over.

You'd typically perform a git fetch on its own so that you can then do something like "git log master..origin/master" which tells git to show you all the commits that are in refs/remotes/origin/mater but that are NOT in refs/heads/master, i.e. exactly those commits you either need to merge in or rebase upon.

So that's fetch.

Git pull then invokes either merge or rebase. To talk about these let's add some history.

Pretend your local repo started as a clone. The remote repo ("origin") has a single branch, master, and at the time you cloned it there was a single commit on master, A. So your local repo after the clone:

  refs/heads/master: A
  refs/remotes/origin/master: A
Now you create a new commit:

  refs/heads/master: B
  refs/remotes/origin/master: A
Someone else pushes a new commit to the clone and you fetch that commit:

  refs/heads/master: B
  refs/remotes/origin/master: C
Now we have a case where history has diverged. Two people have created commits, both which have the same parent commit A, and we need to tie these both into master. Let's do it with a merge first:

  refs/heads/master: D
  refs/remotes/origin/master: C
But what is "D"? D is a merge commit with two parents, B and C. And because D "contains" C, you can push it to the remote repo, updating refs/heads/master in the remote repo.

But what if instead you want to rebase?

  refs/heads/master: B'
  refs/remotes/origin/master: C
This has linearized history. B' has a single parent, C, whose parent is A.

Similar to creating the merge, you can push B' to the remote repo because B' contains C (unlike the original B). The rebase operation "rewrote" refs/heads/master, dropping the original B which had A as its parent and replacing it with B', which has C as its parent. Your original B is still in your repo btw, and will be there for some time until "git gc" removes it. You can find the original B in your ref log.


When I see people say git is/is not complicated, I think that doesn't quite capture the problem. Git, like the set of chess moves, is simple. But git, like playing chess without being able to see the chessboard, puts a big cognitive load on users that many users think isn't necessary.

Moreover, that analogy might approach illustrating the problem, but is still unsatisfying. Unlike, say, Eclipse, where one can readily sum up how Eclipse fails at being a well behaved GUI - it offers to do things that will fail and are nonsensical - git is, on the one hand, a lot better designed and it provides more value than Eclipse, and on the other hand git is even more baffling to beginners.

I want to see what's happening. I want to see what's going on at every accessible point in the workflow of every project I am working on. I want to see the results of actions. I want to be presented with all and only the sensible actions in ways that are discoverable. I do not want a lame fake "GUI" that is just dialogs pasted on a CLI.


Consider that git is a database of snapshots-in-time with an interface to append your own snapshot(s), and a protocol for distributing the revised timeline to others.

`git fetch` is how you retrieve those changes from a remote (like github). `git pull` takes it a step further and attempts to merge your changes in automatically. Think of it as two separate "download" and "sync" operations, depending on your workflow, you may want to download, but defer the sync. Eventually, you will need to sync before sending your changes upstream (`git push`), but you have some flexibility in deciding when.

For example, if you are travelling, you may want to fetch your remote(s) so that you have access to a relatively recent copy of the remote, but you aren't ready to apply those changes to your current source tree. Or, you may want fetch the remote(s) and compare the differences before applying the changes locally in case there's risk of incorporating a breaking change that could disrupt things.


Wow, after reading a dozen or so responses, I still don't really get it.


I never liked stack overflow, so let me try here: because git is distributed, you actually have two "copies" of your code; one that's a mirror of what's on the server (origin/master), and one that's your local checkout (master). Things that in SVN you would just do by contacting the server (e.g. svn diff . https://server/repo/trunk) you instead do against your "local server copy" (git diff . origin/master).

So there's an extra stage to think about; rather than making changes to your local checkout and committing them to the server, in git you make changes to your local checkout, commit them to your local repository, and then push them to the remote server. But this also works in the other direction: rather than fetching changes from the server directly into your local checkout, you fetch changes from the server to your local repository, then from there into your local checkout. At least, if you want to. "fetch" does the first of these, while "pull" does both of them together.

Now, you can naturally ask why we would ever want this additional complexity, but as I said at the start it's the whole point of a DVCS. Conceptually, your repository on github and your local repository are the same kind of thing, rather than a client/server relationship. You can use git without using remotes at all, in which case you'd never use push, fetch or pull; in that mode it behaves rather like SVN (just with the repository being on the same machine as the checkout). Conversely, you can use it in truly distributed fashion, where there are several federated repositories, none of which is physically distinguished from the other. At that point there is no real notion of committing to the "canonical" repository, because there isn't one, but what you can do is commit to, and checkout from, your local repository, and you can synchronize your repository with another repository in either direction (i.e. you can send commits from your repository to another, or receive commits from another repository to yours). When using it in this fashion it becomes very important that these are different operations and you want to be able to manually control when each step happens.


If you want a DVCS that IMO works a lot more like what you are accustomed to from svn, at least in the routine "checkout, change, commit" workflow, have a look at Mercurial (hg).


I think an important aspect to "getting" git is realizing and accepting that git has several somewhat redundant commands and usually two or more ways to accomplish something.

Personally I always do "git remote update -p" to synchronize all my remotes and then explicitly merge or rebase depending on what I'm trying to do.


Simple:

`git pull` = `git fetch` + `git merge`

`git fetch` much more often used by those who use the rebase workflow (vs. the merge workflow).


My understanding if git fetch is that the remote. Hangers are stored out if the way, ready to be merged to a branch. git pull is just a convinence function.

Now a git pull -r, that's interesting. Does this just fetch the commits and then do a rebase instead of a merge?


Pull with rebase: yes, that's how it works. You can also make it the default behavior globally or per-repository if you prefer.


With Mercurial it's the opposite where fetch is the "automated" command. Pull just pulls the latest version from the server while fetch also tries to update, merge and commit.


Just read git-pull.sh: it's a 300-line shell script. How much simpler does it get?


It gets simpler by someone telling in short what's the difference so you don't have to read and understand the code.

Just because you understands the basics of Git doesn't mean you can understand the language it was written in.


Then being able to read basic code is a nice thing to aspire for, no?

Read and ask questions about what you don't understand.


I know how to use drills, saws and other tools but never took them apart to see how they work inside.


Why do so many people have such trouble with this simple concept?

Git isn't complicated, you just have to understand that a git repository is composed of two things:

1) The index which is the git information about all your files (ie. .git/)

2) The files you currently have checked out.

When you git fetch, you update the index, nothing else.

When you git merge or git rebase or reset or checkout, you update your files from the index.

-____-

It makes me extremely sad to see this repeated over and over and people don't get it.


The index isn't the information on all your files (it's not the .git directory). The index is an intermediate holding location between the file system and actually storing all of those files as a commit in the commit tree. It's the current state of the proposed next commit.

Your description misses (or conflates) an entire tree of information (the HEAD tree), and it's arguably the most important one as it holds the whole of your git repo's history.

I didn't fully understand git till I read Scott Chacon's "A Tale of 3 Trees" which explains what reset is all about and goes into the details: http://git-scm.com/blog/2011/07/11/reset.html

I love git, but I do not think it's obvious or intuitive without some explanation. It's different than any other SCM I've used in the past. I created this presentation a while ago that I think highlights some of the real concepts that people need to know to really understand git: http://tednaleid.github.io/showoff-git-core-concepts/


Ok, not the index. The point is:

You have repository meta data in your .git folder.

You can download new information into your .git folder using git fetch.

...but you can only apply changes to your .git folder's data locally (eg. git merge)

...and you can only apply changes to your file system from your local .git folder.

Four, basic concepts. Why is this difficult?

There's tonnes of complexity in git, sure, if you have trouble merging after a rebase, sure, that's totally understandable.

...but the basic failure to understand the difference between applying a remote change directly to your current file system (not possible) and downloading that change and then applying it locally in various ways frustrates me, I've got to say.


This might be best explained with pictures, which of course aren't really possible on a forum like this. The problem is when people try to explain git in words they often start talking about "directed acyclic graphs" and "local remote" or "remote local" branches (what?) and using terms like "clone", "checkout", "rebase", "master", "head", "origin" without defining them; terms that have specific meanings in git that are different from their meanings in other systems and different from what many uninitiated users might think they mean intuitively.


> Four, basic concepts. Why is this difficult?

Because it's 3 more concepts than most people need.

A simple remote repository, coupled with a working copy, is easy to understand (and also all that most people really need).

You commit, you update. The only local divergences are the changes you haven't committed yet. In my experience, almost anyone can understand this semi-intuitively, and they only start getting confused when they hit merge conflicts.

Git, however, introduces considerable more state that a user must understand. There is local working copy state. There is local repository state. There is remote repository state. There is other people's remote repository state. Those states can become in conflict with one another, which means that there are quite a few more places to hit conflicts, and lots of sharp edges in the tool (git) with which you'd resolve them.

That's a lot of state, and it's state that most people don't actually need; it incurs a lot of mental overhead, even for people that understand how it works -- as evidenced by the fact that we're even having this conversation and two of you are debating how the index works.

The fact that this is git's default operating mode means that git is a bit like using a thermonuclear weapon to cook your breakfast, when most of the time, a simple gas stove would do. The tool was built for Linux, where it was designed to handle competing, divergent organizations, all of whom maintain long-term ongoing forks, with existing political and technical disagreements, divergent code bases and divergent ideas of stability and support requirements, all of whom are independently and concurrently maintaining, sharing, and rejecting patch sets across the graph of organizations participating in Linux development.

It's a huge, complicated, expensive, and messy development process, and it's nothing like what most organizations and projects need to support as their default mode of operation.

The problem isn't the people, the problem is that the tool is too damn complicated.


Git isn't complicated

Yes it is. It's not necessarily incomprehensible, but it is absolutely more complicated than predecessor systems like svn. Simply compare the man pages and number of commands and options.

It makes me extremely sad to see this repeated over and over and people don't get it.

If this is the case, then existing explanations/documentation are lacking, or the system is fundamentally too complex for "people" to "get."


And you don't understand what the index is. And because of that you are confusing people instead of helping them.

tednaleid already explained it in his reply to you.


It's not the same people not getting it. More and more people are using git every day so they are the ones who do not understand the git terminology/technology.


A lot of people use git without understanding it.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: