My question is this- if wearing the headphones makes everything sound sharp, and if you were getting your voice folded back into the cans - wouldn't that make your voice sound sharp as well as make the track sound sharp? So wouldn't you just adjust your singing to be in tune to the track, and wouldn't both be off by the same amount so wouldn't they both be in tune when you played it back?
Good question...had me convinced at first. The singer is in a feedback loop at this point in time, but there will be overshoot. For purposes of illustating an example lets imagine that the mix comes back 1 semitone sharper than it actually is. Say the actual tune requires the singer to sing C, while the singer hears C#. The singer will start to sing a C# to be in key with what (s)he's hearing, but this will be heard back as a D in the headphones...the singer will realize that they are going sharp and flatten the note naturally, but there's a pitch warble owing to the finite reaction time of the singer.

How's that for a theory?