See also /tino/english/opinion/vcs/ (which perhaps sometimes will be reorganized/added to here)

VCS

There are zillions of VCS (Version Control Systems) out there.

Here I will list short information how to use such a VCS. Most time ist is for read access only, so it is for first time users who want to get read only access to the repository. If you are a developer you usually do not need this information, as you know your CVS.

History of VCS (from my perspective)

Version control of the early ages ususally started in the development directory. Besides the file (say: main.c) there was a version file (say: main.c,v) saved, next to the file itself. The history file recorded all changes to the file, such that you were able to revert to an elder stage of the files.

This was how SCCS and RCS worked. However this schema had drawbacks. First, the files were in the same directory and thus cluttered the listing. An "rm -rf" thus hit all files, the history was accidentally destroyed with this. And it did not work good when more than one developer worked on the same files.

So a shared VCS was developed. It can be said that the grandmother of those shared VCS was CVS. CVS was built on top of RCS, it was merely a wrapper and transport protocol to access a repository of RCS files. This repository was shared in the sense, that concurrent updates to the same files were denied. And for more complex situations a locking technique was introduced, such that one developer was able to "reserve" the edit of a file.

Later this "reserve" was refined in that you can do "update, merge, checkin" cycles. This is because "reserved edits" hinder other developers. However CVS had several major drawbacks. Some of these are:

  • Only atomic file commits: CVS processes one file only. This way you do not have "changesets". All you have is tags and checkout of a point in time. So to find out what changed you must find "nearby commits" in other files. This is difficult. Also this does not help a lot to get a build number or similar, because each file has it's own revision and there is no global revision number.
  • No renames: CVS cannot track name changes. Another name means another file. So you loose the information on how files were renamed, if this was not noted anywhere. Even if, who reads documentation? So this information must go into the VCS, CVS is not capable of keeping this information.
  • History can be changed: In CVS you can change the history by lying about the contents of old files. The problem is twofold: If you create some software, there is no forensic, when what happened. You can edit old files and erase some history, or you can introduce some history (like a trojan) and nobody is able to tell, what happened when. Modern VCS (like GIT) use cryptographic hashes to denote the revision and this head then authenticates all the changesets which are relevant to this revision. If you change anything in history, those hashes will tell that this old information is corrupt.
There are other aspects which I consider bad design in CVS (and all other VCS out there who use this design):

  • CVS is a central server: Central servers have interlocking problems when it comes to distributed use. If you have 2 CVS servers, which one is the master, which one is the slave? What happens if somebody CI changes to the Slave, somebody else does this with the Master, too, and then both try to "sync" the repository? On CVS this is impossible, as the files are different then and the incompatibilities cannot be resolved automatically.
  • Difficult branch handling: Branch and the opposite of branch (merge?) operations are difficult to do. However this is true for all VCS I have ever used. The problem is, that you have to tell the VCS that this is a branch or a merge. Why? Branch and Merge shall be natural actions, this is, if you edit a source in two directories and the edits are different and you CI these edits, this is a branch. You then get a note that you left your branch until you commit this change. If you then sync those two edits again and do the CI, this is a merge.
  • Difficult conflict handling: If you allow a branch to depend on another branch, this also resolves "conflict" handling. A conflict is, when the source was edited at the same location to contain different data. For example if there is a distribution which must me modified locally to fit local needs. If a new dist comes out you want to CI this distribution. But you do not want to loose your local edits. If the dist modified the same part you modified locally, too, which one shall survive? This is a conflict which must be resolved. However handling of this situation (which is quite common!) is left to the developer (you have a dist-branch and a local-branch and do a 3 way diff to resolve the changes. This is more than clumsy). A good VCS shall be able to resolve that, this is, a branch shall be able to depend on another (major) branch, such that if that (major) branch is changed all the changes are elevated to the depending branch on update (this actually does a 3 way diff, but with the help of the VCS). ote that the default branch (HEAD for CVS) this way usually is a depending branch!
  • CVS is not distributed: In modern development a central server is not helpful anymore. It is OK for enterprises with strict development guidelines, but on distributed development like in the Grid (modern Internet) there simply is no more one single central authority. People vanish and come up as they like to be. The repository must survive all those chaos. So it must be distributed from the start.
  • CVS is a server: A modern VCS shall have no server. Just some access to a filesystem must be enough. Access to a repository through NFS/SMB (network share), HTTP (get) or ftp (anonymous) must be sufficient. You must be able to do Syncing of two repositories on the filesystem level with no server component or scripts involved. Just take a TAR, a wget -r and "merge" everything with the data on your drive. That's how a repository must be organized today. There can be a server which elevates this process to fit into corporate guidelines. But on the low level, you must be able to work independently of this server (in case it is unreachable). This is, you still must be able to CO files, edit files and do Commits to the repository (the sync process then may be somewhat more complex, because you have to resolve conflicts first, but this is up to the server component and shall not be part of the VCS basics).
  • CVS already is too complex: There shall be no more than 3 to 4 basic commands which you use all day. There shall be 5 to 6 other commands you use regularly like for tagging, creating a release, fetching or syncing additional repositories etc. And there can be some 5 to 6 other commands you use rarely for administrative purpose like repairing defective repositories (for example when a power loss occurred while updating a repository), checking file integrity or sign changesets. That shall be enough, any VCS which needs more commands is too complex to use and has a too complex design.
Some important notes to a VCS design:

  • All commands must do what you expect: Commands shall tell what they do, they must not have bad sideffects and they must work reliably all time they are used. Also they must work well on shell layer (give descriptive result codes etc.). If something breaks the commands shall present you with a note what failed and how you can fix the problem. For example if an Update fails because the central server went away, it shall ask if the Update shall continue with the local repository. If you answer "no" the update must be rolled back completely, such that you have the situation before the Update. OTOH if a Commit fails, the commit shall still be done to the local repository in whole. Then the command breaks and tells you of the fact, that the sync to the server cannot be done, such that you now work on your local repository alone.
  • No commands must be allowed to harm anything: This is true even in the worst situation when the lights go out unexpectedly. Harm for example is, if the VCS does a "rm" of a file which it is unable to bring back exactly in the same version which was removed. This particular means: No "fresh checkout" where local files are overwritten with the version of the VCS if a change might vanish this way. The VCS can do a CI of the changed files into a "temporary local branch" which is put into a tashbin such that you can later decide to remove this branch or save it. But in no circumstance it is allowed to put the files somewhere "hidden" onto the local filesystem (it must be moved to the repository).
  • Repositories must not be part of the working tree directory: A VCS shall refuse run this way. It shall even warn the user if the repository and the working tree are on the same harddisk such that one single hardware failure hits both, the repository and the working tree. Note that there shall be a "temporary repository" on the working tree, too. Such that you still have a backup of your working set (but only of the necessary change sets for your working branch) if some catastrophic failure hits the repository. Note that you usually have 3 repositories this way: The server repository (via network, not necessarily synced), your local repository (on another drive, usually synced), the temporary repository (which provides the rollback in case of catastrophic failures such as sudden power loss in the middle of some operation) and your working tree (both on the same drives), of course. That shall be the usual setup! (If you are puzzled perhaps read how SQLite does atomic transactions, then you will understand why you need 3 repositories. Also note that the temporary repository is differently organized, it can reside in a SQLite database or similar and be accompanied with files, so you cannot merge it on file level, however you can re-create the normal repository from it.)
  • Note that I do not request anything about the server component. This can be as difficult as you like. A VCS must be able to run without such a server, so I do not need it, so I do not need to think about it.
  • Note that I do not need funny features like "Update repositories". Those are repositories which provide "updates" to "read only master repositories". Such crap is not needed, as in this case you simply mirror the master repository completely and add your changes to this mirror. Both must be able to live happily with each other, within the same file tree. If not the VCS has a bad design.
  • Note that I do not want to see "undo checkins". If you did a checkin and detect that it contains some important password which must not be distributed, you simply delete the complete changeset. If you do this, you invalidate all dependent changesets as well (hopefully there is none yet). To delete this changeset there must not be any command, so you do this on file system level (using rm or del or whatever). All which shall be there is an admin command, which informs you about which particular file contains parts of which particular changeset. If a file contains more than the changeset you have to reorganize the file (see next note).
  • Note that I want to be able to "reorganize" the repository. This is, there shall be a way to "repack" the data more efficiently in case you want to do so. Also there shall be no strict mapping like "1 file is one changeset". So changesets can be made of parts of several files and one file can contain more than one changeset. It even may be that one single file is split into parts to several files. If you think this will create chaos, you are wrong. A repository just contains what it contains. After a "reorg" it still contains the same data. If you then merge something which is not reorganized, this does not create chaos. It only means, that the information now is redundant in the repository, so you can do another "reorg" to save space. The idea behind this is, that the storage format of the repository is not fixed, it can change over time and evolve, without you loosing all the properties of the repository (signed information later still is signed, hashes still are valid, etc.).
I did not find such a VCS today, sadly. As long as none fits my needs, I will continue to use CVS (for me alone). Because I know CVS deeply, because it can be easily be managed on file level (due to RCS files), and because I use it for nearly two decades yet.

-Tino, 2008-03-09