Currently I lost some hours of work due to Tortoise-SVN. This is because SVN deletes data which is not yet versioned. This is a programmatically incorrect behavior. It is by design.

So here are some additional recommendations I need from a VCS (see below):

  • VCS must be a CMS, too.
  • No server side on the server side needed. Just any file type server, like FTP, HTTP, SSH, Samba, or even Gopher shall do.
  • Supports trainloads of developers in parallel.
  • Local repository with automatic branches and online edits.
  • Both, single file revisions and bundled commits

Why CMS?

A CMS has certain features which are very useful. Think about a software distribution site which wants to give you an impression of the current source code.

However all sites out there always keep two types of software downloads: Latest stable release and the development code.

Now back to the CMS, what do you get?

  • There is an author (the programmer) who prepares the articles locally (edits the source)
  • There is an editor (the release people) who combines all the articles to a preliminary version of the pages (this is the development build. In CMS it is called "staging").
  • When everything is signed and sealed the final version then is published. This is then the stable release or the productive website.
CMS and software development have many things in common. So it would be very good, if th VCS also is usable as a CMS without modification.

Paradigma change

To archive this goal you must change some paradigms found in traditional VCS. The paradigm which I think can be abused to gain all this are following three:

  • Version numbers
  • Tags
  • HEAD revisions
If you drop all three and combine it to one new thing, you get following:

  • Versioning is still done, but it is no more a number scheme. Numbers are used for the automated processes, while non-numeric parts are used as manual markers.
  • TAGs are no more existing. They are the same as Versions. In fact, there is no difference at all. The downside is, that you cannot choose your tag freely, you have to stick to the rules. But besides that, you are free to choose a tag as you want.
  • HEADs are no more present, as there are multiple heads. One head is the edit-head, where the developer is working on. Another head is the publish head, after the developer has checked in. A third head is the distribution head, where the current stable release can be found. You have a lot of heads on different hierarchies. However they are clearly structured.

Version numbers, TAGs and HEADs

In traditional VCS, you have one single HEAD revision. This always is the last revision checked in on the main branch. To stick to some older version, you either use a TAG or use a branch, which again has a branch head.

This is common practice and good understood. However it must be learned.

So here is the deal to learn something new, but you then get rid of all those complex branching rules, too. For one-person-users of the VCS there are only advantages. For multi users there is a very simple setup which everybody can understand easily.

  • A version is a dotted string, this is something which has dots in it. It can have no dots at all.
  • A HEAD is the highest subversion.0 of some version. This is written as version.x.0 or plainly as version.x
  • There are no TAGs, you always refer to a version. However a file can have more than one version. If a file has more than one version, this means, there was a merge, a file belongs to more than one repository, or a file has some arbitrary additional tags on it.
Let's explain it on an example:

There are two authors A and B. A and B are the names which are unique. This is important. If both have the same name, they must get distinct names assigned like A1 and A2. Also these names never are used as TAGs.

There is a repository. The current head revision is 1.2.0 (this is the same as 1.2). It is downloadable from a website under some URL. So the webserver is serving the HEAD of version 1.

Now developer A starts the computer and downloads the software to start editing. The local repository is filled with the data from the website and A starts editing.

As A does not want to publish to the website directly but to do development checkouts instead, he checks out 1.2.0 instead of 1.2, so the version after download is 1.2.0.A.0. After the first edit it becomes 1.2.0.A.1 and so on.

After 10 edits, A is finished and does a commit to the website. So the newly checked in version therefor is 1.2.1. As the website only serves the HEAD of version 1, which is 1.x.0 with the maximum x, you still see the 1.2.0 on the website.

Due to the commit, the version number of A changes from 1.2.0.A.10 to 1.2.1.A.0, as usual.

Now developer B downloads the 1.2.1 version (this is 1.2.1.0, too) to continue editing. He does not checkout 1.2.1.0, instead he checks out 1.2.1. So his version then is 1.2.1.B.0. B does some editing. In the meanwhile A does some editing, too, and does another commit before B. Then B does the commit.

Imagine that there is a conflict between the checkin of A and B. Which version survives? The answer is: Both. The VCS must do an automatic branching in this case as follows:

Before the checkin of B there are following versions:

  • 1.2, the original website
  • 1.2.1, the first commit of A, this also carries the version number 1.2.0.A.10
  • 1.2.2, the second commit of A, this might carry the version number 1.2.1.A.5, too
Now B does a commit from 1.2.1.B.20 to 1.2.1 which does conflict with 1.2.2, so the next possible branch is used and the version committed is 1.2.1.1. So after the checkin of B we have:

  • 1.2, the original website
  • 1.2.1, the first commit of A, this also carries the version number 1.2.0.A.10
  • 1.2.2, the second commit of A (also version 1.2.1.A.5)
  • 1.2.1.1, the first commit of B (also version 1.2.1.B.20)
The version number B then locally has is 1.2.1.1.B.0, which also is 1.2.1.B.20 and 1.2.1.1.0.

Now B resolves the conflict by merging with version 1.2.2 manually. The merger then has following version numbers (assuming one edit): 1.2.1.1.B.1 and 1.2.2.B.1

By default the next commit automatically takes the highest version number to commit to, and therefor commits to 1.2.2. So the next commit checks in:
  • 1.2.3 with additional versions 1.2.1.1.B.1 and 1.2.2.B.1
Now A does an update. The current version of A is 1.2.2.A.0, which also is 1.2.2.0.

So the update tries to get the version 1.2.x with the maximum x higher than 2. If this fails the update will look for 1.2.2.x.0 with a maximum x higher than 0, and then a 1.2.2.A.x (or 1.2.2.A.x.0) with a maximum x higher than 0.

This is a common feature. If some file has more than one version, all possible newer versions are tried. A "newer version" is the next subversion above a .0 and as a second try the increment of the .0.

For this to work correctly, there is a little bit weird sort order:

Alphanumeric tags are bigger than 0 and any other alphanumeric tag, but are smaller than 1. This allows that version 1.2.1.1 is higher than 1.2.A.0, which is correct, as 1.2.1.1 is based on 1.2.1.0 which is based on 1.2.0 somehow. 1.2.A.0 is based on 1.2.0, too, so 1.2.1.0 is a valid successor.

It is arguable if 1.2.1.2.0 would be a valid successor to 1.2.A.0, too. However I don't think this is needed in practice. If you really want to update from 1.2.A.0 to 1.2.1.2.0 then you can issue update twice (first will update to 1.2.1.A.0 and the second to 1.2.1.2.0).

Tracked edits

The VCS is able to track edits without commit. The idea behind this is, that the developer shall be lazy and forget about local commits. Commits usually tell some ready state. From my experience there seldom is a ready state locally. It's hard enough to commit to the next development version, so I usually forget about intermediate commits.

However this is bad. It is bad, as it looses intermediate file states, and you might loose valuable data if something bad happened (like a delete before commit). So the idea is, that the VCS can do commits on regular intervals (or right after each save) just in case you need it.

The tracked edits are done simple. They are just issued behind the normal version number.

That is, if you are working on 1.2.A.0, then the first tracked edit is 1.2.A.0.1, the second one is 1.2.A.0.2 and so on. After you jump from 1.2.A.0 to 1.2.A.1, the first local edit is 1.2.A.1.0.1 (and not 1.2.A.1.1 as you might have thought).

The rule is simple. If the VCS sees some version.0.x, it can increment x. If it sees version.0 it must append .1, else it must append 0.1.

As always, the highest available version is used for this rule.

Tracked edits are important not to loose any data. The idea about this is, that tracked edits are automatic. They are also done before some delete or other things.

So if you edit a file, regardless where you got it, this change will first enter the repository before you alter or replace it. This must be always done, so there shall not be any way to override this feature, even if you want to.

Tracked edits are ignored when you issue updates or merges if you forget to commit first.

Global repositories

In normal circumstances you work solely with your local repository. From this repository you do commits to some external repository. This commit, by default, skips all intermediate versions in your local repository (these are still kept in your local repos, but they do not make it into the global one). However if you want you are able to "sync" your local edits to the global repository, too.

If you lack a local repository, you can use the global repository as your local one. This is convenient, as there shall be no difference between local and global repositories.

Committing to another repository therefor is nothing special. It just is a sync process between your local and the global repository. This can be done on the level above again. The hierarchy can be as difficult as you like.

Multiple global repositories

If you want to manage more than one global repository, for example you have two repositories sharing some common codebase, there must be some way to track all the versions, as the versions can differ between repositories.

For this, each repository gets an unique name (like the developers have), and this name then is prepended to the version seen on the other repository.

So if you have a repository r1 and r2, then version 1.2 in r2 will be seen as r2.1.2 in repository r1. R1 might have version 1.3 for the same file.

This mainly is to track changes. If you commit version 1.15 from r1 to r2, then the tools can find out that 1.15 is based on 1.2 in r2 and therefor commit version 1.3 into r2. If there is a 1.3 already, this version would become 1.2.1 of course.

Automatic merges

If you commit a file which already is known identically with another version, this two versions get merged.

So to resolve a conflict you can do following:

  • Run the merger tool and merge another version into your local one. This will place the conflict lines in the text and record the merged versions, such that you can commit again.
  • Checkout the other file, too, and edit both such that they are identical. Then checkin both again, separately. This will merge, too, as both files then are the same. This is the recommended way for binary files (merger tools only work for text files).
  • Or checkout the right one and commit it to the other version with the conflict. That is the easy part, when you see, that only one of the versions shall survive.
Automatic merges are extremely important.

Please note that these only works for the same file name. It would not make sense to merge different files, as empty files are not uncommon and must not be merged.

Why this all?

For me there must not be any difference between a Wiki, a CMS and a VCS. The only difference is, who is allowed to edit, that do you want to publish and what tools are existing to access it.

There really is no point in not integrating all the three into one single thing. A CMS only has some component to define a staging more comfortably, while a Wiki has some nifty output formatting and allows all people to edit. However internally the same basic engine could be used.

Static pages vs. dynamic server

The idea behind the VCS is, that with a commit it automatically does some checkouts on other paths. That is, if you checkin a version which affects some HEAD, this HEAD is checked out to the path. This shall be builtin in the tools as well.

This way you would not need any dynamic server. You can just have the repository on it, globally accessible or not, and checkout the versions you need as static pages. The checkout can be a render process.

In fact, rendering is nothing bad for a VCS. In CVS you have some sequences which are automatically replaced by something else. This basically was a good feature and shall be kept. Why not extend it, such that you can write some "checkout scripts", which alter the data in some way?

You can then provide different "filters" to the checkout, like views. One of the filters might render Wiki or BBCODE style text into HTML phrases, and another one might apply a template to files, and these templates then might fill in header, footer, menubars or even other templates.

All these files could be fed from the VCS, too, such that you can create your websites with it. Another checkout then can provide the printable version. And a third checkout can provide an archive of deleted pages and so on.

The important thing for me is, that you do not need to keep everything on the web server. The web server can have a local repository, or not. Or it can be the global repository in case you have some people contributing.

With the hierarchical organization of repositories you also can easily manage all demands, like providing a public Wiki and incorporate the best things found there into your intranet. Or other way round, have some Intranet pages and commit the ones which are allowed to go public to the external Web server.

As intermediate edits are not replicated, no information leaks. However you still can keep a complete history with your VCS.

Repository considerations

To provide the cryptographic background, all files must be self-authenticating. That is, the filename must be the cryptographic hash of the contents. This is like GIT.

However I like the way CVS works. As you keep all the history within the file itself. Both goals seem to be contradicting each other. However this claim is false.

If the file contents does not change but the file is only appended to (growing only files), you can still calculate the cryptographic checksum over the portion of a file. All you need is to include the length to the sum.

A filename only can have one name. You can create hardlinks, but hardlinks are inconvenient, as they are limited in number. The solution is easy: You need some dictionary which keeps track of old names of a file. If a file is not found, the dictionary is searched instead.

Note that the dictionary can be rebuild by just calculating all the checksums on all possible file lengths just using the files. This might look inconvenient, but it isn't, if some care is taken when the files are created. You can read the files and find "markers" where to create checksums, and then make a temporary stop and enter that into the dictionary.

Another thing is, that you should not checksum the files itself, but the contents the files represent. To explain: You have a text file and the cryptographic checksum of it. Now you ZIP the file. Now the representation of the file has changed, but after unzipping the cryptographic checksum still is the same. However it would differ if you calculate it over the ZIPped data before unZIPping.

This way you can even "pack" several files into one file, you need not stick on fixed rules "all history of a file in one file" or "different files, different history files". The dictionary can handle all these cases.

This way you can reorganize the contents of the repository as you like! As long as it still contains all the old data, you will be able to access it.

Cryptographic nature

GIT has it, and I like it. One single checksum authenticates a complete tree of files.

This is done as follows:

  • The checksum defines the dictionary, which contains all the file versions in question.
  • The file versions itself contain the checksum(s) of their predecessor(s). This way the complete history of the data unfolds by iteration (in case of branches it might recurse a little bit, as you must follow more than one path).
This is definitively the way to go. Being able to track histories exactly is most important to a VCS. It also allows you to see, if some unpredictable error (like a harddisk failure) has happened and altered some bits.

Mixing of file trees

If done right, there are only two operations needed when downloading data:

  • Adding new files.
  • Expanding already known files.
Files either spring into existence (if a new file is tacked) or the contents might got appended to. This way you can quickly download all updates from a repository.

In your local repository, you only keep the "deltas" to the downloaded files. This can be kept separate from the "global repository cache". With a little care a "wget --mirror" is enough to download a repository. So, besides of your webserver (which you always have) you do not need any additional software to give people full access to your repository.

Also the "commit" can be done via eMail. The people can send in their local repository files directly to you. All you only have to do to add this updates is to "mix" the file trees, this is, just add the files.

There are only some files to regenerate, like the topmost directory entry (which tells the root to start with) or the intermediate directories. However all these files can be missing, too, as you can re-create them locally as well.

The important part then is:

Just hand out some unorganized bunch of files to people, and they still are able to use them. They can tell what is missing and if something is missing. And they can reorganize the data on the fly to improve what you provided.

That is what counts in my eyes.

And to make it perfect, all files shall start with a rosetta stone. That is a short preamble which tells of the file nature and explains how to create tools to read the files. If you ever have seen dds.c, such a goal is not impossible (if you still don't think it's feasible, scroll to the dds.c entry of 1991).

-Tino, 2007-07-11