Some Issues with GNU arch

This memo collects some issues which could be considered design defects, hard-to-overcome implementation deficiencies in arch's current implementation (called tla), and general design directions. The memo is still work in progress and is, of course, entirely subjective.

Design Defects
Implementation Issues
Directions
Some Ideas

Design Defects

The changeset format is defined relative to GNU patch and GNU tar. These data formats are still somewhat in flux.
The changeset format does not handle binaries efficiently, and certain text files (e.g. XML files not created by a text editor and formated for readability).
In essence, an archive consists of concatenated changesets, which are directly exposed in a file-based interface. This makes it very complex to address issues with the changeset format itself, and the archive interpretation might change when new versions of patch and tar are installed.
Arch does not implement a distributed system. For example, its archive replication does not transparently handle write operations.
There is no integrated mechanism to atomically commit related changesets to two branches (even if these branches are contained in the same archive).
Categories, branches, and versions are not orthogonal at all and add unnecessary complexity. Future features cannot differentiate between them because they are used very inconsistently in existing archives.
The idea to automatically subject files to revision control, based on regular expressions, is very hard to deal with for users. While being an interesting experiment, it does not lead to increased usability.
GNU arch does not support a centralized development model which lacks a single, designated committer.
Branch creation is not versioned. Branches cannot be deleted. This means that branches stay around forever, even after development on them has finished. (This could be worked around in the implementation by hiding branches, but it doesn't seem to be the right thing to do.)

Please note that while these issues are likely too fundamental to be fixed in GNU arch without breaking backwards compatibility, the general way arch is designed is quite fine and probably the best way among the advanced free version control systems which are close to production quality. Other systems (including plain Subversion) have not evolved so far that it is possible to judge how the basic features (e.g. inventory management) and advanced features (merge tracking) play along in practice.

Implementation Issues

The access methods for remote archives are subject to a lot of round trips. Therefore, archive replication using tla itself is very slow.
The archive format optimizes for access to early versions, not most recent ones as one would expect. (Once the archive format is no longer exposed directly, this becomes an implementation issue, not a design issue.)
The caches which compensate the previously mentioned issues are not expired by tla. (This includes revision libraries and. apparently, pristine copies stored inside a checked-out copy of a revision.)
Changesets are tar files. They cannot be posted easily to a mailing list for approval and commit; metadata tends to get lost.
In practice, tla requires four inodes per file in a checked-out project tree: one for the file, one for the file ID, and a a pristine copy of both. This gratuitous use of inodes can cause problems.
A checked-out revision of a branch contains at least one inode for each revisions that was ever committed in the history of the branch. Long-running branches also result in huge directories with lots of entries.
The inventory code can create inconsistent results. For example, explicit tagging only overrides classification based on regular expression in some (but not all) parts of tla.
The inventory constructor, project tree checker, and changeset creation code are not fully synchronized. For example, it is possible to commit a changeset with an inconsistent inventory, which is also inconsistent as a result.
Branch creation is very cheap (a few inodes in the archive), but a long-running branch to which changes in a mainline branch are periodically merged replicates all changes on mainline. This means that branch maintenance costs are controlled by the amount of development on the branch and the development on the mainline, and branches are no longer very cheap in total. (This is an implementation issue because unlike other systems, merge tracking does not depend on the way changesets are combined in the archive. This is actually a very strong point of GNU arch.)

The author of this memo thinks that these issues are not ordinary bugs. Workarounds may exist (faster machines and networks, cached revisions, non-traditional file systems, hard link farms, "simply don't do that, then"). However, these issues have not been addressed for quite some time now and, though still rather bothersome, some of them will probably remain forever. (Other issues, such as some hairy error messages and the lack of a file-specific revert command, are considered transitional problems.)

Directions

The GNU arch developers believe that it's easy for all developers participating in a project to publish a repository. However, this requires write access to webspace without file name and directory layout limitations. Such resources are often not available in a corporate environment, or to those who can only afford cheap Internet access. Genuine support for centralized development is required, but GNU arch is unlikely to provide it.
The tendency to trade decreased code complexity for increased running time and more disk space was fine when tla got started, but today, it results in performance that does not compare favorable with optimized competitors. In addition, disk seek times have not improved at a significant rate, and the huge amount of stat operations performed by tla will remain a bottleneck even when developers move to larger machines.
The developers seem to underestimate the need for a robust user interface with clear error messages and transaction semantics (i.e. a command either fails and changes nothing, or it completes successfully). Even non-programmers use revision control systems, and most programmers are primarily interested in their own project and not necessarily in tla internals.
tla input and output formats are currently deliberately incompatible with the rest of the GNU system. An internal file name encoding is externally exposed, and this encoding is not used anywhere else at the moment. The problem at hand is the possibility that file names might contain column or record separators. (The GNU way is to use the ASCII NUL character optionally, to deal with file names with spaces or control characters. tla does not support this.)

Some Ideas

Redesign the changeset format, probably based on VCDIFF (RFC 3284). Unlike unified diffs (which are currently used by tla), VCDIFF deltas are one-way and not reversible when just the delta itself is known. (this is not so much of a problem, tla uses changesets only in forward direction most of the time). Additionally, VCDIFF deltas only support exact application. Fuzzy application (i.e. the contents of new source file for the delta is not identical to the the source file contents the delta was computed to) requires access to the original source file of the data. This is a fundamental change in some respects, but many merge operations in tla already require the computation of source files.
Provide a human-readable changeset format with complete metadata. This format is intended for exchange of patches over mailing lists and should include unified diffs.
Do not expose the archive format, but use a changeset server which implements access control (and pipelining, to cut down effects of network latency).
Project trees should not abuse the file system as a database. If a database is required, use a real one (such as BDB or SQLite), or CSV files containing multiple records, but not one file per record.
Use a file cache (with LRU logic) instead of revision libraries.

Revisions

2004-06-09: published

Florian Weimer
Home Blog (DE) Blog (EN) RSS Feeds Impressum