Opinion

HTTPmirror

I hate HTTP mirroring tools

Neither CURL nor WGET are suited to do HTTP mirroring in an efficient way. What I need is following:

  • Automated checking if the destination was changed (Timestamp+Size)
  • Automated download of new data
  • Automated continuing of broken downloads where it stopped
  • Automated rotation of the downloaded file (keeps history)
I want this at least since 1992. But there is no such tool to do this up to today. It seems, that there is no tool which can fulfill the above goals together, as any of the tools fails in at least 2 of them.

Note that wget nearly is the tool, but it has one thing which makes it unusable:

You cannot stop it to follow query links to HTML pages when in --mirror mode. As I have several cases where this transfers 3 times the data (2.5 MB instead of 500 KB) which are needed to do a mirror, this just makes it counterproductive.

To stress it: There is a directory served by Apache. The directory has 10000 files which are below 100 bytes (or shorter) which must be downloaded. Reading everything from scratch is about 20 MB. Reading everything via wget downloads 100 MB. Why? Apache hands out the directory 9(!) Times a 10 MB. Plus the 10 MB data downloaded from the directory.

Real Bullshit Bingo as the programmers of WGET did not think for a cent when implementing this crap.

Well, this probably means I have to write my own

It shall have following properties:

  • Easy to use like
    tool url file
    where file can be left away to use stdout.
  • Extensible: It shall focus on HTTP/HTTPS but
    it shall be easy to be extended to other data sources / URL schemas.
  • Easy to use: It shall give meaningful return values to scripts
    • Return 0 means: "New data available" (Success!)
    • Return 1 means: "Old data available" (Success!)
    • Return 10 means: "Something broke, but probably can be fixed by calling the tool again."
    • Return 20 means: "Something broke severely, the URL cannot be retrieved"
    • Return 30 means: "Something's wrong, need attention" (like filesystem full, setup corrupt, etc.)
  • Safe to use: This means, it never overwrites file (it renames it first).
  • Safe to interrupt at any time:
    It shall save the data it has until the interruption and no data shall be lost.
  • Safe to restart at any time:
    You can continue to download HTTP URLs or similar without thinking of anything. The restart shall be as efficient as possible. If a reget/continuing is not possible, the destination is only replaced if more data was fetched than before!
  • Easy to integrate:
    There shall be some portable way to query the download status by external tools. This probably means, the tool needs a SQLite database.
All I need is a "just do it" utility. It need not have some "Crawler"-Feature, as you can implement this more efficiently using some scripting arround this tool.

But of course it shall have options to set the referrer etc., this includes "convenience"-options to fix commonly found troubles (like sites which require IE or need cookies or referrer).

Perhaps it can be based on libcurl, as this library is very powerfull.

Similar things

Here are some similar things I did because of trouble with the "standard" tools:

Quickies: Download only

  • www.scylla-charybdis.com/download/accept-2.0.0.tar.gz as netcat etc. cannot connect to Unix sockets. It is able to interconnect two sockets which can be incoming or outgoing and can be Unix or TCP. For incoming sockets socklinger (see below) perhaps is suited better.
    The name is chosen badly, as the name "accept" comes from the fact, that it first was written to use the accept() system call, too.
  • www.scylla-charybdis.com/download/sslconnect-0.0.1.tar.gz An SSL connector, as openSSH s_connect does a 100% CPU loop in case the destination uses shutdown() on the output (instead of terminating with EOF condition).
  • www.scylla-charybdis.com/download/sslwrap.tgz SSLwrap capable of PAM authentication. I added a password_verify.c for this, which is public domain (only my source, not the rest). It is the minimalistic implementation to verify a password through PAM. It is bible sized, though, as PAM is horribly broken. PAM is implemented as if somebody tried hard to make something completely unusable by doing it in such a mad way that nobody else ever can use it. PAM succeeded in this task, so good, that the library interface can be considererd SPAM (probably the name PAM comes from this). The downside is, that it does not yet allow Unix domain sockets.

With page history and documentation

  • www.scylla-charybdis.com/tool.php?tool=dbm A tool to use DBM files from shell
  • www.scylla-charybdis.com/tool.php?tool=ptybuffer A replacement of expect and screen for services which require a PTY. It works with a Unix socket.
  • www.scylla-charybdis.com/tool.php?tool=socklinger A lingering wrapper to call shell scripts on socket connects. It supports Unix sockets, too, and more importantly, it limits the number of concurrently running scripts a natural way. Lingering is needed in case the "accept" tool (see above) does not work correctly (some data is not transmitted, as the application goes away before the socket is fully flushed to the other side, in which case the kernel discards the data which was not transmitted).
Please note that, in a far future, I want to integrate SSLconnect, sslwrap, accept, ptybuffer, socklinger and probably others into one networking tool. And even the HTTP mirroring tools will be integrated by some sideeffect then.

-Tino, 2005-11-29