Tags:
create new tag
, view all tags
This deals with some topics not covered, or not covered fully, on the other RsyncingALargeFile pages.

See also:

(All of these pages exist / will exist on WikiLearn -- they may not all exist on twiki.org.)

See about these pages.

Contents:

Rationale and Focus of these RsyncingALargeFile Pages

These RsyncingALargeFile pages are not intended to cover everything about rsync. The choice about what to include and what not to include depended on the following factors:

  • what I needed to do with rsync, and what I needed to learn that didn't seem easy to find or learn from the man page, the rsync web site, or other readily available resources

  • what I thought I needed to serve as a reference to remember how I used rsync the last time to be able to quickly refresh my memory the next time

There was a time when I thought it was important to separate discussion of using rsync for a (single) large file from a more general discussion of rsync.

Then I forgot what those reasons were and realized that the issues of dealing with a large file are not unique to dealing with a single large file, but are applicable in dealing with one or more large files. (duh!)

Now I remember one of the original issues:

    • More than one large file increases the CRC caculation time and increases the possibility of getting the "unexpected EOF on read-timeout" error.

Thus, it is not inappropriate to have the wording (and titles) on these pages focus on a single large file. Someday I (or someone) will add pages on using rsync for other purposes.

Communication Protocols

I should discuss communication protocols to some extent. I don't know enough to tell you much -- here's what I think I know:

  • rsync can use three different means of communication, which I will describe (possibly inaccurately), as:
    • "over" telnet
    • "over" ssh
    • "native" rsync to rsync (does this mode use a socket?)
The important things about this are:
  • the command line differs depending on which you use (the changes can be subtle)
  • AFAIK, I've used only the last method. This requires the double colon that you see in the command line.

Partial and Compare-Dest: The Problem

The --partial and, even more so, the --compare-dest options are more useful for directories (containing many files) than for single files. Partial in particular creates a trap for unwary users rsyncing a single large file, in that, if the rsync process is interrupted, the original file is deleted and replaced with the truncated partially rsync'd file. When rsync is restarted, the remainder of the transfer (after the recheck of the already rsync'd portion of the file) degrades to the equivalent of a download -- rsync has no raw blocks to check for duplicates to perform the rsync magic.

Rsync's Normal Behavior

I should add a few more steps for better understanding:

  • The first thing rsync does is compare the local and remote copies of the file to see if they are different. Normally it does this just by comparing the file timestamps. If the timestamps are the same, you can either change the timestamp on the local (i.e., the non-canonical) file or specify the -c option. Specifying the -c option is not recommended--rsync will engage in a particularly long calculation to calculate and compare the checksums on each block of each file to look for differences (or at least, until if finds a differrence).

* Next (assuming rsync recognizes the files as different), rsync calculates the checksums on each block of the local and remote files in order to determine which blocks have to be transmitted.

Note: During these first two steps, rsync (at least the version I used at the time this was written) does not show any indication of progress. When blocks start to be exchanged (next step), a progress indicator is displayed.

Then:

The normal behavior of rsync is to construct a hidden copy of each file (in the same directory as the original file) as the rsync synchronization proceeds. When the synchronization for all files is complete, the original files are deleted and the newly constructed synchronized files are unhidden. <double check this point -- is it after all files or after each?>

So far, so good.

Partial

If the rsync process is interrupted before completion, the default behavior of rsync is to restore the local system to its original condition by deleting the hidden updated files.

This is safe, but does not get you any closer to the goal.

If your rsync process is a long one, and has a reasonable chance of being interrupted before completion, the --partial option can be helpful, but can cause problems, so keep reading...

The --partial option replaces those original files that have been completely or partially rsync'd with the hidden complete or partial rsync'd files.

This can be useful behavior, but clearly, some decisions are required.

  • In some circumstances, having some files updated and other not updated is worse than not having any updated files at all. Other times it may be an acceptable temporary situation, correctable by running rsync again.

  • If the files are small, having one (or a few) complete but out-of-date files replaced by an up-to-date partial file may be an acceptable temporary situation, again, correctable by restarting rsync. But, when rsync goes to work on those partial files, it can only do the rsync magic on the portion of the file that exists -- it degrades to a full download for the remaining portion. Not a big deal for small files, but for a large file (in my case a 650 MB file that normally takes 65 hours to download), this behavior is not nice.

A workaround in this case is to:

  • Keep a(n extra) backup copy of the original file in a safe location.
  • Use the --partial option.
  • If rsync is interrupted, manually create a new file for rsync to work from, the first part consisting of the partially rsync'd file, and the second part created from the tail end of the original file.
  • Just like the first time, store a(n extra) backup copy of this new "original" file in a safe location (in addition to the working copy in the "working" directory).
  • Now you can restart rsync, and it will basically proceed from where it left off. (Actually, it will recheck the first part of the file, but it will (or should) find that the first part is updated and then move on to the slower work of rsyncing the previously unsynched portions of the file.)

The --partial option is useful in that scenario (depending on your goal) in that it takes all the files that have been successfully rsync'd and uses them to replace the original files, deleting the original files and unhiding the successfully rsync'd files. Now the directory consists of some updated files and some original files.

Compare-dest

The --compare-dest option allows you to specify an alternate directory containing an alternate set of the (original) files to be rsync'd.

My problem was that I misunderstood the purpose of --compare-dest. I was hoping that by specifying a directory containing another copy of the original corrupt file, I could overcome the problem described in #Partial in a different way. I hoped that if the transfer was interrupted with the --partial option in effect, and then rsync was restarted, rsync could look for for matching blocks in both the truncated partially transferred file and the original file. (In retrospect, it seems like a rather naive hope.) Anyway, I wrote to the rsync mail list, and somebody there confirmed that rsync does not work that way. (IIRC, they mentioned that they either had once or could consider such an enhancement. They might have even offered me a patch, but I wasn't ready to try patching and then compiling rsync or anything else.)

--compare-dest is probably useful in a variety of situations. One example (I'm not sure it's a good one, and I think there are other ways to solve the same problem) is when you have a set of live working files supporting some process and you want to rsync them, but don't want to interrupt the process for the entire duration of the rsync operation. In this case you can copy the live files to an alternate directory, then run rsync specifying this directory. Now rsync uses the files in that alternate directory as the "fodder" for the rsync process.

Of course, unless some other option is specified, the rsync files are still constructed as hidden files in the target directory. Anyway, when the rsync process is done you can then shut down the process briefly and quickly transfer to the recently rsnyc'd files. (This is a bad example -- I've glossed over a few points -- it couldn't really work exactly the way I've described it. Maybe somebody else can suggest a better example.)

Contributors

  • RandyKramer - 2001-04-13 (created), 2001-08-22 (transferred from swiki), 2001-09-02 (rewritten)
Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r8 - 2015-09-25 - RandyKramer
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding WikiLearn? WebBottomBar">Send feedback
See TWiki's New Look