Tags:
create new tag
, view all tags
Sometime in April, 2001 I used rsync to "repair" the md5sum on a 650 MB file I had downloaded, without repeating the download.

Rsync can quickly synchronize (update) files or directories between machines by transmitting only the differences. This page focuses on using rsync for a single large file.

I had some trouble learning to use rsync for this purpose -- some of those difficulties are mentioned on this page, there is more discussion of some of them on RsyncingALargeFileDiscussion.

See also:

(All of these pages exist / will exist on WikiLearn -- they may not all exist on twiki.org.)

See about these pages.

Contents:

Warning

Keep an extra backup copy of your "corrupt" file. Under certain circumstances, rsync will wipe it out before completing the update. In that case you would have to repeat the download of the file if you haven't saved an extra backup. (Those circumstances relate to the --partial option, see #Approach_If_Rsync_is_Interrupted.)

Recommended Normal Approach

This is the command line I recommend. You must change the server, module, and file names appropriately for your situation:

rsync -a -vvv --progress carroll.cac.psu.edu::mandrake-iso/mandrakefreq-i586-20010316.iso  mandrakefreq-i586-20010316.iso 

20010901: deleted "--partial" and "--compare-dest=/mnt/cdrom" options from above -- for single large files they are not usually useful and are potentially dangerous.

Here is the command line in a "generic" form:

rsync -a -vvv --progress <rsync server host name>::<path and file name at server> <local file name>

Other Requirements:

  • This command may need to be run as root. (I did it as root, and never tried it as a normal user -- if you try it as a normal user you will need to make sure you have appropriate file and directory permissions.)
  • This command must be run from the directory where the (local) corrupted file exists. (I've never tried it from anywhere else -- I suspect there are ways to make it run another directory, but it may not be straightforward as rsync does certain things in the current directory by default (like store the hidden copy of the rsync'd file as it is constructed).)
  • The local partition which contains the corrupted file must have free space available equal in size to the file you are attempting to rsync (and a little extra just to be safe). (This is because it creates a hidden copy of the file in the current directory during the rsync process.) Don't forget to store an extra copy of the original "corrupted" file somewhere else, see #Warning.

Other notes

  • If rsync is consistently interrupted before it can complete the rsync process for the entire file, you will need to use the --partial option. See #Approach_If_Rsync_is_Interrupted, below, as there are additional pitfalls to beware of and additional steps required.

  • One of my early errors was not knowing what rsync meant by a module (what I would call a directory), and then specifying the wrong starting directory (module) -- pub instead of mandrake-iso. I forget all the details, but, IIRC, the path to the file on the server must be the path relative to the rsync server. (And I don't know exactly what I mean by that -- I don't know if that is the pwd for the rsync server, or some sort of "home directory" specified in a configuration file for the rsync server. I just know I got it wrong at first. IIRC, I eventually got it right by somehow listing the files in the rsync home directory, then exploring paths from that directory until I got to the file I was looking for. I forget the details of how I did that.)

UPDATE: Bill Kenworthy and Ron Stodden provided some better information, quoted here until I refactor it into the text.

Bill Kenworthy provided the following:

Ok, the easiest is to use rsync like an ls command and navigate to the directory you want if you do not know the path up front:

e.g., "rsync rsync://ftp.uninett.no/" will list all the top level modules (directories) available. Note the last "/" and no target directory - without this nothing is printed! Next add the directory you next want to list: "rsync rsync://ftp.uninett.no/Mandrake/" and so on until you build the path and locate the file(s) you want.

To download the whole 8.1 updates (mirror the updates), use "rsync -Pcavub --bwlimit=3 --stats --exclude=kernel22\* --exclude=kernel-enterprise\* --exclude=kernel-smp\* --exclude=kernel-linus\* --exclude=kernel-pcmcia-cs\* --delete-excluded --delete rsync://ftp.uninett.no/Mandrake/Mandrake/updates/8.1/RPMS/\* ." Note that I had to escape the wildcard - rsync gives the usual unix (un)informative error otherwise. Kernels I do manually so I dont accidentally destroy the system, the bwlimit is coz I use a modem and it leaves some bandwidth for use whilst the command is running. I cron it at 1:05am local. For iso's, its basicly the same,just runs for 3 or so days per cd!

Ron Stodden provided the following:

rsync modules are essentially very convenient path shortcuts.

The modules available from any rsync server can easily be ascertained thusly:

rsync "rsync://"

and the contents of any path under the module thusly:

rsync "rsync:///path/"

That way you can rapidly explore the whole module tree.

  • It is normal for there to be a long pause before data is transmitted by rsync. This is because there is a long calculation to calculate "checksums" for each block of the file on the server and the local copy of the file. (The calculation is longer if you use the -c option, see below.)

  • Do not use the -c option -- it was the cause of many of my early problems with rsync (resulting in the message "unexpected EOF in read_timeout"). It forces a particularly long calculation in order to determine whether the files are different. It is better (takes less processing time) to simply make sure that the date of the local file does not match the date of the remote file, using touch or similar to change the date on the local copy if necessary. Then the files will be recognized as different without the -c option.

  • An "unexpected EOF on read_timeout" error often indicates something is taking too much time on the server. Don't specify the -c option, don't rsync more than one large file in a single rsync command, etc.

  • The -a will update the date, time, owner, and permissions of the local copy to match those of the remote copy. (IIRC, there are some other options to correct some of those items individually.) It appears that at least the date and time must be correct to get the correct md5sum.

  • There is a potential problem on Windows (with the -a option) because time resolution is only to an even number of seconds. I have not experienced this problem so I only mention it here without knowing the best way to solve it. There is an option to make the checksum (of rsync) less exact so it can tolerate this error, but:
    • how do you know that this is the only error, and
    • the md5sum will still indicate an error

  • The -v's (i.e., -vv, -vvv) control the verboseness of rsync. Up to three are recognized. Under circumstances where you are getting the "unexpected EOF on read_timeout" error, you may want to try deleting all v's in the hope that it will reduce the resource utilization on the server and allow you to proceed. I haven't tried that -- deleting the -c option worked for me. People have reported some "bugs" or anomalous behavior with the -v option, particularly with more than one -v. I experienced some problems on occasion with -vvv, but they did not reliably repeat, so I don't know whether they were problems with rsync or problems with one of my installations of Linux.

  • --progress presents a periodically updated byte count and percent complete, but this only begins after the real transfer begins -- it does not appear while the various checksums are being calculated and exchanged.

  • Be sure your firewall/gateway (if you're on a network) allows outgoing connections on the rsync port (873).

  • There are other parameters that might help prevent the "unexpected EOF in read_timeout" error depending on the exact cause. These include a parameter to limit the data transfer speed, and a parameter to choose a different checksum size. (There may be others -- there is a lot to read on the rsync man page.)

  • I've seen reference to another error message: "Write failed on usr/bin/rsync: No space left on device." IIRC, from the context of the notes I read, I assume that it is a server side problem. (By default, data on the client side is written in the destination directory (the directory where the local copy of the file being rsynced is located).)

  • during debugging, it can be helpful to run rsync under strace, like:
 strace rsync -a -vvv --progress  carroll.cac.psu.edu::mandrake-iso/mandrakefreq-i586-20010316.iso  mandrakefreq-i586-20010316.iso 2>strace.txt

  • If you go to the rsync site you can find Andrew Tridge's [the creator of rsync] doctoral thesis in which he discusses the rsync algorithm. It is reasonably interesting and answered several of the questions I had about rsync. You can also learn about the performance of rsync, how it varies for different types of files (or different types of mismatches between files), and how certain parameters were chosen and could be changed (some are command line options) to try to improve the performance of rsync.

  • A particular problem occurs if a text file with the Linux line ending (lf) is accidentally converted to the MSDos line ending (crlf). Rsync checks blocks of about 3000 bytes for similarity. If line length is around 80 bytes and has the wrong line endings, no blocks will match, and rsync's performance will degrade to be on the order of an ordinary file transfer or worse (due to the rsync block comparison overhead).

Approach If Rsync is Interrupted

If rsync is interrupted before completion, the partially rsync'd hidden file is discarded and the original corrupt file remains in its original condition. (You have effectively made no progress.)

A way to overcome the problem is by using the --partial option, but there are potential problems and additional steps required.

To use the --partial option:

  • Make doubly sure you have a backup copy of the corrupted file stored in a safe location, either in a different directory or with a different name (or on tape, CD, a different hard drive, or whatever). (With the --partial option, rsync, if interrupted, will discard the original and save only the (partially) rsync'd copy.)

If rsync proceeds to completion, wonderful, you're done.

If rsync is interrupted before completion (and if the interruption is such that the local rsync client has a chance to clean up before quitting), you will have to do some work before starting rsync again.

(Aside: The rsync process might be interrupted because the server dies, the link between the client and server dies, or because the client dies. In the case of the first two, the local rsync client will have a chance to recognize the problem and "clean up".) <Point to check: it may be that even if the client dies, the described cleanup occurs either intentionally or incidentally with the next startup of rsync.>

At this point, the original corrupted file will have been replaced by the partially rsynced file (example: if only 20 MB of a 650 MB iso has been rsync'd, the partially rsync'd file will only be 20 MB). If you resume rsync at this point, the remainder of the rsync process will be the equivalent of a download, because rsync no longer has any potential duplicate blocks to work with).

The trick here is to create a new file that starts with the 20 MB which have been successfully rsync'd, and ends with the last 630 MB from the original corrupt file. (But, if you haven't saved a backup somewhere else, you won't have the original corrupt file to work with.)

Assuming you do have the original corrupt file, there are a variety of ways to proceed.

One is to split the corrupt file into pieces the same size as the partially rsync'd file (in this case 20 MB), and then concatenate the pieces back together, but starting with the rsync'd portion and deleting the first piece from the corrupt file. (And, in the proper order.)

In this particular example, it probably would not hurt to actually concatentate the partially rsync'd file with the entire corrupt file to create a 670 MB file. Those extra 20 MB will take a little more time for rsync to check, but maybe not as much time as it would take for you to do the extra step of splitting and reconstructing the file. (I haven't tried this -- you would need to make sure you still have sufficient free space on the partition for the rsync'd file.)

If rsync is interrupted again before completing the process, you will have to repeat this procedure, so ...

  • Save a (backup) copy of this new "original" before restarting rsync.

UPDATE: Bill Kenworthy suggested (on 2002-02-22):

The way around this is to keep a copy of the iso and "tail -c +no_of_Bytes_needed+1 copy.iso >> rsync_truncated.iso" test this out first as its late and I am too tired to check the syntax and test it!, but it works a treat.

I have not tested this yet, but it sounds like a good approach. (Like he said, the syntax needs to be confirmed.)

Early Problems

When I first attempted to use rsync to correct the md5sum on this large file, I was not successful. I thought it might be due to one of the following issues.

  • Originally I tried rsync from a machine with Mandrake 7.0 and rsync 2.3.n (??). When rsync started, the first line of output mentioned that the protocol on the server was 24 and on the client was 21, but it did not describe that as an error. (I've since learned that, in most cases, the rsync server can handle protocols older than its own.)

  • Originally my local directory was on a fat32 (vfat) partition. I switched to ext2 but that did not solve the problem. (I have not gone back to try it on a fat32 partition.)

  • During many of my early trials I got the error message "unexpected EOF on read_timeout". This usually indicates a problem on the server side or the link. Eventually I recognized that the -c option was the culprit in my particular case, and the most significant problem.

When I finally learned that the "unexpected EOF on read_timeout" often indicated a problem on the server side, I wondered whether the rsync server was using too much processor for Penn State to tolerate and they were just killing the process. That might be the case.

Some other rsync options

Not a complete list. If I recall correctly, there may be as many as 80 command line options for rsync.

  • --timeout -- There is an option to set a timeout period, but it defaults to no timeout, so it cannot be relevant to the problem I've been having. (And, setting it will only make things worse.) I'm guessing that there is some kind of timeout on the server side, which might not even be part of rsync -- Penn State might watch Internet processes, and kill them if they continue too long without transferring any bytes. I suspect that specifying -c was leading to exactly this kind of situation.

  • --temp_dir=/<directory> I became concerned at one point that the temporary files that rsync was creating might be going to the wrong directory where there was insufficient space. I tried using this option to make sure the files were going to the right place. (Around this time I experienced several kernel panics (my choice of words) -- I don't know if there is a direct relationship or not).) Later I learned that the default is to store the temporary files in the destination directory, and stopped specifying this option.

  • -P --> equivalent to --partial --progress

  • -z, --compress --> compress file data (not recommended)

  • -n, --dry-run --> show what would have been transferred (for testing)

  • See man rsync.

Contributors

  • RandyKramer - 2001-04-13 (created), 2001-08-22 (transferred from swiki), 2001-09-02 (rewritten)
  • Lieven Van Acker - 2001-08-23
  • Sergio Korlowsky - 2001-09-03
  • Bill Kenworthy - 2002-02-22
  • Ron Stodden - 2002-02-23

Comments from Lieven not fully integrated yet:

Some remark on --compare-dest

When I tried this, the new local copy during the rsync was kept in the current directory (original location of the local copy) as a hidden file while another directory was given with --compare-dest?

Command line

 rsync -a -vvv --progress --partial --compare-dest=../tmp/ ftp1.sunet.se::Mandrake-iso/i586/MandrakeLinux-8.1-Raklet-beta1-CD2.i586.iso MandrakeLinux-8.1-Raklet-beta1-CD2.i586c.iso

RHK note: Based on my present understanding, I would recommend not using the --compare-dest option, and only using the --partial option if it is likely that rsync will be interrupted before completion, and then only after having read and understood #Approach_If_Rsync_is_Interrupted.

during rsyncing:

-rw-rw-r--    1 lieven   lieven   679436288 Aug 22 17:49 MandrakeLinux-8.1-Raklet-beta1-CD2.i586c.iso
-rw-------    1 root     root     104071168 Aug 23 09:02 .MandrakeLinux-8.1-Raklet-beta1-CD2.i586c.iso.2L6vVb
Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r8 - 2015-09-25 - RandyKramer
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding WikiLearn? WebBottomBar">Send feedback
See TWiki's New Look