How rsync works: Why can it synchronize two files so quickly?
In day-to-day development and operations, rsync is a very common file synchronization tool. Compared with direct copy tools like cp and scp, the biggest feature of rsync is: it doesn’t necessarily retransmit the entire file; instead, it tries to transmit only the changed parts.
This article explains rsync’s core principles through three questions:
- How does rsync quickly decide whether two files need to be synchronized?
- How does rsync transfer only the differing parts of a file?
- How do rolling checksum and strong checksum work together?
--
1. The core problem rsync solves
Suppose we have two files:
源端新文件:A
目标端旧文件:B
We want to update the destination-side B into the source-side A.
The simplest approach is to send the whole A over and overwrite B:
scp A remote:/path/B
But if the file is large—say several GB—and in reality only a few KB changed, then transferring the whole thing is wasteful.
rsync’s goal is:
尽量复用目标端已经存在的数据,只传输真正变化的部分。
That is rsync’s core value.
2. How does rsync decide by default whether files are identical?
Many people think rsync checks a hash of the file contents by default, but it doesn’t.
By default, rsync uses a very fast method, usually called quick check.
It mainly looks at two pieces of metadata:
1. 文件大小 size
2. 最后修改时间 mtime
If the source and destination files have the same size and the same modification time, rsync will assume the file hasn’t changed and doesn’t need syncing.
So when you run:
rsync -av src/ dst/
most unchanged files will be skipped quickly. That’s because reading file size and modification time only requires accessing file metadata; it doesn’t require reading the entire file.
This is also one of the reasons rsync is fast when synchronizing a large number of files.
However, this is not a strict content verification. If the file contents were modified but the size didn’t change, and the modification time was manually restored to the original value, default rsync might not detect the change.
If you want strict checking based on file contents, you can use:
rsync -avc src/ dst/
Here, -c or --checksum makes rsync compute a checksum of the file contents and then decide whether the file is identical.
The trade-off is: both sides must read the full file contents, so disk I/O will increase noticeably.
In short:
默认模式:size + mtime,速度快,但不是严格内容校验
checksum 模式:按内容校验,更可靠,但更慢
3. Where rsync is truly impressive: delta transfer
When rsync determines that a file needs updating, it doesn’t necessarily retransmit the entire file.
It tries to find out:
源端新文件 A 中,哪些内容目标端旧文件 B 已经有了?
哪些内容是目标端没有的,需要真正传输?
This process is rsync’s delta-transfer algorithm.
The core idea can be summarized in one sentence:
目标端告诉源端:我已有这些数据块;
源端扫描新文件:发现哪些块你已经有了,我就不传,只传你没有的部分。
4. The basic flow of delta transfer
Assume the destination already has the old file B.
First, the destination splits the old file B into fixed-size blocks:
B = [B0][B1][B2][B3][B4]...
Then the destination computes two checksums for each block:
B0 -> weak0, strong0
B1 -> weak1, strong1
B2 -> weak2, strong2
...
There are two kinds of checksums:
weak checksum:弱校验,也就是 rolling checksum
strong checksum:强校验,用来最终确认块是否真的一致
The destination does not send the entire old file to the source; it only sends the checksum list for these blocks.
Next, after the source receives these checksums, it starts scanning the source-side new file A.
5. What is a rolling checksum?
A rolling checksum can be understood as a weak checksum suitable for a “sliding window.”
The problem with a normal hash is: if the window moves by 1 byte, you typically need to recompute the hash for the entire window.
For example, if the window size is 4KB:
窗口 1:[第 0 字节 ... 第 4095 字节]
窗口 2:[第 1 字节 ... 第 4096 字节]
The two windows differ by only one byte, but a normal hash usually requires rereading the full 4KB to compute.
The advantage of a rolling checksum is that it can quickly compute the checksum of the next window based on the previous window’s result.
Roughly speaking:
新 checksum = 旧 checksum - 离开窗口的字节 + 进入窗口的字节
The real algorithm is more complex than this, but that’s the intuition.
So rolling checksum is very suitable for the source scanning the new file:
A = abcdefghijklmnop
[----] 窗口 1
[----] 窗口 2
[----] 窗口 3
Every time the window slides back by 1 byte, you can quickly get a new weak checksum.
6. What is a strong checksum?
Rolling checksum is fast, but it is a weak check and can have collisions.
That is, two different data blocks might produce the same weak checksum.
If rsync relied only on rolling checksum, it could mistakenly treat two different blocks as the same, causing incorrect contents in the final synchronized file.
So rsync also needs a strong checksum.
The role of a strong checksum is:
当 rolling checksum 发现一个“疑似匹配”时,再用 strong checksum 做最终确认。
It is more reliable than rolling checksum, but also more expensive to compute.
Therefore, rsync does not compute a strong checksum for every sliding window; it computes it only when a weak checksum hits.
7. How do rolling checksum and strong checksum work together?
This is the core of the rsync algorithm.
The destination-side old file B has already been split into multiple blocks, and each block has a pair of checksums:
B0 -> weak0, strong0
B1 -> weak1, strong1
B2 -> weak2, strong2
...
When the source scans the new file A, it continuously moves a window of the same size.
For each window, it first computes the rolling checksum:
window = A[pos : pos + block_size]
weak = rolling_checksum(window)
Then it looks up this weak checksum in the checksum table sent by the destination.
Case 1: the weak checksum does not hit
If there is no hit, it means this window is very likely not any block the destination already has.
So the source slides 1 byte forward:
当前位置没有匹配
=> 窗口后移 1 字节
=> 继续计算 rolling checksum
This step is fast because rolling checksum can be computed incrementally.
Case 2: the weak checksum hits
If the weak checksum hits some old block, for example B7:
A 的当前窗口 weak checksum == B7 的 weak checksum
This suggests the current window may be the same as B7 in the old file.
But it’s only a suspected match.
Next, the source computes a strong checksum for the current window:
strong = strong_checksum(A[pos : pos + block_size])
Then it compares it with B7’s strong checksum:
如果 strong == strong7
=> 基本确认 A 的当前窗口和 B7 是同一段内容
After confirming a match, the source no longer needs to send the data itself.
It only needs to tell the destination:
复制你本地旧文件里的第 7 个 block
That is, send an instruction like:
copy block B7
For content that cannot be matched, the source sends the raw bytes, i.e. literal data.
In the end, what the destination receives is a sequence of construction instructions:
literal "abc"
copy B7
copy B8
literal "xyz"
copy B12
...
Based on these instructions, the destination combines the blocks it already has from the old file with the newly transferred literal data to reconstruct the new file A.
8. Why not use only strong checksum?
Because when the source scans the new file, it’s not just checking fixed block boundaries; it needs to search for matches at every possible offset in the new file.
If the file is large, computing a strong checksum at every offset would be very expensive.
The value of rolling checksum is:
它可以非常便宜地扫描大量候选位置。
So it’s suitable as the first filtering layer.
Only when a rolling checksum hits does rsync compute a strong checksum.
This avoids a large amount of unnecessary strong-checksum computation.
9. Why not use only rolling checksum?
Because rolling checksum is a weak check and has a risk of collisions.
If you rely on it alone to decide whether two blocks are identical, false positives can occur.
The role of strong checksum is to reduce the probability of misjudgment and make the reconstructed file as reliable as possible.
So their division of labor is:
rolling checksum:负责快速发现疑似匹配
strong checksum:负责确认疑似匹配是否真的成立
You can think of it as a two-stage filter:
第一阶段:rolling checksum 快速筛选候选块
第二阶段:strong checksum 精确确认候选块
This is why rsync can be both “fast” and “reliable.”
10. A simplified example
Assume the destination-side old file B is split into 4 blocks:
B = [B0][B1][B2][B3]
The destination computes and sends:
B0 -> weak0, strong0
B1 -> weak1, strong1
B2 -> weak2, strong2
B3 -> weak3, strong3
In the source-side new file A, a small piece of new data is inserted at the beginning, but most of the later content is the same as the old file:
A = [新数据][B1][B2][B3]
When scanning A, the source will find:
开头的新数据无法匹配旧 block
=> 需要发送 literal data
后面的窗口匹配到 B1
=> 发送 copy B1
继续匹配到 B2
=> 发送 copy B2
继续匹配到 B3
=> 发送 copy B3
In the end, the source only needs to send:
literal "新数据"
copy B1
copy B2
copy B3
Then the destination can reconstruct the new file A based on the old file B and these instructions.
This is why, when only a small part of a large file changes, rsync’s transfer volume can be far smaller than directly copying the whole file.
11. If you want to check locally whether two files are exactly the same, what should you use?
If you only want to determine on the same machine whether two files are completely identical, the most direct tool is actually not rsync, but cmp:
cmp -s file1 file2 && echo same || echo diff
cmp compares two files byte by byte.
Differences among several approaches:
stat:
只看元数据,例如 size、mtime,最快,但不严格
cmp:
逐字节比较,适合判断两个本地文件是否完全一致
sha256sum:
计算完整 hash,适合生成指纹、跨机器保存和比对
rsync:
适合同步目录或跨机器增量同步
If you’re just syncing directories:
rsync -av src/ dst/
If you suspect modification times are unreliable and want to decide by content:
rsync -avc src/ dst/
If you only want to preview which files will change:
rsync -avcni src/ dst/
Where:
-a:archive mode, preserve permissions, timestamps, etc.
-v:show verbose information
-c:detect content changes by checksum
-n:dry-run, only simulate, don’t actually modify
-i:output a change summary
12. Summary
rsync’s implementation principles can be divided into two layers:
The first layer is quickly determining whether a file needs processing.
By default, rsync uses:
文件大小 size + 修改时间 mtime
to quickly skip unchanged files.
The second layer is performing delta transfer for files that need updating.
It uses:
rolling checksum + strong checksum
to find which blocks in the source-side new file already exist in the destination-side old file.
They work together as follows:
rolling checksum 负责快速扫描和发现候选块;
strong checksum 负责确认候选块是否真的相同。
After finding matching blocks, the source doesn’t need to send the actual data; it only needs to send reference instructions like:
copy block N
Only the parts that cannot be matched are sent as raw data.
Therefore, when file changes are small, rsync can dramatically reduce the amount of data transferred.
In one sentence:
rsync 默认靠 size + mtime 快速判断文件是否变化;
真正同步变化文件时,靠 rolling checksum 找候选块,靠 strong checksum 确认匹配,只传输目标端缺失的数据。
That is the core principle behind rsync’s efficient synchronization.