[Bug 265320] diff program runs long when running diff on large files

From: <bugzilla-noreply_at_freebsd.org>
Date: Tue, 19 Jul 2022 23:53:28 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=265320

            Bug ID: 265320
           Summary: diff program runs long when running diff on large
                    files
           Product: Base System
           Version: CURRENT
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Many People
          Priority: ---
         Component: bin
          Assignee: bugs@FreeBSD.org
          Reporter: shrikanth07@gmail.com

The issue was found when running diff on a set of files that are nearly
identical (1865319 lines / ~76MB in size) except having differences in the
first line and line 1865315.

# uname -a
FreeBSD BSD14vm 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n255938-326a8d3e085:
Fri Jun  3 08:30:41 UTC 2022    
root@releng1.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64

The diff program ran for more than 3172.21 seconds (or even more) I stopped
monitoring with CTRL-T after some point but here is the capture...

root@BSD14vm:# ls -ltrhG
total 469632
-rw-r--r--  1 root  wheel    76M Jul 19 11:46 config_edb_pre
-rw-r--r--  1 root  wheel    76M Jul 19 11:46 config_edb_post

root@BSD14vm:# diff config_edb_p*
load: 0.21  cmd: diff 972 [running] 3.60r 3.44u 0.15s 29% 40480k
load: 0.21  cmd: diff 972 [running] 4.90r 4.74u 0.15s 39% 40576k
load: 1.10  cmd: diff 972 [running] 156.52r 156.31u 0.15s 100% 46564k
load: 1.19  cmd: diff 972 [running] 462.20r 461.86u 0.18s 100% 49140k
load: 1.19  cmd: diff 972 [running] 462.64r 462.30u 0.18s 100% 49140k
load: 1.19  cmd: diff 972 [running] 462.85r 462.51u 0.18s 100% 49140k
load: 1.09  cmd: diff 972 [running] 503.74r 503.39u 0.18s 100% 49140k
load: 1.09  cmd: diff 972 [running] 503.96r 503.60u 0.18s 100% 49140k
load: 1.09  cmd: diff 972 [running] 504.14r 503.78u 0.18s 100% 49140k
load: 1.09  cmd: diff 972 [running] 504.30r 503.94u 0.18s 100% 49140k
load: 1.07  cmd: diff 972 [running] 766.07r 765.61u 0.21s 98% 51196k
load: 1.17  cmd: diff 972 [running] 919.11r 918.61u 0.21s 100% 51196k
load: 1.13  cmd: diff 972 [running] 1014.49r 1013.96u 0.22s 100% 53252k
load: 1.13  cmd: diff 972 [running] 1014.75r 1014.21u 0.22s 100% 53252k
load: 1.13  cmd: diff 972 [running] 1015.05r 1014.52u 0.22s 98% 53252k
load: 1.13  cmd: diff 972 [running] 1015.31r 1014.77u 0.22s 100% 53252k
load: 1.13  cmd: diff 972 [running] 1015.49r 1014.96u 0.22s 100% 53252k
load: 1.04  cmd: diff 972 [running] 1087.73r 1087.17u 0.22s 100% 53252k
load: 1.13  cmd: diff 972 [running] 1266.02r 1265.41u 0.22s 98% 53252k
load: 1.04  cmd: diff 972 [running] 1468.09r 1467.41u 0.22s 100% 59176k
load: 1.02  cmd: diff 972 [running] 1499.33r 1498.63u 0.22s 100% 59176k
load: 1.08  cmd: diff 972 [running] 2095.98r 2095.10u 0.23s 100% 61232k
load: 1.43  cmd: diff 972 [running] 2146.53r 2145.62u 0.23s 100% 61232k
load: 1.39  cmd: diff 972 [running] 2350.22r 2349.22u 0.23s 100% 63288k
load: 1.24  cmd: diff 972 [running] 2874.17r 2873.05u 0.25s 100% 67392k
load: 1.12  cmd: diff 972 [running] 2932.42r 2931.28u 0.25s 100% 67392k
load: 1.12  cmd: diff 972 [running] 2932.67r 2931.52u 0.25s 100% 67392k
load: 1.10  cmd: diff 972 [running] 2980.07r 2978.90u 0.26s 98% 85608k
load: 1.19  cmd: diff 972 [running] 3173.49r 3172.21u 0.26s 100% 85608k
...
1c1
< <configuration changed-seconds="1658220346" changed-localtime="2022-07-19
10:45:46 CEST">
---
> <configuration changed-seconds="1658219983" changed-localtime="2022-07-19 10:39:43 CEST">
1865315c1865315
<             <description>2022-07-19 10:43:02.391784</description>
---
>             <description>2022-07-18 16:19:44.594290</description>

As you see there is difference on line 1 and line 1865315 for the entire file.
If I use 'head' or 'tail' and retain only one of the difference in the file
'diff' is able to complete in less than a second. 

The below files have only the diff on line 1
 #head -n 1865310 config_edb_pre > p1_headn1865310
 #head -n 1865310 config_edb_post > p2_headn1865310

# time diff p1_headn1865310 p2_headn1865310
1c1
< <configuration changed-seconds="1658219983" changed-localtime="2022-07-19
10:39:43 CEST">
---
> <configuration changed-seconds="1658220346" changed-localtime="2022-07-19 10:45:46 CEST">
        0.83 real         0.74 user         0.09 sys

The below files have only the diff on line 1865306
 # tail -n 1865310 config_edb_pre > p1_tailn1865310
 # tail -n 1865310 config_edb_post > p2_tailn1865310

# time diff p1_tailn1865310 p2_tailn1865310 
1865306c1865306
<             <description>2022-07-18 16:19:44.594290</description>
---
>             <description>2022-07-19 10:43:02.391784</description>
        0.84 real         0.75 user         0.09 sys

# which diff
/usr/bin/diff

# diff --version
FreeBSD diff 20220309

# file config_edb_p*
config_edb_post: ASCII text
config_edb_pre:  ASCII text

-- 
You are receiving this mail because:
You are the assignee for the bug.