Software: GZIP vs. BZIP2 vs. XZ - performance

From Luky-Wiki
Jump to: navigation, search

I was part of discussion on G+ recently. Discussion was about best possible compression method for Linux kernel. Later it was extended also to user space compression algorithm. I think it will be interesting to see various compress method and levels on different type of files.

Input data

For test i selected following files:

  • DVD.iso - iso image containing mpeg2 stream (DVD-Video) and jpeg files (pictures)
  • fs.bin - ext4 file system containing "linux.tar" and "random.bin"
  • linux.tar - tarball archive of Linux kernel sources + objects and final kernel / module images
  • random.bin - file containing data from /dev/urandom
  • zero.bin - file containing only 'zero' data (read from /dev/zero)
  • sql.dump - text dump of my PostgreSQL database (backup catalog, sql commands)

As a preparation i executed following cycle:

for a in DVD.iso fs.bin linux.tar random.bin zero.bin sql.dump
do
        for b in 1 6 9
        do
                cat ${a} | gzip  -${b} > ${a}.${b}.gz
                cat ${a} | bzip2 -${b} > ${a}.${b}.bz2
                cat ${a} | xz    -${b} > ${a}.${b}.xz
        done
done

Test methodology

Test was executed on "Intel(R) Atom(TM) CPU 330 @ 1.60GHz". System was running in dual core mode with HT enabled (SMP). There should not be significant difference using one core and "UP" scheduler. Compression/Decompression was done in one thread. System was configured with 3GB of usable RAM memory and without CPU frequency scaling. At time of test system was idling. Sequential disk read speed is 80 MB/sec so it should not affect testing. I used /dev/null as target for compression and decompression to prevent possible problems with concurrent I/O and cache entries.

Result

Size after compression

Table contain size reported by stat and ls -lh command (smaller is better):

DVD.iso fs.bin linux.tar random.bin zero.bin sql.dump
source 6189107200 (5,8G) 4294967296 (4,0G) 760012800 (725M) 1073741824 (1,0G) 4294967296 (4,0G) 739279146 (706M)
gzip -1 5887330412 (5,5G) 1307236772 (1,3G) 222544172 (213M) 1073924290 (1,1G) 18734949 (18M) 294669689 (282M)
gzip -6 5879295258 (5,5G) 1265502809 (1,2G) 189164983 (181M) 1073915726 (1,1G) 4168175 (4,0M) 257863244 (246M)
gzip -9 5878183653 (5,5G) 1263912039 (1,2G) 187578775 (179M) 1073915726 (1,1G) 4168175 (4,0M) 255110366 (244M)
bzip2 -1 5845697940 (5,5G) 1259732950 (1,2G) 177327051 (170M) 1082371295 (1,1G) 26041 (26K) 234198972 (224M)
bzip2 -6 5485927519 (5,2G) 1235652239 (1,2G) 156321978 (150M) 1079336646 (1,1G) 4491 (4,4K) 225847372 (216M)
bzip2 -9 5430387273 (5,1G) 1231448849 (1,2G) 152999062 (146M) 1078496689 (1,1G) 3023 (3,0K) 224565320 (215M)
xz -1 5383272868 (5,1G) 1227513964 (1,2G) 153319596 (147M) 1073795128 (1,1G) 624848 (611K) 235411304 (225M)
xz -6 5305999740 (5,0G) 1188389560 (1,2G) 114173192 (109M) 1073795048 (1,1G) 624848 (611K) 175777564 (168M)
xz -9 5264433664 (5,0G) 1174081380 (1,1G) 99830680 (96M) 1073795048 (1,1G) 624848 (611K) 100893196 (97M)

Compression time and memory usage

for a in DVD.iso fs.bin linux.tar random.bin zero.bin sql.dump
do
        for b in 1 6 9
        do
                cat ${a} | /usr/bin/time -v -o ${a}.${b}.gz.c.txt  gzip  -${b} > /dev/null
                cat ${a} | /usr/bin/time -v -o ${a}.${b}.bz2.c.txt bzip2 -${b} > /dev/null
                cat ${a} | /usr/bin/time -v -o ${a}.${b}.xz.c.txt  xz    -${b} > /dev/null
        done
done

Table contain "wall clock" (e.g. total time of execution) and "maximum resident set size (memory)":

DVD.iso fs.bin linux.tar random.bin zero.bin sql.dump
gzip -1 00:17:02 (3872) 00:04:35 (3952) 00:51 (3952) 02:35 (3728) 01:58 (3952) 01:00 (3952}
gzip -6 00:20:01 (3856) 00:06:16 (3952) 02:02 (3920) 02:44 (3712) 02:38 (3952) 02:31 (3872)
gzip -9 00:34:32 (3840) 00:10:21 (3968) 06:01 (3904) 02:44 (3728) 02:37 (3952) 07:13 (3872)
bzip2 -1 01:34:24 (6768) 00:26:18 (6768) 05:22 (6752) 18:27 (5696) 04:19 (6768) 05:21 (5696)
bzip2 -6 01:43:47 (21536) 00:30:29 (21552) 07:12 (21552) 20:38 (19424) 04:32 (21552) 06:48 (19424)
bzip2 -9 01:47:26 (31056) 00:32:34 (31056) 08:11 (31040) 21:25 (27888) 04:34 (31056) 07:40 (27888)
xz -1 02:10:35 (38384) 00:32:22 (38400) 04:30 (38400) 25:46 (38400) 03:48 (38128) 06:56 (38384)
xz -6 02:52:31 (384928) 01:12:17 (385280) 25:11 (385280) 28:16 (384576) 33:13 (384336) 33:20 (385296)
xz -9 04:15:07 (2760704) 01:30:01 (2760960) 29:49 (2760960) 41:18 (2760256) 33:18 (2759968) 36:59 (2760976)

Note: Wall clock is in format hh:mm:ss or mm:ss. Size is reported in kbytes. Small number is better.

Decompression time and memory usage

For decompression test i used following for cycle:

for a in DVD.iso fs.bin linux.tar random.bin zero.bin sql.dump
do
        for b in 1 6 9
        do
                cat ${a}.${b}.gz  | /usr/bin/time -v -o ${a}.${b}.gz.d.txt  gzip -d  > /dev/null
                cat ${a}.${b}.bz2 | /usr/bin/time -v -o ${a}.${b}.bz2.d.txt bzip2 -d > /dev/null
                cat ${a}.${b}.xz  | /usr/bin/time -v -o ${a}.${b}.xz.d.txt  xz -d    > /dev/null
        done
done

Table contain "wall clock" (e.g. total time of execution) and "maximum resident set size":

DVD.iso fs.bin linux.tar random.bin zero.bin sql.dump
gzip -1 03:19.63 (3184) 00:53.82 (3152) 00:15.20 (3168) 00:25.38 (3120) 00:25.46 (3104) 00:17.75 (3152)
gzip -6 03:11.92 (3152) 01:01.34 (3168) 00:13.51 (3152) 00:23.63 (3136) 00:40.41 (3088) 00:16.21 (3152)
gzip -9 03:11.00 (3152) 01:01.08 (3168) 00:13.17 (3168) 00:23.75 (3120) 00:40.41 (3088) 00:16.31 (3136)
bzip2 -1 27:22.65 (3584) 06:38.13 (3600) 01:15.41 (3600) 05:05.59 (3584) 00:42.02 (3584) 01:28.91 (3600)
bzip2 -6 35:12.75 (12048) 08:51.09 (12048) 01:40.37 (12032) 06:45.97 (12032) 00:42.78 (10992) 02:07.83 (12048)
bzip2 -9 35:49.54 (16272) 09:00.72 (16256) 01:44.57 (16272) 06:53.12 (16272) 00:42.96 (16272) 02:12.46 (16256)
xz -1 18:58.10 (8128) 01:26.62 (8128) 00:33.49 (8128) 00:13.17 (7968) 00:51.64 (8080) 00:58.49 (8128)
xz -6 18:53.31 (36784) 01:14.53 (36816) 00:28.21 (36816) 00:12.86 (36624) 00:51.59 (36752) 00:50.62 (36800)
xz -9 18:47.33 (266176) 01:11.99 (266192) 00:25.69 (266176) 00:13.15 (265984) 00:51.58 (266144) 00:33.14 (266176)

Note: Wall clock is in format m:ss. Size is reported in kbytes. Smaller number is better.

Sumary

  • GZIP is good option for compressing a lot of data as it is quick. It's memory usage is also low. GZIP is option for tasks requiring high throughput. I am using GZIP for backups.
  • BZIP2 provide better compression ratio compared to GZIP but require more "CPU time" to accomplish it. If possible then parallel BZIP2 may reduce overall time for compression. I am using BZIP2 to compress SQL dumps. Wall time is similar compared to GZIP but result is significantly smaller.
  • XZ use LZMA algorithm which provide impressive compression ration but in cost of very high CPU and Memory usage. Decompression speed is very good bud it also consume a lot of memory. For me it make sense to use it only for archiving data (e.g. compress once, save, decompress as many times as necessary).