'Split' Command in Linux to Break Large File Into Smaller Chunks

Split Command in Linux to Break Large File Into Smaller Chunks

In this article, we are going to study split command in Linux with which we can break a large file into smaller pieces.

To demonstrate this, we create a file testFile.txt using seq command in Linux. For those who do not know about seq command, it prints a sequence of numbers, which we would dump into a file. Let's do it.

# Dump 'seq' output to 'testfile.txt'
[root@LinuxFault split_test]# seq 10000 > testfile.txt

# First 10 lines
[root@LinuxFault split_test]$ head testfile.txt
1
2
3
4
5
6
7
8
9
10

# Last 10 lines
[root@LinuxFault split_test]$ tail testfile.txt
9991
9992
9993
9994
9995
9996
9997
9998
9999
10000

1. Basic use of split

The basic usage of any command is when it is not used with any option. In this case, we would supply the file name as an argument or parameter to split command as shown below. When it gets executed, run ls command to list the smaller parts of the file.

[root@LinuxFault split_test]$ split testfile.txt

# Large file has been split into number of smaller files
[root@LinuxFault split_test]$ ls
testfile.txt  xaa  xab  xac  xad  xae  xaf  xag  xah  xai  xaj

We could see a number of files with names in the format x-- have been created. In order to make sure that they are the parts of the original file, we check the number of lines and even their contents.

# Original file -> 10000 Lines
# 10 parts -> 1000 lines each
[root@LinuxFault split_test]$ wc -l *
10000 testfile.txt
 1000 xaa
 1000 xab
 1000 xac
 1000 xad
 1000 xae
 1000 xaf
 1000 xag
 1000 xah
 1000 xai
 1000 xaj
20000 total

# Check the contents of first part
[root@LinuxFault split_test]$ head xaa
1
2
3
4
5
6
7
8
9
10

# Check the contents of last file
[root@LinuxFault split_test]$ tail xaj
9991
9992
9993
9994
9995
9996
9997
9998
9999
10000
[root@LinuxFault split_test]$ 

In this case, the file has been split into 10 smaller chunks based on a number of lines, such that every chunk consists of 1000 lines. Instead, you might want the file to be split into a specific number of chunks, say 5 chunks (so that, every chunk will contain 2000 lines). Let's see how to do that.

2. Split a file in 'n' smaller parts - Option -n

We can define the number of parts a file should be split into using option -nThe syntax for this is split -n [No. of chunks] [file name]Let's create 5 chunks of our file testfile.txt.

# Specify the number of chunks
[root@LinuxFault split_test]$ split -n 5 testfile.txt

# There are 5 chunks created
[root@LinuxFault split_test]$ ls
testfile.txt  xaa  xab  xac  xad  xae

# Their sizes may vary
[root@LinuxFault split_test]$ wc -l *
10000 testfile.txt
 2177 xaa
 1955 xab
 1956 xac
 1955 xad
 1957 xae
20000 total

# But, they contribute to the same information
[root@LinuxFault split_test]$ head xaa
1
2
3
4
5
6
7
8
9
10
[root@LinuxFault split_test]$ tail xae
9991
9992
9993
9994
9995
9996
9997
9998
9999
10000
[root@LinuxFault split_test]$ 

So, this command has created 5 chunks of the file for us, which might differ in their sizes, but eventually have same contents as in the original file when put together. Next, we see how chunks can be created based on size of every chunk.

3. Split a file into chunks of equal sizes - Option -b

We've seen how files can be split based on a number of lines and number of chunks, now we see how to split a file based on the size of every chunk so as to create chunks of equal sizes. For this, we use option -b as split -b [size] [file name], where the size must be mentioned in bytes.

# We specify the chunk size to be 10000 bytes
[root@LinuxFault split_test]$ split -b 10000 testfile.txt

# It creates 5 chunks for us
[root@LinuxFault split_test]$ ll
total 108
-rw-r--r--. 1 root root 48894 Nov 11 14:09 testfile.txt
-rw-r--r--. 1 root root 10000 Nov 11 14:38 xaa
-rw-r--r--. 1 root root 10000 Nov 11 14:38 xab
-rw-r--r--. 1 root root 10000 Nov 11 14:38 xac
-rw-r--r--. 1 root root 10000 Nov 11 14:38 xad
-rw-r--r--. 1 root root  8894 Nov 11 14:38 xae

We can see that there are 5 chunks created, 4 of which with a size of 10000 bytes and others with leftover data. Now, we can split a file based on the size of each chunk, number of chunks and lines in each chunk. Lines? Not yet. 1000 lines are the default value and we can modify it as per our need.

4. Creating chunks with 'n' lines each - Option -l

With -l option of split command, we can set the number of lines each chunk should contain. The syntax is the same, with a different option this time. Let's split the file with each chunk having 1200 lines.

# Specify the number of lines -> 1200
[root@LinuxFault split_test]$ split -l 1200 testfile.txt

[root@LinuxFault split_test]$ ls
testfile.txt  xaa  xab  xac  xad  xae  xaf  xag  xah  xai

[root@LinuxFault split_test]$ wc -l *
10000 testfile.txt
 1200 xaa
 1200 xab
 1200 xac
 1200 xad
 1200 xae
 1200 xaf
 1200 xag
 1200 xah
  400 xai
20000 total

# Verify their contents
[root@LinuxFault split_test]$ head xaa
1
2
3
4
5
6
7
8
9
10
[root@LinuxFault split_test]$ tail xai
9991
9992
9993
9994
9995
9996
9997
9998
9999
10000 

5. Numeric suffixes - Option -d

We have seen that the names of the chunks created are alphabetical if the format x--where - is also an alphabet. We can change this to a digit so that it reads as x01x02 and so on (it makes more sense), using option -d as below.

# Numeric Suffixes
[root@LinuxFault split_test]$ split -d testfile.txt

[root@LinuxFault split_test]$ ls
testfile.txt  x00  x01  x02  x03  x04  x05  x06  x07  x08  x09

6. Suffix length - Option -a

We can also change the suffix length using option -a, so that x01 would read as x0001 if we specify suffix length = 4. Let's check this.

# Suffix Length = 4
[root@LinuxFault split_test]$ split -d -a 4 testfile.txt

[root@LinuxFault split_test]$ ls
testfile.txt  x0000  x0001  x0002  x0003  x0004  x0005  x0006  x0007  x0008  x0009

That's it! Thank you.

Post a Comment

© LinuxFault. All rights reserved. Developed by Jago Desain