pdbkeyedsort - find all unique values of a key column

Sort “mostly sorted” large FSDB files using a double pass dbkeyedsort reads a file twice, sorting the data by the column specified via the -c/–column option. During the first pass, it counts all the rows per key to manage which lines it needs to memorize as it is making its second pass. During the second pass, it only stores in memory the lines that are out of order. This can greatly optimize the amount of memory stored when the data is already in a fairly sorted state (which is common for the output of map/reduce operations such as hadoop). This comes at the expense of needing to read the entire dataset twice, which means its impossible to use stdin to pass in data; instead a filename must be specified instead. The output, though, may be stdout.

Example input (myfile.fsdb):

#fsdb -F s col1:l two:a andthree:d
1	key1	42.0
2	key2	123.0
3	key1    90.2

Example command usage

We add the -v flag to have it give a count of the number of lines that were cached. In general, you want this fraction to be small to conserve memory. In the example below, pdbkeyedsort only needed to memorize one row (the second) of the above file.

$ pdbkeyedsort -c andthree -v  myfile.fsdb

Example output

#fsdb -F t col1:l two andthree:d
1	key1	42.0
3	key1	90.2
2	key2	123.0
cached 1/3 lines