I was creating a new filesystem on a NetBSD box today, and wondered about the appropriate value for the “average file size” parameter. The first question I had was “what is the average filesize in my data set”, which I figured I could answer, since I had some representative data handy. I put together a quick one-liner to answer this question:

find /export -type f -print | perl -ne 'chomp; $count++; $total += (stat())[7]; END { print "$count files $total bytes total ", $total/$count, " byte average\n"; }

After running this, I was surprised at just how large the “average” size was, which led me to wonder just which average they were looking for here: mean or median? While I was at it, I decided to calculate the mode as well.

The one-liner evolved to:

find /export -type f -print | perl -ne 'chomp; $count++; $size = (stat())[7]; push @sizes, $size; $total += $size; $sizes{int(($size+511)/512)}++; print "$count\r" if ($count % 2048 == 0); END { $median = (sort @sizes)[$#sizes / 2]; $mode = 512 * (sort { $sizes{$b} <=> $sizes{$a} } keys %sizes)[0]; print "$count files $total bytes total ", $total/$count, " byte mean $median byte median $mode byte mode\n"; } BEGIN { $| = 1; }'

We can make this a bit more legible:

find /export -type f -print | perl -ne '\
      chomp; \
      $count++; \
      $size = (stat())[7]; \
      push @sizes, $size; \
      $total += $size; \
      $sizes{int(($size+511)/512)}++; \
      print "$count\r" if ($count % 2048 == 0); \
BEGIN { $| = 1; } \
END { $median = (sort @sizes)[$#sizes / 2]; \
      $mode = 512 * (sort { $sizes{$b} <=> $sizes{$a} } keys %sizes)[0]; \
      print "$count files $total bytes total ", $total/$count, \
            " byte mean $median byte median $mode byte mode\n"; }'

Let’s take it line by line. First, the shell pipeline find /export -type f -print | perl -ne 'stuff' recursively descends into the /export directory and prints the names of all regular files found, one per line. This output is piped into Perl, which processes the commands in the -e 'stuff' block once per line due to the -n flag.

The heavy lifting is done in the Perl program:

Gathering the Information

chomp; strips the trailing newline from the input line.

$count++; counts the number of lines.

$size = (stat())[7]; gets the size in bytes of the input file.

push @sizes, $size; saves the size in a list of all file sizes seen.

$total += $size; adds up the total number of bytes seen.

$sizes{int(($size+511)/512)}++; converts the size in bytes into a size in 512-byte blocks, and counts the number of files with a given block count. This creates a histogram of file sizes using 512-byte bins.

Showing Progress

print "$count\r" if ($count % 2048 == 0); prints a progress indication for every 512 files processed.

BEGIN { $| = 1; } is a BEGIN block, which means it is executed once at the beginning of the program. It sets the perl built-in variable $| to a true value, so that output is displayed immediately. With out this, the progress indication would not print until it had been output hundreds of times.

Calculating Results

Results a reported in the END {} block, which, as you may guess, is executed once at the end of the program. It starts with a few calculations:

$median = (sort @sizes)[$#sizes / 2]; sorts the list of sizes seen (sort @sizes), the selects the middle item from the sorted list ($#sizes is the index of the last element in the list, we use half of that as the index to get the middle element).

$mode = 512 * (sort { $sizes{$b} <=> $sizes{$a} } keys %sizes)[0]; calculates the mode (the most frequently-occurring item). Working from the inside out, keys %sizes gets a list of the keys from the %sizes hash. Then sort { $sizes{$b} <=> $sizes{$a} } keys %sizes sorts those keys by hash value in descending order, so that the most-frequently occurring element comes first. The first element is accessed at index 0, then multiplied by 512 to convert back to bytes.

Sorting The Data: The sort is done by passing a custom comparison routine { $sizes{$b} <=> $sizes{$a} } to the built-in sort function.

Perl’s sort will call this routine repeatedly with pairs of list elements in $a and $b. The routine should return 0 if the elements are equal, less than 0 if $a should come first, or greater than 0 if $b should come first.

The default comparison is $a <=> $b, which is a lexical sort. Since out data is numeric, we could use the equivalent $a <=> $b, which in this case would sort the list in ascending order by size. Reversing the comparison to $b <=> $a sorts in descending order by size. Finally, using the hash values rather than the keys sorts by frequency of occurrence.

Output Results

Now it’s a simple matter of displaying the results:

print "$count files $total bytes total ", $total/$count, " byte mean $median byte median $mode byte mode\n"; }'