I was creating a new filesystem on a NetBSD box today, and wondered about the appropriate value for the “average file size” parameter. The first question I had was “what is the average filesize in my data set”, which I figured I could answer, since I had some representative data handy. I put together a quick one-liner to answer this question:
find /export -type f -print | perl -ne 'chomp; $count++; $total += (stat())[7]; END { print "$count files $total bytes total ", $total/$count, " byte average\n"; }
After running this, I was surprised at just how large the “average” size was, which led me to wonder just which average they were looking for here: mean or median? While I was at it, I decided to calculate the mode as well.
The one-liner evolved to:
find /export -type f -print | perl -ne 'chomp; $count++; $size = (stat())[7]; push @sizes, $size; $total += $size; $sizes{int(($size+511)/512)}++; print "$count\r" if ($count % 2048 == 0); END { $median = (sort @sizes)[$#sizes / 2]; $mode = 512 * (sort { $sizes{$b} <=> $sizes{$a} } keys %sizes)[0]; print "$count files $total bytes total ", $total/$count, " byte mean $median byte median $mode byte mode\n"; } BEGIN { $| = 1; }'
We can make this a bit more legible:
find /export -type f -print | perl -ne '\
chomp; \
$count++; \
$size = (stat())[7]; \
push @sizes, $size; \
$total += $size; \
$sizes{int(($size+511)/512)}++; \
print "$count\r" if ($count % 2048 == 0); \
BEGIN { $| = 1; } \
END { $median = (sort @sizes)[$#sizes / 2]; \
$mode = 512 * (sort { $sizes{$b} <=> $sizes{$a} } keys %sizes)[0]; \
print "$count files $total bytes total ", $total/$count, \
" byte mean $median byte median $mode byte mode\n"; }'
Let’s take it line by line. First, the shell pipeline find /export -type f -print | perl -ne 'stuff' recursively descends into the /export directory and prints the names of all regular files found, one per line. This output is piped into Perl, which processes the commands in the -e 'stuff' block once per line due to the -n flag.
The heavy lifting is done in the Perl program:
Gathering the Information
chomp; strips the trailing newline from the input line.
$count++; counts the number of lines.
$size = (stat())[7]; gets the size in bytes of the input file.
push @sizes, $size; saves the size in a list of all file sizes seen.
$total += $size; adds up the total number of bytes seen.
$sizes{int(($size+511)/512)}++; converts the size in bytes into a size in 512-byte blocks, and counts the number of files with a given block count. This creates a histogram of file sizes using 512-byte bins.
Showing Progress
print "$count\r" if ($count % 2048 == 0); prints a progress indication for every 512 files processed.
BEGIN { $| = 1; } is a BEGIN block, which means it is executed once at the beginning of the program. It sets the perl built-in variable $| to a true value, so that output is displayed immediately. With out this, the progress indication would not print until it had been output hundreds of times.
Calculating Results
Results a reported in the END {} block, which, as you may guess, is executed once at the end of the program. It starts with a few calculations:
$median = (sort @sizes)[$#sizes / 2]; sorts the list of sizes seen (sort @sizes), the selects the middle item from the sorted list ($#sizes is the index of the last element in the list, we use half of that as the index to get the middle element).
$mode = 512 * (sort { $sizes{$b} <=> $sizes{$a} } keys %sizes)[0]; calculates the mode (the most frequently-occurring item). Working from the inside out, keys %sizes gets a list of the keys from the %sizes hash. Then sort { $sizes{$b} <=> $sizes{$a} } keys %sizes sorts those keys by hash value in descending order, so that the most-frequently occurring element comes first. The first element is accessed at index 0, then multiplied by 512 to convert back to bytes.
Sorting The Data: The sort is done by passing a custom comparison routine
{ $sizes{$b} <=> $sizes{$a} }to the built-insortfunction.Perl’s
sortwill call this routine repeatedly with pairs of list elements in$aand$b. The routine should return 0 if the elements are equal, less than 0 if$ashould come first, or greater than 0 if$bshould come first.The default comparison is
$a <=> $b, which is a lexical sort. Since out data is numeric, we could use the equivalent$a <=> $b, which in this case would sort the list in ascending order by size. Reversing the comparison to$b <=> $asorts in descending order by size. Finally, using the hash values rather than the keys sorts by frequency of occurrence.
Output Results
Now it’s a simple matter of displaying the results:
print "$count files $total bytes total ", $total/$count, " byte mean $median byte median $mode byte mode\n"; }'