<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Carpé Cocoa &#187; Perl</title>
	<atom:link href="http://carpe-cocoa.com/category/perl/feed/" rel="self" type="application/rss+xml" />
	<link>http://carpe-cocoa.com</link>
	<description>My journey into iPhone development</description>
	<lastBuildDate>Tue, 24 Nov 2009 21:03:31 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Finding Average File Size with Perl</title>
		<link>http://carpe-cocoa.com/2008-10-09/finding-average-file-size-with-perl/</link>
		<comments>http://carpe-cocoa.com/2008-10-09/finding-average-file-size-with-perl/#comments</comments>
		<pubDate>Thu, 09 Oct 2008 20:22:47 +0000</pubDate>
		<dc:creator>Frank</dc:creator>
				<category><![CDATA[Linux]]></category>
		<category><![CDATA[Perl]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://www.szczerba.net/2008-10-09/finding-average-file-size-with-perl/</guid>
		<description><![CDATA[I was creating a new filesystem on a NetBSD box today, and wondered about the appropriate value for the &#8220;average file size&#8221; parameter. The first question I had was &#8220;what is the average filesize in my data set&#8221;, which I figured I could answer, since I had some representative data handy. I put together a [...]]]></description>
			<content:encoded><![CDATA[<p>I was creating a new filesystem on a NetBSD box today, and wondered about the appropriate value for the &#8220;average file size&#8221; parameter. The first question I had was &#8220;what is the average filesize in my data set&#8221;, which I figured I could answer, since I had some representative data handy. I put together a quick one-liner to answer this question:</p>
<p><code>find /export -type f -print | perl -ne 'chomp; $count++; $total += (stat())[7]; END { print "$count files $total bytes total ", $total/$count, " byte average\n"; }</code></p>
<p>After running this, I was surprised at just how large the &#8220;average&#8221; size was, which led me to wonder just <em>which</em> average they were looking for here: <a href="http://en.wikipedia.org/wiki/Arithmetic_mean">mean</a> or <a href="http://en.wikipedia.org/wiki/Median">median</a>? While I was at it, I decided to calculate the <a href="http://en.wikipedia.org/wiki/Mode_(statistics)">mode</a> as well.</p>
<p><span id="more-46"></span></p>
<p>The one-liner evolved to:</p>
<p><code>find /export -type f -print | perl -ne 'chomp; $count++; $size = (stat())[7]; push @sizes, $size; $total += $size; $sizes{int(($size+511)/512)}++; print "$count\r" if ($count % 2048 == 0); END { $median = (sort @sizes)[$#sizes / 2]; $mode = 512 * (sort { $sizes{$b} &lt;=&gt; $sizes{$a} } keys %sizes)[0]; print "$count files $total bytes total ", $total/$count, " byte mean $median byte median $mode byte mode\n"; } BEGIN { $| = 1; }'<br />
</code></p>
<p>We can make this a bit more legible:</p>
<pre><code>find /export -type f -print | perl -ne '\
      chomp; \
      $count++; \
      $size = (stat())[7]; \
      push @sizes, $size; \
      $total += $size; \
      $sizes{int(($size+511)/512)}++; \
      print &#34;$count\r&#34; if ($count % 2048 == 0); \
BEGIN { $| = 1; } \
END { $median = (sort @sizes)[$#sizes / 2]; \
      $mode = 512 * (sort { $sizes{$b} &lt;=&gt; $sizes{$a} } keys %sizes)[0]; \
      print &#34;$count files $total bytes total &#34;, $total/$count, \
            &#34; byte mean $median byte median $mode byte mode\n&#34;; }'</code></pre>
<p>Let&#8217;s take it line by line. First, the shell pipeline <code>find /export -type f -print | perl -ne &#39;<i>stuff</i>&#39;</code>  recursively descends into the /export directory and prints the names of all regular files found, one per line. This output is piped into Perl, which processes the commands in the <code>-e &#39;<i>stuff</i>&#39;</code> block once per line due to the <code>-n</code> flag.</p>
<p>The heavy lifting is done in the Perl program:</p>
<p><strong>Gathering the Information</strong></p>
<p><code>chomp;</code> strips the trailing newline from the input line.</p>
<p><code>$count++;</code> counts the number of lines.</p>
<p><code>$size = (stat())[7];</code> gets the size in bytes of the input file.</p>
<p><code>push @sizes, $size;</code> saves the size in a list of all file sizes seen.</p>
<p><code>$total += $size;</code> adds up the total number of bytes seen.</p>
<p><code>$sizes{int(($size+511)/512)}++;</code> converts the size in bytes into a size in 512-byte blocks, and counts the number of files with a given block count. This creates a histogram of file sizes using 512-byte bins.</p>
<p><strong>Showing Progress</strong></p>
<p><code>print &#34;$count\r&#34; if ($count % 2048 == 0);</code> prints a progress indication for every 512 files processed.</p>
<p><code>BEGIN { $| = 1; }</code> is a BEGIN block, which means it is executed once at the beginning of the program. It sets the perl built-in variable $| to a true value, so that output is displayed immediately. With out this, the progress indication would not print until it had been output hundreds of times.</p>
<p><strong>Calculating Results</strong></p>
<p>Results a reported in the <code>END {}</code> block, which, as you may guess, is executed once at the end of the program. It starts with a few calculations:</p>
<p><code>$median = (sort @sizes)[$#sizes / 2];</code> sorts the list of sizes seen (<code>sort @sizes</code>), the selects the middle item from the sorted list (<code>$#sizes</code> is the index of the last element in the list, we use half of that as the index to get the middle element).</p>
<p><code>$mode = 512 * (sort { $sizes{$b} &lt;=&gt; $sizes{$a} } keys %sizes)[0];</code> calculates the mode (the most frequently-occurring item). Working from the inside out, <code>keys %sizes</code> gets a list of the keys from the %sizes hash. Then <code>sort { $sizes{$b} &lt;=&gt; $sizes{$a} } keys %sizes</code> sorts those keys by hash value in descending order, so that the most-frequently occurring element comes first. The first element is accessed at index 0, then multiplied by 512 to convert back to bytes.</p>
<blockquote>
<p><strong>Sorting The Data:</strong> The sort is done by passing a custom comparison routine <code>{&nbsp;$sizes{$b}&nbsp;&lt;=&gt;&nbsp;$sizes{$a}&nbsp;}</code> to the built-in <code>sort</code> function.</p>
<p>Perl&#8217;s <code>sort</code> will call this routine repeatedly with pairs of list elements in <code>$a</code> and <code>$b</code>. The routine should return 0 if the elements are equal, less than 0 if <code>$a</code> should come first, or greater than 0 if <code>$b</code> should come first.</p>
<p>The default comparison is <code>$a&nbsp;&lt;=&gt;&nbsp;$b</code>, which is a lexical sort. Since out data is numeric, we could use the equivalent <code>$a&nbsp;&lt;=&gt;&nbsp;$b</code>, which in this case would sort the list in ascending order by size. Reversing the comparison to <code>$b&nbsp;&lt;=&gt;&nbsp;$a</code> sorts in descending order by size. Finally, using the hash values rather than the keys sorts by frequency of occurrence.</p>
</blockquote>
<p></p>
<p><strong>Output Results</strong></p>
<p>Now it&#8217;s a simple matter of displaying the results:</p>
<p><code>print &#34;$count files $total bytes total &#34;, $total/$count, &#34; byte mean $median byte median $mode byte mode\n&#34;; }'</code></p>
]]></content:encoded>
			<wfw:commentRss>http://carpe-cocoa.com/2008-10-09/finding-average-file-size-with-perl/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
