How Many Documents in a Gigabyte? An Updated Answer to that Vexing Question

A more recent version of this article can be found here.

For an industry that lives by the doc but pays by the gig, one of the perennial questions is: “How many documents are in a gigabyte?” Readers may recall that I attempted to answer this question in a post I wrote in 2011, “Shedding Light on an E-Discovery Mystery: How Many Docs in a Gigabyte.”

At the time, most people put the number at 10,000 documents per gigabyte, with a range of between 5,000 and 15,000. We took a look at just over 18 million documents (5+ terabytes) from our repository and found that our numbers were much lower. Despite variations among different file types, our average across all files was closer to 2,500. Many readers told us their experience was similar.

Just for fun, I decided to take another look. I was curious to see what the numbers might be in 2014 with new files and perhaps new file sizes.  So I asked my team to help me with an update. Here is a report on the process we followed and what we learned.[1]

How Many Docs 2014?

For this round, we collected over 10 million native files (“documents” or “docs”) from 44 different cases. The sites themselves were not chosen for any particular reason, although we looked for a minimum of 10,000 native files on each. We also chose not to use several larger sites where clients used text files as substitutes for the original natives.

Our focus for the study was on standard office files, such as Word, Excel, PowerPoint, PDFs and email. These are generally the focus of most review and discovery efforts and seem most important to our inquiry. I will discuss several other file types a bit later in this report.

I should also note that the files used in our study had already been processed and were loaded into Catalyst Insight, our discovery repository. Thus, they had been de-NISTed, de-duped (or not depending on client requests), culled, reduced, etc. My point was not to exclude any particular part of the document population. Rather, those kinds of files don’t often make it past processing and are typically not included in a review.

That said, here is a summary of what we found when we focused on the office and standard email files.

OfficeFileSummary

The weighted average for these files comes out to 3,124 docs per gigabyte. Not surprisingly, there are wide variations in the counts for different types of files. You can see these more easily when I chart the data.

OfficeFileChart

The average in 2014 was about 20% higher than our averages in 2011 (2,500 docs per gigabyte). Does that suggest a decrease in the size of the files we create today? I doubt it. People seem to be using more and more graphical elements in their PowerPoints and Word files, which would suggest larger file sizes and lower docs per gigabyte. My guess is that we are seeing routine sampling variation here rather than some kind of trend.

EML and Text Files

We had several sites with EML files (about 2 million in total). These were extracted from Lotus Notes databases by one of our processing partners (our process would normally output to HTML rather than EML). An EML file is essentially a text file with some HTML formatting. Including the EML files will increase the averages for files per gigabyte.

We also had sites with a large number of text and HTML files. Some were chat logs, others were purchase orders and still others were product information. If your site has a lot of these kinds of files, you will see higher averages in your overall counts.

Here are the numbers we retrieved for these kinds of files.

EMLandTXTfiles

Because of the large number of EML files, the weighted average here is much higher, at just over 15,500 files per gigabyte.

Image Files

Many sites had a large number of image files. In some cases they were small GIF files associated with logos or other graphics displaying on the email itself. It appears that these files were extracted from the email during processing and treated as separate records. In our processing, we don’t normally extract these types of files but rather leave them with the original email.

In any event, here are the numbers associated with these types of files.

ImageFiles

We did not find many image files in our last study. I don’t know if these numbers reflect different collection practices, different case issues or just happened to fall in the 2014 matters.

In any event, I did not think it would be helpful to our inquiry to include image files (and especially GIF files) because they are not typically useful in a review. If you do, the number of docs per gigabyte will be affected.

What Did We Learn?

In many ways, the figures from this study confirmed my conclusions in 2011. Once again, it seems that the industry-accepted figure of 10,000 files per gigabyte is over the mark and even the lower range figure of 5,000 seems high. For the typical files being reviewed by our clients, our number is closer to 3,000.

That value changes depending on what files make up your review population. If your site has a large number of EML or text files, expect the averages to get higher. If, conversely, you have a lot of Excel files, the average can drop sharply.

In my discussion so far, I broke out the different file types in logical groupings. If we include all of the different file types in our weighted averages, the numbers come out like this:

AllFileTypes

Including all files gets us awfully close to 5,000 documents per gigabyte, which was the lower range of the industry estimates I found. If you pull out the EML files, the number drops to 3,594.39, which is midway between our 2011 estimate (2,500) and 5.000 documents per gigabyte.

Which is the right number for you? That depends on the type of files you have and what you are trying to estimate. What I can say is that for the types of office files typically seen in a review, the number isn’t 10,000 or anything close. We use a figure closer to 3,000 for our estimates.

 


[1] I wish to particularly thank Greg Berka, Catalyst’s director of application support, for helping to assemble the data used in this article. He also assisted in the 2011 study.

mm

About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000 and is responsible for its overall direction, voice and vision.Well before founding Catalyst, John was a pioneer in the field of legal technology. He was editor-in-chief of the multi-author, two-book series, Winning With Computers: Trial Practice in the Twenty-First Century (ABA Press 1990, 1991). Both were ABA best sellers focusing on using computers in litigation technology. At the same time, he wrote, How to Prepare for Take and Use a Deposition at Trial (James Publishing 1990), which he and his co-author continued to supplement for several years. He also wrote, Lawyer’s Guide to Spreadsheets (Glasser Publishing 2000), and, Lawyer’s Guide to Microsoft Excel 2007 (ABA Press 2009).John has been widely honored for his achievements. In 2013, he was named by the American Lawyer as one of the top six “E-Discovery Trailblazers” in their special issue on the “Top Fifty Big Law Innovators” in the past fifty years. In 2012, he was named to the FastCase 50, which recognizes the smartest, most courageous innovators, techies, visionaries and leaders in the law. London’s CityTech magazine named him one of the “Top 100 Global Technology Leaders.” In 2009, he was named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region. Also in 2009, he was named the Top Technology Entrepreneur by the Colorado Software and Internet Association.John is the former chair of the ABA’s Law Practice Management Section. For many years, he was editor-in-chief of the ABA’s Law Practice Management magazine, a monthly publication focusing on legal technology and law office management. More recently, he founded and edited Law Practice Today, a monthly ABA webzine that focuses on legal technology and management. Over two decades, John has written scores of articles on legal technology and spoken on legal technology to audiences on four of the five continents. In his spare time, you will find him competing on the national equestrian show jumping circuit.

6 thoughts on “How Many Documents in a Gigabyte? An Updated Answer to that Vexing Question

  1. Joshua

    Why don’t you guys add share buttons to your blog posts? You put up great content. You should make it easier to share with the everyone. Thanks.

    Reply
  2. William Kellermann

    An interesting conundrum with a simple answer. As with everything it is important to characterize what you are measuring and also look at why.

    For years, Ralph Losey published a ‘How much’ listing on the sidebar of his eDiscovery Team blog. One of the measures listed was for a one gigabyte PST of Microsoft Outlook email and attachments. The item count was listed as 12,000 items (9,000 emails and 3,000 attachments) which was considered fairly accurate. (I am told this metric was purportedly derived from a study done by Applied Discovery back before its acquisition by Lexis-Nexis.) Assuming that is the source of the industry perception, that metric is a far cry from what you are measuring in your study.

    The other problem is your study is counting the horses after they’ve left the barn. Once reduced to a database, GB and item counts are only important to ongoing costs to host and manage data. They are only marginally helpful to tell us what we face at the front end of a preservation, collection or review exercise. The caveats (like GIF’s as embedded objects that make their way to the database as a record) underscore the multi-faceted difficulty in measurement.

    The ‘12,000 documents’ measure is very useful to describe the burden of a GB of highly compressed email and attachments in a PST. It is worthless for anything else. The ‘2500 documents’ per GB measure for post-eDisco processed review data is also very useful for some purposes, and worthless for others. And is that GB size, size on disk, compressed or uncompressed?

    Reply
    1. mmJohn Tredennick

      Thanks for writing Bill:

      I have seen the sidebar on Ralph’s blog for some time. I will have to ask him where the information came from. One of the reasons for my study was the discrepancy between what we were seeing and what I would hear from others in the industry. I was curious to see what was what.

      I certainly agree with your point about the utility of this type of study. If someone had a large number of raw PST files and could take the time to process them, it would be interesting to see how the numbers looked.

      We measured the size of the individual files that come out of a PST. Thus, you can expect to get MSG files (which as you know actually hold both the email message and attachments) and the attachments themselves. Perhaps you could do some rough estimates on expected counts from the already-processed data but that is probably not something I will be taking on any time soon.

      We are measuring the file sizes on disk, which we record in Insight. You probably know better than I that these sizes differ with a Linux OC than with Windows due to block sizes or the like. But that is beyond the scope of this inquiry.

      Reply
  3. Pingback: How Many Documents in a Gigabyte? An Updated Answer to that Vexing Question | @ComplexD

  4. Pingback: How Many Documents in a Gigabyte? A Quick Revisit to this Interesting Subject

Leave a Reply

Your email address will not be published. Required fields are marked *