How Many Documents in a Gigabyte? An Updated Answer to that Vexing Question

A more recent version of this article can be found here.

For an industry that lives by the doc but pays by the gig, one of the perennial questions is: “How many documents are in a gigabyte?” Readers may recall that I attempted to answer this question in a post I wrote in 2011, “Shedding Light on an E-Discovery Mystery: How Many Docs in a Gigabyte.”

At the time, most people put the number at 10,000 documents per gigabyte, with a range of between 5,000 and 15,000. We took a look at just over 18 million documents (5+ terabytes) from our repository and found that our numbers were much lower. Despite variations among different file types, our average across all files was closer to 2,500. Many readers told us their experience was similar.

Just for fun, I decided to take another look. I was curious to see what the numbers might be in 2014 with new files and perhaps new file sizes.  So I asked my team to help me with an update. Here is a report on the process we followed and what we learned.[1]

How Many Docs 2014?

For this round, we collected over 10 million native files (“documents” or “docs”) from 44 different cases. The sites themselves were not chosen for any particular reason, although we looked for a minimum of 10,000 native files on each. We also chose not to use several larger sites where clients used text files as substitutes for the original natives.

Our focus for the study was on standard office files, such as Word, Excel, PowerPoint, PDFs and email. These are generally the focus of most review and discovery efforts and seem most important to our inquiry. I will discuss several other file types a bit later in this report.

I should also note that the files used in our study had already been processed and were loaded into Catalyst Insight, our discovery repository. Thus, they had been de-NISTed, de-duped (or not depending on client requests), culled, reduced, etc. My point was not to exclude any particular part of the document population. Rather, those kinds of files don’t often make it past processing and are typically not included in a review.

That said, here is a summary of what we found when we focused on the office and standard email files.

OfficeFileSummary

The weighted average for these files comes out to 3,124 docs per gigabyte. Not surprisingly, there are wide variations in the counts for different types of files. You can see these more easily when I chart the data.

OfficeFileChart

The average in 2014 was about 20% higher than our averages in 2011 (2,500 docs per gigabyte). Does that suggest a decrease in the size of the files we create today? I doubt it. People seem to be using more and more graphical elements in their PowerPoints and Word files, which would suggest larger file sizes and lower docs per gigabyte. My guess is that we are seeing routine sampling variation here rather than some kind of trend.

EML and Text Files

We had several sites with EML files (about 2 million in total). These were extracted from Lotus Notes databases by one of our processing partners (our process would normally output to HTML rather than EML). An EML file is essentially a text file with some HTML formatting. Including the EML files will increase the averages for files per gigabyte.

We also had sites with a large number of text and HTML files. Some were chat logs, others were purchase orders and still others were product information. If your site has a lot of these kinds of files, you will see higher averages in your overall counts.

Here are the numbers we retrieved for these kinds of files.

EMLandTXTfiles

Because of the large number of EML files, the weighted average here is much higher, at just over 15,500 files per gigabyte.

Image Files

Many sites had a large number of image files. In some cases they were small GIF files associated with logos or other graphics displaying on the email itself. It appears that these files were extracted from the email during processing and treated as separate records. In our processing, we don’t normally extract these types of files but rather leave them with the original email.

In any event, here are the numbers associated with these types of files.

ImageFiles

We did not find many image files in our last study. I don’t know if these numbers reflect different collection practices, different case issues or just happened to fall in the 2014 matters.

In any event, I did not think it would be helpful to our inquiry to include image files (and especially GIF files) because they are not typically useful in a review. If you do, the number of docs per gigabyte will be affected.

What Did We Learn?

In many ways, the figures from this study confirmed my conclusions in 2011. Once again, it seems that the industry-accepted figure of 10,000 files per gigabyte is over the mark and even the lower range figure of 5,000 seems high. For the typical files being reviewed by our clients, our number is closer to 3,000.

That value changes depending on what files make up your review population. If your site has a large number of EML or text files, expect the averages to get higher. If, conversely, you have a lot of Excel files, the average can drop sharply.

In my discussion so far, I broke out the different file types in logical groupings. If we include all of the different file types in our weighted averages, the numbers come out like this:

AllFileTypes

Including all files gets us awfully close to 5,000 documents per gigabyte, which was the lower range of the industry estimates I found. If you pull out the EML files, the number drops to 3,594.39, which is midway between our 2011 estimate (2,500) and 5.000 documents per gigabyte.

Which is the right number for you? That depends on the type of files you have and what you are trying to estimate. What I can say is that for the types of office files typically seen in a review, the number isn’t 10,000 or anything close. We use a figure closer to 3,000 for our estimates.

 


[1] I wish to particularly thank Greg Berka, Catalyst’s director of application support, for helping to assemble the data used in this article. He also assisted in the 2011 study.

mm

About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000. Over the past four decades he has written or edited eight books and countless articles on legal technology topics, including two American Bar Association best sellers on using computers in litigation technology, a book (supplemented annually) on deposition techniques and several other widely-read books on legal analytics and technology. He served as Chair of the ABA’s Law Practice Section and edited its flagship magazine for six years. John’s legal and technology acumen has earned him numerous awards including being named by the American Lawyer as one of the top six “E-Discovery Trailblazers,” named to the FastCase 50 as a legal visionary and named him one of the “Top 100 Global Technology Leaders” by London Citytech magazine. He has also been named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region, and Top Technology Entrepreneur by the Colorado Software and Internet Association. John regularly speaks on legal technology to audiences across the globe. In his spare time, you will find him competing on the national equestrian show jumping circuit or playing drums and singing in a classic rock jam band.