Shedding Light on an E-Discovery Mystery: How Many Documents in a Gigabyte?

A more recent version of this article can be found here.

In his article, “Accounting for the Costs of Electronic Discovery,” David Degnan states that conducting electronic discovery “may cost upwards of $30,000 per gigabyte.” (You can read Bob Ambrogi’s post about it here.) That is a lot of money for discovery, particularly considering that the number of gigabytes we are seeing per case seems to keep increasing.

Much of Degnan’s analysis turns on how many documents (files actually) you can expect to find in a gigabyte of data. As he points out, review costs make up nearly 60% of the total costs for e-discovery. If a reviewer can only get through an average of 50 documents per hour (as Degnan suggests), the number of documents likely to be found per gigabyte of data becomes important to understanding the costs of electronic discovery.

Degnan’s View

So, how many docs are there in a gig? In Table 3 of his article, Degnan provides the following range of figures:

Degnan cites the Clearwell cost calculator for the proposition that the range goes from 5,000 to 25,000 documents per GB. He also refers to an article by Chris Eagan and Glen Homer for the proposition that 10,000 documents per GB is the industry standard.

Neither estimate is backed by hard data but we have seen these kinds of ranges before. The EDRM group, for example, posted the following as “industry averages” for images and files per GB.

No basis for these ranges is given.

Dutton’s View

E-discovery consultant Cliff Dutton included this question in a survey he submitted to a number of law firms, corporations, consultants and software providers. Specifically, he asked: “What is the average number of documents (post culling) per GB collected from all sources?” See: eOPS 2010: Electronic Discovery Operational Parameters Survey (PDF posted with author’s permission).

Dutton’s survey resulted in figures that would support the low end of Degnan’s range. The mean (average) response to his survey was 5,244 documents per gigabyte. The median response was only a bit higher at 5,500 documents per gigabyte.

That leaves us with a pretty broad range. Dutton’s figure of 5,200 is only half the so-called industry standard of 10,000 documents per gigabyte. In turn, that figure is less than half of the upper limit of 25,000 documents per gigabyte suggested by Degnan.

So what’s the right number?

Catalyst Data

I have been interested in this question for quite some time but never did anything to pin down the answer. When pricing is discussed, clients are naturally interested in knowing how many documents to expect from their collection efforts. Since collections are typically measured in gigabytes, the “How many documents?” question comes up all the time. So, I decided to take a look at our own data.

We started by taking data from a handful of our cases. This survey was not meant to be scientific but we did look for cases with different types of data. For example, one of our clients sends us a lot of native PDF files. These are not scanned images but rather postscript files and thus their sizes tend to be small. Other cases involved Outlook, Word, Excel and the other Office-type files that you expect from e-discovery. We expect some variance among cases that hold different types of data.

We grabbed our initial data from nine cases, chosen pretty much at random. In total, the cases had just under 8 million files with a total of 1.6 terabytes of storage. Of course, I am not using real case names so I’ve labeled them cases 1 through 9.

I should also note that, following the international standards bodies including NIST, we used 1,000 rather than 1,024 as the standard for converting bytes to gigabytes. Thus, we count 1 billion bytes as a gigabyte rather than 1,073,741,824 bytes. You can read about this debate here. If you prefer to use 1,024 as your multiplier, the difference is about 7 percent. Thus, 5,000 files per GB using the 1,000 multiplier would be 5,370 files per GB based on a divisor of 1,024 (bytes/1024x102x1024).

With all that behind us, here is what we found (with the cases sorted based on files per GB):

This seemed interesting. The bottom line average (total files divided by total GBs) of 4,890 was on par with the 5,000 lowest end of the “accepted” range reported by Degnan and nearly matched the 5,244 average found by Dutton’s survey. Our median was just a bit lower at 4,522. For those, like me, who almost failed statistics, the median value says that half the cases had more than 4,522 files per GB and half had fewer.

I found the disparity between the number of files per GB in the different cases interesting as well. Some cases had very low counts—1,500 files per GB—while others stretched as high as 14,000 files per GB. You can see the variation here:

Because Case 9 was the largest outlier, I took a look at the specific files stored there to see what I could learn about how file types played into these calculations.

This case had a lot of GIFs—possibly logos on email messages that were treated as separate attachments. It also had a lot of JPG files, which also could be logos. Should these be included in our calculations? I could argue the point either way but it does tell us that the composition of the files is an important factor.

The other types of files included on this site had high files-per-GB counts as well. PDFs led the way with 21,000 files per GB but counts for the Word documents also were high with 12,000 files per GB. I suppose that is what makes it an outlier.

Case 8, the other site with a high file-per-GB count, was a PDF site. The client sent us postscript PDF files (native rather than a scanned image). You can see that the file sizes were rather small:

I was not surprised to see that the text files would have a high file-per-GB rate. The postscript PDF files, which were largely emails, were also relatively small.

Case 1 was at the other end of the spectrum. The file distribution looked like this:

Across the board, the files per GB are dramatically lower. Other than to analyze the content of individual files (or just to recognize that file sizes can vary depending on content), I don’t have a reason for the variances. I include it simply because it is interesting.

Here is a summary by file type across all nine sites:

Naturally, these numbers tie into the averages presented earlier. But this analysis also shows that the numbers can fall well below the 5,000 files per GB that Degnan suggested as the low end of the range. In particular, Excel and PowerPoint files average far lower than other file types.

Enlarging the Survey

With my curiosity piqued, I asked our team to look at additional cases. This time, we looked at 20 more cases of different sizes, again chosen pretty much at random. In total, the cases had just over 10 million files with a total of 3.8 terabytes of storage. Here is what we found:

This time, the values were quite a bit lower than in the first sample. The average number of files per GB (total files divided by total GBs) was about half of the lowest figure in the accepted range reported by both Degnan and Dutton. The median was a bit lower at 2,421.

Once again, we saw a range of values across the cases. You can see the variation here (you can also see that we presented this batch of cases in order):

Since it was relatively easy to do, I combined all of the data from the 29 cases to see what that would show me. Here is what I found:

Needless to say, these figures are quite a bit below the ones others have been throwing around. If the true average is closer to 3,300 files per GB, the industry standards will certainly need adjusting.

Normalizing the Data

Given that we have some statistics expertise in our Search and Analytics Consulting Group, I thought I would do a little more analysis of the data we had uncovered. We started by ordering the data by size, from the lowest files per GB to the highest. Here is how it sorted out:

We then decided to remove outliers from the sample as indicated in the chart. This is a common practice as statisticians review sample results. Figures that are way low or way high might throw your analysis off and provide misleading data.

You can see the ones we removed (and agree or not agree with the practice). The impact in doing so changed the figures to an average of 2,544 files per GB with a median of just a bit less—2,296.

We also calculated the standard deviation to be 1,258 documents per GB. I am no expert on this but a standard deviation is a reference to the famed bell curve used to normalize a set of survey results. You can read more about standard deviation from Wikipedia. It is often denoted by the sigma character “σ” and is at the heart of the Six Sigma programs for reducing manufacturing defects.

The goal with a standard deviation analysis is to try to get a handle on how much variance you might expect when you survey future document populations. If your document population is fairly homogenous and follows a normal distribution, you can expect that 68% of the larger document population will fall within plus or minus one standard deviation from your mean (average). At two standard deviations, you can expect that 96% of your documents to fall.

I am not in a position to say that the mean we calculated is representative or that e-discovery document populations are sufficiently homogenous to follow a normal distribution. However, we can take the data we obtained and fit it to a normal distribution curve. It looks like this:

This graph suggests that 96% percent of the population will fall well under 5,000 documents per GB with only the extreme outliers going beyond it.

So How Many Docs in a Gig?

What can we make of these surveys? It certainly suggests that the average number of files per GB is well under the 10,000 figure cited as the “industry standard,” let alone the higher numbers. Our belief is that the true figure is well below even the 5,000 files per GB that Cliff Dutton reported. If so, that could impact a lot of e-discovery estimates.

I am not a statistician and the statistics professor we work with is in France on sabbatical. Nonetheless, we did look at over 18 million e-discovery files that totaled over five terabytes of storage. That is a pretty solid number from which to draw conclusions.

At the same time, let me be clear that the numbers will vary dramatically depending on the types of documents you have. With simple text files, which are rare in native e-discovery populations, the numbers could skyrocket. With postscript PDF files (converted email or short Word files for example), we saw values go well above the 10,000 mark. In contrast, with some of the other file types, PowerPoint and Excel for example, your numbers could go well below our calculated mean of 2,500 documents per GB.

Perhaps others have done more sophisticated analyses that they can share. I offer our figures just to get the discussion started. Let me know what your experience has been with your files.

I want to thank my fellow Catalyst employees, Greg Berka, Kevin Hughes and Nirupama Bhatt, for their help on this article.

mm

About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000. Over the past four decades he has written or edited eight books and countless articles on legal technology topics, including two American Bar Association best sellers on using computers in litigation technology, a book (supplemented annually) on deposition techniques and several other widely-read books on legal analytics and technology. He served as Chair of the ABA’s Law Practice Section and edited its flagship magazine for six years. John’s legal and technology acumen has earned him numerous awards including being named by the American Lawyer as one of the top six “E-Discovery Trailblazers,” named to the FastCase 50 as a legal visionary and named him one of the “Top 100 Global Technology Leaders” by London Citytech magazine. He has also been named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region, and Top Technology Entrepreneur by the Colorado Software and Internet Association. John regularly speaks on legal technology to audiences across the globe. In his spare time, you will find him competing on the national equestrian show jumping circuit or playing drums and singing in a classic rock jam band.