How Many Documents in a Gigabyte: 2016 Edition

How_Many_DocsReaders of our blog will know that I have a continuing interest in answering the perennial e-discovery question: “How many native documents are in a gigabyte?” I started thinking about this in 2011 and published my first article on the subject based on analysis of 18 million files. Challenging industry assumptions, which ran from 5,000 to as many as 15,000, I concluded that the average across all files—based on that sample—was closer to 2,500.

In 2014, I decided to take another look at the question. For that round, I sampled from 8 million records pulled from Insight, our newer document repository platform. This time I came out with 4,500 docs in a gigabyte. Eschewing a declaration that file sizes had gotten smaller, which seemed counter-intuitive, I instead suggested we were seeing sampling variation and left it at that.

howmanydocsinfographic

Later that year, and just for fun, I did simple math on some figures released by Barclay T. Blair for an article in Law Technology News, “Four Examples of Predictive Coding Success.” He reported on several successful predictive coding projects but happened also to provide information on numbers of documents and their respective gigabyte counts. In that case, the numbers for the three cases came to about 3,000 documents per gigabyte.

Last year I took another shot at the question, this time grabbing a larger sample of over 132 million files taken from 270 cases. This time the average came to 4,400 documents per gigabyte, with a range over time from a low of 3,809 to a high of 5,317 docs per gigabyte.

In that article, I also spent time looking at different file types across a number of cases. As you might expect, there was a wide variation in the number of docs per gigabyte depending on file type.

DocsInGBbyFiletype

Powerpoint files were the biggest in the survey.
How Many Docs Today?
This time, I decided to go big, so I broadened my study to just under 150 million documents for this round, pulled from over 350 different sites. The files consumed over 48 terabytes of data and consisted of all types of post-processed files. With a hat tip to Bill Kellerman, I was not trying to figure out how many docs might be in a strongly compressed gigabyte or could be stuffed into a PST container. These are interesting questions too but not for today.

The number came to 3,058 documents per gigabyte in this sample. I find it interesting that this number is much closer to my 2011 sample and substantially lower than my findings in 2014 and 2015.
Pulling out a Big Case
Looking at the sample, we noticed that it included one particularly large case with just under 20 million records. We decided to sample just that case, finding that its document count was noticeably lower than the norm. Rather, the sample of 18,065,024 files consumed 10.3 terabytes of file space for an average of 1,740 docs per gigabyte.

Here is a look at the top 10 file types from this site (although about 8 million documents had been removed from the site between the date of the sample and the point when I pulled this graph).

Docs per gigabyte.

Docs per gigabyte.

Removing this large case from the sample had an impact on the numbers. With a reduced sample of 131 million files consuming just over 38 terabytes of file space, the average came to 3,415 docs per gigabyte.

So, How Many Docs per Gig?
Let’s recap my findings from over the past half-decade of how many docs in a gigabyte:

  • 2011: 2,500.
  • 2014: 4,500.
  • 2014: 3,000.
  • 2015: 4,400.
  • 2016: 3,415.

I have no magic way to normalize these. But if I had to bet, I would put the number at about 3,500 documents per gigabyte, with a range of between 3,000 and 4,000 across all file types. But if you get a case like the big one described above, you might see a far lower number, perhaps as low as 1,700 docs per gig. And, of course, if your site consists mostly of text and email files, the numbers could jump above 10,000.

mm

About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000. Over the past four decades he has written or edited eight books and countless articles on legal technology topics, including two American Bar Association best sellers on using computers in litigation technology, a book (supplemented annually) on deposition techniques and several other widely-read books on legal analytics and technology. He served as Chair of the ABA’s Law Practice Section and edited its flagship magazine for six years. John’s legal and technology acumen has earned him numerous awards including being named by the American Lawyer as one of the top six “E-Discovery Trailblazers,” named to the FastCase 50 as a legal visionary and named him one of the “Top 100 Global Technology Leaders” by London Citytech magazine. He has also been named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region, and Top Technology Entrepreneur by the Colorado Software and Internet Association. John regularly speaks on legal technology to audiences across the globe. In his spare time, you will find him competing on the national equestrian show jumping circuit or playing drums and singing in a classic rock jam band.