How Many Documents in a Gigabyte? Our Latest Analysis Shows A Shifting Pattern

Catalyst_How_Many_Docs_2017Since 2011, I have been sampling our document repository and reporting about file sizes in my “How Many Docs in a Gigabyte” series of posts here. I started writing about the subject because we were seeing a large discrepancy between the number of files per gigabyte we stored and the number considered to be standard by our industry colleagues. Indeed, in 2011, I reported that we were finding far fewer documents per GB (2,500) than was generally thought to be the industry norm, which ranged from 5,000 to 15,000.

The article and its successors quickly became the most read articles on our blog. Scientists and practitioners alike were interested in our findings. Even an author of the famed Rand study on e-discovery expenditures told me that he and his team had read my articles as they were doing research.

Earlier Reports

In the last installment in this series in June 2016, I reviewed the averages we had generated over the years:

  • 2011: 2,500.
  • 2014: 4,500.
  • 2014: 3,000.
  • 2015: 4,400.
  • 2016: 3,415.

In that 2016 report and consistent with the above figures, I put the average number of documents in a gigabyte at 3,500. Based on gut feeling and experience, I suggested a range of between 3,000 and 4,000 documents per gigabyte.

2017 Report: The Number Keeps Dropping

I began my 2017 analysis following the same methodology as before. This time, I looked at daily reports from our repository from January 2014 to April 9, 2017, a period of just over three years. The numbers under study ranged from about 23 million records at the beginning to over 173 million in the later samples. That represented a six-fold increase in the sample size.

Here were the results:

Average 3,810
Minimum 2,782
Maximum 5,318
Median 3,927

These figures are certainly consistent with my earlier results and support my 3,500 document estimate or perhaps suggest that the number should be higher, perhaps closer to 3,900.

But here’s the rub. In looking at averages over time, I failed to notice a clear trend. The numbers are going down day by day. This suggests that the relevant number is closer to the minimum and that what we should be talking about is the downward trend rather than past averages.

Here is a chart showing how the daily figures measured over time.


I would say this tells an interesting story. We see a steady drop, albeit with some variations, in the number of documents in a gigabyte. Back in early 2014, the average number of documents exceeded 5,000, rising to as much as 5,318 per gigabyte. By February 2017, the numbers had dropped to as low as 2,782, although we see a small uptick in more recent measurements.

This represents a drop of almost half, corresponding with a doubling of file sizes. It seems like a trend that is likely to continue, at least if this chart is representative of other populations.

What Do We Make of This?

From our research, the pattern is clear: the number of documents per gigabyte is dropping and dropping fast. The reason seems obvious to me. As users add more rich content—such as graphs, charts, pictures and videos—to files, they get larger. I can’t think of any other reason for the decline.

Will the trend continue? I am betting that it will. Studies have shown the power of visual communications to both inform and persuade. Technology makes it easier and easier to add such content. What we can do, we will do and this is no exception.

How many documents in a gigabyte? In 2017, my thinking is that the number is closer to 2,800 than 3,900. In a couple more years, I bet it will drop further. It may be time for our industry to consider record pricing so you don’t have to keep paying for the increase in file sizes.

Read the prior posts in this series:


About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000 and is responsible for its overall direction, voice and vision. Well before founding Catalyst, John was a pioneer in the field of legal technology. He was editor-in-chief of the multi-author, two-book series, Winning With Computers: Trial Practice in the Twenty-First Century (ABA Press 1990, 1991). Both were ABA best sellers focusing on using computers in litigation technology. At the same time, he wrote, How to Prepare for Take and Use a Deposition at Trial (James Publishing 1990), which he and his co-author continued to supplement for several years. He also wrote, Lawyer’s Guide to Spreadsheets (Glasser Publishing 2000), and, Lawyer’s Guide to Microsoft Excel 2007 (ABA Press 2009). John has been widely honored for his achievements. In 2013, he was named by the American Lawyer as one of the top six “E-Discovery Trailblazers” in their special issue on the “Top Fifty Big Law Innovators” in the past fifty years. In 2012, he was named to the FastCase 50, which recognizes the smartest, most courageous innovators, techies, visionaries and leaders in the law. London’s CityTech magazine named him one of the “Top 100 Global Technology Leaders.” In 2009, he was named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region. Also in 2009, he was named the Top Technology Entrepreneur by the Colorado Software and Internet Association.John is the former chair of the ABA’s Law Practice Management Section. For many years, he was editor-in-chief of the ABA’s Law Practice Management magazine, a monthly publication focusing on legal technology and law office management. More recently, he founded and edited Law Practice Today, a monthly ABA webzine that focuses on legal technology and management. Over two decades, John has written scores of articles on legal technology and spoken on legal technology to audiences on four of the five continents. In his spare time, you will find him competing on the national equestrian show jumping circuit.

2 thoughts on “How Many Documents in a Gigabyte? Our Latest Analysis Shows A Shifting Pattern

  1. Stephanie Booher

    Thank you, John. This is fantastic information and extremely valuable in my view! I am one of the many who have followed and read your prior posts on this subject with great interest.

    While one could argue (and I often do) that Docs Per GB (DPG) should be used as a more germane standard of measurement for e-discovery data, noting we are still often confronted with attempting to translate conversion estimates around Pages Per GB (PPG), setting aside potentially even more nebulous Pages Per Document (PPD) standards for now, I would be curious to gain your thoughts and any findings you might be able to share on what you see as current, representative averages for either standard cumulative ESI file formats PPG numbers, or, perhaps limited to simply email file types rather than an aggregation of all ‘standard’ ESI file types, recognizing this assumes some form of post-processing job assessments to calculate in most cases.

    Considering the original motivation for assessing PPG stems from an assumption that all ESI/native file records are being fully imaged (or even printed back in the day), around which cost estimates were historically framed, while we strive to move away from this formula from a contextual framing and/or cost estimate/budget formula approach, the request still exists, we find at least. We still use 60,000 – 75,000 PPG as “average” ranges, when required, but where these numbers seems to ring most true can be from an illustrative standpoint to any stakeholders unfamiliar with volumes of data associated with e-discovery work. As a result, while there are a few industry-standard, highly respected conversion table resources to be found, upon review of those we use, they appear to be either fairly outdated, or if not, very generically reported. Any thoughts or findings you have will be most welcome.

  2. Ethan

    I have to think changes in file type distribution might be a big cause too – does the data support that? More users scanning or printing to pdf vs. sharing native applications?
    The ratio of attachments to emails is also something I suspect could increase over time – and be measurable.
    Another beg to see if your data supports that.

    Relative penetrations of the various version of MS Office is the last thing to come to mind and could also explain that, but my understanding is their newer streamlined file formats should cause document size to decrease, so that doesn’t sound right.


Leave a Reply

Your email address will not be published. Required fields are marked *