How Many Documents in a Gigabyte? Our Latest Analysis Shows A Shifting Pattern

Catalyst_How_Many_Docs_2017Since 2011, I have been sampling our document repository and reporting about file sizes in my “How Many Docs in a Gigabyte” series of posts here. I started writing about the subject because we were seeing a large discrepancy between the number of files per gigabyte we stored and the number considered to be standard by our industry colleagues. Indeed, in 2011, I reported that we were finding far fewer documents per GB (2,500) than was generally thought to be the industry norm, which ranged from 5,000 to 15,000.

The article and its successors quickly became the most read articles on our blog. Scientists and practitioners alike were interested in our findings. Even an author of the famed Rand study on e-discovery expenditures told me that he and his team had read my articles as they were doing research.

Earlier Reports

In the last installment in this series in June 2016, I reviewed the averages we had generated over the years:

  • 2011: 2,500.
  • 2014: 4,500.
  • 2014: 3,000.
  • 2015: 4,400.
  • 2016: 3,415.

In that 2016 report and consistent with the above figures, I put the average number of documents in a gigabyte at 3,500. Based on gut feeling and experience, I suggested a range of between 3,000 and 4,000 documents per gigabyte.

2017 Report: The Number Keeps Dropping

I began my 2017 analysis following the same methodology as before. This time, I looked at daily reports from our repository from January 2014 to April 9, 2017, a period of just over three years. The numbers under study ranged from about 23 million records at the beginning to over 173 million in the later samples. That represented a six-fold increase in the sample size.

Here were the results:

Average 3,810
Minimum 2,782
Maximum 5,318
Median 3,927

These figures are certainly consistent with my earlier results and support my 3,500 document estimate or perhaps suggest that the number should be higher, perhaps closer to 3,900.

But here’s the rub. In looking at averages over time, I failed to notice a clear trend. The numbers are going down day by day. This suggests that the relevant number is closer to the minimum and that what we should be talking about is the downward trend rather than past averages.

Here is a chart showing how the daily figures measured over time.

DocsPerGB2017

I would say this tells an interesting story. We see a steady drop, albeit with some variations, in the number of documents in a gigabyte. Back in early 2014, the average number of documents exceeded 5,000, rising to as much as 5,318 per gigabyte. By February 2017, the numbers had dropped to as low as 2,782, although we see a small uptick in more recent measurements.

This represents a drop of almost half, corresponding with a doubling of file sizes. It seems like a trend that is likely to continue, at least if this chart is representative of other populations.

What Do We Make of This?

From our research, the pattern is clear: the number of documents per gigabyte is dropping and dropping fast. The reason seems obvious to me. As users add more rich content—such as graphs, charts, pictures and videos—to files, they get larger. I can’t think of any other reason for the decline.

Will the trend continue? I am betting that it will. Studies have shown the power of visual communications to both inform and persuade. Technology makes it easier and easier to add such content. What we can do, we will do and this is no exception.

How many documents in a gigabyte? In 2017, my thinking is that the number is closer to 2,800 than 3,900. In a couple more years, I bet it will drop further. It may be time for our industry to consider record pricing so you don’t have to keep paying for the increase in file sizes.

Read the prior posts in this series:

mm

About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000. Over the past four decades he has written or edited eight books and countless articles on legal technology topics, including two American Bar Association best sellers on using computers in litigation technology, a book (supplemented annually) on deposition techniques and several other widely-read books on legal analytics and technology. He served as Chair of the ABA’s Law Practice Section and edited its flagship magazine for six years. John’s legal and technology acumen has earned him numerous awards including being named by the American Lawyer as one of the top six “E-Discovery Trailblazers,” named to the FastCase 50 as a legal visionary and named him one of the “Top 100 Global Technology Leaders” by London Citytech magazine. He has also been named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region, and Top Technology Entrepreneur by the Colorado Software and Internet Association. John regularly speaks on legal technology to audiences across the globe. In his spare time, you will find him competing on the national equestrian show jumping circuit or playing drums and singing in a classic rock jam band.

2 thoughts on “How Many Documents in a Gigabyte? Our Latest Analysis Shows A Shifting Pattern

  1. Stephanie Booher

    Thank you, John. This is fantastic information and extremely valuable in my view! I am one of the many who have followed and read your prior posts on this subject with great interest.

    While one could argue (and I often do) that Docs Per GB (DPG) should be used as a more germane standard of measurement for e-discovery data, noting we are still often confronted with attempting to translate conversion estimates around Pages Per GB (PPG), setting aside potentially even more nebulous Pages Per Document (PPD) standards for now, I would be curious to gain your thoughts and any findings you might be able to share on what you see as current, representative averages for either standard cumulative ESI file formats PPG numbers, or, perhaps limited to simply email file types rather than an aggregation of all ‘standard’ ESI file types, recognizing this assumes some form of post-processing job assessments to calculate in most cases.

    Considering the original motivation for assessing PPG stems from an assumption that all ESI/native file records are being fully imaged (or even printed back in the day), around which cost estimates were historically framed, while we strive to move away from this formula from a contextual framing and/or cost estimate/budget formula approach, the request still exists, we find at least. We still use 60,000 – 75,000 PPG as “average” ranges, when required, but where these numbers seems to ring most true can be from an illustrative standpoint to any stakeholders unfamiliar with volumes of data associated with e-discovery work. As a result, while there are a few industry-standard, highly respected conversion table resources to be found, upon review of those we use, they appear to be either fairly outdated, or if not, very generically reported. Any thoughts or findings you have will be most welcome.

    Reply
  2. Ethan

    I have to think changes in file type distribution might be a big cause too – does the data support that? More users scanning or printing to pdf vs. sharing native applications?
    The ratio of attachments to emails is also something I suspect could increase over time – and be measurable.
    Another beg to see if your data supports that.

    Relative penetrations of the various version of MS Office is the last thing to come to mind and could also explain that, but my understanding is their newer streamlined file formats should cause document size to decrease, so that doesn’t sound right.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *