How Many Documents in a Gigabyte? Our Latest Analysis Shows A Shifting Pattern

Catalyst_How_Many_Docs_2017Since 2011, I have been sampling our document repository and reporting about file sizes in my “How Many Docs in a Gigabyte” series of posts here. I started writing about the subject because we were seeing a large discrepancy between the number of files per gigabyte we stored and the number considered to be standard by our industry colleagues. Indeed, in 2011, I reported that we were finding far fewer documents per GB (2,500) than was generally thought to be the industry norm, which ranged from 5,000 to 15,000.

The article and its successors quickly became the most read articles on our blog. Scientists and practitioners alike were interested in our findings. Even an author of the famed Rand study on e-discovery expenditures told me that he and his team had read my articles as they were doing research.

Earlier Reports

In the last installment in this series in June 2016, I reviewed the averages we had generated over the years:

  • 2011: 2,500.
  • 2014: 4,500.
  • 2014: 3,000.
  • 2015: 4,400.
  • 2016: 3,415.

In that 2016 report and consistent with the above figures, I put the average number of documents in a gigabyte at 3,500. Based on gut feeling and experience, I suggested a range of between 3,000 and 4,000 documents per gigabyte.

2017 Report: The Number Keeps Dropping

I began my 2017 analysis following the same methodology as before. This time, I looked at daily reports from our repository from January 2014 to April 9, 2017, a period of just over three years. The numbers under study ranged from about 23 million records at the beginning to over 173 million in the later samples. That represented a six-fold increase in the sample size.

Here were the results:

Average 3,810
Minimum 2,782
Maximum 5,318
Median 3,927

These figures are certainly consistent with my earlier results and support my 3,500 document estimate or perhaps suggest that the number should be higher, perhaps closer to 3,900.

But here’s the rub. In looking at averages over time, I failed to notice a clear trend. The numbers are going down day by day. This suggests that the relevant number is closer to the minimum and that what we should be talking about is the downward trend rather than past averages.

Here is a chart showing how the daily figures measured over time.

DocsPerGB2017

I would say this tells an interesting story. We see a steady drop, albeit with some variations, in the number of documents in a gigabyte. Back in early 2014, the average number of documents exceeded 5,000, rising to as much as 5,318 per gigabyte. By February 2017, the numbers had dropped to as low as 2,782, although we see a small uptick in more recent measurements.

This represents a drop of almost half, corresponding with a doubling of file sizes. It seems like a trend that is likely to continue, at least if this chart is representative of other populations.

What Do We Make of This?

From our research, the pattern is clear: the number of documents per gigabyte is dropping and dropping fast. The reason seems obvious to me. As users add more rich content—such as graphs, charts, pictures and videos—to files, they get larger. I can’t think of any other reason for the decline.

Will the trend continue? I am betting that it will. Studies have shown the power of visual communications to both inform and persuade. Technology makes it easier and easier to add such content. What we can do, we will do and this is no exception.

How many documents in a gigabyte? In 2017, my thinking is that the number is closer to 2,800 than 3,900. In a couple more years, I bet it will drop further. It may be time for our industry to consider record pricing so you don’t have to keep paying for the increase in file sizes.

Read the prior posts in this series:

One thought on “How Many Documents in a Gigabyte? Our Latest Analysis Shows A Shifting Pattern

  1. Ethan

    I have to think changes in file type distribution might be a big cause too – does the data support that? More users scanning or printing to pdf vs. sharing native applications?
    The ratio of attachments to emails is also something I suspect could increase over time – and be measurable.
    Another beg to see if your data supports that.

    Relative penetrations of the various version of MS Office is the last thing to come to mind and could also explain that, but my understanding is their newer streamlined file formats should cause document size to decrease, so that doesn’t sound right.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *