How Much Data is Out There? A Lot More Than You Might Think

In 2004, I stumbled into a study by researchers at the University of California Berkeley. The study, How Much Information? 2003, was a follow-on to an earlier study released in 2000. These were two of the first attempts anyone had made to calculate the amount of digital content the world was creating. At the least, they were the first I had seen.

I found the report fascinating. The authors suggested that the world was creating about five exabytes of new content every year. What is an exabyte? you ask. Well, try it this way:

  • 5,000 petabytes.
  • 5 million terabytes.
  • 5 billion gigabytes.
  • 5 trillion megabytes.

That is a lot of data. The authors suggested that if you scanned every book and magazine in the Library of Congress, it would only come to about 136 terabytes of information. On that scale, the world created as much electronic data in one year as we might find in 37,000 new libraries the size of the Library of Congress.

That blew my mind. I started speaking to audiences about the incredible amount of content we seemed to be creating and the fact that much of it could be discoverable.

How much data in 2006?

In 2007, IDC authored a new study covering the amount of data created in 2006. This time, the study suggested a much bigger number, fully 161 exabytes, reflecting a compound annual growth rate of 57%. How did that happen? It seems we had gone from 37,000 Libraries of Congress to over 1 million of them each year. Holy Dewey Decimal system Batman!

More data in 2009

Imagine my surprise when I stumbled into the latest IDC study, conveniently sponsored by storage giant EMC. The recently released 2009 IDC study now puts the total amount of data we have created at the mind-boggling figure of 800 exabytes, again representing a compound annual growth rate of 62%. Think of it now as approaching 6 million Libraries of Congress. Holy Batman, Robin and all the rest of the super heroes! This much data could fill a stack of DVDs reaching from the earth to the moon and back. Next year, they predict we will cross the zettabyte barrier, weighing in at 1,200 exabytes.

The hits keep on coming. The latest predictions target a decade from now. By 2020, IDC speculates that the virtual stack of DVDs will pass the moon and continue halfway to Mars. But rather than being stored on DVDs, much of this data will be stored in the cloud or at least will pass through the cloud. More than 70% of this digital universe will be generated by individuals, whether at home, the office or on the go.

What does all this mean for us legal types?

You can imagine where all of this is going. The demands on e-discovery systems and professionals will go through the roof. While lots of the data we are creating consists of videos and music, plenty of the ones and zeros come out of the business world. How much of that will be discoverable? Lots of it. The definition of relevance is broad under the Federal Rules and the courts seem more likely to grant broad discovery requests.

It also means that the age of the e-discovery appliance is limited. I grew up in a world of Summation and Concordance. They worked fine when my cases consisted of 30,000 documents or less. Today, those numbers are reaching the hundreds of thousands and even millions for the bigger cases. Systems designed to run on a single computer were never made to handle the load. They are getting slower and slower and slower.

Three years ago, our average hosted case size was about 15 gigabytes. If we were to use the old standard of 60,000 pages a gig, that would come to 900,000 pages. Today, our average case size has grown to about 140 gigs, perhaps reflecting the fact that we are more often called to handle the bigger cases. That is more like 8.4 million pages—a lot of documents to review. The rate of growth, 900% over the period, seems somewhat commensurate with the IDC and Berkeley projects.

How much data is out there? Way more than any of us ever expected. If it keeps going at this pace, things will really get interesting.


About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000 and is responsible for its overall direction, voice and vision.Well before founding Catalyst, John was a pioneer in the field of legal technology. He was editor-in-chief of the multi-author, two-book series, Winning With Computers: Trial Practice in the Twenty-First Century (ABA Press 1990, 1991). Both were ABA best sellers focusing on using computers in litigation technology. At the same time, he wrote, How to Prepare for Take and Use a Deposition at Trial (James Publishing 1990), which he and his co-author continued to supplement for several years. He also wrote, Lawyer’s Guide to Spreadsheets (Glasser Publishing 2000), and, Lawyer’s Guide to Microsoft Excel 2007 (ABA Press 2009).John has been widely honored for his achievements. In 2013, he was named by the American Lawyer as one of the top six “E-Discovery Trailblazers” in their special issue on the “Top Fifty Big Law Innovators” in the past fifty years. In 2012, he was named to the FastCase 50, which recognizes the smartest, most courageous innovators, techies, visionaries and leaders in the law. London’s CityTech magazine named him one of the “Top 100 Global Technology Leaders.” In 2009, he was named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region. Also in 2009, he was named the Top Technology Entrepreneur by the Colorado Software and Internet Association.John is the former chair of the ABA’s Law Practice Management Section. For many years, he was editor-in-chief of the ABA’s Law Practice Management magazine, a monthly publication focusing on legal technology and law office management. More recently, he founded and edited Law Practice Today, a monthly ABA webzine that focuses on legal technology and management. Over two decades, John has written scores of articles on legal technology and spoken on legal technology to audiences on four of the five continents. In his spare time, you will find him competing on the national equestrian show jumping circuit.

8 thoughts on “How Much Data is Out There? A Lot More Than You Might Think

  1. Pingback: Tweets that mention Catalyst E-discovery Blog » How Much Data is Out There? A Lot More Than You Might Think --

  2. Peg Duncan

    How much of this, though, is unique information? Consider the volume of Torrents (leaving aside the legality for the moment) – or ripped music sitting on IPods, smart phones, etc. Also digital photos uploaded to Facebook. The “alpha” copy is on the camera, with copies loaded on the home computer using various types electronic photo album software, backed up on the external drive, and then uploaded to Picasa and Facebook.

    We are certainly creating new (and unique) information at an ever-increasing pace, but we are also spreading copies everywhere.

    1. mmJohn Tredennick

      You make a good point. I have seen studies suggesting that a substantial proportion of the data out there consists of duplicates. We regularly remove duplicates when we process e-discovery documents just to cut down on needless review.

      But even if the percentage of duplicate material is high, say 60% or the total population, the point would be the same. By any measure, the amount of new data we are creating is breathtaking, particularly for old, paper-based guys like me. And, a lot of it is discoveable.

      It simply means that the old methods for dealing with legal documents won’t cut it any more. Thanks for your comments.

  3. Pingback: Weekly Top Story Digest - November 17, 2010 | ComplexDiscovery

  4. Pingback: Unfiltered Orange | Weekly eDiscovery News Update - November 17, 2010 | Orange Legal Technologies

  5. Ramone Reese

    This is extremely interesting and thought-provoking. As a current law student, it provides a glimpse into the world I will one day practice in. It will be interesting to see how courts continue to adapt and respond to the ever-changing world of e-discovery. As technology changes and the amount of data continues to increase at an exponential rate, I wonder which way will see costs go?


Leave a Reply

Your email address will not be published. Required fields are marked *