In 2004, I stumbled into a study by researchers at the University of California Berkeley. The study, How Much Information? 2003, was a follow-on to an earlier study released in 2000. These were two of the first attempts anyone had made to calculate the amount of digital content the world was creating. At the least, they were the first I had seen.
I found the report fascinating. The authors suggested that the world was creating about five exabytes of new content every year. What is an exabyte? you ask. Well, try it this way:
- 5,000 petabytes.
- 5 million terabytes.
- 5 billion gigabytes.
- 5 trillion megabytes.
That is a lot of data. The authors suggested that if you scanned every book and magazine in the Library of Congress, it would only come to about 136 terabytes of information. On that scale, the world created as much electronic data in one year as we might find in 37,000 new libraries the size of the Library of Congress.
That blew my mind. I started speaking to audiences about the incredible amount of content we seemed to be creating and the fact that much of it could be discoverable.
How much data in 2006?
In 2007, IDC authored a new study covering the amount of data created in 2006. This time, the study suggested a much bigger number, fully 161 exabytes, reflecting a compound annual growth rate of 57%. How did that happen? It seems we had gone from 37,000 Libraries of Congress to over 1 million of them each year. Holy Dewey Decimal system Batman!
More data in 2009
Imagine my surprise when I stumbled into the latest IDC study, conveniently sponsored by storage giant EMC. The recently released 2009 IDC study now puts the total amount of data we have created at the mind-boggling figure of 800 exabytes, again representing a compound annual growth rate of 62%. Think of it now as approaching 6 million Libraries of Congress. Holy Batman, Robin and all the rest of the super heroes! This much data could fill a stack of DVDs reaching from the earth to the moon and back. Next year, they predict we will cross the zettabyte barrier, weighing in at 1,200 exabytes.
The hits keep on coming. The latest predictions target a decade from now. By 2020, IDC speculates that the virtual stack of DVDs will pass the moon and continue halfway to Mars. But rather than being stored on DVDs, much of this data will be stored in the cloud or at least will pass through the cloud. More than 70% of this digital universe will be generated by individuals, whether at home, the office or on the go.
What does all this mean for us legal types?
You can imagine where all of this is going. The demands on e-discovery systems and professionals will go through the roof. While lots of the data we are creating consists of videos and music, plenty of the ones and zeros come out of the business world. How much of that will be discoverable? Lots of it. The definition of relevance is broad under the Federal Rules and the courts seem more likely to grant broad discovery requests.
It also means that the age of the e-discovery appliance is limited. I grew up in a world of Summation and Concordance. They worked fine when my cases consisted of 30,000 documents or less. Today, those numbers are reaching the hundreds of thousands and even millions for the bigger cases. Systems designed to run on a single computer were never made to handle the load. They are getting slower and slower and slower.
Three years ago, our average hosted case size was about 15 gigabytes. If we were to use the old standard of 60,000 pages a gig, that would come to 900,000 pages. Today, our average case size has grown to about 140 gigs, perhaps reflecting the fact that we are more often called to handle the bigger cases. That is more like 8.4 million pages—a lot of documents to review. The rate of growth, 900% over the period, seems somewhat commensurate with the IDC and Berkeley projects.
How much data is out there? Way more than any of us ever expected. If it keeps going at this pace, things will really get interesting.