When we discuss data volumes in electronic discovery, we typically speak in terms of kilobytes, megabytes, gigabytes and sometimes even petabytes, at least for very large matters. What do we mean by these terms? How many bytes are in a kilobyte, a megabyte or a gigabyte?
The question is trickier than you might think. Are we talking about multiples of 1,000 or 1,024? Is a kilobyte 1,000 bytes or 1,024 bytes? Is a gigabyte 1,000 megabytes or 1,024 megabytes? I have heard both measures used by people I respect. Which is right?
Let’s start by checking the Sedona Conference Glossary. I am not saying this is the bible, but there are a lot of smart people involved with Sedona. I have always found their materials helpful and exceedingly well written.
In the Glossary, the Working Group on Electronic Document Retention and Protection (WG1) defines a kilobyte as a unit of 1,024 bytes. They then go on to define megabyte as 1,024 kilobytes (1,048,576 bytes) and a gigabyte as 1,024 megabytes (1,073,741,824 bytes).
Is that right? The authors don’t cite any authority for their definitions. (To be fair, none of the glossary terms are referenced.) They may have had specific authority in mind or it may simply represent the consensus of the group at that time. (A lot of people believe this definition is correct.) Either way, I think we need to look a little farther before we reach a conclusion. Here is why.
The Metric System
The metric system was an outgrowth of the work done in 1875 by the International Bureau of Weights and Measures (“BIPM” for the French version), which itself was set up by the “Metre Convention.” At the time, 17 countries banded together by treaty in an attempt to create measurement standards. Today, at least 51 countries have signed on to the treaty, including the United States. See, Le Système international d’unités (8th ed. 2006) (English translation at 95).
The group almost immediately began work ratifying definitions of the meter and kilogram, both measures that had been used in France and elsewhere for over a hundred years. That work led to the International System of Units (SI), which was ratified in 1960. It is often called the metric system, with expanding and contracting units built around the power of 10 (base 10).
Following this history, let’s move onto some firm ground. The prefixes “kilo,” “mega” and “giga” are a central part of the SI. Each prefix is defined based on the power of 10. Under the SI, kilo means 103, mega means 106, and giga means 109. Under the SI, one gigabyte is 1 billion bytes. No ifs, ands or buts.
If kilo means 1,000, where did all this 1,024 business come in? We need to go back a bit to find out. Like, to the ’60s when I was still wearing tie dye t-shirts and playing in rock bands.
Early Days of Computing
In the early days of computing, computer professionals needed a way to describe numbers that were growing by the minute. As most of us know, computers are binary creatures, using combinations of 1 and 0 for all of their calculations. A bit is a single integer that can be either a 1 or a 0. A byte consists of 8 bits and was the smallest unit in computing associated with a letter or other character.
As the number of bytes used for programs or data grew larger, computer scientists needed a way to express these larger amounts easily. Out of convenience, they reached for decimal prefixes from the metric system to aggregate byte values. They borrowed the term kilobyte for units of 1,024 bytes, and then megabytes and gigabytes for the larger groupings. The idea caught on and people started using the terms to describe binary values based on a divisor of 1,024.
While a bit odd, this misuse of the metric prefixes didn’t matter very much, at least early on. With the two values being relatively close, it seemed simpler to give 1,024 a metric label than invent another name. Since the volumes they were talking about were low, who cared? The differential between the metric and binary approaches was more a theoretical than practical problem in the early days.
By the late 1990s, volumes increased to the point where the differential mattered. The key point to understand is that the difference compounds in a semi-logarithmic function. For example, the SI kilobyte value is nearly 98% of the binary kilobyte, a megabyte is under 96% of a binary megabyte, and a gigabyte is just over 93% of a binary gigabyte value. That meant that a 300 gigabyte hard disk would show as only containing 279 gigabytes.
Different people were now using different measurements for the same or similar things. Memory makers, for example, used the binary system to calculate memory size. In contrast, hard drive manufacturers used the decimal system to express bytes. Remember CD-ROMs? They were measured using the binary system. Today, DVDs are measured using the decimal system. Computer clock speeds are expressed in kilohertz, which mean a thousand hertz. And so on.
From the beginning, the Windows operating system expressed gigabytes in terms of the binary calculation. Bytes are expressed as gigabytes. The measure is based on binary calculations.
Apple takes a different approach. Like other hardware manufacturers, they report on hard drive size using the decimal version of the gigabyte. In earlier versions of the Mac OS, however, they reported on disk size using binary gigabytes. That changes with Mac OS 10.6, called Snow Leopard. Now, the OS reports storage capacity based on decimal calculations. For the first time, a 200 gigabyte hard drive will show 200 gigabytes of storage.
But, whoa, it gets trickier. If you happen to be using Mac OS 10.6, Snow Leopard, storage capacity will be expressed based on decimal calculations. For the first time, a 200 gigabyte hard drive will show 200 gigabytes of storage. Go figure.
For what it’s worth, some components of the Linux kernel measure capacity using decimal units as well.
Naturally, consumers and non-geeks managed to get confused by all of this and lawsuits followed. Class actions were brought against Seagate and Western Digital, two of the largest hard drive manufacturers in the world. While they maintained that the decimal divisor was the correct one for measuring disk size, they ended up settling with a refund to the consumer class.
The Move to Standardize
In the mid-1990s, people started suggesting that we standardize on terms and stop this confusion. After a lot of discussion and false starts, the International Union of Pure and Applied Chemistry proposed the use of specific terms for storage values expressed in metric terms. Three more groups, the Institute of Electrical and Electronic Engineers (IEEE), the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) quickly joined the band.
In December 1998, the IEC, one of the leading international standards organizations, came up with new terms for binary multiples in an attempt to distinguish them from the metric terms. They suggested that the proper terms for binary calculations based on 1,024 as the divisor are kibi, mebi, gibi and the like.
This approach was picked up in the United States by the National Institute of Standards and Technology (NIST), which offered the following chart to describe the binary measures.
To ease confusion, here is a chart from Wikipedia showing the relationship between the metric units and the binary ones.
This movement has gained steam to the point where every major standards body is in agreement that a gigabyte is 1 billion bytes (109) and the corresponding gibibyte represents 1,073,741,824 bytes, based on the binary factorial 230. The organizations that accept this include:
- National Institute of Standards and Technology.
- International Electrotechnical Commission.
- Institute of Electrical and Electronics Engineers.
- International Committee for Weights and Measures.
- International System of Units.
- European Union.
Harkening back to our discussion on the International System of Units, the International Bureau of Weights and Measures (BIPM) now expressly prohibits the use of SI prefixes to denote binary multiples. Instead, they suggest adoption of the IEC prefixes for binary units. (See, Le Système international d’unités, page 121.)
I am not aware of any recognized standards organization, except perhaps the Sedona Conference, proposing that the prefixes kilo, mega and giga mean anything other than multiples of 1,000.
So Which Is It: 1,024 or 1,000?
So what is a gigabyte? Is it 1,024 megabytes as many of our techno geeks claim? Or is it 1,000 megabytes? At the least, we now have a basis to address the questions with a little broader perspective.
Maybe the answer is, “It can be whatever you want it to be.” I was the frog footman in our production of Alice in Wonderland back at Kecoughtan High School. I will never forget this dialog between Humpty Dumpty and Alice, excerpted from Through the Looking-Glass, by Lewis Carroll:
“When I use a word,” Humpty Dumpty said in rather a scornful tone, “it means just what I choose it to mean—neither more nor less.”
“The question is,” said Alice, “whether you can make words mean so many different things.”
“The question is, which is to be master—that’s all.”
Does that work for e-discovery? I suppose it could if everyone agreed that 1,024 should be the measure. A kilobyte means 1,024 bytes because that is what we chose it to mean–or, more appropriately, because that is what we have been calling it for years.
I have spoken with a number of technical guys I respect about this topic. They are adamant. “A gigabyte is 1,024 megabytes,” they say with fervor. “That’s the way it’s always been.”
Maybe they are right. As one pointed out to me, “Every console and network application out there uses binary multiples. Even Windows shows binary gigabytes.” Another person suggested that file systems store data in blocks that are better tracked in binary multiples. That one flew over my right by me but some of you may understand it. Others just go off what they learned when they got started.
With all due respect for differing opinions, I side with NIST and the other international standards bodies. The prefix kilo means 1,000 and that is that. It makes no sense to mix and match definitions depending on how the wind is blowing that day. Mega means 1 million and giga means 1 billion.
Certainly disclosure is central to this discussion. At Catalyst, we have followed the definitions used by the SI for as long as I can remember. We disclose that fact prominently on our price sheets and on our support site and explain that it is the accepted international standard. If others use different definitions, that is certainly their prerogative, just as it was for Humpty Dumpty. It is primarily a matter of disclosure but consistency and standards should factor into the discussion as well.
The problem was significant enough to lead the international standards bodies to create new titles for binary multiples: kibibyte, mebibyte and gibibyte. These sound a bit silly, which perhaps caused people not to use them as substitutes for their metric counterparts. Maybe the problem is a matter of familiarity; we just aren’t used to them. I remember seeing the first elevated tail lights in cars and thinking they looked strange. Now, they look quite normal. Perhaps it would be the same for kibibytes and gibibytes. Or perhaps they aren’t needed in the first place.
How many bytes in a gigabyte? The answer seems simple and straightforward to me. There are 1 billion bytes in a gigabyte, 1 million in a megabyte and 1,000 in a kilobyte. Kilo means 1,000 whether measuring bytes, meters or grams. These are metric figures and they should remain constant across the board. It is as simple as that.