We have seen pseudorandom generators, functions and permutations, as well as Message Authentication codes, CPA and CCA secure encryptions. This week we will talk about cryptographic hash functions and some of their magical properties. We motivate this by the bitcoin cryptocurrency. As usual our discussion will be highly abstract and idealized, and any resemblance to real cryptocurrencies, living or dead, is purely coincidental.
Using cryptography to create a centralized digital-currency is fairly straightforward, and indeed this is what is done by Visa, Mastercard etc.. The main challenge with bitcoin is that it is decentralized. There is no trusted server, there are no "user accounts", no central authority to adjudicate claims. Rather we have a collection of anonymous and autonomous parties that somehow need to agree on what is a valid payment.
Before talking about cryptocurrencies, let's talk about currencies in general.^[I am not an economist by any stretch of the imagination, so please take the dicussion below with a huge grain of salt. I would appreciate any comments on it.] At an abstract level, a currency requires two components:
-
A scarce resource.
-
A mechanism for determining and transferring ownership of certain quantities of this resource.
The original currencies were based on commodity money. The scarce resource was some commodity having intrinsic value, such as gold or silver, or even salt or tea, and ownership based on physical possession. However, as commerce increased, carrying around (and protecting) the large quantity of the commodities became impractical, and societies shifted to representative money, where the currency is not the commodity itself but rather a certificate that provides the right to the commodity. Representative money requires trust in some central authority that would respect the certificate. The next step in the evolution of currencies was fiat money, which is a currency (like today's dollar, ever since the U.S. moved off the gold standard) that does not correspond to any commodity, but rather only relies on trust in a central authority. (Another example is the Roman coins, which though originally made of silver, have underdone a continous process of debasement until they contained less than two percent of it.) One advantage (sometimes disadvantage) of a fiat currency is that it allows for more flexible monetary policy on parts of the central authority.
Bitcoin is a fiat currency without a central authority. A priori this seems like a contradiction in terms. If there is no trusted central authority, how can we ensure a scarce resource? who settles claims of ownership? and who sets monetary policy? Bitcoin (and other cryptocurrencies) is about the solution for these problems via cryptographic means.
The basic unit in the bitcoin system is a coin.
Each coin has a unique identifier, and a current owner .1
Transactions in the system have either the form of "mint coin with identifier
Since there are no user accounts in bitcoin, the "entities"
Please re-read the previous paragraph, to make sure you follow the logic.
One example of a puzzle is that
The main idea behind bitcoin is that there is a public ledger that contains an ordered list of all the transactions that were ever performed and are considered as valid in the system. Given such a ledger, it is easy to answer the question of who owns any particular coin. The main problem is how does a collection of anonymous parties without any central authority agree on this ledger? This is an instance of the consensus problem in distributed computing. This seems quite scary, as there are very strong negative results known for this problem; for example the famous Fischer, Lynch Patterson (FLP) result showed that if there is even one party that has a benign failure (i.e., it halts and stop responding) then it is impossible to guarantee consensus in an asynchronuous network. Things are better if we assume synchronicity (i.e., a global clock and some bounds on the latency of messages) as well as that a majority or supermajority of the parties behave correctly. The central clock assumption is typically approximately maintained on the Internet, but the honest majority assumption seems quite suspicious. What does it mean a "majority of parties" in an anonymous network where a single person can create multiple "entities" and cause them to behave arbitrarily (known as "byzantine" faults in distributed parlance)? Also, why would we assume that even one party would behave honestly- if there is no central authority and it is profitable to cheat then they everyone would cheat, wouldn't they?
Perhaps the main idea behind bitcoin is that "majority" will correspond to a "majority of computing power", or as the original bitcoin paper says, "one CPU one vote" (or perhaps more accurately, "one cycle one vote"). It might not be immediately clear how to implement this, but at least it means that creating fictitious new entities (sometimes known as a Sybill attack after the movie about multiple-personality disorder) cannot help. To implement it we turn to a cryptographic concept known as "proof of work" which was originally suggested by Dwork and Naor in 1991 as a way to combat mass marketing email.3
Consider a pseudorandom function
Stop here and try to think if indeed it is the case that one cannot find an input
The main question in using PRF's for proofs of work is who is holding the key
Indeed, it is an excellent exercise to prove that (under the PRF conjecture) that there exists a PRF
However, suppose that
Theorem: Under the PRG conjecture, there exist super strong PRF.
Unfortunately such a result is not known to be true, and for a very good reason. Most natural ways to define "super strong PRF" will result in properties that can be shown to be impossible to achieve. Nevertheless, the intuition behind it still seems useful and so we have the following heuristic:
The random oracle heuristic (aka "Random oracle model", Bellare-Rogaway 1993): If a "natural" protocol is secure when all parties have access to a random function
$H:{0,1}^n\rightarrow{0,1}^\ell$ , then it remains secure even when we give the parties the description of a cryptographic hash function with the same input and output lengths.
We don't have a good characterization as to what makes a protocol "natural" and we do have fairly strong counterexamples to this heuristic (though they are arguably "unnatural"). That said, it still seems useful as a way to get intuition for security, and in particular to analyze bitcoin (and many other practical protocols) we do need to assume it, at least given current knowledge.
The random oracle heuristic is very different from all the conjectures we considered before. It is not a formal conjecture since we don't have any good way to define "natural" and we do have examples of protocols that are secure when all parties have access to a random function but are insecure whenever we replace this random function by any efficiently computable function (see the homework exercises).
We can now specify the "proof of work" protocol for bitcoin. Given some identifier
How does proof of work help us in achieving consensus? The idea is that every transaction
An honest party in the bitcoin network will accept the longest valid ledger it is aware of. (A ledger is valid if every transaction in it of the form "transfer the coin
The question is then how do we get to that happy state given that many parties might be non-malicious but still selfish and might not want to volunteer their computing power for the goal of creating a consensus ledger.
Bitcoin achieves this by giving some incentive, in the form of the ability to mint new coins, to any party that adds to the ledger.
This means that if we are already in the situation where there is a consensus ledger
Cost to mine, mining pools: Generally, if you know that completing a
The real bitcoin: There are several aspects in which the protocol described above differs from the real bitcoin protocol. Some of them were already discussed above: Bitcoin typically uses digital signatures for puzzles (though it has a more general scripting language to specify them), and transactions involve a number of satoshis (and the user interface typically displayes currency is in units of BTC which are
$10^8$ satoshis). The Bitcoin protocol also has a formula designed to factor in the decrease in dollar cost per cycle so that bitcoins become more expensive to mine with time. There is also a fee mechanism apart from the mining to incentivize parties to add to the ledger. (The issue of incentives in bitcoin is quite subtle and not fully resolved, and it is possible that parties' behavior will change with time.) The ledger does not grow by a single transaction at a time but rather by a block of transactions, and there is also some timing synchronization mechanism (which is needed, as per the consensus impossiblity results). There are other differences as well; see the Bonneau et al paper as well as the Tschorsch and Scheuermann survey for more.
Another issue we "brushed under the carpet" is how do we come up with these unique identifiers per transaction.
We want each transaction
The main idea is the following simple result, which can be thought of as one side of the so called "birthday paradox":
If
Let us think of
This means that a random function
A collection
Once more we do not know a theorem saying that under the PRG conjecture there exists a collision resistant hash function collection, even though this property is considered as one of the desiderata for cryptographic hash functions. However, we do know how to obtain collections satisfying this condition under various assumptions that we will see later in the course such as the learning with error problem and the factoring and discrete logarithm problems. Furthermore if we consider the weaker notion of security under a second preimage attack (also known as being a "universal one way hash function" or UOWHF) then it is known how to derive such a function from the PRG assumption.
A collection
While we discussed hash functions as keyed collections, in practice people often think of a hash function as being a fixed keyless function. However, this is because most practical constructions involve some hardwired standardized constants (often known as IV) that can be thought of as a choice of the key.
Practical constructions of cryptographic hash functions start with a basic block which is known as a compression function
{#merkledamgardfig width=80% }
Let
The intuition behind the proof is that if
In practice we want much more than collision resistance from our hash functions.
In particular we often would like them to be PRF's as well.
Unfortunately, the Merkle-Damgard construction is not a PRF even when
One fix for this is to use a different
A variant of this construction (where
The simplest implementation for a compression function is to take a block cipher with an
Almost all practically used hash functions are based on the Merkle-Damgard paradigm. Hash functions are designed to be extremely efficient6 which also means that they are often at the "edge of insecurity" and indeed have fallen over the edge.
In 1990 Ron Rivest proposed MD4, which was already shown weaknesses in 1991, and a full collision has been found in 1995. Even faster attacks have been since found and MD4 is considered completely insecure.
In response to these weaknesses, Rivest designed MD5 in 1991. A weakness was shown for it in 1996 and a full collision was shown in 2004. Hence it is now also considered insecure.
In 1993 the National Institute of Standards proposed a standard for a hash function known as the Secure Hash Algorithm (SHA), which has quite a few similarities with the MD4 and MD5 functions. This function is known as SHA-0, and the standard was replaced in 1995 with SHA-1 that includes an extra "mixing" (i.e., bit rotation) operation. At the time no explanation was given for this change but SHA-0 was later found to be insecure. In 2002 a variant with longer output, known as SHA-256, was added (as well as some others). In 2005, following the MD5 collision, significant weaknesses were shown in SHA-1. In 2017, a full SHA-1 collision was found. Today SHA-1 is considered insecure and SHA-256 is recommended.
Given the weaknesses in MD-5 and SHA-1 , NIST started in 2006 a competition for a new hashing standard, based on functions that seem sufficiently different from the MD5/SHA-0/SHA-1 family. (SHA-256 is unbroken but it seems too close for comfort to those other systems.) The hash function Keccak was selected as the new standard SHA-3 in August of 2015.
The NSA is the world's largest employer of mathematicians, and is very heavily invested in cryptographic research. It seems quite possible that they devote far more resources to analyzing symmetric primitives such as block ciphers and hash functions than the open research community. Indeed, the history above suggests that the NSA has consistently discovered attacks on hash functions before the cryptographic community (and the same holds for the differential cryptanalysis technique for block ciphers). That said, despite the "mythic" powers that are sometimes ascribed to the NSA, this history suggests that they are ahead of the open community but not so much ahead, discovering attacks on hash functions about 5 years or so ahead.
There are a few ways we can get "insider views" to the NSA's thinking. Some such insights can be obtained from the Snowden documents. The Flame malware has been discovered in Iran in 2012 after operating since at least 2010. It used an MD5 collision to achieve its goals. Such a collision was known in the open literature since 2008, but Flame used a different variant that was unknown in the literature. For this reason it is suspected that it was designed by a western intelligence agency.
Another insight into NSA's thoughts can be found in pages 12-19 of NSA's internal Cryptolog magazine which has been recently declassified; one can find there a rather entertaining and opinionated (or obnoxious, depending on your point of view) review of the CRYPTO 1992 conference. In page 14 the author remarks that certain weaknesses of MD5 demonstrated in the conference are unlikely to be extended to the full version, which suggests that the NSA (or at least the author) was not aware of the MD5 collisions at the time.
Hash functions are of course also widely used for non-cryptographic applications such as building hash tables and load balancing. For these applications people often use linear hash functions known as cyclic redundancy codes (CRC). Note however that even in those seemingly non-cryptographic applications, an adversary might cause significant slowdown to the system if he can generate many collisions. This can and has been used to obtain denial of service attacks. As a rule of thumb, if the inputs to your system might be generated by someone who does not have your best interests at heart, you're better off using a cryptographic hash function.
Footnotes
-
This is one of the places where we simplify and deviate from the actual Bitcoin system. In the actual Bitcoin system, the atomic unit is known as a satoshi and one bitcoin (abberviated BTC) is $10^8$ satoshis. For reasons of efficiency, there is no individual identifier per satoshi and transactions can involve transfer and creation of multiple satoshis. However, conceptually we can think of atomic coins each of which has a unique identifier. ↩
-
There are reasons why Bitcoin uses digital signatures and not these puzzles. The main issue is that we want to bind the puzzle not just to the coin but also to the particular transaction, so that if you know the solution to the puzzle $P$ corresponding to the coin $ID$ and want to use that to transfer it to $Q$, it won't be possible for someone to take your solution and use that to transfer the coin to $Q'$ before your transaction is added to the public ledger. We will come back to this issue after we learn about digital signatures. ↩
-
This was a rather visionary paper in that it foresaw this issue before the term "spam" was introduced and indeed when email itself, let alone spam email, was hardly widespread. ↩
-
The actual bitcoin protocol is slightly more general, where the proof is some $x$ such that $H(ID|x)$, when interpreted as a number in $[2^n]$, is at most $T$. There are also other issues about how exactly $x$ is placed and $ID$ is computed from past history that we ignore here. ↩
-
Note that the other side of the birthday bound shows that you can always find a collision in $h_k$ using roughly $2^{n/2}$ queries. For this reason we typically need to double the output length of hash functions compared to the key size of other cryptographic primitives (e.g., $256$ bits as opposed to $128$ bits). ↩
-
For example, the Boneh-Shoup book quotes processing times of up to 255MB/sec on a 1.83 Ghz Intel Core 2 processor, which is more than enough to handle not just Harvard's network but even Lamar College's. ↩