[MUSIC PLAYING] DOUG LLOYD: By now you know a lot about arrays, and you know a lot about linked lists. And we've discuss the pros and cons, we've discussed that linked lists can get bigger and smaller, but they take up more size. Arrays are much more straightforward to use, but they're restrictive in as much as we have to set the size of the array at the very beginning and then we're stuck with it.
But that's, we've pretty much exhausted all of our topics about linked lists and arrays. Or have we? Maybe we can do something even more creative. And that sort of lends the idea of a hash table.
So in a hash table we're going to try combine an array with a linked list. We're going to take the advantages of the array, like random access, being able to just go to array element 4 or array element 8 without having to iterate across. That's pretty fast, right?
But we also want to have our data structure be able to grow and shrink. We don't need, we don't want to be restricted. And we want to be able to add and remove things very easily, which if you recall, is very complex with an array. And we can call this new thing a hash table.
And if implemented correctly, we're sort of taking the advantages of both data structures you've already seen, arrays and linked lists. Insertion can start to tend toward theta of 1. Theta we haven't really discussed, but theta is just the average case, what's actually going to happen. You're not always going to have the worst case scenario, and you're not always going to have the best case scenario, so what's the average scenario?
Well an average insertion into a hash table can start to get close to constant time. And deletion can get close to constant time. And lookup can get close to constant time. That's-- we don't have a data structure yet that can do that, and so this already sounds like a pretty great thing. We've really mitigated the disadvantages of each on its own.
To get this performance upgrade though, we need to rethink how we add data into the structure. Specifically we want the data itself to tell us where it should go in the structure. And if we then need to see if it's in the structure, if we need to find it, we want to look at the data again and be able to effectively, using the data, randomly access it. Just by looking at the data we should have an idea of where exactly we're going to find it in the hash table.
Now the downside of a hash table is that they're really pretty bad at ordering or sorting data. And in fact, if you start to use them to order or sort data you lose all of the advantages you previously had in terms of insertion and deletion. The time becomes closer to theta of n, and we've basically regressed into a linked list. And so we only want to use hash tables if we don't care about whether data is sorted. For the context in which you'll use them in CS50 you probably don't care that the data is sorted.
So a hash table is a combination of two distinct pieces with which we're familiar. The first is a function, which we usually call a hash function. And that hash function is going to return some non-negative integer, which we usually call a hashcode, OK? The second piece is an array, which is capable of storing data of the type we want to place into the data structure. We'll hold off on the linked list element for now and just start with the basics of a hash table to get your head around it, and then we'll maybe blow your mind a little bit when we combine arrays and link lists together.
The basic idea though is we take some data. We run that data through the hash function. And so the data is processed and it spits out a number, OK? And then with that number we just store the data we want to store in the array at that location. So for example we have maybe this hash table of strings. It's got 10 elements in it, so we can fit 10 strings in it.
Let's say we want to hash John. So John as the data we want to insert into this hash table somewhere. Where do we put it? Well typically with an array so far we probably would put it in array location 0. But now we have this new hash function.
And let's say that we run John through this hash function and it's spits out 4. Well that's where we're going to want to put John. We want to put John in array location 4, because if we hash John again-- let's say later we want to search and see if John exists in this hash table-- all we need to do is run it through the same hash function, get the number 4 out, and be able to find John immediately in our data structure. That's pretty good.
Let's say we now do this again, we want to hash Paul. We want to add Paul into this hash table. Let's say that this time we run Paul through the hash function, the hashcode that is generated is 6. Well now we can put Paul in the array location 6. And if we need to look up whether Paul is in this hash table, all we need to do is run Paul through the hash function again and we're going to get 6 out again.
And then we just look at array location 6. Is Paul there? If so, he's in the hash table. Is Paul not there? He's not in the hash table. It's pretty straightforward.
Now how do you define a hash function? Well there's really no limit to the number of possible hash functions. In fact there's a number of really, really good ones on the internet. There's a number of really, really bad ones on the internet. It's also pretty easy to write a bad one.
So what makes up a good hash function, right? Well a good hash function should use only the data being hashed, and all of the data being hashed. So we don't want to use anything-- we don't incorporate anything else other than the data. And we want to use all of the data. We don't want to just use a piece of it, we want to use all of it. A hash function should also be deterministic. What does that mean? Well it means that every time we pass the exact same piece of data into the hash function we always get the same hashcode out. If I pass John into the hash function I get out 4. I should be able to do that 10,000 times and I'll always get 4. So no random numbers effectively can be involved in our hash tables-- in our hash functions.
A hash function should also uniformly distribute data. If every time you run data through the hash function you get the hashcode 0, that's probably not so great, right? You probably want to big a range of hash codes. Also things can be spread out throughout the table. And also it would be great if really similar data, like John and Jonathan, maybe were spread out to weigh different locations in the hash table. That would be a nice advantage.
Here's an example of a hash function. I wrote this one up earlier. It's not a particularly good hash function for reasons that don't really bear going into right now. But do you see what's going on here? It seems like we're declaring a variable called sum and setting it equal to 0. And then apparently I'm doing something so long as strstr[j] is not equal to backslash 0. What am I doing there?
This is basically just another way of implementing [? strl ?] and detecting when you've reached the end of the string. So I don't have to actually calculate the length of the string, I'm just using when I hit the backslash 0 character I know I've reached the end of the string. And then I'm going to keep iterating through that string, adding strstr[j] to sum, and then at the end of the day going to return sum mod HASH_MAX.
Basically all this hash function is doing is adding up all of the ASCII values of my string, and then it's returning some hashcode modded by HASH_MAX. It's probably the size of my array, right? I don't want to be getting hash codes if my array is of size 10, I don't want to be getting out hash codes 11, 12, 13, I can't put things into those locations of the array, that would be illegal. I'd suffer a segmentation fault.
Now here is another quick aside. Generally you're probably not going to want to write your own hash functions. It is actually a bit of an art, not a science. And there's a lot that goes into them. The internet, like I said, is full of really good hash functions, and you should use the internet to find hash functions because it's really just kind of an unnecessary waste of time to create your own.
You can write simple ones for testing purposes. But when you actually are going to start hashing data and storing it into a hash table you're probably going to want to use some function that was generated for you, that exists on the internet. If you do just be sure to cite your sources. There's no reason to plagiarize anything here.
The computer science community is definitely growing, and really values open source, and it's really important to cite your sources so that people can get attribution for the work that they're doing to the benefit of the community. So always be sure-- and not just for hash functions, but generally when you use code from an outside source, always cite your source. Give credit to the person who did some of the work so you don't have to.
OK so let's revisit this hash table for a second. This is where we left off after we inserted John and Paul into this hash table. Do you see a problem here? You might see two. But in particular, do you see this possible problem?
What if I hash Ringo, and it turns out that after processing that data through the hash function Ringo also generated the hashcode 6. I've already got data at hashcode-- array location 6. So it's probably going to be a bit of a problem for me now, right?
We call this a collision. And the collision occurs when two pieces of data run through the same hash function yield the same hashcode. Presumably we still want to get both pieces of data into the hash table, otherwise we wouldn't be running Ringo arbitrarily through the hash function. We presumably want to get Ringo into that array.
How do we do it though, if he and Paul both yield hashcode 6? We don't want to overwrite Paul, we want Paul to be there too. So we need to find a way to get elements into the hash table that still preserves our quick insertion and quick look up. And one way to deal with it is to do something called linear probing.
Using this method if we have a collision, well, what do we do? Well we can't put him in array location 6, or whatever hashcode was generated, let's put him at hashcode plus 1. And if that's full let's put him in hashcode plus 2. The benefit of this being if he's not exactly where we think he is, and we have to start searching, maybe we don't have to go too far. Maybe we don't have to search all n elements of the hash table. Maybe we have to search a couple of them.
And so we're still tending towards that average case being close to 1 vs close to n, so maybe that'll work. So let's see how this might work out in reality. And let's see if maybe we can detect the problem that might occur here.
Let's say we hash Bart. So now we're going to run a new set of strings through the hash function, and we run Bart through the hash function, we get hashcode 6. We take a look, we see 6 is empty, so we can put Bart there.
Now we hash Lisa and that also generates hashcode 6. Well now that we're using this linear probing method we start at 6, we see that 6 is full. We can't put Lisa in 6. So where do we go? Let's go to 7. 7's empty, so that works. So let's put Lisa there.
Now we hash Homer and we get 7. OK well we know that 7's full now, so we can't put Homer there. So let's go to 8. Is 8 available? Yeah, and 8's close to 7, so if we have to start searching we're not going to have to go too far. And so let's put Homer at 8.
Now we hash Maggie and returns 3, thank goodness we're able to just put Maggie there. We don't have to do any sort of probing for that. Now we hash Marge, and Marge also returns 6.
Well 6 is full, 7 is full, 8 is full, 9, all right thank God, 9 is empty. I can put Marge at 9. Already we can see that we're starting to have this problem where now we're starting to stretch things kind of far away from their hash codes. And that theta of 1, that average case of being constant time, is starting to get a little more-- starting to tend a little more towards theta of n. We're starting to lose that advantage of hash tables.
This problem that we just saw is something called clustering. And what's really bad about clustering is that once you now have two elements that are side by side it makes it even more likely, you have double the chance, that you're going to have another collision with that cluster, and the cluster will grow by one. And you'll keep growing and growing your likelihood of having a collision. And eventually it's just as bad as not sorting the data at all.
The other problem though is we still, and so far up to this point, we've just been sort of understanding what a hash table is, we still only have room for 10 strings. If we want to continue to hash the citizens of Springfield, we can only get 10 of them in there. And if we try and add an 11th or 12th, we don't have a place to put them. We could just be spinning around in circles trying to find an empty spot, and we maybe get stuck in an infinite loop.
So this sort of lends to the idea of something called chaining. And this is where we're going to bring linked lists back into the picture. What if instead of storing just the data itself in the array, every element of the array could hold multiple pieces of data? Well that doesn't make sense, right? We know that an array can only hold-- each element of an array can only hold one piece of data of that data type.
But what if that data type is a linked list, right? So what if every element of the array was a pointer to the head of a linked list? And then we could build those linked lists and grow them arbitrarily, because linked lists allow us to grow and shrink a lot more flexibly than an array does. So what if we now use, we leverage this, right? We start to grow these chains out of these array locations.
Now we can fit an infinite amount of data, or not infinite, an arbitrary amount of data, into our hash table without ever running into the problem of collision. We've also eliminated clustering by doing this. And well we know that when we insert into a linked list, if you recall from our video on linked lists, singly linked lists and doubly linked lists, it's a constant time operation. We're just adding to the front.
And for look up, well we do know that look up in a linked list can be a problem, right? We have to search through it from beginning to end. There's no random access in a linked list. But if instead of having one linked list where a lookup would be O of n, we now have 10 linked lists, or 1,000 linked lists, now it's O of n divided by 10, or O of n divided by 1,000.
And while we were talking theoretically about complexity we disregard constants, in the real world these things actually matter, right? We actually will notice that this happens to run 10 times faster, or 1,000 times faster, because we're distributing one long chain across 1,000 smaller chains. And so each time we have to search through one of those chains we can ignore the 999 chains we don't care about , and just search that one.
Which is on average to be 1,000 times shorter. And so we still are sort of tending towards this average case of being constant time, but only because we are leveraging dividing by some huge constant factor. Let's see how this might actually look though. So this was the hash table we had before we declared a hash table that was capable of storing 10 strings. We're not going to do that anymore. We already know the limitations of that method. Now our hash table's going to be an array of 10 nodes, pointers to heads of linked lists.
And right now it's null. Each one of those 10 pointers is null. There's nothing in our hash table right now.
Now let's start to put some things into this hash table. And let's see how this method is going to benefit us a little bit. Let's now hash Joey. We'll will run the string Joey through a hash function and we return 6. Well what do we do now?
Well now working with linked lists, we're not working with arrays. And when we're working with linked lists we know we need to start dynamically allocating space and building chains. That's sort of how-- those are the core elements of building a linked list. So let's dynamically allocate space for Joey, and then let's add him to the chain.
So now look what we've done. When we hash Joey we got the hashcode 6. Now the pointer at array location 6 points to the head of a linked list, and right now it's the only element of a linked list. And the node in that linked list is Joey.
So if we need to look up Joey later, we just hash Joey again, we get 6 again because our hash function is deterministic. And then we start at the head of the linked list pointed to by array location 6, and we can iterate across that trying to find Joey. And if we build our hash table effectively, and our hash function effectively to distribute data well, on average each of those linked lists at every array location will be 1/10 the size of if we just had it as a single huge linked list with everything in it.
If we distribute that huge linked list across 10 linked lists each list will be 1/10 the size. And thus 10 times quicker to search through. So let's do this again. Let's now hash Ross.
And let's say Ross, when we do that the hash code we get back is 2. Well now we dynamically allocate a new node, we put Ross in that node, and we say now array location 2, instead of pointing to null, points to the head of a linked list whose only node is Ross. And we can do this one more time, we can hash Rachel and get hashcode 4. malloc a new node, put Rachel in the node, and say a array location 4 now points to the head of a linked list whose only element happens to be Rachel.
OK but what happens if we have a collision? Let's see how we handle collisions using the separate chaining method. Let's hash Phoebe. We get the hashcode 6. In our previous example we were just storing the strings in the array. This was a problem.
We don't want to clobber Joey, and we've already seen that we can get some clustering problems if we try and step through and probe. But what if we just kind of treat this the same way, right? It's just like adding an element to the head of a linked list. Let's just malloc space for Phoebe.
We'll say Phoebe's next pointer points to the old head of the linked list, and then 6 just points to the new head of the linked list. And now look, we've changed Phoebe in. We can now store two elements with hashcode 6, and we don't have any problems.
That's pretty much all there is to chaining. And chaining is definitely the method that's going to be most effective for you if you are storing data in a hash table. But this combination of arrays and linked lists together to form a hash table really dramatically improves your ability to store large amounts of data, and very quickly and efficiently search through that data.
There's still one more data structure out there that might even be a bit better in terms of guaranteeing that our insertion, deletion, and look up times are even faster. And we'll see that in a video on tries. I'm Doug Lloyd, this is CS50.