[MUSIC PLAYING] DOUG LLOYD: So we've been inching closer and closer that holy grail of data structures, one that we can insert into, delete from, and look up in constant time. Right. That's sort of the goal. We want to be able to do things very, very quickly.
Have we found it here when we're talking about tries? Well, let's take a look. So we've seen several different data structures that handle the mapping of so-called key-value pairs, mapping some piece of data to some other piece of data so we know where to find information in the structure.
So for array, for example, the key is the element index or array location 0 or array 1 and so on. And the value is the data that exists at that location. So what is stored in array 0? What is stored in array 1 versus just 0 and 1, which would be the keys.
With a hash table it's sort of the same idea. With a hash table, we have this hash function that generates hash codes. So the key is the hash code of the data. And the value, particularly we talked about chaining in the video on hash tables, is that linked list of data that hashes to that hashcode. Right. What about another approach this method, though? What about a method where the key is guaranteed to be unique, unlike a hash table, where we could end up with two pieces of data having the same hashcode. And then we have to deal with that by either probing or more preferably chaining to fix that problem.
So now we can guarantee that our key will be unique. And what if our value was just something as easy as true and false that tells us whether or not that piece of information exists in the structure? A Boolean could be as simple as a bit. Realistically it's probably a byte more likely than a bit. But that's a lot smaller than storing maybe a 50-character string, for example.
So tries, similar to hash tables, which combine arrays and linked list, tries combine arrays, structures, and pointers together to store data in an interesting way that's pretty different from anything we've seen so far. Now we use the data as a roadmap to navigate this data structure. And if we can follow the roadmap, if we can follow the data from beginning to end, we'll know whether that data exist in the trie.
And if we can't follow the map from meaning to end at all, the data can't exist. Again, the keys here are guaranteed to be unique. And so unlike a hash table, we'll never have to deal with collisions here. And no two pieces of data have exactly the same roadmap unless that data is identical.
If we insert John, then we search for John. That's two identical pieces of data, right, we're looking through. But otherwise, any two pieces of data are guaranteed to have unique roadmaps through this data structure. And we're going to take a look at a visual of this in just a moment.
We'll do this by trying to create a new data structure, mapping the following key value pairs. In this case, we're not going to use something as simple as a Boolean. We actually will store the string. And that string is going to be the name of a university.
And the key is going to be the year when that university was founded. All years for universities are going to be four digits. And so we'll use those four digits to navigate through this data structure. And we'll see, again, how we do that in just a second.
At the end of the path, we'll see the name of the university that corresponds to that key, those four digits. The basic idea behind a trie is we have a central route. So think about it like a tree. And this is similar in spelling and in concept to a tree. Generally when we think about trees in the real world, they have a root that's in the ground and they grow upward and they have branches and they have leaves. And basically the idea of a trie is exactly the same, except that root is anchored somewhere in the sky. And the leaves are at the bottom.
So it's kind of like taking a tree and just flipping it upside down. But there are still branches. And those would be our pathways, those will be our connections from the root to the leaves. In this case, those paths, those branches are labeled with digits that tell us which way to go from where we are.
If we see a 0, we go down this branch, if we see a 1, we go down this branch, and so and so on. Well, what does this mean? Well, that means that at every junction point and every node in the middle and every branch, there are 10 possible places that we can go. So there are 10 pointers from every location.
And this is where tries can get a little bit intimidating for somebody who's doesn't have a lot of experience in computer science before. But tries are really pretty awesome. And if you have the chance to work with them and you're willing to dig-in and experiment with them, they're really quite interesting data structures to work with. If we want to insert an element into the trie, all we need to do is build the correct path from the root to the leaf. Here's what every step along the way might look like. We're going to define a new data structure for a new node called a trie.
And inside of that data structure there are two pieces. We're going to store the name of a university. And we're going to store an array of pointers to other nodes of the same type. So, again, this is that sort of concept of everywhere we are, we at 10 possible places we can go. If we see a 0, we go down this branch. If we see a 1, this branch, and so on and so on and so on. If we say 9, we go down this branch. So at every junction point, we can go 10 possible places. So every node has to contain 10 pointers to other nodes, to 10 other nodes.
And the data we're storing is just the name of the university. So let's build a trie. Let's insert a couple of items into our trie. So at the very top, this is our root node. This is probably going to be something you're going to globally declare. And you're going to globally maintain a pointer to this node always.
You're going to say, root equals, and you're going to malloc yourself trie node. And you're never going to touch root again. Every time you want to start navigating through, you set another pointer equal to root, such as trav, which is the example I use in many of my videos here on stacks and queues and link lists and so on.
You set another pointer called trav for traversal. And you use trav to navigate through the data structure. So let's see how this might look. So right now, what does a node look like? Well, just as our data structure declaration indicated, we have a string, which in this case is empty. There's nothing here.
And an array of 10 pointers. And right now, we only have 1 node in this trie. There's nothing else in it. So all 10 of those pointers point to null. That's what the red indicates.
Let's insert the string Harvard. Let's insert the University of Harvard into this trie, which was founded in the year 1636. We want to use the key, 1636, to tell us where we're going to store Harvard in the trie. Now, how might we do that?
It might look something like this. We start at the root. And we have these 10 places we can go. The root is just like any other node of the trie. There are 10 places we can go from here.
Where do we probably want to go if the key is 1636? There's really two options. Right. We can build the key from right to left and start with 6. Or we could build the key from left to right and start with 1.
It's probably more intuitive as a human being to understand we'll just go left to right. And so if I want to insert Harvard into this trie, I probably want to start by starting at the root, looking at my 10 options in front of me, and saying I want to go down the 1 path. OK.
Now, 1 path is currently null. So if I want to proceed down that path to insert this element into the trie, I have to malloc a new node, have 1 point there, and then I'm good to go.
So I basically am at a point where I'm standing at the root of the tree or the trie and there are 10 branches. But each branch has a gate in front of it. Right. Because there's nothing else there's. No safe passage. That means that there's nothing down any of those branches. If I want to start building something, I want to remove the gate. I want to remove the gate in front of number 1. And I want to walk down that. And I want to build another place for me to go.
And that's what I've done here. So 1 no longer points to null. I've said it's safe to go down here now. I built another node. And when I get to that node, I have another decision to make. Where am I going to go from here?
Well, I've already gone down the 1. So now I probably want to go down the 6. Right. Again, I have 10 locations I can choose. So let's now go down number 6. So I clear the gate in front of number 6. And I walk down there. And I build another node. And I've reached another junction point.
Again, I have 10 choices for where I can go. I've moved from 1 to 6. So now I probably want to go to 3. 3, there's nowhere I can go. So I have to clear the way and build myself a new space. And then from 3, where do I want to go? I want to go down 6.
And, again, I had to clear the way to do it. So now I've used my key to insert create nodes and start to build this trie. I've started at the root. I've gone down 1636. And now I'm at the bottom there on that node. And you might be able to see it on your screen.
It's highlighted in yellow. That's where I currently am. My key is done. I've exhausted every position in my key. So I can't go any further. So at this point, all I really need to do is say, OK. It's kind of like looking down at the ground, if you're envisioning yourself as this sort of path with different connections. Sort of looking down and sort of spray painting Harvard on the ground. That's the name of this. Know that's what's at this location. If we start at the root and we go down 1 and then 6 and then 3 and then 6, where are we? Well if we look down and we see Harvard, then we know that Harvard was founded in 1636 based on the way we're implementing this data structure. So that was hopefully straightforward. We're going to do two more insertions. And hopefully it'll help to see this done twice more.
Now, let's insert another university. Let's insert Yale into this trie. Yale was founded in 1701. So we'll start at the root, as we always do. And we set a traversal pointer. We're going to use that to move through. The first thing we want to do is go down the 1 path. That's the first digit of our key. Fortunately, though, we don't have to do any work this time. The 1 path has already been cleared. I cleared it previously when I was inserting Harvard at 1636. So I can safely move down 1 and just go there. If can move down the 1.
Now, though, I want to go to 7. I cleared the way at 6. I know I can safely proceed down the 6 path. But I need to proceed on the 7 path. So what do I need to do? Well, just like before, I just need to clear the gate, get out of the way, and build a new node from the 7 path. Just like this.
So now I've moved 1 and then 7. And now notice, I'm sort of on this new subbranch. Right. Everything else from 16 on, I don't care about. I'm not doing 16 anything. I'm doing 17 stuff.
So now from 17 on, I have to sort of blaze new trails here. The next digit my key is 0. I clearly can't get anywhere. I just built this node. So I know there's no paths forward from here. So I have to make one myself.
So I malloc a new node and have 0 point there. And then one more time, I malloc a new node and have one point there. Again, I've exhausted my key, 1701. So I look down and I spray paint Yale. That's the name of this node.
And so now if I ever need to see if Yale is in this trie, I start at the root, I go down 1701, and look down. And if I see Yale spray painted onto the ground, then I know Yale exists in this trie. Let's do one more. Let's insert Dartmouth into this trie, which was founded in 1769.
Start at the root again. My first digit of my key is 1. I can safely move down that path. That already exists. The next digit of my key is 7. I can safely move down that path. It exists as well.
My next is 6. From here, from where I currently am in yellow there in that middle node, 6 is currently locked off. If I want to go down that path, I have to build it myself. So I'll malloc a new node and have 6 point there. And then, again, I'm blazing new trails here.
So I malloc a new node so that from that node-- path number 9-- and then now if I travel 1769, and I look down. There's nothing currently spray painted there. I can write Dartmouth. And I've inserted Dartmouth into the trie.
So that's inserting things into the trie. Now we want to search for things. How do we search for things in the trie? Well, it's pretty much the same idea. Now we just use the digits of the key to see if we can navigate from the root to where we want to go in the trie.
If we hit a dead end at any point, then we know that that element can't exists or else that path would have already been cleared. If we make it all the way to the end, all we need to do is look down and see if that's the element we're looking for. If it is, success. If it's not, fail.
So let's search for Harvard in this trie. We start at the root. And, again, we're going to create a traversal pointer to do our moves for us. From the root we know that the first place we need to go is 1, can we do that? Yes, we can. If safely exists. We can go there.
Now, the next place we need to go is 6. Does the 6 path exist from here? Yeah, it does. We can go down the 6 path. And we end up here.
Can we go down the 3 path from here? Well, as it turns out, yes, that exists too. And can we get on the 6 path from here? Yes, we can.
We haven't quite answered the question yet. There's still one more step, which is now we need to look down and see if that's actually-- if we're looking for Harvard, is that what we find after we exhaust the key? In the example we're using here, the years are always four digits. But you might be using the example where you are storing a dictionary of words.
And so instead of having 10 pointers for my location, you might have 26. One for each letter of the alphabet. And there are some words like bat, which is a subset of batch, for example. And so even if you get to the end of the key and you look down, you might not see what you're looking for.
So you always have to traverse the entire path and then if you were successfully able to traverse the entire path, look down and do one final confirmation. Is that what I'm looking for? Well, I look down after starting at the top and going 1636. I look down. I see Harvard. So, yes, I succeeded. What if what I'm looking for isn't in the trie, though. What if I'm looking for Princeton, which was founded in 1746. And so 1746 becomes my key to navigate through the trie. Well, I start at the root. And the first place I want to goes down the 1 path. Can I do it? Yes, I can.
Can I go down the 7 path from there? Yeah, I can. That exists too. But can I go down the 4 path from here? That's like asking the question, can I proceed down that little square that I've highlighted in yellow? There's nothing there. Right.
There's no way forward down the 4 path. If Princeton was in this trie, that 4 would have been cleared for us already. And so at this point we've reached a dead end. We can't go any further. And so we can say, definitively, no. Princeton does not exist in this trie.
So what does this all mean? Right. There's a lot going on here. There's pointers all over the place. And, as you can see just from the diagram, there's a lot of nodes that are kind of flying around. But notice every time we wanted to check whether something was in the trie, we only had to make 4 moves.
Every time we wanted to insert something in the trie, we have to make 4 moves, possibly mallocing some stuff along the way. But as we saw when we inserted Dartmouth into the trie, sometimes some of the path might already be cleared for us. And so as our trie gets bigger and bigger, we have do less work every time to insert new things because we've already built a lot of the intermediate branches along the way. If we only ever have to look at 4 things, 4 is just a constant. We really are kind of approaching constant time insertion and constant time lookup. The tradeoff, of course, being that this trie, as you can probably tell, is huge. Each one of these nodes takes up a lot of space.
But that's the tradeoff. If we want really quick insertion, really quick deletion, and really quick lookup, we have to have a lot of data flying around. We have to set aside a lot of space and memory for that data structure to exist.
And so that's the tradeoff. But it looks like we might have found it. We might have found that holy grail of data structures with quick insertion, deletion, and lookup. And maybe this will be an appropriate data structure to use for whatever information we're trying to store. I'm Doug Lloyd, this is cs50.