In this blog, we will be talking about an algorithm which is extremely interesting because it’s not like usual engineering algorithms that we come across which have perfect answers.
Problem Statement: To find the number of unique users connected to your website at any given point of time.
P.S: We are going to go through multiple implementations of this.
Here we will be storing the connection information in a database. So, for USER_ID (is going to be same for same users). So what we want to do is find the unique USER_ID’s in this and if you use SQL or NoSQL any of this so what you can do is fire a distinct query on USER_ID.
SELECT DISTINCT USER_ID FROM Connection WHERE STATUS=” ACTIVE”
What’s the problem with this?
So every time you run this query it’s going to be hitting the database and going to be going through all the connections. So it’s an O(N) operation if you have ’N’ connections. So the time complexity really bad. There is also the problem every time you ask for distinct USER_ID’s database has to store those elements somewhere, for example, every time it finds an unique entry it can add it into a HashMap and the next entry has to be checked whether it’s already there or not. So the space complexity of this algorithm is also O(N). So this is a really bad way to do things.
So here what we can do on the top of the previous approach is that we can precompute the information. Instead of firing this query whenever you need the distinct number of users, whenever a person makes a connection you go to the connection table then you make a connection entry and you have a separate table (User Table) which tells you USER_ID that’s it.
So what does it do?
Well, every time you make a connection and you have the USER_ID then the USER_ID goes to the User table that we’ve created and checks if the USER_ID is there or not and if not then insert there and update the counter that means let’s say Sup joined with another device you go to the connection table made an entry but don’t update the counter.
If you're assuming User Table to be some sort of HashMap here that have really fast access it's going to be O(1) insertion. So contact factor increase but not really too bad. So in terms of time complexity it's O(1) but in terms for space complexity you still need an addition table so in worst case scenario you have all the unique users over here which are the number of connections you've ever made that's O(N)
Umm...there's a problem!
The problem comes when the system is distributed. You've multiple nodes and for the sake of example let's say UID512 connected to both Vedantu Server 1 and Vedantu Server2. Then VS1 database gonna return COUNTER 1 and same in case of VS2.
And that's the problem because you've just one unique user and the way to get over this is to some how aggregate this result and some how merging them to give you a COUNTER = 1 and return that.
But this is of course extremely slow because the whole point is that if you have 'N' number of servers and you have to get a consensus on the information. Also you have a lot of information getting duplicated. In a connection scenario it's not that bad but let's say there is a scenario where you want to know the number of videos UID512 has viewed so that will be in multiple boxes or transaction history. So this going to get really blotted in multiple places.
This is the problem we have. We have to find the number of unique users in the given search space (USER_IDs) and you know that the time complexity O(1) and space complexity O(N) can't be improved further theoretically but practically what can we do given the search space. Our primary concerns here are reducing the memory footprints and simplifying consensus.
So finally we come to the HyperLogLog algorithm which is going to solve this problem for us. One disclaimer is that it is an approximation algorithm and we are expecting the search space (USER_IDs) to be uniformly distributed and random.
NOTE: Why are we choosing USER_IDs instead of CONNECTION_IDs because we want to find the unique number of users not the unique number of connections because every connection is unique
So USER_ID is something that we are going to deal with and we are going to assume is that it's a 64 bit number. The important thing here is that if it's a 64 bit number and it's uniformly distributed and random then we are expected to get zero at the end half (1/2) the number of times.
But let's say now I want two zeros at the end. Then the probability is going to be 1/4 right? because I just fixed two bits which could have been either 00 or 01 or 10 or 11. In general if I want 'x' zeroes in the end and I am expected to go over 2^x numbers to actually get to this.
Now we converted this into a challenge so we are going to every server and ask it what is the maximum number of zeroes you saw in the end of all USER_IDs you came across?
So let's say it came across these numbers
Look at the maximum number of zeroes you came across which is three. So what approximately number of requests that we got?
Ans: 2^3 = 8 (so we probably came across 8 requests)
The variance is really high in this case because we just have 3 requests. If we increase this that would be much better and one thing to notice is that if 1000 comes again it won't the answer which is 8.
The space requirement here is negligible because instead of storing O(N) what we really need to store is the number of zeroes that you come across in the end (which is at most 64 zeroes).
So basically 64 is the log of 2^64 and 8 is the log of 64 so that's why the name HyperLogLog comes in :-)
So this algorithm needs a lot of improvement if we are going to trust it and one way you can do that is to reduce the variance that you had.
For example taking x = 5 and we got 5 consecutive zeroes that gives 32 while you actually hit just 4 requests.
Fun Fact: We always hash when there is any trouble in computer science 😁
Basically we are taking maximum ending zeroes from each bucket and averaging them and then calculating the power of 2 and in this case we got 4 which is better than 32 😁. Look into the diagram for more clarity.
If you think about this it's really smart because the first thing that came to my mind is to use K-Hash functions and we did that we mapped the USER_IDs to buckets and it bring down our variance a lot. Now the space complexity brings down to constant factor now which is amazing.
The best thing is if you have a distributed system (multiple servers) you don't need to worry too much because merging these buckets is now a simple operation like taking average and powering that. So the merging part is still there but it is much much simpler now.
So this all about HyperLogLog algorithm. I hope you enjoyed it. I hope you love this algorithm as much as I do because it's got hashing it's got randomness everything that's a computer science algorithm needs. if you have any doubts or suggestions leave them in the comment section below.
PS: I am now studying system design in detail. This involves understanding how organizations write software to handle millions of users. Specifically, how do large organizations stay available, reliable and fast? This despite the enormous load on their machines.