Yesterday Google, Today Facebook

June 26th, 2008

So yesterday, we divided into my theory of how Google work, but today, we get to see the inner workings of how Facebook serves a picture. I came across a very interesting presentation. The presentation shows what kind of technology Facebook uses, how they customize their own kernels, file systems, use CDNs (Content Distribution Network), caching, and etc to improve speed. I guess for starters I’ll have to explain how the internet works before I even get around to explaining one of the roles CDNs play.

“The internet is a bunch of interconnecting tubes”. Although this doesn’t fully do the internet justice, I can see how it might make sense to others. The internet is really a series of interconnect computers. You have thousands and thousands of computers connected to each other all across the world. The interaction between computers generally consists of interactions between a server and a client. The further the server is from the client, the more computers it’ll have to go through to arrive at the client.

A CDN is a network of computers that are generally well distributed for the region(s) it serves. These distributed computers cache or save the information that are frequently requested and act as a server for such information. What this does is that it prevents a client computer from having to wait for the data to come all the way from that super far away server. Obviously, there are other uses besides speed, such as preventing the system which generated the content from having to regenerate redundant information a second time.

So in the Facebook’s image serving system every picture gets cached at three levels, according to the lecture; once at the CDN level, once at memcache, and once by MySQL. Although later on, the lecturer says two. In this case the most important reason for caching is to prevent disk reads or MySQL requests. If the request matches something in the cache, it simply returns the information which it has stored in the memory, bypassing any disk reads or MySQL queries. If the requested information isn’t in the cache the server will then perform either a disk read or a MySQL request, which on a heavily trafficked system can the be difference between a split second or 5 seconds. In this case if the information isn’t cached the server hits the “Net APP” which I visualize as a massive central database to request the file’s location. This file location information is then used to retrieve the file requested. This file gets sent back to the user through the pipeline again, but is also sent to the cache to be cached.

Their cache system uses a most accessed last out system. What this mean is that the more the image gets accessed, the longer it stays in cache, which simply makes sense.

The lecture also goes into how they created their own file system and kernel and the reasons why the needed to create their own file system and kernel.

It was a very interest lecture, I recommend my readers to check out.

http://static.flowgram.com/p2.html#2qi3k8eicrfgkv

Google

June 25th, 2008

Ever wonder how Google spell check, related topic suggestions, or ranking works? I do.

My theory (yes theory, I doubt anyone but the two founders truly know the secret to how Google works) is that Google collects information on how user’s behave, the clicks on a link, the amount of sites pointing to a link, and etc. It uses this information to statistically guess at what the user truly wants based on data that seem to correlate with the user’s behavior.

In the case of a misspelling, instead of doing a performing some sort of Levenshtein-word-distance type check to find the best candidate for the word, Google can simply collect the data on what the user typed after his typo, and suggest that word that has been frequently typed in response to the typo.

Topic suggestions probably work the same way as misspellings, they’ll look at behavioral similarities to suggest content that you might like.

My point is that in this process, Google probably doesn’t need have a slightest idea of what’s in the content that it’s displaying on its pages.

My belief is that initially Google was probably did content parsing to figure out how to sort the content to seed their database, and after their database was well seeded, they collected user behavior and used that to rank relevance.

I came across an article that seems to back this theory of mine, so I decided to post about it today:

http://www.wired.com/science/discoveries/magazine/16-07/pb_theory

I completely disagree with the author on how we can throw the scientific method out of the window now since we have so much data, but I did appreciate the possible insight on how Google does things.

Now hosted with DownTownHost

June 19th, 2008

I just moved my server to DownTownHost. What a WORLD of difference. BlueHost = 8 cpus with an average serverload of 100. DownTownHost 8 cpus with an average serverload of .07.

Also, I applied for the 4.95 per month plan and it’s offering all the features I want and need. After the 25% discount using the code below it’s 3.71 per month. That’s even better than GoDaddy. For now, I am very happy with the service.

Anyways, if you want to check out the host yourself use this link: DownTownHost

Oh btw, use the code “happy2008″ to qualify for a 25% discount.

BlueHost = LackOfDecentHostBluesHost

June 16th, 2008

So I’ve been with BlueHost for a couple of 2-3 months now. I can say with certainty that as soon as I find a decent host, I’m moving again. I’ve monitored BlueHost for 10 days from June 2 - June 12 and the logs show that BlueHost is pretty much overloaded all the time. When I called their tech support, they gave me some lame excuse like “It’s not unusual for servers to be overloaded during peak hours for 5-10 minutes at a time in a shared hosting environment”. Since the server load coincidentally went down by the time I was done waiting for the representative to pick up, I had no choice but to wait for it to go back up before I called. Unfortunately, the minute I hung up, the server load spiked again. This is when I decided to log the server loads. The log shows that server is overloaded per on an average of 50-90% of the time. I think that is simply unacceptable. If my web page takes FOREVER to load, I no longer consider it hosted. I think it’s okay if they’re overselling, as long as they keep the load under control. I’ve heard HostGator does a decent job of this, but that doesn’t mean I’m going to switch to HostGator. I want to find an even better host. Anyways, you can take my word for it, or you can click the links to my logs of BlueHost server loads (They’re in the format of: <time>, <server load>):

June 2
June 3
June 4
June 5
June 6
June 7
June 8
June 9
June 10
June 11-12

Bluehost Review

June 2nd, 2008

This entry is about my experience with BlueHost so far, after 2 months. I can’t say I’m particularly happy with their service. I’ve experienced high server loads at 3 am in the morning, due to mysql database backup. I’ve experienced high server loads from the hours of 10 am to 3 pm due to peak usage. I’m probably going to experience high server load as well during the evening. So my question is, when can I expect there to be a normal server load? When nobody surfs the web? What’s the point of having webhosting if that’s the case?

If you don’t know what a serverload is, it’s a number that roughly represents how many CPUs the load is taking. Your server performs best when that number is less than the total number of CPUs. I’m pretty envious at my co-worker because his server load at HostGator doesn’t ever seem to exceed 5 and doesn’t seems bogged down, whereas my server load seems to exceed 8 like ALL THE TIME and is constantly laggy.

I’m writing a script that tracks the server load and displays the information graphically. I’m going to use that information to try to get BlueHost to move me to a better server. If after all that, they still don’t do anything about my server load, I’m probably going to move on to a new hosting company.

Another quote I really like

May 18th, 2008

I stumbled across this quote while traversing the web.

Insanity: doing the same thing over and over again and expecting different results. (Albert Einstein)

I love this quote because it reminds us that, sometimes, in order to solve our problems, we have to think outside the box.

Cross-browser Compatibility

April 23rd, 2008

I was working on a project that required cross-browser compatibility, this generally means at least FireFox and IE. The reason is due to the fact that IE is still the most commonly used browser (IE7, IE6, etc.), FireFox coming in second, and then the other browsers split up the rest.

I was trying to code the following structure:

<div>

<div/><div/>

</div>

<div>

<div/><div/>

</div>

Except the two inner most divs were floated left, followed by a break which cleared the float. The code rendered perfect in FireFox and IE7, but it didn’t render correctly in IE 6. So I wracked my head on it for a bit, looked up various reasons on why IE 6 might render the code differently, and eventually found a solution. The solution is to make the div’s position relative. This solution is completely counter-intuitive, and frankly, doesn’t make much sense. So the moral of the story is, sometimes the solution for things can be very dumb, but regardless, it’s the solution.

Dev and Live Environment

April 12th, 2008

Today’s topic will be the importance of having two environments, one to develop your code in, and one to release into the public. Sometimes it is simply easier to edit, run, and test it in the live environment, but when your code deals with data, this becomes more problematic. Imagine some code that insert data into the database whenever you run it, if you run it in the dev environment, it’s really no big deal, but if you run it in the live environment, it might cause database pollution. Although your code should account for that case anyways, but it’s hard to say that during your “updates” you wouldn’t accident code it so that it starts inserting a bunch of meaningless data into your database. Having two environments will allow you to code and test in whichever way you want without having to worry about the consequences.

Migration from GoDaddy to BlueHost

April 11th, 2008

I have just finished moving my site from GoDaddy to BlueHost. This form of migration was the first one I’ve ever done, and it went quite smoothly.

I think a few tips that would help anyone migrate from one host to another would be to first figure out if you’re transferring the domain, web hosting, or both.

Domain hosting is simply the reservation of the domain name, such as google.com, yahoo.com, jacksonleung.com, etc., much like an address, or a telephone number.

Web hosting actually contains all the files and databases behind the domain name, much like the company an address points to, or the customer service representatives behind a telephone number.

If you’re simply changing the hosting, like I did in my case, not only will you have to migrate the database and the files, you’ll most likely have to change the namespace of your domains to the new namespace server of the web hosting provider.

Afterwards you have to make sure all the data from your databases were copied correctly from one server to another then you have to make sure all your script work with the new database environment. You might also want to move your emails from your old web hosting provider to the new one.

Although it might be unnecessary, I like to run through my scripts one last time just to make sure everything works, and after all that, you can cancel the domain / web hosting with the previous provider.

Questions and Answers in the Work Environment

March 21st, 2008

Questions need to be asked humbly, patiently, and clearly.

Answers need to be clear, neutral, and precise.