Archive for June, 2008

Yesterday Google, Today Facebook

Thursday, June 26th, 2008

So yesterday, we divided into my theory of how Google work, but today, we get to see the inner workings of how Facebook serves a picture. I came across a very interesting presentation. The presentation shows what kind of technology Facebook uses, how they customize their own kernels, file systems, use CDNs (Content Distribution Network), caching, and etc to improve speed. I guess for starters I’ll have to explain how the internet works before I even get around to explaining one of the roles CDNs play.

“The internet is a bunch of interconnecting tubes”. Although this doesn’t fully do the internet justice, I can see how it might make sense to others. The internet is really a series of interconnect computers. You have thousands and thousands of computers connected to each other all across the world. The interaction between computers generally consists of interactions between a server and a client. The further the server is from the client, the more computers it’ll have to go through to arrive at the client.

A CDN is a network of computers that are generally well distributed for the region(s) it serves. These distributed computers cache or save the information that are frequently requested and act as a server for such information. What this does is that it prevents a client computer from having to wait for the data to come all the way from that super far away server. Obviously, there are other uses besides speed, such as preventing the system which generated the content from having to regenerate redundant information a second time.

So in the Facebook’s image serving system every picture gets cached at three levels, according to the lecture; once at the CDN level, once at memcache, and once by MySQL. Although later on, the lecturer says two. In this case the most important reason for caching is to prevent disk reads or MySQL requests. If the request matches something in the cache, it simply returns the information which it has stored in the memory, bypassing any disk reads or MySQL queries. If the requested information isn’t in the cache the server will then perform either a disk read or a MySQL request, which on a heavily trafficked system can the be difference between a split second or 5 seconds. In this case if the information isn’t cached the server hits the “Net APP” which I visualize as a massive central database to request the file’s location. This file location information is then used to retrieve the file requested. This file gets sent back to the user through the pipeline again, but is also sent to the cache to be cached.

Their cache system uses a most accessed last out system. What this mean is that the more the image gets accessed, the longer it stays in cache, which simply makes sense.

The lecture also goes into how they created their own file system and kernel and the reasons why the needed to create their own file system and kernel.

It was a very interest lecture, I recommend my readers to check out.

http://static.flowgram.com/p2.html#2qi3k8eicrfgkv

Google

Wednesday, June 25th, 2008

Ever wonder how Google spell check, related topic suggestions, or ranking works? I do.

My theory (yes theory, I doubt anyone but the two founders truly know the secret to how Google works) is that Google collects information on how user’s behave, the clicks on a link, the amount of sites pointing to a link, and etc. It uses this information to statistically guess at what the user truly wants based on data that seem to correlate with the user’s behavior.

In the case of a misspelling, instead of doing a performing some sort of Levenshtein-word-distance type check to find the best candidate for the word, Google can simply collect the data on what the user typed after his typo, and suggest that word that has been frequently typed in response to the typo.

Topic suggestions probably work the same way as misspellings, they’ll look at behavioral similarities to suggest content that you might like.

My point is that in this process, Google probably doesn’t need have a slightest idea of what’s in the content that it’s displaying on its pages.

My belief is that initially Google was probably did content parsing to figure out how to sort the content to seed their database, and after their database was well seeded, they collected user behavior and used that to rank relevance.

I came across an article that seems to back this theory of mine, so I decided to post about it today:

http://www.wired.com/science/discoveries/magazine/16-07/pb_theory

I completely disagree with the author on how we can throw the scientific method out of the window now since we have so much data, but I did appreciate the possible insight on how Google does things.

Now hosted with DownTownHost

Thursday, June 19th, 2008

I just moved my server to DownTownHost. What a WORLD of difference. BlueHost = 8 cpus with an average serverload of 100. DownTownHost 8 cpus with an average serverload of .07.

Also, I applied for the 4.95 per month plan and it’s offering all the features I want and need. After the 25% discount using the code below it’s 3.71 per month. That’s even better than GoDaddy. For now, I am very happy with the service.

Anyways, if you want to check out the host yourself use this link: DownTownHost

Oh btw, use the code “happy2008” to qualify for a 25% discount.

BlueHost = LackOfDecentHostBluesHost

Monday, June 16th, 2008

So I’ve been with BlueHost for a couple of 2-3 months now. I can say with certainty that as soon as I find a decent host, I’m moving again. I’ve monitored BlueHost for 10 days from June 2 – June 12 and the logs show that BlueHost is pretty much overloaded all the time. When I called their tech support, they gave me some lame excuse like “It’s not unusual for servers to be overloaded during peak hours for 5-10 minutes at a time in a shared hosting environment”. Since the server load coincidentally went down by the time I was done waiting for the representative to pick up, I had no choice but to wait for it to go back up before I called. Unfortunately, the minute I hung up, the server load spiked again. This is when I decided to log the server loads. The log shows that server is overloaded per on an average of 50-90% of the time. I think that is simply unacceptable. If my web page takes FOREVER to load, I no longer consider it hosted. I think it’s okay if they’re overselling, as long as they keep the load under control. I’ve heard HostGator does a decent job of this, but that doesn’t mean I’m going to switch to HostGator. I want to find an even better host. Anyways, you can take my word for it, or you can click the links to my logs of BlueHost server loads (They’re in the format of: <time>, <server load>):

June 2
June 3
June 4
June 5
June 6
June 7
June 8
June 9
June 10
June 11-12

Bluehost Review

Monday, June 2nd, 2008

This entry is about my experience with BlueHost so far, after 2 months. I can’t say I’m particularly happy with their service. I’ve experienced high server loads at 3 am in the morning, due to mysql database backup. I’ve experienced high server loads from the hours of 10 am to 3 pm due to peak usage. I’m probably going to experience high server load as well during the evening. So my question is, when can I expect there to be a normal server load? When nobody surfs the web? What’s the point of having webhosting if that’s the case?

If you don’t know what a serverload is, it’s a number that roughly represents how many CPUs the load is taking. Your server performs best when that number is less than the total number of CPUs. I’m pretty envious at my co-worker because his server load at HostGator doesn’t ever seem to exceed 5 and doesn’t seems bogged down, whereas my server load seems to exceed 8 like ALL THE TIME and is constantly laggy.

I’m writing a script that tracks the server load and displays the information graphically. I’m going to use that information to try to get BlueHost to move me to a better server. If after all that, they still don’t do anything about my server load, I’m probably going to move on to a new hosting company.