Monday, February 04, 2013

Hadoop


So this is how fb is doing it:
http://www.wired.com/wiredenterprise/2013/02/facebook-data-team/



http://www.wired.com/wiredenterprise/2011/10/how-yahoo-spawned-hadoop/


http://www.wired.com/wiredenterprise/2013/02/facebook-data-team/

http://www.cloudera.com/content/cloudera/en/resources.html

I don't understand any of this yet but I had been wondering for several years how fb was managing all their data (and google and others).

We ran into this kind of problem at Cray with the Unix file system (same for Linux.)
In the unix file system there is a top down inverted tree structure, i.e. Everything starts from the "root" and descends.
As your file system gets larger more and more requests are made on the root (call it the root directory, the root inode, whatever.)
Also, the directories (folders) directly below that directory, get lots of request too. Mostly for reading but occasionally for writing.
Cray got around this problem by duplicating these oft used directories. But you had to make sure most were read requests. If you got a write request you had to freeze all the blocks while they all got written.
So Cray delayed the inevitable problem.

When I was at Sybase we had similar issues with our databases but the idea was to make sure the database stayed consistent and if that meant freezing out some requests so be it, or at least make them wait while you updated everything. Oracle had the same problem, not that their salesmen would know. 

The problem with relational databases like Sybase, Oracle, and mysql was that they required a design to meet a specific need. Naturally, someone would  figure out some query for which the database was not designed and then complain when it was slow. Siebel, which ran on top of Oracle, was a giant data structure and their thing was that they could jam any data you had into it and it would work - so they claimed. It ran slowly, very slowly, and it took up gobs and gobs of hardware. I lost my job at Fannie Mae because of this. (I kept asking for some kind of turn around times and upper management, which hired Siebel people to oversee the project, didn't want to hear it - so they got rid of me. It ran so slowly that they couldn't use it and ended up shelving it.)

Several years ago I heard facebook was using some sort of variant on mysql. I couldn't understand how they could solve the massive data problem that they had to be facing. A relational db (rdbms) would have the problems of any relational database and a filesystem would have the problem that we ran into at Cray.

Standard solutions are: more and faster hardware, duplicate and disperse, put critical stuff in memory (ie and in-core solution) for fast access and off load as much i/o as possible to other cpus.
It seems that fb has gone with an in-core solution and somehow distributed it. They could distribute it by local interest. I would guess most people are interested in people who live near them (like kids going to college, they tend to go to places close to home.)

Hadoop seems to be the solution. What it is I don't know. What I find interesting is that all the big data companies are using it. 
So, even though these companies compete perhaps against one another their problems are so similar that the brains behind what they are doing are common as is their code.




0 Comments:

Post a Comment

<< Home