Twitter performance DB vs Filesystem
If you're new here, you may want to subscribe to my RSS feed. Thanks for visiting!
So much has been said about the twitter performance issue I feel I have to follow up some of the points that have been raised.
One comment especially had me ranting at my screen “I am still stunned by this suggestion for a site which is getting 11000 hits per second and even more discouraged by the number of people who think using files and filesystem in the manner described is a viable architecture option”. I am afraid this comments comes from what I classed in a previous post as ‘a generic programmer’.
Firstly most databases sit on the backbone of a filesystem and rely upon the filesystem for the lowest level of caching. This then at a lower level relies upon some form of FAT and below that the hard drives own small (but important) cache.
A database is just a layer of information organisation. And a filesystem is just another form of information organisation but is specialised for the storage of chunks of data within a hierarchy of folders. This does mean it is supremely efficient at retrieval of information when the information has a fixed known location.
Just to take the twitter example further. If you took the name of your user as a directory and then created a file based upon the current date.
E.g. /storage/_Nick_/10-03-2007.txt
This file because of the nature of twitter would only ever need to be appended too, which filesystems are extremely quick at. To create your timeline you just have /storage/timeline/10-03-2007.txt which you also append too. This would become very large very quickly and make it in efficient to perform read request upon, so you would split it further down to /storage/timeline/{date}/{hour}/{minute}.txt
This is just a overly simplistic example and not something that is possible to use in practice (without much more work.)
Other concepts/ideas
- Memory shared module to perform higher level caching above the file system.
- Try out different file systems (ext, ext2, ReiserFS) as they all have different performance profiles for types of usage, especially where it comes to number of files within a single directory before you get degradation.
- Remove the filesystem completely from the equation and write your own driver that interacts with the hard drive directly. This (and I have done this.) can give amazing performance as you can write your own custom FAT that is equivilient to a database index.
Now before I get millions of flames (too late..) I wanted to just show that there is life beyond SQL, DB2 etc. And hopefully to show that developers should keep an open mind about data storage mechanisms.
Original Articles
- Twitter, Rails, Hammers, and 11,000 Nails per Second
- Twitter Trouble
- 5 Question Interview with Twitter Developer Alex Payne
Comment by Kirit on 21 April 2007:
This reads as if its from somebody who has no clue. I don’t think that is quite the case, but the language is sloppy in the extreme.
By FAT do you mean the old MS-DOS file format? That’s what it sounds like, but I assume that must be wrong.
I think you know the difference between a cache and a data store, but it isn’t apparent from “databases sit on the backbone of a filesystem and rely upon the filesystem for the lowest level of caching”. It isn’t really clear what you mean by “lowest level” here either.
When you say filesystems are “supremely efficient at retrieval of information when the information has a fixed known location” you already know that a file path isn’t a “fixed known location”, so it isn’t clear how you think the filesystem’s indexes would compare against a proper RDBMS’s.
You admit that your ad-hoc filesystem database is “not something that is possible to use in practice (without much more work.)” So what is the nature of this work? Does it pay off more quickly than scaling to more RDBMSs?
Now of course an RDBMS isn’t appropriate for every situation, but a filesystem is not optimised for the sort of abuse you’re talking about either. If you take a look at web systems that have successfully used the filesystem for persistent store you will notice that they deal with it in a completely different way, generally running everything in memory and writing to disk infrequently and reading infrequently - this seems to be the opposite usage pattern you’re proposing.
As a test write a program that tries to read and write to a hundred files at the same time and another that does hundreds of reads and writes to an RDBMS (you’ll need to fork a hundred processes or write a multi-threaded app to make it fair). I think I know which my money is on.
Comment by nick on 21 April 2007:
I think you misssed the point. Your test program shows this. Change the test to this.
Have a database with 10 million rows indexed on date and user name. Then have a file system that stores the same data using /storage/{date}/{username}.
Then randomly access 50 million random records. Which one is faster? Well for an answer I will give another example.
When websites with lots of images need to improve performance what do they do? they move those images out to a separate web server which does not use a database but a file system.
I was not proposing anything, I was trying to open up the subject for discussion. And obviously you require caching above the file system. The point I was trying to make was that any database is just an abstraction layer (read joel on software he loves abstraction layers) and a file system is a abstraction layer, but both can be used when applied to the correct problem.
BTW.. FAT stands for File Allocation Table (yes it was invented for use within MS-DOS) but I used the term as most people associate it with a way to organise files on a disk using some sort of allocation table.
Comment by JulesLt on 21 April 2007:
It read to me more like someone with a significant clue than without - and as I said in my comments on Jens blog the important thing is making informed decisions. I’m just wary of what seems to be a recent ‘database backlash’ - to my mind it seems to stem from OO developers who see a database as ‘persistence’ (hence frequent idea of just storing objects as XML).
All that said - most RDBMS abstract a lot of the data-cacheing - static lookup tables will almost certainly be in memory, and table updates are rarely implemented directly (although obviously for integrity reasons transactions have to be written to disk somewhere before a transaction returns a valid commit).
Twitter, on the other hand, is very read heavy, low concurrent update and with (I would imagine) a very high spread of data i.e. lots of users wanting a very small amount of the total data - which I would imagine would remove a lot of the advantage of memory caching (chances of item you want being in cache would be low).
Referring to your response - let’s say I have 10 million rows stored on file system by /storage/date/username. Now, I know which approach would make it easier for me to change my app to allow display all posts by username. Which doesn’t mean it’s the right answer in all cases - as I said on Jens blog, the key thing is for architects to make informed decisions. I’d say a filesystem approach is only recommended when you’re absolutely sure requirements won’t change.
(But if your code is properly abstracted it shouldn’t be too difficult to move when required)>
Comment by nick on 21 April 2007:
I agree that there does seem to be a slight trend, but I would be the last to cast doubts on good use of RDBMS when I see how well they cope under extreme conditions. I think I will have to follow up this article with one on database abuse. I know of several sites that range in the 50-100 queries per page request (ouch) (nothing I worked on!)
Exactly my point, thanks for the comment Jules.
Comment by L.B.J on 22 April 2007:
So let’s see. It is being proposed that we replace a proper RDBMS (because Rails suffers from the “It’s only a hash in the sky” syndrome) with a conglomeration of files. Except that you seem to overlook the fact that all those files have to be looked up by the system. How does the system do that? It has to look through F.A.T or equivalent on DOS/NTFS (it isn’t much better) systems or a more sophisticated approach in UN*X filesystems. What we have here is an index !! which is looking through files which are essentially surrogates for records in relational parlance.
Once a file has been located, then the process can access information by accessing the disk directly (through system calls of course) but the speed of information retrieval is not being hindered by the lookup through the FAT every time new information needs to be written or read.
In a highly concurrent system you will quickly overwhelm the file system because it is not designed for this. Keep in mind also that every file lookup and eventual access involves the security overhead which is absent in a monolithic DBF type file.
DBMS manage all this work for you and they are more efficient than filesystem level access for every record. This is not some secret magic that everyone in the field has missed, and which web programmers have discovered.
Once you have a file object open (that is what the RDBMS storage is) the lookup and ACID business is taken over by the RDBMS engine, which is specialized just for this purpose.
Instead of using the specialized engine which manages indexes for separate tables, you are proposing that we re-invent the wheel and use a horribly more inefficient method by bottlenecking through the ONE index (the FAT page or what have you) instead of using an RDBMS which gives us Multiple indexes per table, fast lookups, fast writes, concurrency and ACID properties out of the box.
How would you report on this information?
how would you implement concurrency? Locking anyone?
And forget about re-organizing this information EVERRR!
You will end up writing a crappy DBMS on top of primitives which have proven insufficient for high speed data access in the first place, hence the rise of the RDBMS. Perhaps you should pick up a book on RDBMS theory and refresh your knowledge a bit. Anecdotal “evidence” from a couple of PHP programmers writting todo note applications for mom-n-pop business is not evidence enough that filesystems ought to be used even in small projects let alone something the size and scale of Twitter.
The problem with Twitter is that they are using a braindead “framework” to do the messaging part. They should use Rails only for the front-end and use something different for the message passing/storage/retrieval part. Using files to store this info would only further the simplistic and regressive thinking which seems to have gone into their architecture decisions in the first place.
P.S. The problem is not “caching� , filesystem or otherwise, but the speed of read/write operations in a concurrent scenario while maintaining some level of ACID-ity as it were.
Comment by nick on 22 April 2007:
I love it when people are constructive. Another ‘generic programmer’ who fails to A) read the post B) understand the world beyond your simple understanding of a RDBM. C) When you have actually worked on building a database engine from scratch that serves hundreds of thousands of people a day at 10x the performance of a generic ‘RDBM’ you can come back and discuss the merits with me one to one, until then feel free to input into my spam folder.I apologise, was in rant mode. I agree, with most points, but I need to follow up with something more detailed.
P.S. Formatting fixed
Comment by Kirit on 22 April 2007:
What do you mean by ‘generic programmer’? It sounds like an insult. So far it seems to be used against people you disagree with you as if name calling somehow strengthens your argument.
I hope I’m wrong.
Comment by nick on 22 April 2007:
Nope probably not wrong, I wrote that off the back off deleting several other comments of ’stupid idiot’ and L.B.J bore the brunt. I actually agree with nearly all of his points. But the fact is the topic is amazingly complex, I was trying to make a generic point using generic terms and hoping people would make the mental leap that I was trying to get them to make.
Maybe if I get time I will write up a detailed seperate article on the whole subject of the underlying infrastructure and relationships between the various bottlenecks and even dig out some old code (scary idea.) to post.
Comment by L.B.J on 22 April 2007:
Nick, I wasn’t attacking you when I mentioned the PHP programmers. That comment was referring to the anecdotal evidence proffered at Jens’ blog post. Some people were offering their minimal experience as proof that it can be scaled to a larger, more concurrent architecture like Twitter’s.
I have no reason to doubt that you have written a fast database system. I guess I didn’t understand how you were proposing to use the filesystems as they exist to-day. I did read your post and the crux of your argument seems to be that a purely filesystem-based alternative for a Twitter-like site is possible. I am still at odds trying to visualize how this can be done in a way which will compete or better an properly architected RDBMS system. I do think that although FS and DBMS are abstractions for storage, their purposes are different IMO.
My understanding is that FS is optimized for locating abstractions called files, and applying systemwide security and access rules everytime an FS call is made. DB’s are optimized for working on top of this abstraction and generally do away with having to make the same kind of security checks and open/close calls that the file-systems have to deal with. Along with that, the DBMS’es also take care of maintaining a read-ahead capability for a high cache hit ratio, as you well know from your own experience of writing one. Filesystems offer none of this, even the new journaled filesystems are not as advanced as the DBMS’es and sort of borrow some of the techniques from the DB world.
I don’t know Twitter architecture well enough to make any extrapolations on what kind of load a filesystem based approach would generate, but my thinking is that using the indexing structures offered by a filesystem, along with the security overhead and the open/close calls for each file accessed, it would be quite large. How much larger than using an rDBMS? I can’t quantify, but my hunch is that you will get the “hot page” problem quite quickly as i/o system requests line up to access the large numbers of files created by such a methodology, all looking through the same ‘index’ or ‘F.A.T’ or what have you. The file allocation maps could be cached in-memory, but at some point they have to be written out to disk, as corruption of such critical data would slay the beast in one fell swoop if it were to happen. Add to that all the separate files which need to be open/closed every time about 10,000 or so twitterrific clients decide to update status. Each client update request would generate an locate()/open()/read()/close() cycle and seems to me they would generate hard faults for actual writes. This would become even more challenging when the number of total clients looking to update their status reaches an even larger threshold like 100k or 1000k??
Plan9 is the only system that I know of which uses the file I/O paradigm for all sorts of communications, but I don’t know how a mainstream filesystem would handel this as opposed to plan9 which is built from the ground up using the file i/o metaphor for all it’s internal/external/interprocess communication needs.
BTW, I am not a ‘programmer’ per se (generic or otherwise, though I’ve done my share of it, and written toy compilers and rdbms systems during my school days), as my focus is on enterprise databases and large datasets, but I do deal with developer/programmers on a daily basis, and do have to deal with rather simple assumptions made by them about the systems they are programming for.
I would be very interested in reading your follow up on a proposed filesystem based architecture which would take care of issues like concurrency, ACID’ity, reorganization and reporting, all the while offering graceful recovery in case of partial or complete system failure. As I think that it is premature to assume the information people are submitting to twitter is really not ‘that important’ (as some were suggesting on Jens’ blog)
Cheers. (and sorry for the long comment)
Comment by admin on 22 April 2007:
Thanks for the recent comment, and I thank you for your words, I was in a really bad mood after getting some really stupid comments (not yours!) and I decided to write something then and there (a stupid idea, but lesson learnt). And when I went back and re-read yours I realised I had made the mistake I was complaining about of not reading what people say.
I agree with all your points regarding RDBM’s vs Filesystems, and I took too simple approach to trying to making an example. Of course to have any level of performance you need some form of caching extraction layer above the filesystem. Or an abstraction layer that works on top of a partition (something I have done.). Neither are a good result in 98% of cases where RDBM as you state scale better, have better logging better everything. But when you need pure outright performance (and dont have lots of money to waste on hardware) you can take the option of ignoring the ‘norm’ and with a very localised performance aim in mind (which the spec is unlikely to change, rare i know.) then I think you can justify not using SQL etc..
Twitter was a bad example but I used it because it was the current topic of the day (and we all want traffic) and some of the twitter concepts work well as append only (most POSIX filesystems work well append only with locking.) but I know it was taking the concept a bit far.
Anyway, just finished adding a Digg API block to the blog (for fun) will post about the fun getting it to work on nasty old PHP 4 (my vps only supports 4)
Thanks for taking the time to read my rants anyway
Comment by L.B.J on 24 April 2007:
No Worries Nick, check out Blaine’s slides on Twitter’s architecture, seems a bit sparse, but gives a better idea of what they’re dealing with.
http://www.slideshare.net/Blaine/scaling-twitter
cheers,
-L
Comment by Something to read on 28 April 2007:
C.J. Date’s new book seems interesting. Also, see his interview
http://www.oreillynet.com/pub/a/network/2005/07/29/cjdate.html