Wednesday, July 22, 2009

Internet-scale Java Web Applications

I am currently working on 2 application architectures. One is a PHP Facebook app (IFrame) with Postgresql in the backend, the other is a Glassfish/Jersey/Toplink/PostgreSql stack.

When reading the glowing web 2.0 tech stories in the news and sites like highscalability it seems like just about everyone requiring a "internet-scale" architecture is using MySQL, many are using stacks in the line of {Phyton,Django|PHP, Zend}{memcached/MySQL} and take advantage of the new offerings of Amazon or Google to push their infrastructure to the cloud ( Microsoft Azure is another big one, Sun has something cooking, and there are many smaller cloud service providers).

I actually also had in the back of my head to go for EC2 in the near future for my apps - thinking of EC2 just of a vserver with more/less power on demand.

However when thinking about it, I was not so sure anymore if the architecture I am using is even ready for the Cloud - and ready to scale.

When hearing the advocates of BigTable, traditional RDBMS are not suited for such endavours. Nowadays all the hype seems about simple data structures, like hashtables, and doing the Joins in a Application Layer. Another approach is to do sharding - divide the database into shards which are exact replicas of each other, direct e.g. usergroup x to shard z, ensuring that they mostly only need data from this shard.

Where do JEE technologies fit in those high-scalability scenarios and why not Postgres - is the transactional db a scalability killer?

Lets examinate my concrete questions for my 2 use cases:

a) MySQL vs. Postgres

The traditional PHP application goes within Apache with mod_php using process forking - so every request is basically a new php process. Very different from the concept of a container. This implies that in regards to data caching there is nothing out of the PHP box. Maybe not so astonishing anymore that PHP does not have connection pooling support - yes it just wouldn't make sense. Quoting Rasumus Lersdorf, Creator of PHP from a 2002 interview:

A pool of connections has to be owned by a single process. Since most people use the Apache Web server, which is a multi-process pre-forking server, there is simply no way that PHP can do this connection pooling …
If/when the common architecture for PHP is a single-process multithreaded Web server, we might consider putting this functionality into PHP itself, but until that day it really doesn’t make much sense. Even Apache 2 is still going to be a multi-process server with each process being able to carry multiple threads. So we could potentially have a pool of connections in each process.

Connections to MySQL MyISAM storage engine are apparently only 4KB and quite cheap. On the other hand Oracle connections

every single connection takes up 5MB in NT4 for Oracle 8i 816

The truth is, that most of the MySQL-PostgresSQL comparisons found on Google are really outdated. Postgres made huge performance increases from in their last version 8 as well as MySQL had significantly improved their transactional INNODB engine. So in terms of performance it more depends on the optimal configuration and design than MySQL vs. Postgres. Both are good databases and after going over an excellend presentation "Scaling with Postgres" by Robert Treat given at the Percona Performance Conference 2009, I feel in good hands using Postgres.

b) Data caching

1) facebook app with PHP/Postgres

How would i cache to improve performance when i see that direct database access is taking too much. Well actually memcache can be used by many database systems, it just happens that a lot of people use it with MySQL, but it also integrates with Postgres and many more.

Besides memcached i am sure there are other distributed caches usable with PHP.

2) Glassfish/Jersey/Toplink/Postgres

Here I am using JEE JPA with Toplink Essentials. The later does not have clustering support - or at least no production quality. The open source Toplink code base EclipseLink 1.0 was last time i looked at it (ca. Jan 2009) a bit unstable.

So I guess I would have to look at other distributed caches. Fortunately the choices here are not too little - hibernate integrates with EHCache, OSCache to name a few. So I guess I do not have to worry too much about distributed caching for my JEE app right now.

c) Physical Infrastructure

My current vServer provider (which i can absolutly not recommend but this is another story.. ) charges about 100 CHF / a month for 1 GB (3GB burstable) RAM, 60GB Hd, 2 GHz Xeon processor. I am already a bit short of RAM at times, so the next bigger package is dedicated which starts 200 CHF / month , or more realistcally a 350 / month for a dual core Xeon and 4 GB of RAM.

From the Amazon Website:

"As an example, a medium sized website database might be 100 GB in size and expect to average 100 I/Os per second over the course of a month. This would translate to $10 per month in storage costs (100 GB x $0.10/month), and approximately $26 per month in request costs (~2.6 million seconds/month x 100 I/O per second * $0.10 per million I/O)."

Given my app is no video/media sharing the scenario would be a small instance always on, and moderate Elastic Block Storage (EBS) requirements for the data storage. This gives me a rough estimate using their handy calculator:
  • Small (1.7 GB RAM,..) Linux Instance (always on 1 month, 36$ EBS costs): 118 $
  • Large (7.5 GB RAM,..) Linux Instance (always on 1 month, 36$ EBS costs): 363 $
The instance types of EC2 also include high performance CPU instances and different OS. For me right now something between a small and a large instance would be ideal, just in terms of RAM. (I mean just for a single glassfish instance the recommended memory allocation is 1 GB ..). Maybe having 2 small instances would be the best solution in my case.

So overall I guess I will go with EC2. There are a bunch of articles, questions and comparisons out there that list all the pro and cons between dedicated servers, cloud providers, and vservers. Fact is that Amazon has been a leader in the Cloud space and improved their services constantly. Also the usablility with the Management Console has increased significantly.

d) Impacts of EC2 on Application Architecture / Clustering

On the WebServer / DNS tier EC2 offers Elastic Load Balancing. This is 1 public static ip adress per AWS account. The ip adresses of the instances will change upon reboot, but their only private so don't have to worry about this. Furthermore the elastic ip feature implies a load balancing included for you to distribute load to the instances.

One problem with EC2 though is in the application tier because is there's no multicast - makes sense when you think about the potential network flood it would possible generate. This s a problem, because most of the applications/frameworks/application servers usually rely on multicast for their clustering solutions - in order to the discovery of other service instances

I found a nice article on a Terracotta architecture solving this problem. Terracotta provides clustering and caching for Java objects by instrumenting the Java byte-codes and doing things like (pre)fetching content or updating copies. They do this via TCP/IP and therefore enable clustering and distributed caches that do not rely on multicast. What's really cool is that they went recently OSS and you can download their software for free!

How does Terracotta work?

A few interesting quotes from their forum:

Every application node is connected to the Terracotta Server Cluster via a TCP connection. There is no multicast. Terracotta is very efficient over the network. Because it intercepts field-level changes, only the changes to your objects are sent across the wire. In addition, objects do not live everywhere, so Terracotta only sends changes where objects are resident. In the case where you have a well partitioned application, this means that on average, your changes will only be copied to the Terracotta Server Cluster, and not to all of the application nodes (because they don't need a copy of objects they do not have a reference to in Heap)


Just because one has 1000 clients running the same application doesn't mean all data is everywhere. One of the features of Terracotta is that it has a virtual heap. Objects are only where they need to be when they need to be there. Some users do have large numbers of clients and it works quite well. Scale is more of a question of concurrency and rate of change than number of clients.

The Terracotta server uses an efficient mechanism to send changes using Java NIO under the covers to achieve high scalability.

There are integrations with several App Servers, among them Glassfish. Yes!


Summary

Without further ado, my takeaways to this rather long post are:

Postgres does not per se underperform MySQL
Memecache can be used with Postgres
Do not use Persistent DB Connections in PHP ever
EC2 will fit my bill for infrastructure/hosting
Terracotta will be a good candidate for clustering in a EC2 environment without multicast
Hibernate with EHCache, JBoss Cache, OSCache is your distributed cache replacement for Toplink Essentials

No comments:

Post a Comment