Hi readers,
As of today I will continue by blogging at uebersoftware.com . Why? I needed some additional functionality and customization that blogspot could not offer. So I now use my own hosted blogsystem - yes its wordpress - which has all the options I need. Post and comments have been transferred. So please visit me on the new Site for any new content.
Regards Ben
Monday, August 17, 2009
Wednesday, July 22, 2009
Internet-scale Java Web Applications
I am currently working on 2 application architectures. One is a PHP Facebook app (IFrame) with Postgresql in the backend, the other is a Glassfish/Jersey/Toplink/PostgreSql stack.
When reading the glowing web 2.0 tech stories in the news and sites like highscalability it seems like just about everyone requiring a "internet-scale" architecture is using MySQL, many are using stacks in the line of {Phyton,Django|PHP, Zend}{memcached/MySQL} and take advantage of the new offerings of Amazon or Google to push their infrastructure to the cloud ( Microsoft Azure is another big one, Sun has something cooking, and there are many smaller cloud service providers).
I actually also had in the back of my head to go for EC2 in the near future for my apps - thinking of EC2 just of a vserver with more/less power on demand.
However when thinking about it, I was not so sure anymore if the architecture I am using is even ready for the Cloud - and ready to scale.
When hearing the advocates of BigTable, traditional RDBMS are not suited for such endavours. Nowadays all the hype seems about simple data structures, like hashtables, and doing the Joins in a Application Layer. Another approach is to do sharding - divide the database into shards which are exact replicas of each other, direct e.g. usergroup x to shard z, ensuring that they mostly only need data from this shard.
Where do JEE technologies fit in those high-scalability scenarios and why not Postgres - is the transactional db a scalability killer?
Lets examinate my concrete questions for my 2 use cases:
a) MySQL vs. Postgres
The traditional PHP application goes within Apache with mod_php using process forking - so every request is basically a new php process. Very different from the concept of a container. This implies that in regards to data caching there is nothing out of the PHP box. Maybe not so astonishing anymore that PHP does not have connection pooling support - yes it just wouldn't make sense. Quoting Rasumus Lersdorf, Creator of PHP from a 2002 interview:
A pool of connections has to be owned by a single process. Since most people use the Apache Web server, which is a multi-process pre-forking server, there is simply no way that PHP can do this connection pooling …
If/when the common architecture for PHP is a single-process multithreaded Web server, we might consider putting this functionality into PHP itself, but until that day it really doesn’t make much sense. Even Apache 2 is still going to be a multi-process server with each process being able to carry multiple threads. So we could potentially have a pool of connections in each process.
Connections to MySQL MyISAM storage engine are apparently only 4KB and quite cheap. On the other hand Oracle connections
every single connection takes up 5MB in NT4 for Oracle 8i 816
The truth is, that most of the MySQL-PostgresSQL comparisons found on Google are really outdated. Postgres made huge performance increases from in their last version 8 as well as MySQL had significantly improved their transactional INNODB engine. So in terms of performance it more depends on the optimal configuration and design than MySQL vs. Postgres. Both are good databases and after going over an excellend presentation "Scaling with Postgres" by Robert Treat given at the Percona Performance Conference 2009, I feel in good hands using Postgres.
b) Data caching
1) facebook app with PHP/Postgres
How would i cache to improve performance when i see that direct database access is taking too much. Well actually memcache can be used by many database systems, it just happens that a lot of people use it with MySQL, but it also integrates with Postgres and many more.
Besides memcached i am sure there are other distributed caches usable with PHP.
2) Glassfish/Jersey/Toplink/Postgres
Here I am using JEE JPA with Toplink Essentials. The later does not have clustering support - or at least no production quality. The open source Toplink code base EclipseLink 1.0 was last time i looked at it (ca. Jan 2009) a bit unstable.
So I guess I would have to look at other distributed caches. Fortunately the choices here are not too little - hibernate integrates with EHCache, OSCache to name a few. So I guess I do not have to worry too much about distributed caching for my JEE app right now.
c) Physical Infrastructure
My current vServer provider (which i can absolutly not recommend but this is another story.. ) charges about 100 CHF / a month for 1 GB (3GB burstable) RAM, 60GB Hd, 2 GHz Xeon processor. I am already a bit short of RAM at times, so the next bigger package is dedicated which starts 200 CHF / month , or more realistcally a 350 / month for a dual core Xeon and 4 GB of RAM.
From the Amazon Website:
"As an example, a medium sized website database might be 100 GB in size and expect to average 100 I/Os per second over the course of a month. This would translate to $10 per month in storage costs (100 GB x $0.10/month), and approximately $26 per month in request costs (~2.6 million seconds/month x 100 I/O per second * $0.10 per million I/O)."
Given my app is no video/media sharing the scenario would be a small instance always on, and moderate Elastic Block Storage (EBS) requirements for the data storage. This gives me a rough estimate using their handy calculator:
So overall I guess I will go with EC2. There are a bunch of articles, questions and comparisons out there that list all the pro and cons between dedicated servers, cloud providers, and vservers. Fact is that Amazon has been a leader in the Cloud space and improved their services constantly. Also the usablility with the Management Console has increased significantly.
d) Impacts of EC2 on Application Architecture / Clustering
On the WebServer / DNS tier EC2 offers Elastic Load Balancing. This is 1 public static ip adress per AWS account. The ip adresses of the instances will change upon reboot, but their only private so don't have to worry about this. Furthermore the elastic ip feature implies a load balancing included for you to distribute load to the instances.
One problem with EC2 though is in the application tier because is there's no multicast - makes sense when you think about the potential network flood it would possible generate. This s a problem, because most of the applications/frameworks/application servers usually rely on multicast for their clustering solutions - in order to the discovery of other service instances
I found a nice article on a Terracotta architecture solving this problem. Terracotta provides clustering and caching for Java objects by instrumenting the Java byte-codes and doing things like (pre)fetching content or updating copies. They do this via TCP/IP and therefore enable clustering and distributed caches that do not rely on multicast. What's really cool is that they went recently OSS and you can download their software for free!
How does Terracotta work?
A few interesting quotes from their forum:
Every application node is connected to the Terracotta Server Cluster via a TCP connection. There is no multicast. Terracotta is very efficient over the network. Because it intercepts field-level changes, only the changes to your objects are sent across the wire. In addition, objects do not live everywhere, so Terracotta only sends changes where objects are resident. In the case where you have a well partitioned application, this means that on average, your changes will only be copied to the Terracotta Server Cluster, and not to all of the application nodes (because they don't need a copy of objects they do not have a reference to in Heap)
Just because one has 1000 clients running the same application doesn't mean all data is everywhere. One of the features of Terracotta is that it has a virtual heap. Objects are only where they need to be when they need to be there. Some users do have large numbers of clients and it works quite well. Scale is more of a question of concurrency and rate of change than number of clients.
The Terracotta server uses an efficient mechanism to send changes using Java NIO under the covers to achieve high scalability.
There are integrations with several App Servers, among them Glassfish. Yes!
Summary
Without further ado, my takeaways to this rather long post are:
Postgres does not per se underperform MySQL
Memecache can be used with Postgres
Do not use Persistent DB Connections in PHP ever
EC2 will fit my bill for infrastructure/hosting
Terracotta will be a good candidate for clustering in a EC2 environment without multicast
Hibernate with EHCache, JBoss Cache, OSCache is your distributed cache replacement for Toplink Essentials
When reading the glowing web 2.0 tech stories in the news and sites like highscalability it seems like just about everyone requiring a "internet-scale" architecture is using MySQL, many are using stacks in the line of {Phyton,Django|PHP, Zend}{memcached/MySQL} and take advantage of the new offerings of Amazon or Google to push their infrastructure to the cloud ( Microsoft Azure is another big one, Sun has something cooking, and there are many smaller cloud service providers).
I actually also had in the back of my head to go for EC2 in the near future for my apps - thinking of EC2 just of a vserver with more/less power on demand.
However when thinking about it, I was not so sure anymore if the architecture I am using is even ready for the Cloud - and ready to scale.
When hearing the advocates of BigTable, traditional RDBMS are not suited for such endavours. Nowadays all the hype seems about simple data structures, like hashtables, and doing the Joins in a Application Layer. Another approach is to do sharding - divide the database into shards which are exact replicas of each other, direct e.g. usergroup x to shard z, ensuring that they mostly only need data from this shard.
Where do JEE technologies fit in those high-scalability scenarios and why not Postgres - is the transactional db a scalability killer?
Lets examinate my concrete questions for my 2 use cases:
a) MySQL vs. Postgres
The traditional PHP application goes within Apache with mod_php using process forking - so every request is basically a new php process. Very different from the concept of a container. This implies that in regards to data caching there is nothing out of the PHP box. Maybe not so astonishing anymore that PHP does not have connection pooling support - yes it just wouldn't make sense. Quoting Rasumus Lersdorf, Creator of PHP from a 2002 interview:
A pool of connections has to be owned by a single process. Since most people use the Apache Web server, which is a multi-process pre-forking server, there is simply no way that PHP can do this connection pooling …
If/when the common architecture for PHP is a single-process multithreaded Web server, we might consider putting this functionality into PHP itself, but until that day it really doesn’t make much sense. Even Apache 2 is still going to be a multi-process server with each process being able to carry multiple threads. So we could potentially have a pool of connections in each process.
Connections to MySQL MyISAM storage engine are apparently only 4KB and quite cheap. On the other hand Oracle connections
every single connection takes up 5MB in NT4 for Oracle 8i 816
The truth is, that most of the MySQL-PostgresSQL comparisons found on Google are really outdated. Postgres made huge performance increases from in their last version 8 as well as MySQL had significantly improved their transactional INNODB engine. So in terms of performance it more depends on the optimal configuration and design than MySQL vs. Postgres. Both are good databases and after going over an excellend presentation "Scaling with Postgres" by Robert Treat given at the Percona Performance Conference 2009, I feel in good hands using Postgres.
b) Data caching
1) facebook app with PHP/Postgres
How would i cache to improve performance when i see that direct database access is taking too much. Well actually memcache can be used by many database systems, it just happens that a lot of people use it with MySQL, but it also integrates with Postgres and many more.
Besides memcached i am sure there are other distributed caches usable with PHP.
2) Glassfish/Jersey/Toplink/Postgres
Here I am using JEE JPA with Toplink Essentials. The later does not have clustering support - or at least no production quality. The open source Toplink code base EclipseLink 1.0 was last time i looked at it (ca. Jan 2009) a bit unstable.
So I guess I would have to look at other distributed caches. Fortunately the choices here are not too little - hibernate integrates with EHCache, OSCache to name a few. So I guess I do not have to worry too much about distributed caching for my JEE app right now.
c) Physical Infrastructure
My current vServer provider (which i can absolutly not recommend but this is another story.. ) charges about 100 CHF / a month for 1 GB (3GB burstable) RAM, 60GB Hd, 2 GHz Xeon processor. I am already a bit short of RAM at times, so the next bigger package is dedicated which starts 200 CHF / month , or more realistcally a 350 / month for a dual core Xeon and 4 GB of RAM.
From the Amazon Website:
"As an example, a medium sized website database might be 100 GB in size and expect to average 100 I/Os per second over the course of a month. This would translate to $10 per month in storage costs (100 GB x $0.10/month), and approximately $26 per month in request costs (~2.6 million seconds/month x 100 I/O per second * $0.10 per million I/O)."
Given my app is no video/media sharing the scenario would be a small instance always on, and moderate Elastic Block Storage (EBS) requirements for the data storage. This gives me a rough estimate using their handy calculator:
- Small (1.7 GB RAM,..) Linux Instance (always on 1 month, 36$ EBS costs): 118 $
- Large (7.5 GB RAM,..) Linux Instance (always on 1 month, 36$ EBS costs): 363 $
So overall I guess I will go with EC2. There are a bunch of articles, questions and comparisons out there that list all the pro and cons between dedicated servers, cloud providers, and vservers. Fact is that Amazon has been a leader in the Cloud space and improved their services constantly. Also the usablility with the Management Console has increased significantly.
d) Impacts of EC2 on Application Architecture / Clustering
On the WebServer / DNS tier EC2 offers Elastic Load Balancing. This is 1 public static ip adress per AWS account. The ip adresses of the instances will change upon reboot, but their only private so don't have to worry about this. Furthermore the elastic ip feature implies a load balancing included for you to distribute load to the instances.
One problem with EC2 though is in the application tier because is there's no multicast - makes sense when you think about the potential network flood it would possible generate. This s a problem, because most of the applications/frameworks/application servers usually rely on multicast for their clustering solutions - in order to the discovery of other service instances
I found a nice article on a Terracotta architecture solving this problem. Terracotta provides clustering and caching for Java objects by instrumenting the Java byte-codes and doing things like (pre)fetching content or updating copies. They do this via TCP/IP and therefore enable clustering and distributed caches that do not rely on multicast. What's really cool is that they went recently OSS and you can download their software for free!
How does Terracotta work?
A few interesting quotes from their forum:
Every application node is connected to the Terracotta Server Cluster via a TCP connection. There is no multicast. Terracotta is very efficient over the network. Because it intercepts field-level changes, only the changes to your objects are sent across the wire. In addition, objects do not live everywhere, so Terracotta only sends changes where objects are resident. In the case where you have a well partitioned application, this means that on average, your changes will only be copied to the Terracotta Server Cluster, and not to all of the application nodes (because they don't need a copy of objects they do not have a reference to in Heap)
Just because one has 1000 clients running the same application doesn't mean all data is everywhere. One of the features of Terracotta is that it has a virtual heap. Objects are only where they need to be when they need to be there. Some users do have large numbers of clients and it works quite well. Scale is more of a question of concurrency and rate of change than number of clients.
The Terracotta server uses an efficient mechanism to send changes using Java NIO under the covers to achieve high scalability.
There are integrations with several App Servers, among them Glassfish. Yes!
Summary
Without further ado, my takeaways to this rather long post are:
Postgres does not per se underperform MySQL
Memecache can be used with Postgres
Do not use Persistent DB Connections in PHP ever
EC2 will fit my bill for infrastructure/hosting
Terracotta will be a good candidate for clustering in a EC2 environment without multicast
Hibernate with EHCache, JBoss Cache, OSCache is your distributed cache replacement for Toplink Essentials
Saturday, July 18, 2009
FB Series: Integrating JS
Where are we coming from
Javascript one of the most popular language for web programming. Although it has been around for decades, - coming from Netscape 1995 - it was not until the advent of the Ajax and Web 2.0 when JavaScript came to the spotlight and brought more professional programming attention.
The OpenSource movement of JavaScript frameworks started in 2005 with prototype and script.aculo.us. Since then programming appealing JavaScript based Websites has become so much easier. Nowadays its litteraly possible to mash-up widgets with little to no JavaScript knowledge and set up a stunning page. I would however still recommend for someone not knowing JavaScript first to get the basics and maybe have some JS Reference & DOM Reference handy.
Looking at the core functionality of todays frameworks what do we have?
- DOM Traversal with CSS selectors to locate elements
- DOM Modification: Create/remove/modify elements
- Short syntax for adding/removing Events
- Short syntax for Ajax Request
- Animations (hide/show/toggle) & Transition
- User Interface Widgets (Drag & drop, Tree, Grid, Datepicker, Model Dialog, Menu / Toolbar, Slider, Tabbed Pane)
- Wide Browser Support
Choosing a JS Framework
On the net you will find a mirrad of comparisons of popular frameworks. The most popular OpenSource and free JS Frameworks available today are:
* Prototype (&Scriptaculous for UI)
* Dojo
* jQuery
* YUI
* Mootools
I would say that all of those have their advantages and drawbacks and there's not really an obvious winner. It really depends on what you use it for. Some helpful links that helped me choosing are:
Wikipedia - to get a feeling on the features
unbiased functional comparison
Stackoverflow comparing JQuery Dojo and more
Performance comparison
Choosing a framework, as mentioned, depends a bit on how you want to use JavaScript:
Plug-and-Play:
- Drop in a “calendar widget” or “tabbed navigation”
- Little, to no, JavaScript experience required.
- Just customize some options and go.
- No flexibility.
Some Assembly Required
- Write common utilities
- Click a link, load a page via Ajax
- Build a dynamic menu
- Creating interactive forms
- Use pre-made code to distance yourself from browser bugs.
- Flexible, until you hit a browser bug.
Down-and-Dirty
- Write all JavaScript code from scratch
- Deal, directly, with browser bugs
- Quirksmode is your lifeline
- Excessively flexible, to the point of hinderance.
Of course you can also mix different approaches, as you can also mix Frameworks but of course in terms of maintainability and productivity it would better if you don't have to.
For my personal use case I had the following requirements:
- Easy to learn with minimal intuitive syntax
- Lightweight solution
- Availability of some widgets: Datepicker, Grid, maybe more
- Appealing Web 2.0 effects
- Production quality
For development i guess its essential to still have a good Javascript support in your IDE and a JavaScript debugger - I use Firebug and Netbeans for this. If you are into Eclipse I strongly suggest to take a look at Aptana, their IDE is really great for JavaScript and PHP, but unfortunately not optimal for Facebook development.
As a last hint: consider using Google Ajax Libs for speed up JS loading on your clients. However do not use it in local development. It just does not work reliable (loads too late etc.)
FB Series: Choosing an PHP IDE
What are the contenders?
Netbeans 6.7 with PHP support
Aptana 1.5 (standalone or Eclipse plugin)
Zend Studio 6.1.2 (not free ;( )
My experiences:
Netbeans 6.7 with PHP support
Aptana 1.5 (standalone or Eclipse plugin)
Zend Studio 6.1.2 (not free ;( )
My experiences:
- Code completion/parser in Netbeans better than Aptana (Aptana also crashes / gets in a endless loop at points when it cannot parse / code assist a file resulting in java heap exhaustion / cpu up 50%)
- Aptana browser integration & debugger integration is better. I like the tab away browser, and the debugger xdebug works more reliable than in Netbeans - however still has its times when it does just not attach
- Zend Studio: set up was not working ok for me, couldn't get the debugger working. Script only debugging works.
FB Series: Java Guy starting with FB development
I recently started with PHP development. Although I have had some small exposure to the
language before, this was my first real project.
What are the big differences to the Java (web) development experience?
language before, this was my first real project.
What are the big differences to the Java (web) development experience?
- interpreted (no compiler to check for errors, just a parser)
- dynamically typed (->no "parse-time" type checking in the IDE ..)
- essential procedural, object handling completely rewrittten from version 4 to 5 now basic oo functionality
- hunderts of statically callable base function; e.g. very easy to download & store a file: $contents = file_get_contents($url)
- Very "close" to the HTTP request-response model
- the script execution is determined by the HTTP request (e.g. check ignore_user_abort )
- MySQL and Postgres integration modules for persistence
- Caching, Pooling is different due to the fact that a PHP process in its pure form only lives during one Web request:
3 ways a PHP can run in a Web Server:
- as CGI wrapper: instance of the php interpreter created and destroyed for every page request
- module (e.g. mod_php, mod_fastcgi) in a multiprocessor web server like Apache. Apache/FastCGI can reuse php instances
- plugin to web server
This means that connection pooling is not really possible.
Caching:
- page scope: not necessary in php. All variables within includes are acessible
- request scope: not supported natively, but can be implemented via session
- session scope: see PHP Session. Data stored within a text file in php/tmp or similar. Data must be serializable (It is not possible to serialize PHP built-in objects!)
- application scope: does not exist - Generally faster code-test iterations.
- Quick infrastructure setup: Apache, php, Apache-php integration (e.g. mod_php), Editor.
- More difficult to manage complex/larger projects because:
- interpreted & dynamically typed: errors only found at runtime
- Exception handling difficult
- OO-shortcomings
- ...
Friday, July 10, 2009
FB Series: Locally debug your PHP Facebook App
Starting with Facebook applications is a rather a bumpy road. First there is the scattered and often not up-to-date documentation on the Facebook wiki. Its a pain to go through. But ok once you managed to get the important bits its ok.
What I am talking about is the testing. Its quite astonishing that this is not thought through at all: Facebook does not offer a Test environment (!). So this means developers have to test their apps in the real Facebook which means. Developers have to create 2 apps one for test one for prod with different configurations canvas urls etc. as well as create Testusers - or they want to annoy their friends with test apps.
If your like me you prefer debugging to changelogginguploadtestchangeuploadtest cycles. With Facebook Apps this was not straigthforward so therefore here is a easy end-to-end explanation for you how to setup your php facebook app for local debugging - which i didnt find on the net.
Prerequisistes: You have apache and php 5.x installed.
That's it: Fire up a debugging session in your ide, add a breakpoint in your facebook app and try access your facebook app from facebook. You should end up in your local debugging session.
What I am talking about is the testing. Its quite astonishing that this is not thought through at all: Facebook does not offer a Test environment (!). So this means developers have to test their apps in the real Facebook which means. Developers have to create 2 apps one for test one for prod with different configurations canvas urls etc. as well as create Testusers - or they want to annoy their friends with test apps.
If your like me you prefer debugging to changelogginguploadtestchangeuploadtest cycles. With Facebook Apps this was not straigthforward so therefore here is a easy end-to-end explanation for you how to setup your php facebook app for local debugging - which i didnt find on the net.
Prerequisistes: You have apache and php 5.x installed.
- Set up your computer for portforwarding. Portworward.com has good explanations for all different kind of configurations and routers.
- Get yourself a alias from dyndns.org, something like myfacebookapp.dyndns.org pointing to your ip. In order to update your dynamic ip if you have one, install the dyndns updater.
- Try accessing your apache file via your dyndyns alias(myfacebookapp.dyndns.org). Keep in mind that many routers do not support ip loopback, therefore you might need to go through a proxy to do that. Also make sure to configure your firewall to open the ports your forwarding.
- If this all worked you are ready to setup your php environment. Startup (or install) your php ide of choice with integrated xdebugging for php (like Netbeans 6.7 or Aptana 1.5)
- You need to configure your php.ini file to support the xdebug. At the end of your php.ini file add:
zend_extension_ts="c:/php/ext/php_xdebug-2.0.5-5.2.dll"
xdebug.remote_enable=on
xdebug.remote_handler=dbgp
xdebug.remote_mode=req
xdebug.remote_host=localhost
xdebug.remote_port=9000
As you see I use windows change your path to the xdebug .dll / .so as you need. You might need to download it from xdebug.org - Now you should be ready to startup a debug session of a file locally. Test it
- In order to debug HTTP Sessions its important that you have the parameter XDEBUG_SESSION_START=
(for Netbeans its XDEBUG_SESSION_START=netbeans-xdebug) in your request url. So this gets a bit tricky for facebook apps since facebook is calling the canvas url I think that cannot contain query params (I guess ..) . Anyway no need to try or think either there is a nice firefox plugin that does that for us.
That's it: Fire up a debugging session in your ide, add a breakpoint in your facebook app and try access your facebook app from facebook. You should end up in your local debugging session.
Subscribe to:
Posts (Atom)