There is a trick to building fast realtime analytics, regardless of the platform. I've done this with Microsoft Analysis Services, but you can use similar techniques with other platforms as well.
The trick is to have a leading partition that can be populated with near-realtime data and a historical partition (or partitions) that are optimised for fast queries. If you keep the leading partition small enough it will be quick to query as well.
To manage this, your ETL processing populates the leading partition and you build a supplementary process that periodically converts the partitions to the fast query optimised format. The exact nature of this process will vary with your platform.
On MS Analysis Services the leading partition is done as a ROLAP partition that reads directly off the table. Trailing partitions are converted to MOLAP with aggregates. Other OLAP systems will work similarly. On Oracle you can create bitmap indexes and materialised view partitions on your trailing partitions to speed up queries. Some other systems have this type of feature as well, although I'm not aware of MySQL supporting it.
At a guess, I'd say the cheapest mainstream platform that would do this is MS Analysis Services, which is only available bundled with SQL Server and can cannot be purchased separately. For the partitioning with 2008 R2 you will need Enterprise Edition of SQL Server, which runs to about £22,000 per CPU socket in the UK and a bit less on the other side of the pond. Microsoft are shipping a new 'Business Intelligence' edition of SQL Server with 2012. Once this hits RTM the B.I. edition of the product does support partitioned cubes and is substantially cheaper than Enterprise Edition. Depending on your budget and time constraints you may be able to use that instead.
Another aspect of the problem you will have to tackle is changed data capture - efficiently identifying and pushing new or changed data rows into the ETL process. Most DBMS vendors' CDC features only work with their own databases, so if you want a CDC solution you may have to go to a third party app or triggers on the source.
Various third parties punt CDC applications that will migrate across database platforms. A list of CDC products can be seen on the wikipedia entry on the subject. Note that you may still have issues with integration. For example IBM Infosphere CDC can only trigger external processes on a per-row basis, rather than per batch, which could cause efficiency problems on large data volumes.
You can create a set of triggers on the source tables that push out the changes into a staging area. This would require you to have sufficient access to the source database to do this, so it may not be an option on vendor-supported databases.
If the data is from a file (for example a web server log) you would have to write a client side process that monitors the tail of the files for new records.
It is quite likely you will end up having to implement a pull process that polls the data sources. In this case you have to work out the tolerable latency and write your process so it is efficient enough at detecting changes so it can be run sufficiently quickly. There is an old saying, sometimes found in embedded systems circles, to the effect of: 'You know they're getting serious about reliability when they start polling'
If you're on Windows and have DCD (dead connection detection) enabled then this could be relevant to your case. There is a note on Oracle support that talks about poor implementation of DCD on windows leaving behind a bunch of shadow processes. Without going into details as I don't think I can republish that info here, the solution was to disable DCD. So keep a monitor on the number of shad processes and disable DCD in sqlnet.ora. Fairly quick and painless to test if that is the cause..
Best Answer
Argh! I wanted to open a chat panel to ask you some more questions, but actually inviting doesn't work unless you are already chatting. Oh well.
You have indicated that these are java strings, which means that there are two problems.
Hopefully the SQL should be fairly recognizable. It exists in a string, and sql follows a fairly ridged format. Also, your non-aliased items will follow a simple format too. They will be in
columnname AS aliasname
. Skip those, and only add aliases forcolumnnames
not in that format. If you load in the db schema, you even get all validcolumnnames
to look for. Finally, if there is an ambiguous item, then write some context to stdout, and ask the user for input.I would suggest using python and the PLY library. You get a full blown lex/yacc combo, with an easier language than C. Open a new question on SO if you want more help/or to scream at me. ;)