How to do it. How not to do it.
It has been pointed out by others (codemonkeyism.com/dark-side-nosql/), that NoSql Databases (like Hadoop HBase Cassandra Hypertable, Bigtable) might not succeed because they lack supporting tools, reporting tools in particular.
So let us take a look at the options that are available, to generate reports from a NoSql Database. I want to draw from my experiences, with object-databases (the products being ObjectStore, Versant, Objectivity, Caché), which also use the NoSql approach, and which have similar storage concepts like the new NoSql databases. (Navigational access, easy distribution, great scalability over many nodes).
I see three general pathes to solve the problem:
- Write a program that queries the database and exports the results in a format suitable for reporting. E.g. CSV-Files or Relational Database Tables.
- Provide SQL access to the Database (SQL for NoSQL Databases)
- Develop a reporting system that supports non-relational data structures.
All the approaches have been tried before with the commercial object databases.
So lets examine what to expect when you follow one of those pathes.
1. Export. This is the approach, that those projects pursue, which need reporting urgently. Not elegant, rather expensive, but you get your reports. Basically you maintain two databases, one integrated with your application and with a rich and flexible data model. (NoSql) And another which is a relational data warehouse used for reporting and analytical processing. Apart from the problem of data duplication, if your database is big and you make use of the sacalability of your NoSql database, or if the structures you store are complex and heterogeneous, then this approach is a pain in the lower back. Maintaining the ETL processes is a major headache.
2. SQL-Access for the database. This has been tried by the object database vendors, and they all failed miserably. It is just not possible to present the data as a simple table automatically. A relational database keeps all data of one type in one place (the table for this type), whereas a NoSql database keeps data clustered by access patterns. Which means a the data equivalent to a table may be distributed over many nodes. (E.g. an invoice is stored together with the invoice items, the order and other related information in one node and the information for another invoice and its items in another node) Creating a table for the invoice items means accessing many many nodes, extracting the invoice item data, and centralizing all that information in one table. And in many cases you would create a join with the same invoice related data, that you just extracted it from. If you want this to perform efficiently, you need to cache the extraced table, which is more or less the approach from path No 1.
3. A reporting tool with supports native access to NoSql databases. This is the approach we used with ReportsAnywhere, and of course we believe it to be best. ReportsAnywhere uses a navigational access model, that is also used by many NoSQL databases. So there is no impedance mismatch.
There is one other tool, ReportMill, that has this capability, but only for access via Java Objects. Why are there so few? Would it not be possible to adapt another open source reporting tool like Birt of JFreeReports to a NoSQL database?
I believe no. Experience with ReportsAnywhere has shown, that the internal data representation is completely different than for a relational tool.
This would really mean to duplicate the object-relational impedance mismatch for reporting. You would have a rich data model in the application, a rich data model in the database and then a limited flat data model for reporting. That would be nonsense.
However, if your Java application can access the database, then you already have all what you need to create reports. ReportsAnywhere uses your Java business classes as the access layer for the database. And it can use all your methods to compute values for additional fields.
Therefore a reporting tool which uses the native access library of any NoSQL database is best.