Big Data is a term used to refer to an ecosystem where huge amount of unstructured data is handled to typically meet the analytics/reporting needs of organizations. We have had our own share of excitement and problems while implementing such solution in our IT shop.
Couple of years back, we had a data handling system where large log files were broken into meaningful data domain objects and stored in RDBMS. Access to RDBMS was then given to various user facing applications – through a webservice layer.
While this architecture met our real-time requirements very well, the various user groups that rely on analytics and reporting were hugely disappointed. We had to move data from this OLTP layer to another OLAP layer and then expose it to batch users. This created a huge data availability delays (running into days) and the data provided was also not complete since we were not handling the entire log data.
And thus came the idea of a Big Data implementation using Hadoop, HDFS, Hive, HBase, Solr, Cassandra and other peripheral tools. The batch users were extremely happy as they had access to much wider range of data and results were reaching them in minutes/hours when compared to days.
Wait a minute – did I mention RDBMS in the new architecture? Actually, NO!! This was the mistake #1 we made.
So, the real time use cases were now supposed to be met with typical NoSQL type of data stores. We underestimated the complexities involved. It is just not possible to replace your DAO and ORM layer seamlessly with the new architecture components. Even when we re-wrote the entire data access code, there was still a problem of linking this log data with other enterprise data that runs on RDBMS.
Consider the 2 scenarios below
- Scenario#1 – You have a stranded car (that weighs a 1000 kg) and the need is to push it uphill for 100 metres
- Solution#1.1 – A human asked to execute the job will take probably 30 minutes to complete it
- Solution#1.2 – An elephant could push that car uphill in 10 minutes
There is clearly a huge advantage in asking the elephant to do this task. It is synonymous to our batch/near-real-time use cases.
- Scenario#2 – You have a laptop (weighing about a 1 kg) and the ask is to take it uphill for 100 metres.
- Solution#1.1 – A human asked to execute the job will take probably 1 minute to complete it.
- Solution#1.2 – An elephant could carry that light weight to destination uphill in 5 minutes
I hope you get the point here. Elephant gives you incredible benefits in certain situations….not in all scenarios.
But now that we implemented this solution and our real time use cases were failing miserably, we went ahead and made our mistake #2.
The elephant was concealed behind a leopard skin, roller skates added to its feet and a mechanical motor attached to its rear. You see, the intent was to get the best out of what we already invested in. In big data terms, this translates to:
- Creating more NoSQL tables
- Changing the row keys to match the real time use cases
- Maintaining more than one copy of enterprise data
- And various other patch jobs
These short term fixes resulted in an architecture with too many moving parts and data duplication. Data quality was a casualty as well. Big Data implementations are designed to include lot of redundancy/duplication in data so that the processing can be done in massively parallel jobs. But this increases the footprint of the data and makes it more denormalized. Real time use cases will be difficult to implement in such situations.
I’m not saying it is not possible to use the big data, NoSQL backend systems to cater to your real time application needs. But this requires an extremely smart architecture that is scalable from both batch and real time use case standpoint.