SAIDAR is a simple application to address a common business use case – continuously streaming data to be stored and accessible realtime. Data in consideration here is of fairly simple format like events getting generated from a web server / storage appliance / application code. This data is primarily subjected to time series analysis.
Solution offered here is nothing new in terms of architecture or accomplishment. There are a variety of third party tools (open source and propriety). A good example is .
Intent of this solution is to enable developers to deploy something quickly without going to their IT groups requesting for these special data storage / management software. It is a custom code that can be extended.
In Part-1 of this series, I will be explaining the architecture. Part-2 will come up sometime in future with code level details. First version of the tool will work only in standalone mode. The next revision will enable multi node deployment using which the benefits of a cluster based operation can be fully exploited.
Saidar Architecture Depiction
The architecture has 2 types of components : Control Center and Services.
- Scheduler – Manages the various scheduling tasks like polling for streamed data, in-memory index update etc
- Configurator – Settings of various features is controlled through this interface
- Monitoring – Metrics and health tracker
- Ingestion Service – Handles entry of data into the tool via HTTP, File or DB channels. It also has user extension points to help convert incoming data into Saidar compliant format
- Indexer Service – Data received through ingestion service is persisted first as an index. Raw data is sent to a different service. Also, In Memory copy of the index is updated at regular intervals. This service supports an index scan feature for client searches. Another smart feature is to rank the various nodes of the memory tree and promote/demote nodes based on access pattern
- Data Storage Service – Data that needs to be persisted to disk is packaged in a smart way so that retrieval of information can minimize the number of disk seeks
- Query Service – Handles the tasks of interpreting the client query, searching both index and data storage and finally aggregating the response based on user query. This supports user defined code since aggregation is very specific to implementation
- Access Service – Client entry point to the tool is enabled through a RESTful webservice interface and also via TCP/IP
In the subsequent posts, I will explain more on the technical components and their performance characteristics. The goal is to keep the design simple and easy to deploy and use. Challenge will be to check to what extent this tool can handle large datasets.