Challenge
After the initial research, SCAND engineers summarized the following major requirements for the requested solution:
- Intensive reporting of data on the application usage gathered from the users’ devices. The amount of reported data is expected to grow.
- Ability to calculate the existing/new types of statistics based on all data collected. In other words, the storage of the data should be provided on a lifelong basis.
- Ability to filter the calculated statistics and build charts based on specific criteria.
- Ability to automatically perform statistical computation according to specific time periods.
Approach
To perform all the manipulations with statistics, a lot of disk space is required as well as CPU power. That’s why the Amazon Web Services platform was chosen as it provides the fully integrated set of services and resources required for this solution at a reasonable price. The AWS technology provides relatively cheap storage for a virtually unlimited amount of data and supports the automated deployment of Hadoop clusters, which was a necessity to be able to fulfill the customer’s task.
Description
SCAND developers implemented the data pipeline consisting of several stages.
First stage: raw data collection. To minimize the cost of reporting API and have it protected from unauthorized access, the web development team utilized the Amazon Cognito service, which provides fine-grained permission management for both authorized and guest users. Reports from Android applications are collected on devices and transmitted directly to the S3 bucket with write-only access controlled by the Cognito service. This is an entirely safe transaction, as it is impossible to steal information from the write-only accessed Amazon S3 bucket. The data received as the result of the abovementioned operations contains a large amount of small-size report files in a special “buffer” S3 bucket.
Second stage: data compression. Before storing the data to the main bucket, it is converted to much bigger-sized compressed columnar format files optimized for partial data reading. The conversion is enabled with the help of the Spark code running in Amazon Web Services EMR Spot Instances Hadoop cluster that is launched and stopped automatically. We see here the benefits AWS control panel is bringing to the project because the EMR service provides automatic Hadoop cluster management with zero engineering efforts.
Third stage: statistics computation. A similar EMR-based cluster is launched manually by SCAND engineers on-demand and runs the distributed Spark program calculating the required statistics. Statistical output is saved in a large SQL database hosted by Amazon. The statistical output contains several SQL tables optimized for a time range search.
Fourth stage: statistics presentation. Web frontend running on Apache Tomcat uses the D3 Javascript library to draw charts based on the results from the data stored in the SQL database and selected by specified criteria.
Result
The Amazon Web Services platform makes it possible to implement a complex Big Data pipeline avoiding the main bottlenecks of any Big Data processing: disk space, CPU power, and distributed computing.
Amazon S3 is one of the cheapest Key-Value store solutions of its class, and the EMR cluster provides an easy, highly automated way to instantly deploy and configure a large Hadoop cluster, dramatically reducing the cost of development efforts.