Forrester Research defines Big Data Streaming Analytics as, “Tools that allow a business to process and act on massive amounts of information while it’s still moving, as opposed to waiting for data to come to rest in a data warehouse or Hadoop. The technology is being used increasingly as new sources of data become common, such as streaming sensor data from the Internet of Things, streaming social media data like Twitter, and streaming mobile information from apps.”
Stream processing refers to a method of continuous processing that happens as data is flowing into the system. For example, a system summarizing stock market statistics as stock ticks and updating during the day is a stream processing system. Often times, the criterion we impose on such a stream processing system is that we need to make sure our stream data consumer (processing pipeline) doesn’t fall behind the data input rate. This consideration leads us to design a system that horizontally scales and responds to changes in the data input rate.
A stream data platform doesn’t replace your data warehouse; in fact, it feeds it data in real-time for long-term retention and ad-hoc analysis. However, if the application we are building is only for real-time analysis of recent events, we may not need a data warehouse and the retention period of stream is sufficient.
The Three V’s of Big Data
It only makes sense for some use cases, the most obvious one being when you have multiple consumers for one data stream that can benefit from concurrent data consumption. For example, data stream with two consumers – one to update real-time dashboard, the other to store them in data warehouse or simply in S3.
Why not database? Your database stores the current state of your data, but the current state is always caused by actions that took place in the past. The actions are the events. In a real-time analysis application, let’s say you display analysis for the last 24 hours of data – you don’t want to keep processing the same data over and over again. Instead, you want to process the new data once as it comes in.
- Amazon Kinesis
- Apache Kafka
- Apache Flink
- Apache Storm
- Apache Spark
- Apache Samza
EMR and Alternatives
Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective for you to distribute and process vast amounts of your data across dynamically scalable Amazon EC2 instances. You can also run other popular distributed frameworks such as Apache Spark in Amazon EMR, which makes it a good service for us.
The alternative is that we deploy and manage spark cluster ourselves. Then we’ll need the spark-ec2 script which can be found here – https://github.com/amplab/spark-ec2, example –http://rotationsymmetry.github.io/2015/06/16/spark-ec2-script/
Some use cases where stream processing can help:
- Network monitoring
- Intelligence and surveillance
- Risk management
- Fraud detection
- Smart order routing
- Transaction cost analysis
- Pricing and analytics
- Market data management
- Algorithmic trading
- Data warehouse augmentation
Our use case as an example: Real-time social media streaming and analysis
- separate dedicated EC2 instances with Elastic IP for independent Twitter streaming connections, continuously send data to Avata stream
- Avata stream:
- Option 1: MySQL
- Benefit: MySQL can ignore duplicate in inserts easily, saves resource for processing stage
- Problem: need to batch write to MySQL and the processing needs to batch read for better performance introduce delay in different stages, which we have to manage.
- Option 2: Amazon Kinesis
- Benefit: managed stream, native integration with other resources on aws, scalable (data source doesn’t matter, better for IOT data, volume can scale up too) and continuous feed for processing
- Problem: stream allows duplicate data coming in, resulting in more throughput and higher cost; need to eliminate duplicates from the stream to not waste resource during processing
- processing data from stream continuously, eg. NLP, sentiment, topic modeling, save processed data in database
- MySQL database stores processed data and serves it to applications