Data engineers have two ways of moving data from source to destination for data analytics: stream processing and batch processing.
Stream processing is a continuous flow of data from sources such as point-of-sale systems, mobile apps, e-commerce websites, GPS devices, and IoT sensors. In batch processing, by contrast, data is bundled up and processed at regular intervals.
Whether your business needs real-time latency depends on what you need to do with your data. If you’re a book retailer checking a dashboard for inventory, you’re probably fine with data that’s hours old. If you’re analyzing data from a heart monitoring implant, you might want no more than a second’s latency. If you’re doing algorithmic trading in the financial markets, you’ll want up-to-the-microsecond pricing information.
Stream processing vs. batch processing
Stream processing handles data in motion — like moving water through a fire hose in a continuous stream. Batch processing is like opening the fire hose every day at midnight and running it until the tank is empty. For example, a day’s worth of data may be batch processed overnight to produce reports the following day.
Stream vs. batch processing: a comparison
|Stream processing||Batch processing|
|What||Single transaction, record, or set of data points||Large datasets composed of multiple transactions or data points|
|When||Continuously processed as data is received from sources||Processed periodically — often run automatically based on a set schedule|
|How||Find new data, process the data.||Examine the dataset, determine the most up-to-date records to include in the batch|
|How fast||Milliseconds to seconds||Minutes to hours|
|Why||Real-time or near real-time interaction with people, sensors, or devices||Periodic in-depth analysis or reporting|
Stream processing: use cases
Many industries use stream processing to add value to their products and services. Streaming data gives companies real-time, actionable insights.
Streaming data from ATMs makes it possible for banks to offer consumers continuous access to their bank accounts without human interaction. The ATM can’t rely on a nightly batch process; it must know the consumer’s account balance at all times.
Fraud detection is another ATM feature made possible by streaming data. If you use an ATM in Philadelphia, and your ATM card is used five minutes later in Tampa, the bank will decline the Tampa transaction when analysis determines is it fraudulent.
Hyperpersonalization examines a user’s real-time website browsing behavior to gain an up-to-the-minute 360-degree customer view. This allows e-commerce retailers to upsell and customize the shopping experience. Another trend is to link the website with apps and physical locations. For example, if a customer views a product on a website, and then walks into a store that sells the product, streaming processing enables the seller to send a coupon to the customer’s mobile device at that time.
Sensors/monitors and IoT
Streaming data also appears in businesses as ordinary as laundromats. The Washlava laundry tech platform has turned washing machines into IoT devices to create a better laundromat experience. Customers use an app to reserve a machine and pay for their wash, and the wash cycle status is updated in real time on the customer’s app. That means no more waiting around for your laundry. Of course, this is only possible with streaming data monitoring machine availability and status.
In a CIO article, “How big data is disrupting the gaming industry,” Dan Schoenbaum, CEO of Cooladata, talks about the importance of data in gaming. “Graphics and creative storylines are no longer enough,” he says. “Today’s online game developers should be investing in business intelligence to understand user likes, dislikes, what’s off-putting, when they’re leaving and not returning.”
Game developers can get that user information from streaming data during game play. Some game development companies even alter an in-progress game to provide a more satisfying gaming experience and keep players in the game longer.
Streaming data meets the demand for real-time and near real-time responsiveness. But you should consider whether you really need real-time data replication, because it degrades the performance of data warehouses, bogging down data loading and using processing resources that could be spent creating reports. If your goal is to provide people with information they need to make better decisions, it doesn’t make sense to update your business intelligence faster than the human brain can process.
Real-time data with webhooks
You may have heard the term “webhooks” or “push API.” Webhooks are another way to connect two applications based on events as they happen. When you set up a webhook, a developer creates a URL, and thereafter, whenever a relevant event occurs, the app pushes data to the URL and a connected app can pick up the data. Webhooks fire as discrete events, so they’re not the same as stream processing and are not recommended for high-volume applications. But developers use webhooks because they can trigger events from system to system, enabling real-time workflows.
Stitch simplifies data ingestion
If you want to push events as they happen to your data warehouse, you can use Stitch’s webhooks implementation. Your data source will notify Stitch as events happen, and Stitch’s Input API can ingest the data from the event.
Stitch provides connectors from more than 100 data sources to the most popular data warehouse destinations. The Stitch Incoming Webhooks integration provides a simple and flexible method to integrate webhook APIs with Stitch. Our approach is simple and straightforward, so give Stitch a try, on us — set up a free trial in minutes.