Real Time Processing
Real-time processing is defined as the processing of unbounded stream of input data, with very short latency requirements for processing — measured in milliseconds or seconds. This incoming data typically arrives in an unstructured or semi-structured format, such as JSON, and has the same processing requirements as batch processing, but with shorter turnaround times to support real-time consumption.
One of the big challenges of real-time processing solutions is to ingest, process, and store messages in real time, especially at high volumes. Processing must be done in such a way that it does not block the ingestion pipeline. The data store must support high-volume writes. Another challenge is being able to act on the data quickly, such as generating alerts in real time or presenting the data in a real-time (or near-real-time) dashboard.
Architecture
A real-time processing architecture has the following logical components.
Real-time message ingestion. The architecture must include a way to capture and store real-time messages to be consumed by a stream processing consumer. In simple cases, this service could be implemented as a simple data store in which new messages are deposited in a folder. But often the solution requires a message broker, such as Azure Event Hubs, that acts as a buffer for the messages. The message broker should support scale-out processing and reliable delivery.
After capturing real-time messages, we focus on Stream processing – the solution must process them by filtering, aggregating, and otherwise preparing the data for analysis.
In the Stream solution you can configure where the data will be stored. You have the option to define the outputs for the data, for example you can configure one output to send the data to a Report tool like Power BI plotting the data in Real time and in other hand you can send the data to a Data Store like SQL Data Warehouse or hive for example.
Regarding technologies, The following technologies are recommended choices for real-time processing solutions in Azure.
Technology choices for Real-time message ingestion
Let’s provide a quick overview about each one of them.
Azure Event Hubs. Azure Event Hubs is a message queuing solution for ingesting millions of event messages per second. The captured event data can be processed by multiple consumers in parallel.
Azure IoT Hub. Azure IoT Hub provides bi-directional communication between Internet-connected devices, and a scalable message queue that can handle millions of simultaneously connected devices.
Apache Kafka. Kafka is an open source message queuing and stream processing application that can scale to handle millions of messages per second from multiple message producers, and route them to multiple consumers. Kafka is available in Azure as an HDInsight cluster type.
For more information, see Real-time message ingestion.
Data storage
Azure Storage Blob Containers or Azure Data Lake Store. Incoming real-time data is usually captured in a message broker (see above), but in some scenarios, it can make sense to monitor a folder for new files and process them as they are created or updated. Additionally, many real-time processing solutions combine streaming data with static reference data, which can be stored in a file store. Finally, file storage may be used as an output destination for captured real-time data for archiving, or for further batch processing in a lambda architecture.
For more information, see Data storage.
Stream processing
Azure Stream Analytics. Azure Stream Analytics can run perpetual queries against an unbounded stream of data. These queries consume streams of data from storage or message brokers, filter and aggregate the data based on temporal windows, and write the results to sinks such as storage, databases, or directly to reports in Power BI. Stream Analytics uses a SQL-based query language that supports temporal and geospatial constructs, and can be extended using JavaScript.
Storm. Apache Storm is an open source framework for stream processing that uses a topology of spouts and bolts to consume, process, and output the results from real-time streaming data sources. You can provision Storm in an Azure HDInsight cluster, and implement a topology in Java or C#.
Spark Streaming. Apache Spark is an open source distributed platform for general data processing. Spark provides the Spark Streaming API, in which you can write code in any supported Spark language, including Java, Scala, and Python. Spark 2.0 introduced the Spark Structured Streaming API, which provides a simpler and more consistent programming model. Spark 2.0 is available in an Azure HDInsight cluster.
For more information, see Stream processing.
Analytical data store
SQL Data Warehouse, HBase, Spark, or Hive. Processed real-time data can be stored in a relational database such Azure SQL Data Warehouse, a NoSQL store such as HBase, or as files in distributed storage over which Spark or Hive tables can be defined and queried.
For more information, see Analytical data stores.
Analytics and reporting
Azure Analysis Services, Power BI, and Microsoft Excel. Processed real-time data that is stored in an analytical data store can be used for historical reporting and analysis in the same way as batch processed data. Additionally, Power BI can be used to publish real-time (or near-real-time) reports and visualizations from analytical data sources where latency is sufficiently low, or in some cases directly from the stream processing output.
For more information, see Analytics and reporting.
In a purely real-time solution, most of the processing orchestration is managed by the message ingestion and stream processing components. However, in a lambda architecture that combines batch processing and real-time processing, you may need to use an orchestration framework such as Azure Data Factory or Apache Oozie and Sqoop to manage batch workflows for captured real-time data.
But, the main question remains… which one should I use in which kind of scenario? Let’s figure out below.
Both Azure IoT Hub and Azure Event Hubs are cloud services that can ingest large amounts of data and process or store that data for business insights. The two services are similar in that they both support ingestion of data with low latency and high reliability, but they are designed for different purposes. IoT Hub was developed specifically to address the unique requirements of connecting IoT devices, at-scale, to the Azure Cloud while Event Hubs was designed for big data streaming. This is why Microsoft recommends using Azure IoT Hub to connect IoT devices to Azure
Azure IoT Hub is the cloud gateway that connects IoT devices to gather data to drive business insights and automation. In addition, IoT Hub includes features that enrich the relationship between your devices and your backend systems. Bi-directional communication capabilities mean that while you receive data from devices you can also send commands and policies back to devices, for example, to update properties or invoke device management actions. This cloud-to-device connectivity also powers the important capability of delivering cloud intelligence to your edge devices with Azure IoT Edge. The unique device-level identity provided by IoT Hub helps better secure your IoT solution from potential attacks.
Azure Event Hubs is the big data streaming service of Azure. It is designed for high throughput data streaming scenarios where customers may send billions of requests per day. Event Hubs uses a partitioned consumer model to scale out your stream and is integrated into the big data and analytics services of Azure including Databricks, Stream Analytics, ADLS, and HDInsight. With features like Event Hubs Capture and Auto-Inflate, this service is designed to support your big data apps and solutions. Additionally, IoT Hub leverages Event Hubs for its telemetry flow path, so your IoT solution also benefits from the tremendous power of Event Hubs.
To summarize, while both solutions are designed for data ingestion at a massive scale, only IoT Hub provides the rich IoT-specific capabilities that are designed for you to maximize the business value of connecting your IoT devices to the Azure cloud. If your IoT journey is just beginning, starting with IoT Hub to support your data ingestion scenarios will assure that you have instant access to the full-featured IoT capabilities once your business and technical needs require them.
The following table provides details about how the two tiers of IoT Hub compare to Event Hubs when you’re evaluating them for IoT capabilities. For more information about the standard and basic tiers of IoT Hub, see How to choose the right IoT Hub tier.
Even if the only use case is device-to-cloud data ingestion, we highly recommend using IoT Hub as it provides a service that is designed for IoT device connectivity.
Let’s detail a little bit more each one of them.
Azure Event Hub
Azure Event Hubs is a Big Data streaming platform and event ingestion service, capable of receiving and processing millions of events per second. Event Hubs can process and store events, data, or telemetry produced by distributed software and devices. Data sent to an event hub can be transformed and stored using any real-time analytics provider or batching/storage adapters.
Event Hubs is used in some of the following common scenarios:
- Anomaly detection (fraud/outliers)
- Application logging
- Analytics pipelines, such as clickstreams
- Live dashboarding
- Archiving data
- Transaction processing
- User telemetry processing
- Device telemetry streaming
Why use Event Hubs?
Data is valuable only when there is an easy way to process and get timely insights from data sources. Event Hubs provides a distributed stream processing platform with low latency and seamless integration, with data and analytics services inside and outside Azure to build a complete Big Data pipeline.
Event Hubs represents the “front door” for an event pipeline, often called an event ingestor in solution architectures. An event ingestor is a component or service that sits between event publishers and event consumers to decouple the production of an event stream from the consumption of those events. Event Hubs provides a unified streaming platform with time retention buffer, decoupling the event producers from event consumers.
The following sections describe key features of the Azure Event Hubs service:
Fully managed PaaS
Event Hubs is a managed service with little configuration or management overhead, so you focus on your business solutions. Event Hubs for Apache Kafka ecosystems gives you the PaaS Kafka experience without having to manage, configure, or run your clusters.
Support for real-time and batch processing
Ingest, buffer, store, and process your stream in real time to get actionable insights. Event Hubs uses a partitioned consumer model, enabling multiple applications to process the stream concurrently and letting you control the velocity of processing.
Capture your data in near-real time in an Azure Blob storage or Azure Data Lake Store for long-term retention or micro-batch processing. You can achieve this on the same stream you use for deriving real-time analytics. Setting up Capture is fast, there are no administrative costs to run it, and it scales automatically with Event Hubs throughput units. Event Hubs Capture enables you to focus on data processing rather than on data capture.
Azure Event Hubs also integrates with Azure Functions for a serverless architecture.
Scalable
With Event Hubs, you can start with data streams in megabytes, and grow to gigabytes or terabytes. The Auto-inflate feature is one of the many options available to scale the number of throughput units to meet your usage needs.
Rich ecosystem
Event Hubs for Apache Kafka ecosystems enables Apache Kafka (1.0 and above) clients and applications to talk to Event Hubs without having to manage any clusters.
With a broad ecosystem available in various languages (.NET, Java, Python, Go, Node.js), you can easily start processing your streams from Event Hubs. All supported client languages provide low-level integration.
Key architecture components
Event Hubs provides message stream handling capability but has characteristics that are different from traditional enterprise messaging. Event Hubs capabilities are built around high throughput and event processing scenarios. Event Hubs contains the following key components:
Event producers: Any entity that sends data to an event hub. Event publishers can publish events using HTTPS or AMQP 1.0 or Apache Kafka (1.0 and above)
Partitions: Each consumer only reads a specific subset, or partition, of the message stream.
Consumer groups: A view (state, position, or offset) of an entire event hub. Consumer groups enable multiple consuming applications to each have a separate view of the event stream, and to read the stream independently at their own pace and with their own offsets.
Throughput units: Pre-purchased units of capacity that control the throughput capacity of Event Hubs.
Event receivers: Any entity that reads event data from an event hub. All Event Hubs consumers connect via the AMQP 1.0 session, and events are delivered through the session as they become available.
The following figure shows the Event Hubs stream processing architecture:
But… and about Kafka?
Kafka on HDInsight
Apache Kafka is an open-source distributed streaming platform that can be used to build real-time data pipelines and streaming applications. Kafka also provides message broker functionality similar to a message queue, where you can publish and subscribe to named data streams. It is horizontally scalable, fault-tolerant, and extremely fast. Kafka on HDInsight provides a Kafka as a managed, highly scalable, and highly available service in Azure.
The following are specific characteristics of Kafka on HDInsight:
It is a managed service that provides a simplified configuration process. The result is a configuration that is tested and supported by Microsoft.
Microsoft provides a 99.9% Service Level Agreement (SLA) on Kafka uptime. For more information, see the SLA information for HDInsight document.
It uses Azure Managed Disks as the backing store for Kafka. Managed Disks can provide up to 16 TB of storage per Kafka broker. For information on configuring managed disks with Kafka on HDInsight, see Increase scalability of Kafka on HDInsight.
For more information on managed disks, see Azure Managed Disks.
Kafka was designed with a single dimensional view of a rack. Azure separates a rack into two dimensions – Update Domains (UD) and Fault Domains (FD). Microsoft provides tools that rebalance Kafka partitions and replicas across UDs and FDs.
For more information, see High availability with Kafka on HDInsight.
HDInsight allows you to change the number of worker nodes (which host the Kafka-broker) after cluster creation. Scaling can be performed from the Azure portal, Azure PowerShell, and other Azure management interfaces. For Kafka, you should rebalance partition replicas after scaling operations. Rebalancing partitions allows Kafka to take advantage of the new number of worker nodes.
For more information, see High availability with Kafka on HDInsight.
Azure Log Analytics can be used to monitor Kafka on HDInsight. Log Analytics surfaces virtual machine level information, such as disk and NIC metrics, and JMX metrics from Kafka.
For more information, see Analyze logs for Kafka on HDInsight.
Kafka on HDInsight architecture
The following diagram shows a typical Kafka configuration that uses consumer groups, partitioning, and replication to offer parallel reading of events with fault tolerance:
Apache ZooKeeper manages the state of the Kafka cluster. Zookeeper is built for concurrent, resilient, and low-latency transactions.
Kafka stores records (data) in topics. Records are produced by producers, and consumed by consumers. Producers send records to Kafka brokers. Each worker node in your HDInsight cluster is a Kafka broker.
Topics partition records across brokers. When consuming records, you can use up to one consumer per partition to achieve parallel processing of the data.
Replication is employed to duplicate partitions across nodes, protecting against node (broker) outages. A partition denoted with an (L) in the diagram is the leader for the given partition. Producer traffic is routed to the leader of each node, using the state managed by ZooKeeper.
Why use Kafka on HDInsight?
The following are common tasks and patterns that can be performed using Kafka on HDInsight:
Replication of Kafka data: Kafka provides the MirrorMaker utility, which replicates data between Kafka clusters.
For information on using MirrorMaker, see Replicate Kafka topics with Kafka on HDInsight.
Publish-subscribe messaging pattern: Kafka provides a Producer API for publishing records to a Kafka topic. The Consumer API is used when subscribing to a topic.
For more information, see Start with Kafka on HDInsight.
Stream processing: Kafka is often used with Apache Storm or Spark for real-time stream processing. Kafka 0.10.0.0 (HDInsight version 3.5 and 3.6) introduced a streaming API that allows you to build streaming solutions without requiring Storm or Spark.
For more information, see Start with Kafka on HDInsight.
Horizontal scale: Kafka partitions streams across the nodes in the HDInsight cluster. Consumer processes can be associated with individual partitions to provide load balancing when consuming records.
For more information, see Start with Kafka on HDInsight.
In-order delivery: Within each partition, records are stored in the stream in the order that they were received. By associating one consumer process per partition, you can guarantee that records are processed in-order.
For more information, see Start with Kafka on HDInsight.
Use cases
Messaging: Since it supports the publish-subscribe message pattern, Kafka is often used as a message broker.
Activity tracking: Since Kafka provides in-order logging of records, it can be used to track and re-create activities. For example, user actions on a web site or within an application.
Aggregation: Using stream processing, you can aggregate information from different streams to combine and centralize the information into operational data.
Transformation: Using stream processing, you can combine and enrich data from multiple input topics into one or more output topics.
Key differences between Kafka and Event Hubs
While Apache Kafka is software, which you can run wherever you choose, Event Hubs is a cloud service similar to Azure Blob Storage. There are no servers or networks to manage and no brokers to configure. You create a namespace, which is an FQDN in which your topics live, and then create Event Hubs or topics within that namespace. As a cloud service, Event Hubs uses a single stable virtual IP address as the endpoint, so clients do not need to know about the brokers or machines within a cluster.
Scale in Event Hubs is controlled by how many throughput units you purchase, with each throughput unit entitling you to 1 MB per second, or 1000 events per second of ingress. By default, Event Hubs scales up throughput units when you reach your limit with the Auto-Inflate feature; this feature also works with the Event Hubs for Kafka feature.
Security and authentication
Azure Event Hubs requires SSL or TLS for all communication and uses Shared Access Signatures (SAS) for authentication. This requirement is also true for a Kafka endpoint within Event Hubs. For compatibility with Kafka, Event Hubs uses SASL PLAIN for authentication and SASL SSL for transport security.
For more information about security in Event Hubs, see Event Hubs authentication and security.
Other Event Hubs features available for Kafka
The Event Hubs for Kafka feature enables you to write with one protocol and read with another, so that your current Kafka producers can continue publishing via Kafka, and you can add readers with Event Hubs, such as Azure Stream Analytics or Azure Functions.
Additionally, Event Hubs features such as Capture and Geo Disaster-Recovery also work with the Event Hubs for Kafka feature.
PaaS versus IaaS
Before we jump into the relative merits and demerits of these two solutions, it is important to note that Azure Event Hub is a managed service, i.e., PaaS, whereas when you run Kafka on Azure, you will need to manage your own Linux VM-s, making it an IaaS solution. IaaS is more work, more control and more integration. It may, however, be worth it for your specific situation – but it is important to know when it is not.
Kafka scales very well – Linked-In handles half a trillion events per day spread across multiple datacenters. Kafka, however, is not a managed service. You can install Kafka and create a cluster of Linux VM-s on Azure, essentially choosing IaaS, i.e., you have to manage your own servers, upgrades, packages and versions.
On the other hand, PaaS frees you from all such constraints. As a general rule, if a PaaS solution meets your functional needs and if its throughput, scalability and performance meet your current and future SLA-s, it is, in my opinion, a more convenient choice. In case of Azure Event Hub, there are several other factors like integration with legacy systems, choice of language/ technology to develop producers/ consumers, size of messages, security, protocol support, availability in your region, etc. which can come into play and we will look at these later in this article. PaaS versus IaaS debate is bigger and older than Kafka versus Even Hub – hence let us not spend much time.
Both offer easy configuration to adjust retention policy: how long messages will be retained in storage. With Azure Event Hub, multi-tenancy dictates enforcement of limits – with the Standard Tier service, Azure Event Hub allows message retention for up to 30 days (default is 24 hours). With Kafka, you specify these limits in configuration files, and you can specify different retention policies for different topics, with no set maximum
However, Azure Event Hub costs you less than $100 for 2.5 billion 1 KB events per month. Kafka, on the other hand, is open source and free – but the machines it runs on are not. The people you hire to maintain those machines are not. It almost always costs you more than Azure Event Hub.
Enough of that – we got dragged back into the PaaS and IaaS debate. Let us focus on some more technical differences.
On-premises support: Azure Event Hub cannot be installed and used on-premises. Kafka, on the other hand, can be installed on-premises. Though this article is about the differences between Azure Event Hub and Kafka running on Azure, I thought that I should point this one out
Protocol Support: Kafka has HTTP REST based clients, but it does not support AMQP. Azure Event Hub supports AMQP
Coding paradigms are slightly different (as is to be expected). In my experience, only the JAVA libraries for Kafka are well maintained and exhaustive. Support in other languages exist, but are not that great. This page lists all the clients for Kafka written in various languages.
Kafka is written in Scala. Therefore, whenever you have to track the source code down from the JAVA API wrappers, it may get very difficult if you are not familiar with Scala. If you are building something where you need answers to non-standard questions like can I cast this object into this data type, and you need to look at the Scala code, it can hurt productivity
Azure Event Hub, apart from being fully supported by C# and .NET, can also use the JAVA QPID JMS libraries as client (because of its AMQP support): See here.
As QPID has support for C or Python as well, building clients using these languages is easy as well.
REST support for both means we can build clients in any languages, but Kafka prefers JAVA as the API language
Disaster Recovery (DR) – Azure Event Hub applies Replication on the Azure Storage Unit (where the messages are stored) – hence we can apply features like Geo-Redundant Storage and make replication across regions a single click solution.
Kafka, however, applies Replication only between its cluster nodes and has no easy in-built way to replicate across regions. There are solutions like the GO Mirror Maker Tool which are additional software layers needed to achieve inter-region replication, adding more complexity and integration points. The advice from Kafka creators for on-premises installation is to keep clusters local to datacenters and mirror between datacenters.
Kafka uses Zookeeper for Configuration Management. ZK is actually known for its lack of proper multi-region support. ZK performs writes (updates) poorly when there is higher latency between hosts (and spread across regions means higher latency).
For running Kafka on Azure VM-s, adding all Kafka instances to an Availability Set is enough to ensure not losing any messages, effectively eliminating (to a large degree) the need for Geo-Redundant Storage Units – as 2 VM-s in the same Availability Set are guaranteed to be in different fault and update domains, Kafka messages should be safe from disasters small enough to not impact the entire region
High Availability (HA) and Fault Tolerance – Until the point the whole region or site goes down, Kafka is highly available inside a local cluster because it tolerates failure of multiple nodes in the cluster very well. Again, Kafka’s dependency on Zookeeper to manage configuration of topics and partitions means HA is affected across globally distributed regions.
Azure Event Hub is Highly Available under the umbrella Azure guarantee of HA. Under the hoods, Event Hub servers use replication and Availability Sets to achieve HA and Fault Tolerance.
It may be worth mentioning in this regard that Kafka’s in-cluster Fault Tolerance allows zero downtime upgrades where you can rotate deployments and upgrade one node at a time without taking the entire cluster down. Azure Event Hub, it goes without saying, has the same feature with the exception that you do not have to worry about upgrades and versions!
Scalability – Kafka’s ability to shard partitions as well as increase both (a) partition count per topic and (b) number of downstream consumer threads – provides flexibility to increase throughput when desired – making it highly scalable.
Also, the Kafka team, as of this article being written, is working on zero downtime scalability – using the Apache Mesos framework. This framework makes Kafka elastic – which means Kafka running on Mesos framework can be expanded (scaled) without downtime. Read more on this initiative here.
Azure Event Hub, of course, being hosted on Azure is automatically scalable depending on the number of throughput units purchased by you. Both storage and concurrency will scale as needed with Azure Event Hub. Unlike Event Hub, Kafka will probably never be able to scale on storage – but again these are exactly the kind of points that go back to the point where Kafka is IaaS and Event Hub is PaaS
Throttling – With Azure Event Hub, you purchase capacity in terms of TU (Throughput Unit) – where 1 TU entitles you to ingest 1000 events per second (or 1 MBPS, whichever is higher) – and egress twice that. When you hit your limit, Azure throttles you evenly across all your senders and receivers (.NET clients will receive ServerBusyException). Remember that room service analogy? You can purchase up to 20 TU-s from the Azure portal. More than 20 can be purchased by opening a support ticket.
If you run Kafka on Azure, there is no question of being throttled. Theoretically, you can reach a limit where the underlying storage account is throttled, but that is practically almost impossible to reach with event messaging where each message is a few kilo bytes big, and there are retention policies in place.
Kafka itself lacks any throttling mechanism or protection from abuse. It is not multi-tenant by nature, so I guess such features must be pretty low in the pecking order when it comes to roadmap
Security – Kafka lacks any kind of security mechanism as of today. A ticket to implement basic security features is currently open (as of August 2015: https://issues.apache.org/jira/browse/KAFKA-1682). The goal of this ticket is to encrypt data on the wire (not at rest), authenticate clients when they are trying to connect to the brokers and support role based authorizations and ACLS.
Therefore, in IoT scenarios where (a) each publishing device must have its own identity and (b) messages must not be consumed by non-intended receivers, Kafka developers will have to additionally build in complicated security instruments possibly built into the messages themselves. This will result in additional complexity and redundant bandwidth consumption.
Azure Event Hub, on the other hand, is very secure – it uses SAS tokens just like other Service Bus components. Each publishing device (think of an IoT scenario) is assigned a unique token. On the consuming side, a client can create a consumer group if the request to create the consumer group is accompanied by a token that grants manage privileges for either the Event Hub, or for the namespace to which the Event Hub belongs. Also, a client is allowed to consume data from a consumer group if the receive request is accompanied by a token that grants receive rights either on that consumer group, or the Event Hub, or the namespace to which the Event Hub belongs.
The current version of Service Bus does not support SAS rules for individual subscriptions. SAS support will be added for this in the future
Integration with Stream Analytics – Stream Analytics is Microsoft’s solution to real time event processing. It can be employed to enable Complex Event processing (CEP) scenarios (in combination with EventHubs) allowing multiple inputs to be processed in real time to generate meaningful analytics. Technologies like Esper and Apache Storm provide similar capabilities but with Stream Analytics you get out-of-the-box integration with EventHub, SQL Databases, Storage, which make it very compelling for quick development with all these components. Moreover, it exposes a query processing language which is very similar to SQL syntax, so the learning curve is minimal. In fact, once you have a job created, you can simply use the Azure Management portal to develop queries and run jobs eliminating the need for coding in many use cases.
Once you integrate Azure Event Hub with Azure Stream Analytics, the next logical step is to use Azure Machine Learning to take intelligent decisions based on that Analytics data.
With Kafka, you do not get such ease of integration. It is, of course, possible to build such integration, but it is time consuming
Message Size: Azure Event Hub imposes an upper limit on message size: 256 KB, need for such policies of course arising from its multi-tenant nature. Kafka has no such limitation, but its performance sweet spot is 10 KB message size. Its default maximum is 1 MB.
In this regard, it is worth mentioning that with both Kafka and Azure Event Hub, you can compress the actual message body and reduce its size using standard compression algorithms (Gzip, Snappy etc.).
Another word of caution in this regard: If you are building an event driven messaging system, the need to send a massive XML or JSON file as the message usually indicates poor design. Therefore, when comparing these two technologies for message size, we should keep this in mind – the well designed system is going to exchange small messages less than 10 KB in size each
Storage needs: Kafka writes every message to broker disk, necessitating attachment of large disks to every VM running Kafka cluster (seehere on how to attach a data disk to Linux VM). See Disaster Recovery above to understand how stored messages can be affected in case of disaster.
For Azure Event Hub, you need to configure Azure Storage explicitly from the publisher before you can send any messages. 1 TU gives you 84 GB of event storage. If you cross that limit, storage is charged at the standard rate for blob storage. This overage does not apply within the first 24 hour period (if your message retention policy is 24 hours, you can store as much as you want without getting charged for storage).
Again a design pointer here: a messaging system should not be doubling as a storage system. The current breed of persistent replicated messaging systems have storage built into itself for messaging robustness, not for long term retention of data. We have databases, Hadoop and No SQL solutions for such – and all these could be consumers of the Event Hub you use, enabling you to erase messages quickly but still retain their value
Pricing and Availability: For running Kafka on VM-s, you need to know this: http://azure.microsoft.com/en-us/pricing/details/virtual-machines/
For using Azure Event Hub, you need to know this (has link to pricing page at the top): https://azure.microsoft.com/en-us/documentation/articles/event-hubs-availability-and-support-faq/
Performance: Now that we are down to comparing performance, I must go back to my managed multi-tenant service pitch first. You need to know that you are not comparing apples to apples. You can run a massive test on Kafka, but you cannot run it on Azure Event Hub: you will be throttled!
There are so many throughput performance numbers on Apache Kafka out there, that I did not want to set another up and offer more of the same thing. However, in order to compare Azure Event Hub and Kafka effectively, it is important to note that if you purchase 20 TU-s of Event Hub, you will get 20,000 messages ingested per second (or 20 MBPS of ingress, whichever is larger). With Kafka, this comparison concludes that a single node single thread achieves around 2,550 messages/second and 25 sending/ receiving threads on 4 nodes achieve 30K messages/second.
That makes the performance comparable to me, which is very impressive for Event Hub, as it is still a multi-tenant solution, meaning each and every tenant is getting that performance out of it! This, however, needs to be proved and I have not run multi-tenant tests simultaneously.
A point to note about the famous Jay Kreps test is that the message size there used is 100 bytes. Azure Event Hub guarantees 20,000 messages/second or 20 MBPS ingress to someone who has purchased 20 TU-s: that tells us that the expected message size is 1 KB. That is a 10 times difference in message size.
However, I did run single tenant test and tested something different: how long does a message take to reach the consumer? This approach provides a new angle to performance. Most test results published on the internet deals with throughput of ingestion. Therefore, I wanted to test something new and different.
Before discussing the results, let me explain an inherent inequality in these tests. The latency numbers here are of end-to-end message publishing and consumption. While in the case of Event Hub, the three actors (publisher, highway and consumer) were separated on the network – in case of Kafka, all three were on the same Linux Azure VM (D4 8 core).
With that, let us compare the results:
In spite of the fact that the Azure Event Hub end-to-end test involved multiple network hops, the latency was within a few milliseconds of Kafka (whereas the messages were traveling within the boundaries of the same machine in case of Kafka).
In essence, what I have measured shows no difference between the two. I am more convinced that Event Hub, in spite of being a managed service, provides a similar degree of performance compared to Kafka. If you add all the other factors to the equation, I would choose Azure Event Hub over running Kafka either on Azure or my own hardware.
Key selection criteria
To narrow the choices, start by answering these questions:
Do you need two-way communication between your IoT devices and Azure? If so, choose IoT Hub.
Do you need to manage access for individual devices and be able to revoke access to a specific device? If yes, choose IoT Hub.
Choosing between Event Hub or Kafka, At the end of the day, you need to ask yourself: what degree of control do I need?
What is Stream Analytics?
Azure Stream Analytics is an event-processing engine that allows you to examine high volumes of data streaming from devices. Incoming data can be from devices, sensors, web sites, social media feeds, applications, and more. It also supports extracting information from data streams, identifying patterns, and relationships. You can then use these patterns to trigger other actions downstream, like alerts, feed information to a reporting tool, or store it for later use.
Following are some examples where Azure Stream Analytics can be used:
Internet of Things(IoT) Sensor fusion and real-time analytics on device telemetry
Web logs/clickstream analytics
Geospatial analytics for fleet management and driverless vehicles
Remote monitoring and predictive maintenance of hi-value assets
Real-time analytics on Point of Sale data for inventory control and anomaly detection
How does Stream Analytics work?
Azure Stream Analytics starts with a source of streaming data that is ingested into Azure Event Hub, Azure IoT Hub or from a data store like Azure Blob Storage. To examine the streams, you create a Stream Analytics job that specifies the input source that streams data. The job also specifies a transformation query that defines how to look for data, patterns, or relationships. The transformation query leverages a SQL-like query language that is used to filter, sort, aggregate, and join streaming data over a period of time. When executing the job, you can adjust the event ordering options, and duration of time windows when performing aggregation operations.
After analyzing the incoming data, you specify an output for the transformed data, and you can control what to do in response to the information you’ve analyzed. For example, you can take actions like:
Send data to a monitored queue to trigger custom workflows downstream.
Send data to Power BI dashboard for real-time visualization.
Archive data to other Azure storage services.
The following image illustrates the Stream Analytics pipeline, Your Stream Analytics job can use all or a selected set of inputs and outputs. This image shows how data is sent to Stream Analytics, analyzed, and sent for other actions like storage, or presentation:
Real-time stream processing consumes messages from either queue or file-based storage, process the messages, and forward the result to another message queue, file store, or database. Processing may include querying, filtering, and aggregating messages. Stream processing engines must be able to consume an endless streams of data and produce results with minimal latency. For more information, see Real time processing.
Key capabilities and benefits
Azure Stream Analytics is designed to be easy to use, flexible, reliable, and scalable to any job size. It is available across multiple datacenters as well as sovereign clouds. Following image illustrates the key capabilities of Azure Stream Analytics:
Ease of getting started
Azure Stream Analytics is easy to get started. It only takes few clicks to connect to multiple sources, sinks and to create an end to end pipeline. Stream Analytics can connect to Azure Event Hubs, Azure IoT Hub for streaming data ingestion. It can also connect to Azure Blob storage service to ingest historical data. It can combine data from event hubs with other data sources and processing engines. Job input can also include reference data that is static or slow-changing data and you can join streaming data to this reference data to perform lookup operations.
Stream Analytics can route job output to many storage systems such as Azure Blob, Azure SQL Database, Azure Data Lake Stores, or Azure Cosmos DB. After storing, you can run batch analytics with Azure HDInsight or send the output to another service such as event hubs for consumption or to Power BI for real-time visualization by using Power Bi streaming API.
Programmer Productivity
Azure Stream Analytics uses a simple SQL based query language that has been augmented with powerful temporal constraints to analyze data in motion. To define job transformations, you use a simple, declarative Stream Analytics query language that lets you author complex temporal queries and analytics using simple SQL constructs. Stream Analytics query language is consistent to the SQL language, familiarity with SQL language is sufficient to get started with creating jobs. You can also create jobs by using developer tools like Azure PowerShell, Stream Analytics Visual Studio tools, or Azure Resource Manager templates. Using developer tools allow you to develop transformation queries offline and use the CI/CD pipeline to submit jobs to Azure.
The Stream Analytics query language offers a wide array of functions for analyzing and processing the streaming data. This query language supports simple data manipulation, aggregation functions to complex geo-spatial functions. You can edit queries in the portal, and test them using sample data that is extracted from the live stream.
You can extend the capabilities of the query language by defining and invoking additional functions. You can define function calls in the Azure Machine Learning service to take advantage of Azure Machine Learning solutions and integrate JavaScript user-defined functions (UDFs) or user-defined aggregates to perform complex calculations as part a Stream Analytics query.
Fully managed
Azure Stream Analytics is a fully managed serverless (PaaS) offering on Azure. Which means you don’t have to provision any hardware or manage clusters to run your jobs. Azure Stream Analytics fully manages your job, by taking care of setting up complex compute clusters in the cloud and the performance tuning necessary to run the job. Integration with Azure Event Hubs and Azure IoT Hub allows jobs to ingest millions of events per second coming from connected devices, clickstreams, and log files, to name a few. Using the partitioning feature of event hubs, you can partition computations into logical steps, each with the ability to be further partitioned to increase scalability.
Low Total Cost of Ownership
As a cloud service, Stream Analytics is optimized for cost. There are no upfront costs involved, you only pay for the streaming units you consume, and the amount of data processed. There is no commitment or cluster provisioning required. You can scale the job up or down your steaming jobs based on your business needs.
Reliability
As a managed service, Stream Analytics guarantees event processing with a 99.9% availability, helps prevent data loss, and provides business continuity refer to the Stream Analytics SLA page for more details. Stream Analytics can process millions of events every second and it can deliver results with low latency. Stream Analytics guarantees exactly once event processing and at least once delivery of events. It has built-in recovery capabilities in case the delivery of an event fails. Stream Analytics can internally maintain the state of your job, you can start a job from its last output time, provides repeatable results by providing same results all the time. This feature of Stream Analytics enables you to go back in time and investigate computations when doing root-cause analysis.
Performance
Azure Stream Analytics is optimized for high performance, it can process streaming data and perform in memory computations. It allows you to scale-up or scale-down to handle real-time and complex event processing applications. Stream Analytics supports performance by partitioning. A complex query can be parallelized and executed on multiple streaming nodes.
What are your options when choosing a technology for real-time processing?
In Azure, all of the following technologies will meet the core requirements supporting real-time processing:
HDInsight with Spark Streaming
Apache Spark in Azure Databricks
Below you can see the main differences between each one of them. Programmability support, Pricing models, Inputs and Sinks could be some key factors to consider in your decision.
Key Selection Criteria
For real-time processing scenarios, begin choosing the appropriate service for your needs by answering these questions:
Do you prefer a declarative or imperative approach to authoring stream processing logic?
Do you need built-in support for temporal processing or windowing?
Does your data arrive in formats besides Avro, JSON, or CSV? If yes, consider options support any format using custom code.
Do you need to scale your processing beyond 1 GB/s? If yes, consider the options that scale with the cluster size.
Exploring Different Architectures
- Architecture using Kafka
Get insights from live streaming data with ease. Capture data continuously from any IoT device, or logs from website clickstreams, and process it in near-real time.
1- Easily ingest live streaming data for an application using Apache Kafka cluster in Azure HDInsight.
2- Bring together all your structured data using Azure Data Factory to Azure Blob Storage.
3- Take advantage of Azure Databricks to clean, transform, and analyze the streaming data, and combine it with structured data from operational databases or data warehouses.
4- Use scalable machine learning/deep learning techniques, to derive deeper insights from this data using Python, R or Scala, with inbuilt notebook experiences in Azure Databricks.
5- Leverage native connectors between Azure Databricks and Azure SQL Data Warehouse to access and move data at scale.
6- Build analytical dashboards and embedded reports on top of Azure Data Warehouse to share insights within your organization and use Azure Analysis Services to serve this data to thousands of users.
7- Power users take advantage of the inbuilt capabilities of Azure Databricks and Azure HDInsight to perform root cause determination and raw data analysis.
8- Take the insights from Azure Databricks to Cosmos DB to make them accessible through real time apps.
- Architecture using Cosmos DB and Apache Spark
Azure Cosmos DB is a fast and flexible globally replicated database, well-suited for IoT, gaming, retail, and operational logging applications. A common design pattern in these applications is to use changes to the data to kick off additional actions. These additional actions could be any of the following:
Triggering a notification or a call to an API when a document is inserted or modified.
Stream processing for IoT or performing analytics.
Additional data movement by synchronizing with a cache, search engine, or data warehouse, or archiving data to cold storage.
The change feed support in Azure Cosmos DB enables you to build efficient and scalable solutions for each of these patterns.
The change feed feature allows downstream streaming and batch compute the process the data. This allows your systems to keep stich of your data while processing it in real time.
Real time analytics start from the Lambda Architecture that allows you to make sense of data. Both in batch and in real time by having separated batch and speed layers to push your data into.
The master data is composite of a master dataset and The ability the precompute the batch views within the serving layer. Meanwhile the speed layer provides the real time answers to conversate with the serving layer latency. All queries can be answered by merging the real time and batch views.
By using cosmos DB change feed you can push the data only into the batching layer simplifying operations. The cosmos DB Change Feature streams the data in real time for HDInsight using Apache spark to process.
We can use the same spark infrastructure to perform computations from cosmos BD.
Directly from batch aggregated queries for example the number of tweets with machine learning in the past few days or months, the results are pushed into another Cosmos DB Collection.
So many clients or users can query the results concurrently.
Meanwhile Spark streaming is running another job constantly processing in the stream change feed data the speed layer calculates algorisms or aggregations in lower latency.
For example well the serving layers provide the number of tweets of machine learning in the past few days, the speed layer provide the number of tweets from machine learning in the past few seconds.
So, we can consider simplify the lambda architecture by using only cosmos DB and HDInsight spark.
References and Additional Resources:
https://docs.microsoft.com/en-us/azure/architecture/data-guide/technology-choices/stream-processing
https://docs.microsoft.com/en-us/azure/hdinsight/kafka/apache-kafka-introduction
https://www.youtube.com/watch?v=iBHRlYFGLSM
https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-features
https://dev.to/fedejsoren/amqp-vs-http
https://www.youtube.com/watch?v=tDAoIzF-7q0