wiki

Codit Wiki

Loading information... Please wait.

Codit Blog

Posted on Tuesday, October 17, 2017 12:53 PM

Tom Kerkhove by Tom Kerkhove

Auto-scaling is a great way to not only optimize your costs but also a flexible way of doing asynchronous processing.We will look at how Azure Monitor Autoscale allows you to define auto-scaling rules, what the caveats are and what would be good additions to the service

Building scalable systems is crucial for any cloud platform.

One way to achieve this is to decouple your frontend nodes from your backend processing by using the Competing Consumer pattern. This makes it possible to easily add more processing instances (scale out) when the workload is growing, being messages filling the queue.
Automating things is always great, but it is crucial to be aware of what is going on in your platform. This is often forgotten, but should be part of your monitoring as well.
Once everything is setup you can save money by optimizing your resources based on your needs, instead of overprovisioning.

A question I have received a couple of times is - Great! But how do I do that?

Enter Azure Monitor Autoscale

Azure Monitor Autoscale enables you to define rules that will automatically scale your workloads based on specific metrics.

These metrics can be Service Bus Queues, Storage Queues, Application Insights, custom metrics and more. Currently, Azure Monitor Autoscale is limited to workloads running on Azure Cloud Services (Yes, you've read that right!), App Service Plans and/or Virtual Machine Scale Sets.

When more advanced auto-scaling rules are required, you can define multiple autoscale conditions. This allows you to vary your scaling based on day of the week, time of day or even date ranges.

This makes it really great because this allows you to have more aggressive scaling over the weekend, when more people are buying products than during working hours. The date ranges are also interesting because you can define specific rules for a specific period when you are launching a new marketing campaign and expect more traffic.

Configuring auto-scaling for an Azure Service Bus Queue

Sello is hosting an online platform for selling items online and would like to improve their scalability. To achieve this, they want to start auto-scaling their worker role based on the message count of their Service Bus queue.

In order to configure it, we need to go to "Azure Monitor" and click on "Autoscale". There it will give you an overview of all resources that can be autoscaled and their current status:

As you can see, there is no auto-scaling configured which we can easily add by clicking on the specific role we'd like to autoscale.

When no auto-scaling is configured you can easily change the current instance count, or you can enable auto-scaling and define the profile that fits your needs.

Each auto-scaling condition has a name and contains a set of scaling rules that will trigger a scaling action. Next to that, it provides you the capability to limit the instances to a certain amount of instances.

When adding a scale rule you can select the metric you want to scale on and basically define the criteria that triggers the action you want to perform being scaling up or down.

By using a cooldown, it allows your platform to catch up after the previous scaling activity. This is to avoid that you add more instance again, while the previous scale action has actually already mitigated it.

In this case, we're adding a rule to add 2 instances when the active message count is greater than 2000 with a cooldown of 15 minutes.

Scaling out is great, scaling in is even better! Just follow the same principle, here we're scaling 1 instance down when the criteria are met.

Once everything is configured, your role will start auto-scaling and the configuration looks similar to this:

 

Creating awareness about auto-scaling

Woohoow, auto-scaling! Awesome!

Well - It's great but not done yet. Be aware of how your platform is auto-scaling. By using the Run History you can get an overview of your recent scaling activities and learn from it. Creating scaling definitions is not an easy thing to do and should be re-evaluated frequently.

As you can see below, we can handle the load without any problem but it can be improved by scaling down more aggressively.

A more proactive way of monitoring this is by using notifications where you can either use email notifications or trigger an HTTP webhook when scaling action is happening.

This is very handy when you want to create awareness about these actions - An easy way to achieve this is to create a Logic App that handles these events, similar to how I did this for Azure Alerts.

You can use one centralized handler for this or create dedicated handlers, based on your use-case. I personally prefer to use a centralized handler because it makes it easier to maintain if the handling is the same for all.

When we put everything together, this is a high-level overview of all the settings for auto-scaling.

If we were to add a new autoscale condition, we'd have to specify the period in which it would be in effect and basically ignoring all other scaling conditions.

Caveats

Defining auto-scaling rules are not easy and they come with a few caveats:

Be careful what metric you are auto-scaling on and make sure that it's the correct one. Unfortunately, I've seen a case where we were stuck in an infinite scaling loop because we were auto-scaling our worker roles based on the Message Count of a Service Bus queue. However; Message Count not only includes the active messages but also the dead-lettered messages which weren't going away. What we actually ended up with was changing our auto-scaling metric to Active Message Count which is what we were interested in here.

This brings me to monitor your auto-scaling - This is not only important to detect issues as I've just mentioned but also to learn how your platform is scaling and continuously improve your scaling criteria. It is something that needs to grow since this is use-case specific.

Protect your budget and include instance limitations on your auto-scaling conditions. This will protect you from burning your resource costs in case something goes wrong or if having to wait a little longer is not a problem.

Taking auto-scaling to the next level

Azure Monitor Autoscale is great how it is today, but I see a couple of features that would be nice to have:

  • Scaling Playbooks - Similar to Azure Alerts & Security Center's Security Playbooks, it would be great to have native integration with Azure Logic Apps which makes it not only easier but also encourages people to use a centralized workflow of handling these kinds of notifications. Next to that, it also makes it easier to link both resources together, instead of having to copy the URL of the HTTP connector in your Logic App.
  • Event-Driven Auto-scaling - The current auto-scaling is awesome and it provides a variety of metric sources. However, with the launch of Azure Event Grid, it would be great to see Azure Monitor Autoscale evolve to support an event-based approach as well:
    • Autoscale when certain events are being pushed by Azure Event Grid to react instead of polling a specific metric
    • Emit auto-scaling events when actions are being started or finalized. That would allow subscribers to react on that instead of triggering a webhook. This also provides more extensibility where instead of only notifying one webhook, we can basically open it up for everybody who is interested in this

That said, I think having both a metric-based & eventing-based model would be the sweet spot as these support their own use-cases.

Conclusion

With Azure Monitor Autoscale it is really easy to define auto-scaling rules that handling all the scaling for you, but you need to be careful with it. Having a good monitoring approach here is the key to success.

Every powerful tool comes with a responsibility.

Thanks for reading,

Tom

Categories: Azure
written by: Tom Kerkhove

Posted on Friday, May 19, 2017 11:29 AM

Toon Vanhoutte by Toon Vanhoutte

Receiving data from a SQL table and processing it towards other back-end systems; It's a very common use case in integration. Logic Apps has all the required functionality in its toolbox to fulfill this integration need. This blog post explains how you can do this in a reliable fashion, in case you're dealing with mission critical interfaces where no data loss is accepted.

Scenario

Let's discuss the scenario briefly.  We need to consume data from the following table.  All orders with the status New must be processed!

The table can be created with the following SQL statement:

First Attempt

Solution

To receive the data, I prefer to create a stored procedure. This avoids maintaining potentially complex SQL queries within your Logic App. The following stored procedure selects the first order with status New and updates its status to Processed in the same statement. Remark that it also returns the @@ROWCOUNT, as this will come in handy in the next steps.

The Logic App fires with a Recurrence trigger.  The stored procedure gets executed and via the ReturnCode we can easily determine whether it returned an order or not.  In case an order is retrieved, its further processing can be performed, which will not be covered in this post.

Evaluation

If you have a BizTalk background, this is a similar approach on using a polling SQL receive location. One very important difference: the BizTalk receive adapter executes the stored procedure within the same distributed transaction as it persists the data in the MessageBox, whereas Logic Apps is completely built on API's that have no notion of MSDTC at all.

In failure situations, when a database shuts down or the network connection drops, it could be that the order is already marked as Processed, but it never reaches the Logic App. Depending on the returned error code, your Logic App will end up in a Failed state without clear description or the Logic App will retry automatically (for error codes 429 and 5xx). In both situations you're facing data loss, which is not acceptable for our scenario.

Second attempt

Solution

We need to come up with a reliable way of receiving the data. Therefore, I suggest to implement a similar pattern as the Azure Service Bus Peek-Lock. Data is received in 2 phases:

  1. You mark the data as Peeked, which means it has been assigned to a receiving process
  2. You mark the data as Completed, which means it has been received by the receiving process

Next to these two explicit processing steps, there must be a background task which reprocesses messages that have the Peeked status for a too long duration. This makes our solution more resilient.

Let's create the first stored procedure that marks the order as Peeked.

The second stored procedure accepts the OrderId and marks the order as Completed.

The third stored procedure should be executed by a background process, as it sets the status back to New for all orders that have the Peeked status for more than 1 hour.

Let's consume now the two stored procedures from within our Logic App.  First we Peek for a new order and when we received it, the order gets Completed.  The OrderId is retrieved via this expression: @body('Execute_PeekNewOrder_stored_procedure')?['ResultSets']['Table1'][0]['Id']

The background task could be executed by a SQL Agent Job (SQL Server only) or by another Logic App that is fired every hour.

Evaluation

Happy with the result? Not a 100%! What if something goes wrong during further downstream processing of the order? The only way to reprocess the message is by changing its status in the origin database, which can be a quite cumbersome experience for operators. Why can't we just resume the Logic App in case of an issue?

Third Attempt

Solution

As explained over here, Logic Apps has an extremely powerful mechanism of resubmitting workflows. Because Logic Apps has - at the time of writing - no triggers for SQL Server, a resubmit of the Recurrence trigger is quite useless. Therefore I only want to complete my order when I'm sure that I'll be able to resubmit it if something fails during its further processing. This can be achieved by splitting the Logic App in two separate workflows.

The first Logic App peeks for the order and parses the result into a JSON representation. This JSON is passed to the next Logic App.

The second Logic App gets invoked by the first one.  This Logic App completes the order first and performs afterwards the further processing.  In case something goes wrong, a resubmit of the second Logic App can be initiated.

Evaluation

Very happy with the result as:

  • The data is received from the SQL table in a reliable fashion
  • The data can be resumed in case further processing fails

Conclusion

Don't forget that every action is HTTP based, which can have an impact on reliability. Consider a two-phased approach for receiving data, in case you cannot afford message loss. The same principle can also by applied on receiving files: read the file content in one action and delete the file in another action. Always think upfront about resume / resubmit scenarios. Triggers are better suited for resubmit than actions, so if there are triggers available: always use them!

This may sound overkill to you, as these considerations will require some additional effort. My advice is to determine first if your business scenario must cover such edge case failure situations. If yes, this post can be a starting point for you final solution design.

Liked this post? Feel free to share with others!
Toon

 

 

Categories: Azure
written by: Toon Vanhoutte

Posted on Monday, August 21, 2017 10:47 AM

Tom Kerkhove by Tom Kerkhove

Azure Event Grid is here - In this first article we'll have a look at what it is, dive into the details and discuss certain new scenarios.

Last week Microsoft announced Azure Event Grid (Preview), an event-driven service that allows you to stitch together all your components and design event-driven architectures.

Next to the built-in support for several Azure services you can also provide your own custom topics and custom webhooks that fix your needs.

By using a combination of filters and multicasting, you can create a flexible event routing mechanism that fits your needs by for example sending event A to one handler, while event B is being multicasted to multiple handlers. Read more about this here.

Azure resources can act as Event Publishers where they send a variety of events to Event Grid. By using Event Subscriptions you can then subscribe to those events and send them to an Event Handler.

The main scenarios for Azure Event Grid are serverless architectures, automation for IT/operations and integration:

  • Serverless Architectures - Trigger a Logic App when a new blob is uploaded
  • Operations - Listen & react on what happens in your subscription by subscribing to Azure Subscription changes
  • Integration - Extend existing workflows by triggering a Logic App once there is a new record in your database
  • Custom - Create your own by using application topics (aka custom topics)

The pricing for Azure Event Grid is fairly simple - You pay $0.60 per million operations and you get the first 100k operations per month for free. Operations are defined as event ingress, advanced match, delivery attempt, and management calls. Currently you only pay $0.30 since it's in public preview, more information on the pricing page.

Basically you can see Azure Event Grid as an extension service that allows you to integrate Azure Services with each other more closely while you also have the flexibility to plug in your own custom topics.

Let's have a closer look at what it has to offer.

Diving into Azure Event Grid

Event Handling at Scale

Azure Event Grid is designed as an high scalable eventing backplane which comes with some serious performance targets:

  • Guaranteed sub-second end-to-end latency (99th percentile)
  • 99.99% availability
  • 10 million events per second, per region
  • 100 million subscriptions per region
  • 50 ms publisher latency for batches of 1M

These are very big numbers which also indirectly have impact on the way we design our custom event handlers. They will need to be scalable and protect themselves from being overwhelmed and should come with a throttling mechanism.

But then again, designing for the cloud typically means that each component should be highly scalable & resilient so this should not be an exception.

Durable Message Delivery

Every event will be pushed to the required Event Handler based on the configured routing. For this, Azure Event Grid provides durable message delivery with an at-least-once delivery.

By using retries with exponential backoff, Event Grid keeps on sending events to the Event Handler until it acknowledges the request with either an HTTP 200 OK or HTTP 202 Accepted.

The Event Handler needs to be capable of processing the event in less than one minute, otherwise Event Grid will consider it as failed and retry it. This means that all Event Handlers should be idempotent to avoid creating invalid state in your system.

However, if your Event Handler is unable to process the event in time and Event Grid has been retrying for up to 24h, 2h in public preview, it will expire the event and stop retrying.

In summary, Event Grid guarantees an at-least-once delivery for all your events but you as an Event Handler are still in charge of being capable of processing the event in time. This also means that it should be able to preserve performance when they are dealing with load spikes.

It is also interesting to see what really happens with the expired events. Do they really just go away or will there be a fallback event stream to which they are forwarded for later processing? In general, I think expiration of events will work but in certain scenarios I see a case where having the fallback event stream is a valuable asset for mission critical event-driven flows.

You can read more on durable message delivery here.

How about security?

Azure Event Grid offers a variety of security controls on all levels:

  • Managing security on the Event Grid resource itself is done with Role-based Access Control (RBAC). It allows you to define granular control to the correct people. It's a good practice to use the least-priviledge principle, but that is applicable to all Azure resources. More information here.
  • Webhook Validation - Each newly registered webhook needs to be validated by Azure Event Grid first. This is to prove that you have ownership over the endpoint. The service will send a validation token to the webhook, which the webhook implementer needs to send back as a validation. It's important to note that only HTTPS webhooks are supported. More information here.
  • Event Subscription uses Role-based Access Control (RBAC) on the Event Grid resource where the person creating a new subscription needs to have the Microsoft.EventGrid/EventSubscriptions/Write permissions.
  • Publishers need to use SAS Tokens or key authentication when they want to publish an event to a topic. SAS tokens allow you to scope the access you grant to a certain resource in Event Grid for a certain amount of time. This is similar to the approach Azure Storage & Azure Service Bus use.

The current security model looks fine to me, although it would be nice if there would be a concept of SAS tokens with a stored access policysimilar to Azure Storage. This would allow us to issue tokens for a certain entity, while still having the capability to revoke access in case we need this, i.e. when a token was compromised.

An alternative to SAS stored access policies would be to be able to create multiple authorization rulessimilar to Azure Service Bus, where we can use the key approach for authentication while still having the capability to have more granular control over whom uses what key and being able to revoke it for one publisher only, instead of revoking it for all publishers.

You can read more on security & authentication here.

Imagine the possibilities

Integration with other Azure services

As of today there are only a few Azure services that integrate with Azure Event Grid but there are a lot of them coming.

Here are a couple of them that I would love to use:

  • Use API Management as a public-facing endpoint where all events are transformed and sent over to Azure Event Grid. This would allow us to use API Management as a webhook proxy between the 3rd party and Azure Event Grid. More on this later in the post
  • Streamlined event processing for Application Insights custom events where it acts as an Event Publisher. By doing this we can push them to our data store so that we can use it in our Power BI reporting, instead of having to export all telemetry and setting up a processing pipeline for that, as described here
  • Real-time auditing & change notifications for Azure Key Vault
    • Publish events when a new version of a Key or Secret was added to notify dependent processes about this so they can fetch the latest version
    • Real-time auditing by subscribing to changes on the access policies
  • Sending events when alerts in Azure Monitor are triggered would be very useful. In the past I've written about how using webhooks for processing alerts instead of emails are more interesting as you can trigger an automation workflow such as Logic Apps. If an alert would send an event to Azure Event Grid we can take it even a step further and create dedicated handlers per alert or alert group. You can already achieve this with Logic Apps & Service Bus Topics as of today but with Event Grid this comes out of the box and makes it more easy to create certain routings
  • Trigger an Azure Data Factory when an event occurs, i.e. when a blob was added to an Azure Storage container
  • Send an event when Azure Traffic Manager detects a probe that is unhealthy

New way of handling webhook events?

When we want to provide 3rd parties to send notifications to a webhook we need to provide a public endpoint which they can call. Typically, these just take the event and queue them for later processing allowing the 3rd party to move on as we handle the event at our own pace.

The "problem" here is that we still need to host an API middleware somewhere; be it an Azure Function, Web App, Api App, etc; that just handles this message. Even if you use Azure API Management, you still need to have the middleware running behind the API Management proxy since you can't push directly to a topic.

Wouldn't it be nice if we can get rid of that host and let API Management push the requests directly to Azure Event Grid so that it can fan-out all the external notifications to the required processors?

That said, this assumes that you don't do any validation or other business logic before the webhook middleware pushes to the topic for processing. If you need this capability, you will have to stick with hosting your own middleware I'm afraid.

Unified integration between APIs

Currently when you are using webhooks inside your infrastructure the Event Publishers are often calling webhooks directly creating a spaghetti infrastructure. This is not manageable since each Event Publisher needs to have the routing logic inside their own component.

By using Azure Event Grid we can route all the events through Azure Event Grid and use it as an event broker, or routing hub if you will, and thus decoupling Event Publisher from the corresponding Event Handlers.

By doing this we can easily change the way we route events to new Event Handlers by simply changing the routing, not the routing logic in the Event Publishers.

Depending on the monitoring Azure Event Grid will provide, it can also provide a more generic approach in how we monitor all the event handling instead of using the monitoring on each component. More on this in my next blog.

Depending on the load, you can of course also use Azure Service Bus Topics but all depends on the load you are expecting. As always, it depends on the scenario; to pick which technology is best for the scenario.

Summary

Azure Event Grid is a unique service that has been added to Microsoft Azure and brings a lot to the table. It promises big performance targets and will enable new scenarios, certainly in the serverless landscape.

I'm curious to see how the service will evolve and what publishers & handlers will be coming soon. Personally, I think it's a big announcement and will give it some more thinking on how we can use it when building platforms on Microsoft Azure.

Want to learn more yourself? Here's a good Cloud Cover episode that will give you a high-level overview of Azure Event Grid or read about the concepts of Event Grid.

What features would you like to see being added to the service? In what scenarios do you see Event Grid as a good fit? Feel free to mention them in the comments!

Thanks for reading,

Tom Kerkhove.