azure-samples / streaming-at-scale Goto Github PK
View Code? Open in Web Editor NEWHow to implement a streaming at scale solution in Azure
License: MIT License
How to implement a streaming at scale solution in Azure
License: MIT License
Instead of using DBFS to store the Delta table, an Azure Blob Store should be used instead
Re-writing Azure functions with .NET Core since v2 is supposed to give better performing functions
In the solution, generation of duplicate events in locust is disabled (23bd9d1#diff-82da803b8f58c37426d86bd254ace7e4R222) for stream analytics as no mechanism exists to deduplicate. Even now, in case the ASA infra fails and is restarted, duplicates could cause the job to fail since there is a PK on EventId in Azure SQL.
Rate of generated events sometimes varies from target. E.g. below, IncomingMessages should be 600k per minute but is actually around 400k
cd eventhubs-streamanalytics-eventhubs
./create-solution.sh -d algatkssc29 -t 10 -l westeurope
[...]
starting to monitor locusts for 20 seconds...
locust is sending 956.6666666666666 messages/sec
locust is sending 1697.8 messages/sec
locust is sending 2426.714285714286 messages/sec
locust is sending 3016.8888888888887 messages/sec
locust is sending 4457.5 messages/sec
locust is sending 5731.3 messages/sec
locust is sending 6912.6 messages/sec
locust is sending 7900.7 messages/sec
locust is sending 8935.6 messages/sec
locust is sending 9352.6 messages/sec
monitoring done
locust monitor available at: http://51.105.174.161:8089
***** [M] Starting METRICS reporting
Event Hub capacity: 10 throughput units (this determines MAX VALUE below).
Reporting aggregate metrics per minute, offset by 1 minute, for 30 minutes.
IncomingMessages IncomingBytes OutgoingMessages OutgoingBytes ThrottledRequests
---------------- ------------- ---------------- ------------- ------------------
MAX VALUE 600000 600000000 2457600 1200000000 -
---------------- ------------- ---------------- ------------- ------------------
2019-07-08T07:34:27+0200 0 0 0 0 0
2019-07-08T07:35:00+0200 0 0 0 0 0
2019-07-08T07:36:00+0200 296922 596714414 291020 287368004 0
2019-07-08T07:37:00+0200 435748 881065804 415067 409844384 0
2019-07-08T07:38:00+0200 522704 1062592640 508476 502081594 0
2019-07-08T07:39:00+0200 496339 1007033882 481689 475630885 0
2019-07-08T07:40:00+0200 375336 744465210 347935 343561428 487
2019-07-08T07:41:00+0200 390234 776925168 365353 360760676 115
2019-07-08T07:42:00+0200 412583 831936433 396489 391500889 25
2019-07-08T07:43:00+0200 476078 955091213 451125 445464583 0
creating HDInsight cluster
Deployment failed. Correlation ID: xxx. Internal server error occurred while processing the request. Please retry the request or contact support.
I am running through the eventhubs-streamanalytics-cosmosdb sample and get this error below when I run the create-solution.sh script. Any idea how I can fix?
CC @yorek
thanks.
***** [M] Starting METRICS reporting
./create-solution.sh: line 208: ./05-report-throughput.sh: Permission denied
There was an error, execution halted
Error at line 208
As per the Event Hubs Kafka samples, please add support for CosmosDB via Mongodb API:
Selecting data based on sample queries produces errors:
jdbc:drill:zk=local> select * from azure.`algatssp01ingest/algatssp01ingest-2/2019_07_07_21_11_49_0.avro` limit 10;
Error: INTERNAL_ERROR ERROR: Avro union type must be of the format : ["null", "some-type"]
Document duplicate event generation/deduplication mechanisms in README files and solution bootstrap template
Hey! So as we were discussing during oneweek, It would be nice to put on the docs some scenarios that best fit each of the samples. What do you think about it? Thanks a lot!
User could have configured AZ CLI not to return JSON results by default. This conflict with jq
usage as it expects a JSON input. Make sure to explicitly use the -o json
option every time AZ CLI is used and its output is piped into jq
Steps to be executed: CIDTM
Configuration:
. Resource Group => bie3i8
. Region => eastus
. EventHubs => TU: 2, Partitions: 2
. Data Explorer => SKU: Standard_D11_v2
. Simulators => 1
Deployment started...
***** [C] Setting up COMMON resources
creating resource group
. name: bie3i8
. location: eastus
creating storage account
. name: bie3i8storage
***** [I] Setting up INGESTION
creating eventhubs namespace
. name: bie3i8eventhubs
. capacity: 2
. capture: False
. auto-inflate: false
creating eventhub instance
. name: bie3i8in-2
. partitions: 2
creating consumer group
. name: dataexplorer
***** [D] Setting up DATABASE
checking Key Vault exists
creating KeyVault bie3i8spkv
checking service principal exists
creating service principal
WARNING: The output includes credentials that you must protect. Be sure that you do not include these credentials in your code or check the credentials into your source control. For more information, see https://aka.ms/azadsp-cli
WARNING: 'name' property in the output is deprecated and will be removed in the future. Use 'appId' instead.
getting service principal
ERROR: Service principal 'http://bie3i8adx-reader' doesn't exist
What does that mean "Service principal 'http://bie3i8adx-reader' doesn't exist "?
If "zip" command is not available the function deployment will fail
Since the solution is using Azure Function v1 and now by default the v2 Azure Function host is used, the deployed function doesn't work anymore
Duplicate data is being generated and as result, Azure SQL is (correctly :)) preventing it to be inserted:
System.Data.SqlClient.SqlException (0x80131904): Violation of PRIMARY KEY constraint 'PK__#BAEC272__7944C811B84F8B93'. Cannot insert duplicate key in object 'dbo.@payload'. The duplicate key value is (5afa9381-61c1-4d15-ab19-8b616acf65da).
The data for table-valued parameter "@payload" doesn't conform to the table type of the parameter. SQL Server error is: 3602, state: 30
The statement has been terminated.
at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
at System.Data.SqlClient.SqlCommand.CompleteAsyncExecuteReader()
at System.Data.SqlClient.SqlCommand.EndExecuteNonQueryInternal(IAsyncResult asyncResult)
at System.Data.SqlClient.SqlCommand.EndExecuteNonQuery(IAsyncResult asyncResult)
at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
ClientConnectionId:43576d0a-f89c-43b9-b61f-4e82783d4c38
Error Number:2627,State:2,Class:14
ClientConnectionId before routing:8ed108b2-dcc0-4a47-bba7-ea13633358d6
Routing Destination:d7f1e4838529.tr2237.eastus1-a.worker.database.windows.net,11044
Add Kafka producer as a load generator to test Event Hubs Kafka API at scale
deploying Helm charts
. chart: zookeeper
Error: looks like "http://storage.googleapis.com/kubernetes-charts-incubator" is not a valid chart repository or cannot be reached: failed to fetch http://storage.googleapis.com/kubernetes-charts-incubator/index.yaml : 403 Forbidden
OutgoingMessages should keep up with IncomingMessages (600k per minute)
cd eventhubs-streamanalytics-cosmosdb
./create-solution.sh -d myrg -t 10 -l northeurope
[...]
starting to monitor locusts for 20 seconds...
locust is sending 872 messages/sec
locust is sending 1621 messages/sec
locust is sending 2697.375 messages/sec
locust is sending 3455.9 messages/sec
locust is sending 5471.5 messages/sec
locust is sending 6810.2 messages/sec
locust is sending 7971.5 messages/sec
locust is sending 9053 messages/sec
locust is sending 9561.8 messages/sec
locust is sending 9753.8 messages/sec
monitoring done
locust monitor available at: http://52.155.179.29:8089
***** [M] Starting METRICS reporting
Event Hub capacity: 12 throughput units (this determines MAX VALUE below).
Reporting aggregate metrics per minute, offset by 1 minute, for 30 minutes.
IncomingMessages IncomingBytes OutgoingMessages OutgoingBytes ThrottledRequests
---------------- ------------- ---------------- ------------- ------------------
MAX VALUE 720000 720000000 2949120 1440000000 -
---------------- ------------- ---------------- ------------- ------------------
2019-07-08T08:41:51+0200 0 0 0 0 0
2019-07-08T08:42:00+0200 0 0 0 0 0
2019-07-08T08:43:00+0200 53900 53219651 49959 49327981 0
2019-07-08T08:44:00+0200 515056 508587368 496420 490177954 0
2019-07-08T08:45:00+0200 598890 591349724 578479 571203437 0
2019-07-08T08:46:00+0200 600572 593017681 561561 554499015 0
2019-07-08T08:47:00+0200 597500 589985683 563482 556393599 0
2019-07-08T08:48:00+0200 600960 593408050 513331 506880384 0
2019-07-08T08:49:00+0200 601521 593959016 593205 585749886 0
2019-07-08T08:50:00+0200 601282 593717804 500150 493861481 0
2019-07-08T08:51:00+0200 604751 597149192 514832 508366889 0
2019-07-08T08:52:00+0200 608397 600748827 472418 466475537 0
2019-07-08T08:53:00+0200 606160 598535776 474963 468996517 0
2019-07-08T08:54:00+0200 607886 600239871 477438 471441458 0
2019-07-08T08:55:00+0200 605485 597875502 471086 465163684 0
2019-07-08T08:56:00+0200 608527 600871623 524951 518354312 0
2019-07-08T08:57:00+0200 607039 599406517 399120 394100780 0
2019-07-08T08:58:00+0200 607360 599739819 471705 465777118 0
2019-07-08T08:59:00+0200 607220 599573247 450127 444471194 0
2019-07-08T09:00:00+0200 609149 601490846 452120 446435778 0
2019-07-08T09:01:00+0200 609889 602221035 480420 474384460 0
2019-07-08T09:02:00+0200 608722 601070463 416933 411690081 0
2019-07-08T09:03:00+0200 603686 596098118 473337 467389152 0
2019-07-08T09:04:00+0200 590519 583094755 444143 438559398 0
2019-07-08T09:05:00+0200 600124 592579242 426720 421356704 0
2019-07-08T09:06:00+0200 606179 598555994 461778 455984029 0
2019-07-08T09:07:00+0200 601860 594294746 428422 423034922 0
2019-07-08T09:08:00+0200 603490 595902290 474911 468950668 0
2019-07-08T09:09:00+0200 602521 594955778 481977 475913196 0
2019-07-08T09:10:00+0200 604696 597083899 385932 381093727 0
***** Done
Use Delta MERGE commands
Hey! Also from a discussion that started during oneweek, maybe having blobs as part of the samples should be nice, since some people also use that for storing raw data as logs. What do you think about that?
Look for start of sentence.. "As you'll notice when using funciont "Test1". Function is misspelt.
configmap/container-azm-ms-agentconfig created
deploying Helm
serviceaccount/tiller created
clusterrolebinding.rbac.authorization.k8s.io/tiller created
Error: unknown command "init" for "helm"
Did you mean this?
lint
Run 'helm --help' for usage.
Use Kubernetes to deploy the locust cluster. Also make sure locust is configured to use the distributed feature: https://docs.locust.io/en/stable/running-locust-distributed.html
It's not necessary to use Locust if that is a problem with IoTHub. Just create an app that can be containerized and we'll use ACI to create many instances and add load to the test
This is a great resource for anyone interested in streaming architectures & Azure; but please consider publishing comparison of cost vs performance across all the options.
Error from server (NotFound): services "xx-kafka-external-bootstrap" not found
Broker not yet deployed. Retrying...
Use cosmos db's merge functionality to perform upsert in case of event duplication
eventhubs-databricks-cosmosdb
eventhubs-functions-cosmosdb
eventhubs-streamanalytics-cosmosdb
eventhubskafka-databricks-cosmosdb
Deployment of eventhubs-streamanalytics-azuresql fails in Azure Cloud Shell with:
Error:
***** [D] Setting up DATABASE
retrieving storage connection string
deploying azure sql
. server: myiottestsql
. database: streaming
creating logical server
enabling access from Azure
deploying database db
creating file share
uploading provisioning scripts
uploading provision.sh...
Finished[#############################################################] 100.0000%
uploading db/provision.sql...
Finished[#############################################################] 100.0000%
running provisioning scripts in container instance
../components/azure-sql/create-sql.sh: line 54: uuidgen: command not found
There was an error, execution halted
Error at line 54
Hello,
I'm curious why there are no Python samples in this repo.
Thank you
I encountered failure in creating the Eventhub Databricks CosmosDb solution - https://github.com/Azure-Samples/streaming-at-scale/blob/main/eventhubs-databricks-cosmosdb/create-solution.sh
I have narrowed down the issue that Databricks ARM Templates do not exist at the location used by the script.
https://github.com/Azure-Samples/streaming-at-scale/blob/main/components/azure-databricks/create-databricks.sh
Line#22 on above script references a broken link - https://raw.githubusercontent.com/Azure/azure-quickstart-templates/master/101-databricks-all-in-one-template-for-vnet-injection/azuredeploy.json
Hello,
I tried to deploy, but keep getting this issue below. Appears to be issued while deploying function. Please let me know
Tried with Deployment command: ./create-solution.sh -d evfnsql -l westeurope
for Incremental deployment: ./create-solution.sh -d evfnsql -l westeurope -s PTMV
Error is in the script: https://github.com/Azure-Samples/streaming-at-scale/blob/main/eventhubs-functions-azuresql/StreamingProcessor-AzureSQL-Test0/StreamingProcessor-AzureSQL/Test0.cs
Output:
Deployment started...
***** [C] Setting up COMMON resources
***** [I] Setting up INGESTION
***** [D] Setting up DATABASE
***** [P] Setting up PROCESSING
creating app service plan
. name: evfnsqlprocessplan
creating function app
. name: evfnsqlprocess
WARNING: No functions version specified so defaulting to 2. In the future, specifying a version will be required. To create a 2.x function you would pass in the flag --functions-version 2
WARNING: Application Insights "evfnsqlprocess" was created for this Function App. You can visit https://portal.azure.com/#resource/subscriptions/........ to view your Application Insights component
building function app
. path: ./StreamingProcessor-AzureSQL-Test0/StreamingProcessor-AzureSQL
There was an error, execution halted
Error at line 25
Stopping and restarting the job results in
org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 8.0 failed 4 times, most recent failure: Lost task 7.3 in stage 8.0 (TID 158, 10.139.64.9, executor 1): com.microsoft.sqlserver.jdbc.SQLServerException: Violation of PRIMARY KEY constraint 'pk__rawdata'. Cannot insert duplicate key in object 'dbo.rawdata'. The duplicate key value is (2284bd09-4619-49ea-93d9-0015d9372ab2, 2).
In the Databricks samples the "processedAt" field it is missing. It should contain the UTC datetime of when the row has been processed by the Streaming Engine
Hi,
We are using the Flink Table API,
is that compatible with the Flink ApplicationInsightsReporter sample?
Or is this meant to be used with a job that uses Streaming API and implements RichFunction?
For example using an instance of ApplicationInsightsReporter and inside the lets say the map function, sends the metric data using ApplicationInsightsReporter?
Regards
Suleiman
Query/Question
It's regarding streaming-at-scale/eventhubs-capture/.
I tried it with by sending basic data such as described in the MSDN doc (https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-dotnet-standard-getstarted-send). However, when I try to read the avro files, it simply fails. I tried to download the avro file locally and I have the same issue.
Is there any particular configuration that are required on the event hub in order to make Apache Drill to work with the content of Event Hubs?
apache drill> select * from dfs.`temp/16.avro`;
Error: UNSUPPORTED_OPERATION ERROR: 'complex union' type is not supported
Column: value
Fragment: 0:0
[Error Id: c0f67d7f-a43d-4efc-b643-b1ff48660a4b on host.docker.internal:31010] (state=,code=0)
Note: I also tried to convert to JSON, but the issue remain the same.
Environment:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.