microsoft / data-accelerator Goto Github PK

Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.

License: MIT License

Scala 13.28% Batchfile 0.22% PowerShell 5.07% Python 0.15% Shell 0.06% Dockerfile 0.09% C# 54.67% JavaScript 26.21% CSS 0.24% HTML 0.02%

spark spark-streaming spark-sql sparksql streaming-data streaming servicefabric nodejs docker hdinsight

data-accelerator's Introduction

Data Accelerator for Apache Spark

Flow		Gateway		DataProcessing
Metrics		SimulatedData		Website

Data Accelerator for Apache Spark democratizes streaming big data using Spark by offering several key features such as a no-code experience to set up a data pipeline as well as fast dev-test loop for creating complex logic. Our team has been using the project for two years within Microsoft for processing streamed data across many internal deployments handling data volumes at Microsoft scale. It offers an easy to use platform to learn and evaluate streaming needs and requirements. We are thrilled to share this project with the wider community as open source!

Azure Friday: We are now featured on Azure Fridays! See the video here.

Data Accelerator offers three level of experiences:

The first requires no code at all, using rules to create alerts on data content.
The second allows to quickly write a Spark SQL query with additions like LiveQuery, time windowing, in-memory accumulator and more.
The third enables integrating custom code written in Scala or via Azure functions.

You can get started locally for Windows, macOs and Linux following these instructions
To deploy to Azure, you can use the ARM template; see instructions deploy to Azure.

The data-accelerator repository contains everything needed to set up an end-to-end data pipeline. There are many ways you can participate in the project:

Submit bugs and requests
Review code changes
Review documentation and make updates ranging from typos to new content.

Getting Started

To unleash the full power Data Accelerator, deploy to Azure and check cloud mode tutorials.

We have also enabled a "hello world" experience that you try out locally by running docker container. When running locally there are no dependencies on Azure, however the functionality is very limited and only there to give you a very cursory overview of Data Accelerator. To run Data Accelerator locally, deploy locally and then check out the local mode tutorials.

Data Accelerator for Spark runs on the following:

Azure HDInsight with Spark 2.4 (2.3 also supported)
Azure Databricks with Spark 2.4
Service Fabric (v6.4.637.9590) with
- .NET Core 2.2
- ASP.NET
App Service with Node 10.6

See the wiki pages for further information on how to build, diagnose and maintain your data pipelines built using Data Accelerator for Spark.

Contributing

If you are interested in fixing issues and contributing to the code base, we would love to partner with you. Try things out, join in the design conversations and make pull requests.

Feedback

Request new features on GitHub
Open a new issue on GitHub
Ask a question on Stack Overflow
Contact us: [email protected]
Check out the contributing page to see the best places to log issues and start discussions.

Please also see our Code of Conduct.

Security issues

Security issues and bugs should be reported privately, via email, to the Microsoft Security Response Center (MSRC) [email protected]. You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Further information, including the MSRC PGP key, can be found in the Security TechCenter.

License

This repository is licensed with the MIT license.

data-accelerator's People

Contributors

Stargazers

Watchers

Forkers

guptam whitwaldo rajkrishnamurthy ru003ar hongjunren deluxebear handong890 b-xiang hhy5277 lifejoyforpy stainless5792 lqyxt zljie sunshinehome hcchappy wjmboss stanxii zengpu2109 jovian007 anberm dongshengfengniaowu anke522 freeradius-xx hong1990 xalison wenfeifei pandyer davidmr001 auswang ramkd12 maximejf42 yqjack dataxpd barryt2 donjayamanne pawanrana sumabh-msft vdedyukhin radtek iracding awesomedotnetcore ashley-marie jozavala-msft damolaakinleye cronologi mengjin001 anand360 ankisho shethsameer blinds52 mohantyrr2003 mallik-g apollusehs-oss bhaskers-blu-org2 rajeevchugh1980 alfredorevilla-msft felipemoz taffywrinkle claudiusgonzo sathya-reddy-m dystudio lanicon itshawi qpc-database carlosjrestrepo omarine datatrekkers kpraoh ehtick sureshbnjus wjaithiang test-mass-forker-org-1 ishali isabella232 kandalareddy ievsantillan fionachan01 ajunlonglive doytsujin firetom5900 krenukumar developers81828182 shivatomar2183 dearborn-open-ai mimsii marmikreal sergeishaikin

data-accelerator's Issues

Enable ARM extensibility

Is your feature request related to a problem? Please describe.
No clear way other than forking the code under DeploymentCloud to extend the functionality

Describe the solution you'd like
A separate package to use

Add Data Explorer Sink

Add support for sending data to Azure Data explorer

Web: Enable uploading jar files and csv from the web portal

Is your feature request related to a problem? Please describe.
Today the user needs to deploy udf jars and reference data csvs manually to the blob location

Describe the solution you'd like
Enable the user to choose a file on a local disk which the web portal will then upload to the right location

Can't install using docker

Describe the bug
Cannot pull image from DockerHub

docker pull mcr.microsoft.com/datax/dataxlocal
Using default tag: latest
Error response from daemon: manifest for mcr.microsoft.com/datax/dataxlocal:latest not found: manifest unknown: manifest tagged by "latest" is not found
❯ docker pull mcr.microsoft.com/datax/dataxlocal:v1.2
Error response from daemon: manifest for mcr.microsoft.com/datax/dataxlocal:v1.2 not found: manifest unknown: manifest tagged by "v1.2" is not found
❯ docker pull mcr.microsoft.com/datax/dataxlocal:v1.1
Error response from daemon: manifest for mcr.microsoft.com/datax/dataxlocal:v1.1 not found: manifest unknown: manifest tagged by "v1.1" is not found
❯ docker pull mcr.microsoft.com/datax/dataxlocal:v1
Error response from daemon: manifest for mcr.microsoft.com/datax/dataxlocal:v1 not found: manifest unknown: manifest tagged by "v1" is not found

To Reproduce
Steps to reproduce the behavior:
Run docker pull mcr.microsoft.com/datax/dataxlocal

Expected behavior
Image can be pulled

Desktop (please complete the following information):

OS: Mac
docker -v
Docker version 19.03.12, build 48a66213fe

Publish 1.0.0 packages

Describe the bug
Nuget files have a prerelease tag

To Reproduce
Steps to reproduce the behavior:

Go to nuget package and see they are marked as prerel

Expected behavior
Final version 1.0.0 available

Automate localdb creation (cosmosdb for local)

Describe the bug
Automate creation of localdb file to avoid having to check it in

To Reproduce
Steps to reproduce the behavior:

Go to local folder
Notice localdb file
Can't be code reviewed or easily maintained

Expected behavior
Generated from existing config files

Replace calls to eval in Web Node project

Describe the bug
Eval can be replaced by a better call

To Reproduce
Steps to reproduce the behavior:

Support for Azure Data Lake Storage Gen2

Is your feature request related to a problem? Please describe.
Add support to read and write to Azure Data Lake Storage Gen2

Describe the solution you'd like
Enable configuring both batch and streaming jobs to read and write from blobs

Describe alternatives you've considered

Additional context

Enable support for Snapshot packages in Git Maven Repo

Describe the bug
No way to host snapshot jar from master on Maven repo

To Reproduce
Steps to reproduce the behavior:

Go to DataProcessing
Open datax-host pom file
Notice it depends on the non-snapshot version

Expected behavior
Maven hosted Snapshot packages

Enable Non-Windows Cloud automated deployment

Describe the bug
Missing non-Windows installation steps and scripts

To Reproduce
Steps to reproduce the behavior:

Go to DeployCloud folder
Notice there no non-Windows instructions nor the scripts work directly on Linux/Mac

Expected behavior
Scripts and docs exists

Enable Home page content to be pluggable i.e. tutorials definition

Is your feature request related to a problem? Please describe.
Content is currently hard coded in datax-home package

Describe the solution you'd like
Content should be read as a paramter

Configurable durability level of service fabric cluster in ARM template

Make durability level of service fabric cluster configurable through common.parameters file in the ARM template

Support extended character set on flow names

Is your feature request related to a problem? Please describe.
Support extended caracter sets on Flow names

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Services: Improve comment support

Is your feature request related to a problem? Please describe.
Add support for /* */ and -- comments

Describe the solution you'd like
The codegen service should ignore anything placed between /* and */ also ignore content after -- for a given line.

Missing dependencies for nuget packages

Describe the bug
Created nugets do not have any dependencies defined

To Reproduce
Steps to reproduce the behavior:

Go to nuspec files
Notice there are no dependencies

Expected behavior
Dependencies are present

Parametrise any azure.net URLs

Is your feature request related to a problem? Please describe.
URLs against azure.net should be coming from a setting

Describe the solution you'd like
Have a configuration option in the dictionary to load the base urls from

i.e. https://github.com/microsoft/data-accelerator/search?q=%22azure.net%22&unscoped_q=%22azure.net%22

Versioning should include combination of ARM template, binaries and packages

Is your feature request related to a problem? Please describe.
No way to sync ARM template code and packages
No way to roll back to specific version of code
No way to install packages built from master

Describe the solution you'd like
Solution to issues above and versioning support across all components

LiveQuery support of AzureFunction

Is your feature request related to a problem? Please describe.
When doing LiveQuery, AzureFunctions should work

Describe the solution you'd like
Azure Function is called when i select a LiveQuery

Additional context
LIveQuery should POST against AzureFunctions

[Local deployment] Aggregate alerts not shown in Metrics - requires db cleanup step in Services

Describe the bug
Sometimes aggregate alerts aren't shown in the corresponding table under metrics.

To Reproduce
Steps to reproduce the behavior:

Set up aggregate alerts as shown in the local deployment tutorial
Go to http://localhost:49080/dashboard
Click on the Flow for which alert was created
Scroll down to corresponding table for the aggregate alert
Observe no data shown in table

Expected behavior
Alert fires and time & description shown in table

Version of Data Accelerator
v.1.1

Desktop (please complete the following information):

macOS Mojave 10.14.4
Google Chrome Version 76.0.3809.132 (Official Build) (64-bit)

Additional context
I also tried:

Stopping & redeploying Data Accelerator
Restarting the job
Turning off alerts for all other rules & restarting job
Along the way, Events Ingested Today & Avg Events/Min were no longer loading. The job went idle at a couple times without me stopping it.
Data Accelerator local deployment missing alerts & metrics.txt

Support for batch jobs

Is your feature request related to a problem? Please describe.
Running the same queries as a batch job

Describe the solution you'd like
Processing blobs in a batch mode

Support for Databricks

Is your feature request related to a problem? Please describe.
Hosting Data Accelerator should also work with Databricks as the compute resource.

Describe the solution you'd like
Databricks integration

Additional context
The Data Accelerator is looking into adding support for Data Bricks

Build status on main page

Is your feature request related to a problem? Please describe.
Add our build status to the main page

ARM Expose location of Spark cluster as a parameter

Please expose the location of the Spark cluster as a parameter in the config file. There are occasions where customers might run out of core quota in a location and might want to create the cluster in a separate location.

NPMJS packages do not point to our GitHub

Describe the bug
The published NPM packages do not point to this github repo

To Reproduce
Steps to reproduce the behavior:

Go to datax-home package on npm
see readme

Web: Handle saving automatically and avoid navigating away with changes

Is your feature request related to a problem? Please describe.
The editor can lose changes if user navigates away without having deployed changes to Flow.

Describe the solution you'd like
Automatically save document without deploying
Add a dialog "Are you sure you want to navigate away?" dialog when detection of unsaved changes happen

Local: Support Live Query in Local mode

Is your feature request related to a problem? Please describe.
In local mode, running within docker, the live query feature is disabled

Describe the solution you'd like
Enable live query for local mode

AzureFunction integration expects 2 keyvault entries and parameters and POST is not supported

Describe the bug
Could not use Azure Function without manually creating the keyvault entry

To Reproduce
Steps to reproduce the behavior:

set up azure function
notice keyvault entry is not correct

Expected behavior
Azure function should work

Screenshots
If applicable, add screenshots to help explain your problem.

Version of Data Accelerator

Services:
DataProcessing
WebSite:

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

Web: Enable save Flow to local disk in Cloud mode

Is your feature request related to a problem? Please describe.
In order to enable a code flow scenario i.e. git, we need a way to save the editor changes to disk before committing to deploy

Describe the solution you'd like
Add a "Save to disk" button on the editor tab

Describe alternatives you've considered
A full git client could be done in the web but seems more costly than letting user manage themselves

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

Improve Error messages when Input is sampled and no results are returned

Describe the bug
When sampled data is empty after resampling data in live experience, sample can be empty if schema is wrong/eventhub doesn’t send anything. Panel on the right doesn't have a good error message

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Unable to complete the setup

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Version of Data Accelerator

Services:
DataProcessing
WebSite:

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

Services: Refactor Utilities DotNet projects

Describe the bug
Some utilities project are duplicated between singular (utility) and plural versions (utilities). Let's align to plural versions

To Reproduce
Steps to reproduce the behavior:

Go to Services\DataX.Utilities folder
Notice duplicate folders like 2 cosmosdb utils

Expected behavior
One dll per area

How do we configure the localMsiEndpoint?

The localMsiEndpoint of http://localhost:40381/managed/identity/oauth2/token" was used in the code. If I want to use it for my HDI cluster, is there a way to configure the corresponding endpoint? Thank you!

Integrate Simulator into Web Portal

Is your feature request related to a problem? Please describe.
Easier way to simulate new data

Describe the solution you'd like
The web portal should have options to customize the Simulator service

Provide configuration to expire metrics data

Is your feature request related to a problem? Please describe.
Provide mecanism to expire metrics data

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Metrics default dashboard view shows constant loading spinner

Describe the bug
When viewing the metrics dashboard (without clicking into a specific Flow) there is a spinning loading indicator that does not finish or provide more information.

To Reproduce
Steps to reproduce the behavior:

Run Data Accelerator local deployment
Go to http://localhost:49080/home
Open the Metrics tab. Do not click on any listed Flows.
Observe spinner

Expected behavior
Loading completes, or display a message indicating that I should click on a Flow to view its metrics.

Screenshots
If applicable, add screenshots to help explain your problem.

Version of Data Accelerator
How do I find this information?

Services:
DataProcessing
WebSite:

Desktop (please complete the following information):

macOS Mojave 10.14.4
Google Chrome Version 76.0.3809.132 (Official Build) (64-bit)

Additional context
Add any other context about the problem here.

Restart job: Validate in flows service that the current spark job has stopped before issuing a new start job instance

Restart job: Validate in flows service that the current spark job has stopped before issuing a new start job instance.

Repro steps:

Restart a currently running job from the jobs page

Actual:
Sometimes the currently running job continues and a new instance of the same job get triggered

Expected:
Restart job api should validate that the currently running job instance has stopped before starting a new instance

Move default configurations files into Services deployment or initial configuration

Is your feature request related to a problem? Please describe.
Common configurations should be part of the services to match code to be deployed.

Describe the solution you'd like
Move the json files from the Deployment folder to the Flows folder and include them in the release. The Deployment scripts should be able to use/deploy them from there.

Web: Improve readability for screen readers

Is your feature request related to a problem? Please describe.
Some areas of the web portal have issues with screen readers. Here are a few examples

Describe the solution you'd like
Improve readability for screen readers across the web portal

Local: Support Accumulator in Local mode

Is your feature request related to a problem? Please describe.
When running Data Accelerator in the docker, using the Accumulator feature does not work

Describe the solution you'd like
Enable the accumulator feature in local mode

Deploy the iotsample job with running failure

After setup Data Accelerator, simply deploy the IoT Sample flow with a new name.
and several seconds (120s?) later getting below error at jobs tab:

19/06/16 13:56:07 INFO ShutdownHookManager: Shutdown hook called
19/06/16 13:56:07 INFO ShutdownHookManager: Deleting directory /tmp/spark-8dd197f7-3687-4445-93d7-ddb1a1d357ef
19/06/16 13:56:07 INFO MetricsSystemImpl: Stopping azure-file-system metrics system...
19/06/16 13:56:07 INFO MetricsSinkAdapter: azurefs2 thread interrupted.
19/06/16 13:56:07 INFO MetricsSystemImpl: azure-file-system metrics system stopped.
19/06/16 13:56:07 INFO MetricsSystemImpl: azure-file-system metrics system shutdown complete.
stderr:
YARN Diagnostics:
java.lang.Exception: No YARN application is found with tag livy-batch-8-yikjsvds in 120 seconds. Please check your cluster status, it is may be very busy.
org.apache.livy.utils.SparkYarnApp.org$apache$livy$utils$SparkYarnApp$$getAppIdFromTag(SparkYarnApp.scala:182) org.apache.livy.utils.SparkYarnApp$$anonfun$1$$anonfun$4.apply(SparkYarnApp.scala:239) org.apache.livy.utils.SparkYarnApp$$anonfun$1$$anonfun$4.apply(SparkYarnApp.scala:236) scala.Option.getOrElse(Option.scala:120) org.apache.livy.utils.SparkYarnApp$$anonfun$1.apply$mcV$sp(SparkYarnApp.scala:236) org.apache.livy.Utils$$anon$1.run(Utils.scala:97)

microsoft / data-accelerator Goto Github PK

data-accelerator's Introduction

Data Accelerator for Apache Spark

Getting Started

Contributing

Feedback

Security issues

License

data-accelerator's People

Contributors

Stargazers

Watchers

Forkers

data-accelerator's Issues

Recommend Projects

Recommend Topics

Recommend Org