Code Monkey home page Code Monkey logo

microsoft / data-accelerator Goto Github PK

View Code? Open in Web Editor NEW
293.0 31.0 88.0 387.58 MB

Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.

License: MIT License

Scala 13.28% Batchfile 0.22% PowerShell 5.07% Python 0.15% Shell 0.06% Dockerfile 0.09% C# 54.67% JavaScript 26.21% CSS 0.24% HTML 0.02%
spark spark-streaming spark-sql sparksql streaming-data streaming servicefabric nodejs docker hdinsight

data-accelerator's Introduction

Data Accelerator for Apache Spark

Flow Build status Gateway Build status DataProcessing Build status
Metrics Build status SimulatedData Build status Website Build status

Data Accelerator for Apache Spark democratizes streaming big data using Spark by offering several key features such as a no-code experience to set up a data pipeline as well as fast dev-test loop for creating complex logic. Our team has been using the project for two years within Microsoft for processing streamed data across many internal deployments handling data volumes at Microsoft scale. It offers an easy to use platform to learn and evaluate streaming needs and requirements. We are thrilled to share this project with the wider community as open source!

Azure Friday: We are now featured on Azure Fridays! See the video here.

Data Accelerator offers three level of experiences:

  • The first requires no code at all, using rules to create alerts on data content.
  • The second allows to quickly write a Spark SQL query with additions like LiveQuery, time windowing, in-memory accumulator and more.
  • The third enables integrating custom code written in Scala or via Azure functions.

You can get started locally for Windows, macOs and Linux following these instructions
To deploy to Azure, you can use the ARM template; see instructions deploy to Azure.

The data-accelerator repository contains everything needed to set up an end-to-end data pipeline. There are many ways you can participate in the project:

Getting Started

To unleash the full power Data Accelerator, deploy to Azure and check cloud mode tutorials.

We have also enabled a "hello world" experience that you try out locally by running docker container. When running locally there are no dependencies on Azure, however the functionality is very limited and only there to give you a very cursory overview of Data Accelerator. To run Data Accelerator locally, deploy locally and then check out the local mode tutorials.

Data Accelerator for Spark runs on the following:

  • Azure HDInsight with Spark 2.4 (2.3 also supported)
  • Azure Databricks with Spark 2.4
  • Service Fabric (v6.4.637.9590) with
    • .NET Core 2.2
    • ASP.NET
  • App Service with Node 10.6

See the wiki pages for further information on how to build, diagnose and maintain your data pipelines built using Data Accelerator for Spark.

Contributing

If you are interested in fixing issues and contributing to the code base, we would love to partner with you. Try things out, join in the design conversations and make pull requests.

Feedback

Please also see our Code of Conduct.

Security issues

Security issues and bugs should be reported privately, via email, to the Microsoft Security Response Center (MSRC) [email protected]. You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Further information, including the MSRC PGP key, can be found in the Security TechCenter.

License

This repository is licensed with the MIT license.

data-accelerator's People

Contributors

acrokat avatar ankisho avatar barryt2 avatar birgithi avatar bryanchen-d avatar carlbrochu avatar dependabot[bot] avatar dineshc-msft avatar gstaneff avatar jozavala-msft avatar kjcho-msft avatar microsoft-github-policy-service[bot] avatar microsoftopensource avatar msandg avatar msftgits avatar ramkd12 avatar rohit489 avatar s-tuli avatar shibbas avatar sumabh-msft avatar tylorhl avatar tylorhl-msft avatar v-swmi avatar vijayupadya avatar yangyi-msft avatar yinyuwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-accelerator's Issues

Enable ARM extensibility

Is your feature request related to a problem? Please describe.
No clear way other than forking the code under DeploymentCloud to extend the functionality

Describe the solution you'd like
A separate package to use

Web: Enable uploading jar files and csv from the web portal

Is your feature request related to a problem? Please describe.
Today the user needs to deploy udf jars and reference data csvs manually to the blob location

Describe the solution you'd like
Enable the user to choose a file on a local disk which the web portal will then upload to the right location

Can't install using docker

Describe the bug
Cannot pull image from DockerHub

docker pull mcr.microsoft.com/datax/dataxlocal
Using default tag: latest
Error response from daemon: manifest for mcr.microsoft.com/datax/dataxlocal:latest not found: manifest unknown: manifest tagged by "latest" is not found
❯ docker pull mcr.microsoft.com/datax/dataxlocal:v1.2
Error response from daemon: manifest for mcr.microsoft.com/datax/dataxlocal:v1.2 not found: manifest unknown: manifest tagged by "v1.2" is not found
❯ docker pull mcr.microsoft.com/datax/dataxlocal:v1.1
Error response from daemon: manifest for mcr.microsoft.com/datax/dataxlocal:v1.1 not found: manifest unknown: manifest tagged by "v1.1" is not found
❯ docker pull mcr.microsoft.com/datax/dataxlocal:v1
Error response from daemon: manifest for mcr.microsoft.com/datax/dataxlocal:v1 not found: manifest unknown: manifest tagged by "v1" is not found

To Reproduce
Steps to reproduce the behavior:
Run docker pull mcr.microsoft.com/datax/dataxlocal

Expected behavior
Image can be pulled

Desktop (please complete the following information):

  • OS: Mac
  • docker -v
    Docker version 19.03.12, build 48a66213fe

Publish 1.0.0 packages

Describe the bug
Nuget files have a prerelease tag

To Reproduce
Steps to reproduce the behavior:

  1. Go to nuget package and see they are marked as prerel

Expected behavior
Final version 1.0.0 available

Automate localdb creation (cosmosdb for local)

Describe the bug
Automate creation of localdb file to avoid having to check it in

To Reproduce
Steps to reproduce the behavior:

  1. Go to local folder
  2. Notice localdb file
  3. Can't be code reviewed or easily maintained

Expected behavior
Generated from existing config files

Support for Azure Data Lake Storage Gen2

Is your feature request related to a problem? Please describe.
Add support to read and write to Azure Data Lake Storage Gen2

Describe the solution you'd like
Enable configuring both batch and streaming jobs to read and write from blobs

Describe alternatives you've considered

Additional context

Enable support for Snapshot packages in Git Maven Repo

Describe the bug
No way to host snapshot jar from master on Maven repo

To Reproduce
Steps to reproduce the behavior:

  1. Go to DataProcessing
  2. Open datax-host pom file
  3. Notice it depends on the non-snapshot version

Expected behavior
Maven hosted Snapshot packages

Enable Non-Windows Cloud automated deployment

Describe the bug
Missing non-Windows installation steps and scripts

To Reproduce
Steps to reproduce the behavior:

  1. Go to DeployCloud folder
  2. Notice there no non-Windows instructions nor the scripts work directly on Linux/Mac

Expected behavior
Scripts and docs exists

Support extended character set on flow names

Is your feature request related to a problem? Please describe.
Support extended caracter sets on Flow names

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Services: Improve comment support

Is your feature request related to a problem? Please describe.
Add support for /* */ and -- comments

Describe the solution you'd like
The codegen service should ignore anything placed between /* and */ also ignore content after -- for a given line.

Missing dependencies for nuget packages

Describe the bug
Created nugets do not have any dependencies defined

To Reproduce
Steps to reproduce the behavior:

  1. Go to nuspec files
  2. Notice there are no dependencies

Expected behavior
Dependencies are present

Versioning should include combination of ARM template, binaries and packages

Is your feature request related to a problem? Please describe.
No way to sync ARM template code and packages
No way to roll back to specific version of code
No way to install packages built from master

Describe the solution you'd like
Solution to issues above and versioning support across all components

LiveQuery support of AzureFunction

Is your feature request related to a problem? Please describe.
When doing LiveQuery, AzureFunctions should work

Describe the solution you'd like
Azure Function is called when i select a LiveQuery

Additional context
LIveQuery should POST against AzureFunctions

[Local deployment] Aggregate alerts not shown in Metrics - requires db cleanup step in Services

Describe the bug
Sometimes aggregate alerts aren't shown in the corresponding table under metrics.

To Reproduce
Steps to reproduce the behavior:

  1. Set up aggregate alerts as shown in the local deployment tutorial
  2. Go to http://localhost:49080/dashboard
  3. Click on the Flow for which alert was created
  4. Scroll down to corresponding table for the aggregate alert
  5. Observe no data shown in table

Expected behavior
Alert fires and time & description shown in table

Version of Data Accelerator
v.1.1

Desktop (please complete the following information):

  • macOS Mojave 10.14.4
  • Google Chrome Version 76.0.3809.132 (Official Build) (64-bit)

Additional context
I also tried:

  • Stopping & redeploying Data Accelerator
  • Restarting the job
  • Turning off alerts for all other rules & restarting job
    Along the way, Events Ingested Today & Avg Events/Min were no longer loading. The job went idle at a couple times without me stopping it.
    Data Accelerator local deployment missing alerts & metrics.txt

Support for batch jobs

Is your feature request related to a problem? Please describe.
Running the same queries as a batch job

Describe the solution you'd like
Processing blobs in a batch mode

Support for Databricks

Is your feature request related to a problem? Please describe.
Hosting Data Accelerator should also work with Databricks as the compute resource.

Describe the solution you'd like
Databricks integration

Additional context
The Data Accelerator is looking into adding support for Data Bricks

Build status on main page

Is your feature request related to a problem? Please describe.
Add our build status to the main page

ARM Expose location of Spark cluster as a parameter

Please expose the location of the Spark cluster as a parameter in the config file. There are occasions where customers might run out of core quota in a location and might want to create the cluster in a separate location.

NPMJS packages do not point to our GitHub

Describe the bug
The published NPM packages do not point to this github repo

To Reproduce
Steps to reproduce the behavior:

  1. Go to datax-home package on npm
  2. see readme

Web: Handle saving automatically and avoid navigating away with changes

Is your feature request related to a problem? Please describe.
The editor can lose changes if user navigates away without having deployed changes to Flow.

Describe the solution you'd like
Automatically save document without deploying
Add a dialog "Are you sure you want to navigate away?" dialog when detection of unsaved changes happen

Local: Support Live Query in Local mode

Is your feature request related to a problem? Please describe.
In local mode, running within docker, the live query feature is disabled

Describe the solution you'd like
Enable live query for local mode

AzureFunction integration expects 2 keyvault entries and parameters and POST is not supported

Describe the bug
Could not use Azure Function without manually creating the keyvault entry

To Reproduce
Steps to reproduce the behavior:

  1. set up azure function
  2. notice keyvault entry is not correct

Expected behavior
Azure function should work

Screenshots
If applicable, add screenshots to help explain your problem.

Version of Data Accelerator

  • Services:
  • DataProcessing
  • WebSite:

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Web: Enable save Flow to local disk in Cloud mode

Is your feature request related to a problem? Please describe.
In order to enable a code flow scenario i.e. git, we need a way to save the editor changes to disk before committing to deploy

Describe the solution you'd like
Add a "Save to disk" button on the editor tab

Describe alternatives you've considered
A full git client could be done in the web but seems more costly than letting user manage themselves

Improve Error messages when Input is sampled and no results are returned

Describe the bug
When sampled data is empty after resampling data in live experience, sample can be empty if schema is wrong/eventhub doesn’t send anything. Panel on the right doesn't have a good error message

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Unable to complete the setup

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Version of Data Accelerator

  • Services:
  • DataProcessing
  • WebSite:

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Services: Refactor Utilities DotNet projects

Describe the bug
Some utilities project are duplicated between singular (utility) and plural versions (utilities). Let's align to plural versions

To Reproduce
Steps to reproduce the behavior:

  1. Go to Services\DataX.Utilities folder
  2. Notice duplicate folders like 2 cosmosdb utils

Expected behavior
One dll per area

Integrate Simulator into Web Portal

Is your feature request related to a problem? Please describe.
Easier way to simulate new data

Describe the solution you'd like
The web portal should have options to customize the Simulator service

Provide configuration to expire metrics data

Is your feature request related to a problem? Please describe.
Provide mecanism to expire metrics data

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Metrics default dashboard view shows constant loading spinner

Describe the bug
When viewing the metrics dashboard (without clicking into a specific Flow) there is a spinning loading indicator that does not finish or provide more information.

To Reproduce
Steps to reproduce the behavior:

  1. Run Data Accelerator local deployment
  2. Go to http://localhost:49080/home
  3. Open the Metrics tab. Do not click on any listed Flows.
  4. Observe spinner

Expected behavior
Loading completes, or display a message indicating that I should click on a Flow to view its metrics.

Screenshots
If applicable, add screenshots to help explain your problem.
image

Version of Data Accelerator
How do I find this information?

  • Services:
  • DataProcessing
  • WebSite:

Desktop (please complete the following information):

  • macOS Mojave 10.14.4
  • Google Chrome Version 76.0.3809.132 (Official Build) (64-bit)

Additional context
Add any other context about the problem here.

Restart job: Validate in flows service that the current spark job has stopped before issuing a new start job instance

Restart job: Validate in flows service that the current spark job has stopped before issuing a new start job instance.

Repro steps:

  1. Restart a currently running job from the jobs page

Actual:
Sometimes the currently running job continues and a new instance of the same job get triggered

Expected:
Restart job api should validate that the currently running job instance has stopped before starting a new instance

Move default configurations files into Services deployment or initial configuration

Is your feature request related to a problem? Please describe.
Common configurations should be part of the services to match code to be deployed.

Describe the solution you'd like
Move the json files from the Deployment folder to the Flows folder and include them in the release. The Deployment scripts should be able to use/deploy them from there.

Web: Improve readability for screen readers

Is your feature request related to a problem? Please describe.
Some areas of the web portal have issues with screen readers. Here are a few examples

Describe the solution you'd like
Improve readability for screen readers across the web portal

Local: Support Accumulator in Local mode

Is your feature request related to a problem? Please describe.
When running Data Accelerator in the docker, using the Accumulator feature does not work

Describe the solution you'd like
Enable the accumulator feature in local mode

Deploy the iotsample job with running failure

After setup Data Accelerator, simply deploy the IoT Sample flow with a new name.
and several seconds (120s?) later getting below error at jobs tab:

19/06/16 13:56:07 INFO ShutdownHookManager: Shutdown hook called
19/06/16 13:56:07 INFO ShutdownHookManager: Deleting directory /tmp/spark-8dd197f7-3687-4445-93d7-ddb1a1d357ef
19/06/16 13:56:07 INFO MetricsSystemImpl: Stopping azure-file-system metrics system...
19/06/16 13:56:07 INFO MetricsSinkAdapter: azurefs2 thread interrupted.
19/06/16 13:56:07 INFO MetricsSystemImpl: azure-file-system metrics system stopped.
19/06/16 13:56:07 INFO MetricsSystemImpl: azure-file-system metrics system shutdown complete.
stderr:
YARN Diagnostics:
java.lang.Exception: No YARN application is found with tag livy-batch-8-yikjsvds in 120 seconds. Please check your cluster status, it is may be very busy.
org.apache.livy.utils.SparkYarnApp.org$apache$livy$utils$SparkYarnApp$$getAppIdFromTag(SparkYarnApp.scala:182) org.apache.livy.utils.SparkYarnApp$$anonfun$1$$anonfun$4.apply(SparkYarnApp.scala:239) org.apache.livy.utils.SparkYarnApp$$anonfun$1$$anonfun$4.apply(SparkYarnApp.scala:236) scala.Option.getOrElse(Option.scala:120) org.apache.livy.utils.SparkYarnApp$$anonfun$1.apply$mcV$sp(SparkYarnApp.scala:236) org.apache.livy.Utils$$anon$1.run(Utils.scala:97)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.