astronomer / astronomer Goto Github PK
View Code? Open in Web Editor NEWHelm Charts for the Astronomer Platform, Apache Airflow as a Service on Kubernetes
Home Page: https://www.astronomer.io
License: Other
Helm Charts for the Astronomer Platform, Apache Airflow as a Service on Kubernetes
Home Page: https://www.astronomer.io
License: Other
@aloisbarreras had created MVP with node api and express - adding real-time ingestion from other sources using clickstream infrastructure would be great.
@schnie commented on [Tue Oct 03 2017]
We've discussed this internally several times and Alooma has a similar feature. We'd like to be able to optionally slot a mapper function in the middle of our streaming pipelines. This could be useful for clickstream as well as our standard pipelines built on Kafka Connect. Potential use cases would be light data transformations, or enrichment. We're getting this request from customers and potential clients a lot lately.
Some open questions:
@willastronomer commented on [Wed Nov 29 2017]
We could create consume/produce pairs for different languages, create and interface for them that the customer implements and then pull that code in between the event-router and integration-worker.
Would be helpful if Airflow could alert us when it can't schedule a task for X minutes, instead of just silently spinning. If we had a Prometheus counter for tasks scheduled, alert could be based on that.
Set up circle to build and push docs.
Get event-api in shape for open/EE.
@schnie commented on [Wed Nov 08 2017]
Several customers/prospects have been asking about cross-domain analytics and how they can track user actions across multiple domains.
An example use case is a company that uses a micro-site or blog to drive traffic into their main site to purchase products. This company could pump events from both sites into schemas within a single data warehouse or cluster and run queries that JOIN
the two datasets to see how events/pageviews on one site, drive sales on the other.
A solution would be to append a crossDomainId
to all events originating from a domain that the customer has set up in our app somewhere in our event processing pipeline.
Simple example query could look like:
SELECT count(*)
FROM my_blog.pages
INNER JOIN my.bid_on_item
ON my_blog.pages.context_traits_crossDomainId =
my.bid_on_item.context_traits_crossDomainId
WHERE my_blog.pages.name = 'Introducing Collectibles' AND
my.bid_on_item.product_category = 'Collectible'
@schnie commented on Mon Dec 11 2017
Either create a single user that we user everywhere in base, or come up with a common way/order of adding users to application images.
Full documentation for using the images can be found here.
The project is licensed under the Apache 2 license. For more information on the licenses for each of the individual Astronomer Platform components packaged in the images, please refer to the respective Astronomer Platform documentation for each component.
This URL returns 404: https://astronomerio.github.io/astronomer/
This section of the documentation is hard to follow:
https://open.astronomer.io/airflow/index.html
I noticed this commit 31f95f4#diff-364439c8141492fe1670af4f51c33125 removed the start/stop scripts.
What is the recommended way to start the services now? The commit message mentions the CLI but I could not find any documentation on using the CLI with Open.
I tried docker-compose up
but it does not seem to be pointing to the onbuild image that reads requirements.txt and packages.txt so dependencies are not installed.
The document also mentions the .astro directory, which I don't believe is created by this version of the platform.
@schnie commented on Sat Dec 09 2017
Our entrypoint script currently uses some bash features. It'd be nice if we could just use the default shell, BusyBox Ash. Removes another dependency and reduces surface area.
Include a prometheus client with the ingestion api to maintain a count of messages consumed
Setup a system of enabling/disabling features for permissions, release management, and billing authorization.
Users can create/invite and manage users and teams, and assign them access roles with CRUD permissions on various actions in the app.
Our grafana dashboard for the scheduler/workers shows some inaccurate numbers. See: https://issues.apache.org/jira/browse/AIRFLOW-774.
We should fix those bugs on our fork, release it, and submit a PR to apache.
This is kind of important, but since people aren't actually using it in production, there are no complaints. We should get ahead of it.
Users can create and manage airflow deployments.
.
Users can sign up and login to the app.
We've heard this question from a few prospects and think it's relevant enough to add to docs.
Can someone answer this/write the guide, and then add it to the docs site?
A user may prefer to develop and work on only Airflow. We need to create an example that loads the bare bones install of Airflow Scheduler and Webserver along with the PostgresDB.
PostgresDB is required if we want users to be able to demo Airflow DAG concurrency.
https://segment.com/blog/exactly-once-delivery
Big takeaways here is we should be partitioning by messageId
so that each message should get processed by the same worker every time, so then we can build a log of what messages we have already seen. If we haven't seen it, we pass it along. If we have, we drop it.
This is more about de-duping messages that were sent multiple times (a true dupe). Kafka can give us exactly-once processing, but if there's duplicate messages then that would just guarantee that we process each of the duplicate messages once.
Simplify Airflow EE install, could be a bash script, go bin or python
@cwurtz made an API plugin which covers a few use cases (list dags, list dag runs, backfill using executor), but can be extended, that we want to add to our Airflow distribution.
Compare w/ functionality of https://github.com/teamclairvoyant/airflow-rest-api-plugin/blob/master/plugins/rest_api_plugin.py which just makes calls to the python cli.
@tedmiston commented on Fri Dec 15 2017
It might be possible to get this info through Flower or the Airflow statsd thing?
@tedmiston commented on Mon Dec 18 2017
@schnie Can you add anything you've learned here recently? I think you were working with statsd in Airflow. Is there overlap with that and how we use Prometheus?
@schnie commented on Wed Dec 20 2017
@tedmiston I don't have anything currently that will track memory usage at a task level over time, but I think we could get live monitoring of the celery workers into prometheus/grafana with cadvisor. I'll tinker around some more and report back.
Users can see a history of all user activity in the app, by user, team, and deployment.
As a superadmin
I want to see a log of analytics.js builds
so that I can troubleshoot potential issues with the builder
Tweak our install requirements to allow minor version upgrades to Celery to support the new RC.
Requested by a user in astronomer/airflow#23 (comment).
Users can manage their account and profile info.
We are looking to replace the Airflow based Redshift loader with any combination of Flink, Kafka, Spark
Fix up event-router for open/EE.
Currently we take any version of Celery >=4.0.2
, however there is significant discussion on the Airflow Dev list of stability issues with the latest release of Celery (4.1.0) e.g., the recent thread "Airflow looses track of Dag Tasks" and the issue apache/airflow#2806 where a core contrib recommended pinning to 4.0.
Our dependency is https://github.com/astronomerio/incubator-airflow/blob/148c8f4a26bcc1be745534e0a5981982202db66e/setup.py#L100-L103 via https://github.com/astronomerio/astronomer/blob/f03223f773c93cd09417044742722679dd8b97a8/docker/platform/airflow/Dockerfile#L26
The puckel/docker-airflow repo that we forked docker-airflow-saas and docker-airflow-clickstream from pins to ==4.0.2
as well.
@schnie @andscoop @cwurtz @ryw I think we should pin to 4.0.2 like everyone else for now for similar reasons. Anyone opposed to that?
I've been using your Open setup with moderate success. I've spent a lot of time reconciling dependencies. It would be great if instead of a single requirements.txt you could support multiple virtual environments. I don't believe this is possible out-of-the-box now but if it is, please let me know.
To add more context, this is to be able to run Singer taps and targets which designed to be run from the command line (as opposed to as a python module). Another package I had trouble with was dbt.
I realize there is a Python operator that has support for virtual environments but given the fact these tools are run from the command line, I am not sure that would serve my use case.
As a workaround, I created a new Dockerfile based on astronomerinc/ap-airflow:latest-onbuild and installed target-stitch and dbt in their separate venvs. I ran a quick test and that seems to be working OK.
@tedmiston i think you have a checklist somewhere for how to add new clickstream destinations. Can you add to https://github.com/astronomerio/astronomer/blob/master/docs/pages/clickstream/add_destination.md
Get event-forwarded ready for open/EE.
We have good content around installing Astronomer Airflow in Kubes, but we don't have very many videos or much material on actually using Airflow in the wild.
After a conversation with a customer - we need to be distinguishing Open from EE. Ideas around this are
Add warning to docs to highlight that Open is meant for testing and viewing of EE internals, not for dev or production workflows. For this they should be pointed to the Astronomer EE CLI.
change airflow-enterprise
example to airflow-ee-bundle
in order to remove the impression that the example is anywhere near feature parody to an enterprise install.
This is slightly inline with #39
Users can create and manage organizations.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.