Code Monkey home page Code Monkey logo

piflow's Introduction


GitHub releases GitHub stars GitHub forks GitHub downloads GitHub issues

πFlow is an easy to use, powerful big data pipeline system.

Table of Contents

Features

  • Easy to use
    • provide a WYSIWYG web interface to configure data flow
    • monitor data flow status
    • check the logs of data flow
    • provide checkpoints
  • Strong scalability:
    • Support customized development of data processing components
  • Superior performance
    • based on distributed computing engine Spark
  • Powerful
    • 100+ data processing components available
    • include Spark、MLlib、Hadoop、Hive、HBase、TDengine、OceanBase、openLooKeng、TiDB、Solr、Redis、Memcache、Elasticsearch、JDBC、MongoDB、HTTP、FTP、XML、CSV、JSON,etc.

Architecture

Requirements

  • JDK 1.8
  • Scala-2.12.18
  • Apache Maven 3.1.0 or newer
  • Spark-3.4.0
  • Hadoop-3.3.0

Compatible with X86 architecture and ARM architecture, Support center OS and Kirin system deployment

Getting Started

To Build:

  • install external package

        mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/spark-xml_2.11-0.4.2.jar -DgroupId=com.databricks -DartifactId=spark-xml_2.11 -Dversion=0.4.2 -Dpackaging=jar
        mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/java_memcached-release_2.6.6.jar -DgroupId=com.memcached -DartifactId=java_memcached-release -Dversion=2.6.6 -Dpackaging=jar
        mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/ojdbc6-11.2.0.3.jar -DgroupId=oracle -DartifactId=ojdbc6 -Dversion=11.2.0.3 -Dpackaging=jar
        mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/edtftpj.jar -DgroupId=ftpClient -DartifactId=edtftp -Dversion=1.0.0 -Dpackaging=jar
    
  • mvn clean package -Dmaven.test.skip=true

        [INFO] Replacing original artifact with shaded artifact.
        [INFO] Reactor Summary:
        [INFO]
        [INFO] piflow-project ..................................... SUCCESS [  4.369 s]
        [INFO] piflow-core ........................................ SUCCESS [01:23 min]
        [INFO] piflow-configure ................................... SUCCESS [ 12.418 s]
        [INFO] piflow-bundle ...................................... SUCCESS [02:15 min]
        [INFO] piflow-server ...................................... SUCCESS [02:05 min]
        [INFO] ------------------------------------------------------------------------
        [INFO] BUILD SUCCESS
        [INFO] ------------------------------------------------------------------------
        [INFO] Total time: 06:01 min
        [INFO] Finished at: 2020-05-21T15:22:58+08:00
        [INFO] Final Memory: 118M/691M
        [INFO] ------------------------------------------------------------------------
    

Run πFlow Server:

  • run piflow server on Intellij:

    • download piflow: git clone https://github.com/cas-bigdatalab/piflow.git

    • import piflow into Intellij

    • edit config.properties file

    • build piflow to generate piflow jar:

      • Edit Configurations --> Add New Configuration --> Maven
      • Name: package
      • Command line: clean package -Dmaven.test.skip=true -X
      • run 'package' (piflow jar file will be built in ../piflow/piflow-server/target/piflow-server-0.9.jar)
    • run HttpService:

      • Edit Configurations --> Add New Configuration --> Application
      • Name: HttpService
      • Main class : cn.piflow.api.Main
      • Environment Variable: SPARK_HOME=/opt/spark-2.2.0-bin-hadoop2.6(change the path to your spark home)
      • run 'HttpService'
    • test HttpService:

      • run /../piflow/piflow-server/src/main/scala/cn/piflow/api/HTTPClientStartMockDataFlow.scala
      • change the piflow server ip and port to your configure
  • run piflow server by release version:

    • download piflow.tar.gz:
      https://github.com/cas-bigdatalab/piflow/releases/download/v1.2/piflow-server-v1.5.tar.gz

    • unzip piflow.tar.gz:
      tar -zxvf piflow.tar.gz

    • edit config.properties

    • run start.sh、stop.sh、 restart.sh、 status.sh

    • test piflow server

      • set PIFLOW_HOME
        • vim /etc/profile
          export PIFLOW_HOME=/yourPiflowPath/bin
          export PATH=$PATH:$PIFLOW_HOME/bin

        • command
          piflow flow start example/mockDataFlow.json
          piflow flow stop appID
          piflow flow info appID
          piflow flow log appID

          piflow flowGroup start example/mockDataGroup.json
          piflow flowGroup stop groupId
          piflow flowGroup info groupId

  • how to configure config.properties

    #spark and yarn config
    spark.master=yarn
    spark.deploy.mode=cluster
    
    #hdfs default file system
    fs.defaultFS=hdfs://10.0.86.191:9000
    
    #yarn resourcemanager.hostname
    yarn.resourcemanager.hostname=10.0.86.191
    
    #if you want to use hive, set hive metastore uris
    #hive.metastore.uris=thrift://10.0.88.71:9083
    
    #show data in log, set 0 if you do not want to show data in logs
    data.show=10
    
    #server port
    server.port=8002
    
    #h2db port
    h2.port=50002
    
    #If you want to upload python stop,please set hdfs configs
    #example hdfs.cluster=hostname:hostIP
    #hdfs.cluster=master:127.0.0.1
    #hdfs.web.url=master:50070
    

Run πFlow Web:

  vim /usr/lib/systemd/system/docker.service
  ExecStart=/usr/bin/dockerd -H tcp://0.0.0.0:2375 -H unix://var/run/docker.sock
  systemctl daemon-reload
  systemctl restart docker

Restful API:

  • flow json

    flow example
        
          {
    "flow": {
      "name": "MockData",
      "executorMemory": "1g",
      "executorNumber": "1",
      "uuid": "8a80d63f720cdd2301723b7461d92600",
      "paths": [
        {
          "inport": "",
          "from": "MockData",
          "to": "ShowData",
          "outport": ""
        }
      ],
      "executorCores": "1",
      "driverMemory": "1g",
      "stops": [
        {
          "name": "MockData",
          "bundle": "cn.piflow.bundle.common.MockData",
          "uuid": "8a80d63f720cdd2301723b7461d92604",
          "properties": {
            "schema": "title:String, author:String, age:Int",
            "count": "10"
          },
          "customizedProperties": {
    
      }
    },
    {
      "name": "ShowData",
      "bundle": "cn.piflow.bundle.external.ShowData",
      "uuid": "8a80d63f720cdd2301723b7461d92602",
      "properties": {
        "showNumber": "5"
      },
      "customizedProperties": {
    
      }
    }
    

    ] } }

  • CURL POST:

  • Command line:

    • set PIFLOW_HOME
      vim /etc/profile
      export PIFLOW_HOME=/yourPiflowPath/piflow-bin
      export PATH=$PATH:$PIFLOW_HOME/bin

    • command example
      piflow flow start yourFlow.json
      piflow flow stop appID
      piflow flow info appID
      piflow flow log appID

      piflow flowGroup start yourFlowGroup.json
      piflow flowGroup stop groupId
      piflow flowGroup info groupId

docker-started

  • pull piflow images
    docker pull registry.cn-hangzhou.aliyuncs.com/cnic_piflow/piflow:v1.5

  • show docker images
    docker images

  • run a container with piflow imageID , all services run automatically. Please Set HOST_IP and some docker configs.
    docker run -h master -itd --env HOST_IP=*.*.*.* --name piflow-v1.5 -p 6001:6001 -v /usr/bin/docker:/usr/bin/docker -v /var/run/docker.sock:/var/run/docker.sock --add-host docker.host:*.*.*.* [imageID]

  • please visit "HOST_IP:6001", it may take a while

  • if somethings goes wrong, all the application are in /opt folder

use-interface

  • Login:

  • Dashboard:

  • Flow list:

  • Create flow:

  • Configure flow:

  • Load flow:

  • Monitor flow:

  • Flow logs:

  • Group list:

  • Configure group:

  • Monitor group:

  • Process List:

  • Template List:

  • DataSource List:

  • Schedule List:

  • StopHub List:

Contact Us

piflow's People

Contributors

1017729642 avatar airzihao avatar bao319 avatar bbbbbbyz avatar bluejoe2008 avatar chancexin avatar coco11563 avatar cool-hacker00 avatar day0n avatar dependabot[bot] avatar hc-teemo avatar hulululu910 avatar jingtiantian avatar judy0131 avatar kyofin avatar leishu-521 avatar lj044500 avatar mayn1y avatar sosoll7 avatar tianyao-0315 avatar xiaoxiaocn avatar yanfqidong0604 avatar yg000 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

piflow's Issues

自定义组件mount后icon缺失

自定义组件mount之后在编排页面不显示图标,重启piflow-server之后,先unmount再mount一下图标就出来了。
第一次报错信息:java.io.IOException: Stream closed

配置数据源JDBC的driver应该填什么

文档里mysqlRead的设置里面并没有driver这个选项,但是实际配置mysqlRead和JDBC数据源都要写入driver配置。
尝试过写服务器中的driver路径以及driver文件名(mysql-connector-java),但都显示为java.lang.ClassNotFounderror

关于piflow启动web服务报错问题java.lang.ClassNotFoundException: websocket.drawboard.Room$Player

01-Mar-2022 11:47:59.396 严重 [localhost-startStop-1] org.apache.catalina.core.ContainerBase.addChildInternal ContainerBase.addChild: start:
org.apache.catalina.LifecycleException: 无法启动组件[StandardEngine[Catalina].StandardHost[localhost].StandardContext[/examples]]
at org.apache.catalina.util.LifecycleBase.handleSubClassException(LifecycleBase.java:440)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:198)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:743)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:719)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:705)
at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:1125)
at org.apache.catalina.startup.HostConfig$DeployDirectory.run(HostConfig.java:1858)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: websocket/drawboard/Room$Player
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.getDeclaredMethods(Class.java:1975)
at org.apache.tomcat.websocket.pojo.PojoMethodMapping.(PojoMethodMapping.java:86)
at org.apache.tomcat.websocket.server.WsServerContainer.addEndpoint(WsServerContainer.java:155)
at org.apache.tomcat.websocket.server.WsServerContainer.addEndpoint(WsServerContainer.java:130)
at org.apache.tomcat.websocket.server.WsSci.onStartup(WsSci.java:122)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5144)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:183)
... 10 more
Caused by: java.lang.ClassNotFoundException: websocket.drawboard.Room$Player
at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1358)
at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1180)
... 19 more

关于flow保存的问题

假设两个用户A ,B同时打开同一个flow(假设只有一个组件),然后A设计了自己的复杂流程,此时B显示的界面还是只有一个组件。B只要拖动下这一个组件。A设计的复杂流程就没有了

关于piflow1.0使用scala2.12.10和spark3.0遇到的问题和解决过程

1.githup下载的源码,直接修改scala和spark版本,登陆会遇到如下问题:
image
2.使用如下地址https://github.com/cas-bigdatalab/piflow/tree/piflow_spark-3.0.0的依赖,可以登陆进去,但是stop是空的,后台server报如下错误:
image
3.修改piflow-bundle、piflow-core、piflow-configure中的net.liftweb依赖为如下版本,重新reload stop 问题解决

net.liftweb
lift-json_${scala.big.version}
3.4.1

web可视化增强

针对web可视化界面,希望能够增加Sql Editor 类似功能,如:hue、zeppelin 、国内的datagear等轻量级在线编辑和展示的功能。

重写各Stop间传递的DataFrame,使其能够带有属性

无属性的DF在传递时,如果需要传递单个字符串,也需要构建一个新的DataFrame
设想这样的场景

从上游传递一系列的DF,对每个DF有不同的操作

如果每个传递的DF内不包含属性字段,则该需求无法实现

需支持更多的组件

1、支撑Kafka读取和写入
2、支持主流数据库的Upsert操作,如:Mysql、Postgresql、Oracle
参考:
mysql https://blog.csdn.net/a544258023/article/details/94029334
postgresql https://stackoverflow.com/questions/34643200/spark-dataframes-upsert-to-postgres-table
oracle merge into语法(暂时没有找到相应实现,应该与mysql的实现方式类似)
希望团队可以扩展spark的这块逻辑,支持带有主键的Upsert的应用场景,目前这个需求在实践过程中还是较为常见的。
3、图库(janusgraph 、Neo4j)数据的写入和读取(Gremlin、Cepher)

piflow-run command

provides a piflow-run command line tool

usage: piflow run abc.flow

this command will execute abc.flow and output log messages.

QRCODE过期

hi,提醒一下,readme.md 最末的二维码已过期不可使用

指定运行的Executor位置

不同于NIFI的主从
Piflow应该可以指定一个机器上的Executor进行运行
这样可以实现写入与读取某个具体机器上的文件这样的需求

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.