- Enables you to use the capability of Spark without actually writing the spark codes.
- Includes many workflows; which helps in writing codes and get your results in just few lines.
- For power user, it allows you to tweak every step in the flow.
This assumes that you have access to Apache Spark. (and Cassandra clusters if working with cassandra workflow)
Clone the repo and build with the command:
python setup.py install
sudo pip uninstall simple_spark_lib
# First, import your libraries
from simple_spark_lib import SimpleSparkCassandraWorkflow
# Define connection configuration for cassandra
cassandra_connection_config = {
'host': '192.168.56.101',
'username': 'cassandra',
'password': 'cassandra'
}
# Define Cassandra Schema information
cassandra_config = {
'cluster': 'rootCSSCluster',
'tables': {
'api_events': 'simpl_events_production.api_events',
# <alias of table> : <keyspace>.<table_name>
# (Spark's temporary table name) : Cassandra's config
}
}
# Initiate your workflow
workflow = SimpleSparkCassandraWorkflow(appName="Simple Example Worker")
# Setup the workflow with configurations
workflow.setup(cassandra_connection_config, cassandra_config)
# Run your favourite query
df = workflow.process(query="SELECT * FROM api_events")
print df.take(10)
Run this example with the command:
simple-runner filename.py -d cassandra