- Table comparison for EMR vs Cloudera on EC2 (AWS)
- Pros and Cons if switching to EMR under cost cutting situation
- How to mitigate the "Cons" when switching to EMR. List out the shortfall for the things that cannot be achieved and the workaround for them.
AWS EMR | Cloudera on EC2 | |
---|---|---|
License Cost Impact | No licenses cost, just EC2 + EMR rate costs | Enterprise licenses + EC2 costs |
Scalability | Can do auto scaling + spot instance | Due to license per node, not so easy to do auto scaling |
Storage Impact Filesystem (HDFS) and S3 | S3 (EMRFS), Compute and Storage are loosely decoupled. S3 is the primary storage for the entire data lake. HDFS will become buffer storage. Much lower storage cost on S3 than EBS. Link | HDFS is the primary storage, require EBS disk permanently provisioned. Compute and Storage cannot be decoupled |
Controller of Cluster | Ganglia UI, a web console for monitoring, and managing cluster will be through AWS EMR portal, UI functionality are much less than Cloudera Manager, some orchestration tasks require manual scripting | Cloudera Manager UI, very flexible controller to manage the entire cluster |
AWS Ecosystem integration | Better integrate with other AWS components, such as Kinesis, S3 (EMRFS) and IAM (Identity and Access Management) | Cloudera also has S3Connector, it can only using S3a:// protocol, instead of s3:// (EMRFS), which has different compatibility level, certain DDL cannot be done on s3a:// |
Security | AWS IAM + Ranger + AWS KMS. It covers security, it may not have as fine-grained control as cloudera navigator. Link | Cloudera Navigator + Trustee KMS server + Sentry to protects the clusters, which has more fine-grain level of controls |
Security – Ranger vs Sentry | Though it is from beginning of 2016, at least is done by a third party. Scroll to the bottom, and click on the image for the comparison table. Link | |
Audit | Ranger logs HDFS and Hive access + AWS Cloudtrail for any API call activity to the cluster. In comparison to Navigator, the key question is how fast Ranger in the community is growing, to what extent what hadoop ecosystem it is covering. Link | Cloudera Navigator logs everything happen to the cluster + logging to HDFS, Hive, Impala, etc. It has a comprehensive architecture, as the cloudera agent talks to the audit server. Link |
Encryption | At-rest data encryption. In-transit data encryption. HDFS transparent encryption. Server-side / client side encrpytion. Link | Also has full set of encryption functionality, using Cloudera Navigator, Trustee Key Server, HSM VM, etc |
Interactive Query End user Authentication and Authorization | EMR does not support Impala natively, instead it relies on Presto and Hive. Ranger supports Hive on Tez at the moment, so Hive will be the main query engine used for data explorer / scientist. Until Presto has better authorization mechanism for interactive query, or need to investigate into Data virtualization like Denodo for security governance. Like slide 44 (Data Lake security in below slideshare) | Because Sentry can cover Impala's access control, it is ready for data explorer / scientist to safely consume the services. |
High Availability | Currently no HA for Master node, AWS mention it will be available in Q4 2017. For other data nodes, it can recover crashed node automatically | Cloudera Manager have feature to do NameNode HA, and HA for other Hadoop components |
Backup and Restore | EMR relies on S3 ( EMRFS ), and S3 can do versioning if required. For Hive metadata, it will be stored in external MySQL database, which we can do daily backup or restore. | Cloudera Manager can do HDFS snapshotting. Using external database to store metadata of Hive as well. |
DR | EMRFS is natively sitting on S3, S3 is regionally available across AZs, meaning no copy data is needed during DR, what will happen is to re-spin up a EMR cluster another AZ, as the Hive tables are all external tables sitting in S3. | Using Cloudera BDR, replicating data to S3, during DR, replicate back from S3 to HDFS. The process is managed through Cloudera Manager. |
Informatica Compatibility | Compatible with EMR. Link | Compatible with Cloudera |
BI tools Compatibility | Hive has JDBC / ODBC driver for connectivity | Impala has JDBC / ODBC driver for connectivity |
Kafka vs Kinesis | AWS provide an alternative to Kafka, which is Kinesis and it is a PaaS, which can integrate with EMR. E.g. Inside a Spark streaming, you can refer to an external table which is Kinesis stream through a driver written by AWS. Link | Cloudera Manager support running Kafka through managing the VMs |
Sizing of instances Impact | Following EMR guidelines, in general the instance sizes are smaller because of using EMRFS, as the maintenance of filesystem are mainly on S3 now, HDFS are there but only acts as temporary storage. Link | Cloudera has its guideline of sizing the VM instances, in general slightly higher VM specs are required. |
OS Impact | EMR use Amazon Linux 64bit by default, as it is a semi-PaaS from infrastructure perspective, it also supports custom AMI, but the custom AMI has to be based on Amazon Linux 64bit. So the hardening will be performed on top of Amazon Linux 64bit. Link | It will be based on Redhat linux |
Version upgrade approach | Will need to use a "Blue-green" update method with Route 53, having a global DNS mapping to the cluster endpoint, then spill up new cluster with new version, and change global DNS to point to new cluster, and destroy old one. For more details on EMR endpoint using Route 53, here is one of the blog talking about it, with private DNS server and private zone record. Link | Cloudera Manager has nicely built-in rolling upgrade feature. |
How to leverage S3 storage more | EMR using s3:// (EMRFS) that is fully compatible with S3, as it is a proprietary driver by AWS. It can use S3 like a HDFS filesystem. | Cloudera option, we are currently already leveraging S3 as cold archive as an external data store, but cannot use S3 as a HDFS replacement |
AWS Future Component | AWS Redshift spectrum, directly query S3 storage data, provide JDBC / ODBC and standard database security. AWS Glue (preview), which is AWS version of ETL to do transformation |
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns Link
Challenge for migrating to EMR from other Big Data technologies Link
Pros for AWS EMR
- Auto scaling cluster, use of spot instance. Because of compute and storage separation, can easily scale according to burst demand
- No license cost, included as part of using AWS benefit
- EMR is being treated as transient cluster, as all permanent data are stored on S3, therefore reducing EBS disk cost
- Better AWS Ecosystem integration
- DR failover time should be less than 45 mins, because all data are stored on S3 across AZs already
Cons for AWS EMR
- Losing data locality nature in Hadoop, using S3 instead of local HDFS, for some type of queries, there will be some performance drawback.
- For interactive query authentication / authorization, EMR use Ranger, which supports only Hive. Comparing to Impala with Sentry on cloudera, the performance difference become Impala vs Hive on Tez. Impala should be faster.
- High availability of Master Node, EMR does not yet provide HA for name node until Q4 2017, meaning all ETL transformation result should be saved back to S3, HDFS inside EMR can only be used as temporary and buffer storage
- Ease of use UI / controller , EMR portal UI is not as comprehensive as Cloudera Manager
- Security control UI, the EMR does not as fine grain level of security settings control as Cloudera Manager, instead it provides security configuration at launch wizard.
Mitigation of some of the Cons for AWS EMR
- Interactive query authentication / authorization If Hive on Tez is not fast enough, will need to consider "Redshift" and "Presto", and adding data virtualization layer like Denodo for authorization purpose. This will increase costs though.
- High availability If costs allow, spin up another EMR cluster for interactive query, as all data are in S3 (EMRFS), and AWS allows multiple EMR cluster pointing to the same S3 data.
Appendix – screenshots
Ganglia UI for monitoring EMR cluster
Ranger – RBAC access control for Hive / HDFS
EMRFS
Informatica BDM on AWS