Code Monkey home page Code Monkey logo

big-data-hadoop-spark-lab's Introduction

Data Engineering with Spark⭐ and Hadoop🐘

Big Data🛢️ with Hadoop🐘 and Spark⭐ part of IBM Data Engineering Professional Certificate



image

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Hands-on Lab: Getting Started with Hive: Hive is a data warehouse software within Hadoop that is designed to read, write, and manage large and tabular-type datasets and data analysis.

Hands-on lab on Hadoop Map-Reduce: MapReduce is a programming pattern that enables massive scalability across hundreds or thousands of servers in a Hadoop cluster. As the processing component, MapReduce is the heart of Apache Hadoop. MapReduce is a processing technique and a program model for distributed computing, it is based on Java. Distributed computing is a system or machine with multiple components located on different machines. Each component has its own job, but the components communicate with each other to run as one system to the end user." The MapReduce algorithm consists of two important tasks - Map and Reduce. Many MapReduce programs are written in Java. MapReduce can also be coded in C++, Python, Ruby, R and so on.

Hands-on lab on Hadoop Cluster: A Hadoop cluster is a collection of computers, known as nodes, that are networked together to perform parallel computations on big data sets. The Name node is the master node of the Hadoop Distributed File System (HDFS). It maintains the meta data of the files in the RAM for quick access.



image

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. Its versatility, offering distributed data processing, real-time streaming, and in-memory processing capabilities, making it a powerful choice for various data processing tasks.

Hands-on Lab: Getting Started with Spark using Python: Introduction to Spark.

Hands-on Lab: Introduction to DataFrames: A DataFrame is two-dimensional. Columns can be of different data types. DataFrames accept many data inputs including series and other DataFrames. You can pass indexes(row labels) and columns (column labels). Indexes can be numbers, dates, or strings/tuples.

Hands-On Lab: Introduction to SparkSQL: Spark SQL is a Spark module for structured data processing. It query structured data inside Spark programs, using either SQL or a familiar DataFrame API.

Submit Apache Spark Applications Lab: In this lab, you will learn how to submit Apache Spark applications from a python script. This exercise is straightforward thanks to Docker Compose.

Apache Spark Monitoring and Debugging: practice how to monitor and debug a Spark application through the web UI.



This practice project focuses on data transformation and integration using PySpark. You will work with two datasets, and perform various transformations such as adding columns, renaming columns, dropping unnecessary columns, joining dataframes, and finally, writing the results into a Hive warehouse and an HDFS file system.

big-data-hadoop-spark-lab's People

Contributors

kmohamedalie avatar

Stargazers

Brylle Ace Nuñez avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.