Spark Technology - Education

Welcome, Guest: Register On Nairaland / LOGIN! / Trending / Recent / New
Stats: 3,194,621 members, 7,955,280 topics. Date: Saturday, 21 September 2024 at 09:23 PM

Spark Technology - Education - Nairaland

Nairaland Forum / Nairaland / General / Education / Spark Technology (388 Views)

Nobody Has Been Able To Get It Right”: Viral Photo Of Mathematics Question Spark / Video Of Alleged Students Having Sex At A Party Spark Outrage / Students Organize Sex Party In Anambra Hotel, Spark Outrage (2) (3) (4)

(1) (Reply)

Spark Technology by Ttacy341(f): 2:04pm On Sep 14, 2017

Introduction
Big data and data science are enabled by scalable, distributed processing frameworks that allow organizations to analyze petabytes of data on large commodity clusters. MapReduce (especially the Hadoop open-source implementation) is the first, and perhaps most famous, of these frameworks.

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs in Scala, Java, Python, and R that allow data workers to efficiently execute machine learning algorithms that require fast iterative access to datasets (see Spark API Documentation for more info). Spark on Apache Hadoop YARN enables deep integration with Hadoop and other YARN enabled workloads in the enterprise.

Apache Spark is a general-purpose distributed computing engine for processing and analyzing large amounts of data. Though not as mature as the traditional Hadoop MapReduce framework, Spark offers performance improvements over MapReduce, especially when Spark’s in-memory computing capabilities can be leveraged.

Spark programs operate on Resilient Distributed Datasets, which the official Spark documentation defines as “a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.”

MLlib is Spark’s machine learning library, which we will employ for this tutorial. MLlib includes several useful algorithms and tools for classification, regression, feature extraction, statistical computing, and more.

Concepts

At the core of Spark is the notion of a Resilient Distributed Dataset (RDD), which is an immutable collection of objects that is partitioned and distributed across multiple physical nodes of a YARN cluster and that can be operated in parallel.

Typically, RDDs are instantiated by loading data from a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat on a YARN cluster.

Once an RDD is instantiated, you can apply a series of operations. All operations fall into one of two types: transformations or actions. Transformation operations, as the name suggests, create new datasets from an existing RDD and build out the processing Directed Acyclic Graph (DAG) that can then be applied on the partitioned dataset across the YARN cluster. An Action operation, on the other hand, executes DAG and returns a value.

for more info : https://tekslate.com/tutorials/spark/

(1) (Reply)

How To Check Your 2017 June/july NECO Results / GATE Study Material For Computer Science Exam Based On Address Mode Type / Accounting Project Topics And Research Materials

(Go Up)

Sections: politics (1) business autos (1) jobs (1) career education (1) romance computers phones travel sports fashion health
religion celebs tv-movies music-radio literature webmasters programming techmarket

Links: (1) (2) (3) (4) (5) (6) (7) (8) (9) (10)

Nairaland - Copyright © 2005 - 2024 Oluwaseun Osewa. All rights reserved. See How To Advertise. 10
Disclaimer: Every Nairaland member is solely responsible for anything that he/she posts or uploads on Nairaland.