Welcome, Guest: Register On Nairaland / LOGIN! / Trending / Recent / New
Stats: 3,153,260 members, 7,818,886 topics. Date: Monday, 06 May 2024 at 07:23 AM

Spark Technology - Education - Nairaland

Nairaland Forum / Nairaland / General / Education / Spark Technology (371 Views)

Nobody Has Been Able To Get It Right”: Viral Photo Of Mathematics Question Spark / Video Of Alleged Students Having Sex At A Party Spark Outrage / Students Organize Sex Party In Anambra Hotel, Spark Outrage (2) (3) (4)

(1) (Reply)

Spark Technology by Ttacy341(f): 2:04pm On Sep 14, 2017
Introduction
Big data and data science are enabled by scalable, distributed processing frameworks that allow organizations to analyze petabytes of data on large commodity clusters. MapReduce (especially the Hadoop open-source implementation) is the first, and perhaps most famous, of these frameworks.

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs in Scala, Java, Python, and R that allow data workers to efficiently execute machine learning algorithms that require fast iterative access to datasets (see Spark API Documentation for more info). Spark on Apache Hadoop YARN enables deep integration with Hadoop and other YARN enabled workloads in the enterprise.

Apache Spark is a general-purpose distributed computing engine for processing and analyzing large amounts of data. Though not as mature as the traditional Hadoop MapReduce framework, Spark offers performance improvements over MapReduce, especially when Spark’s in-memory computing capabilities can be leveraged.

Spark programs operate on Resilient Distributed Datasets, which the official Spark documentation defines as “a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.”

MLlib is Spark’s machine learning library, which we will employ for this tutorial. MLlib includes several useful algorithms and tools for classification, regression, feature extraction, statistical computing, and more.


Concepts

At the core of Spark is the notion of a Resilient Distributed Dataset (RDD), which is an immutable collection of objects that is partitioned and distributed across multiple physical nodes of a YARN cluster and that can be operated in parallel.

Typically, RDDs are instantiated by loading data from a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat on a YARN cluster.

Once an RDD is instantiated, you can apply a series of operations. All operations fall into one of two types: transformations or actions. Transformation operations, as the name suggests, create new datasets from an existing RDD and build out the processing Directed Acyclic Graph (DAG) that can then be applied on the partitioned dataset across the YARN cluster. An Action operation, on the other hand, executes DAG and returns a value.

for more info : https://tekslate.com/tutorials/spark/

(1) (Reply)

Update On FUTO 2017/2018 Admission Screening / Keep Moving Forward / UNIPORT Post-utme Screening Schedule 2017/2018 Announced

(Go Up)

Sections: politics (1) business autos (1) jobs (1) career education (1) romance computers phones travel sports fashion health
religion celebs tv-movies music-radio literature webmasters programming techmarket

Links: (1) (2) (3) (4) (5) (6) (7) (8) (9) (10)

Nairaland - Copyright © 2005 - 2024 Oluwaseun Osewa. All rights reserved. See How To Advertise. 7
Disclaimer: Every Nairaland member is solely responsible for anything that he/she posts or uploads on Nairaland.