|Join Nairaland / LOGIN! / Trending / Recent / New|
Stats: 1,945,189 members, 4,029,316 topics. Date: Tuesday, 16 January 2018 at 04:43 PM
|Spark Technology by Ttacy341(f): 2:04pm On Sep 14, 2017|
Big data and data science are enabled by scalable, distributed processing frameworks that allow organizations to analyze petabytes of data on large commodity clusters. MapReduce (especially the Hadoop open-source implementation) is the first, and perhaps most famous, of these frameworks.
Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs in Scala, Java, Python, and R that allow data workers to efficiently execute machine learning algorithms that require fast iterative access to datasets (see Spark API Documentation for more info). Spark on Apache Hadoop YARN enables deep integration with Hadoop and other YARN enabled workloads in the enterprise.
Apache Spark is a general-purpose distributed computing engine for processing and analyzing large amounts of data. Though not as mature as the traditional Hadoop MapReduce framework, Spark offers performance improvements over MapReduce, especially when Spark’s in-memory computing capabilities can be leveraged.
Spark programs operate on Resilient Distributed Datasets, which the official Spark documentation defines as “a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.”
MLlib is Spark’s machine learning library, which we will employ for this tutorial. MLlib includes several useful algorithms and tools for classification, regression, feature extraction, statistical computing, and more.
At the core of Spark is the notion of a Resilient Distributed Dataset (RDD), which is an immutable collection of objects that is partitioned and distributed across multiple physical nodes of a YARN cluster and that can be operated in parallel.
Typically, RDDs are instantiated by loading data from a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat on a YARN cluster.
Once an RDD is instantiated, you can apply a series of operations. All operations fall into one of two types: transformations or actions. Transformation operations, as the name suggests, create new datasets from an existing RDD and build out the processing Directed Acyclic Graph (DAG) that can then be applied on the partitioned dataset across the YARN cluster. An Action operation, on the other hand, executes DAG and returns a value.
for more info : https://tekslate.com/tutorials/spark/
|Sections: politics (1) business autos (1) jobs (1) career education (1) romance computers phones travel sports fashion health |
religion celebs tv-movies music-radio literature webmasters programming techmarket
Nairaland - Copyright © 2005 - 2018 Oluwaseun Osewa. All rights reserved. See How To Advertise. 30