0%

Background

本文为阅读Spark The Definitive Guide Chapter 12所做的归纳与整理

Definition

  • Immutable, Partitioned Collection Of Records
  • No Concept Of Rows in RDD, individual Records are just Java/Scala/Python Objects. There are no schema in RDDs.
  • All the code in spark compiles down to RDD
  • Spark’s Structured API automatically store data in an optimized, compressed binary format while you need to implement this format inside your objects manually
Read more »

Preface

本文为阅读 “Spark SQL内核剖析” 第三章 “Spark SQL 执行全过程概述”所做的归纳和整理。以下内容主要基于Spark 2.1 & Spark 2.2。首先,作者在该章的第一部分介绍了Spark SQL 执行优化器Catalyst涉及的重要概念,包括InternalRow, TreeNode体系和Expression体系。在该章的第二部分,作者介绍了从一段SQL语句到Spark可以执行的RDD[InternalRow]类型的转换所要经历的三个阶段, 逻辑计划, 物理计划以及代码的生产以及提交。

Read more »

前言

本文为阅读”Spark The Definitive Guide”Chapter 7 所做的归纳和整理,部分代码来自于书中,部分为自己在本机试验所得

Definition of Aggregation

Aggregation is the act of collecting something together.

Read more »

Introduction Of Spark Cache

Preface

In Spark Sql, the use of cache is common when you need to reuse some intermediate computation result. Understanding the mechanism of spark cache can help developers speed up the computation process and raise the efficiency. This blog will help readers build up a whole picture of spark cache by answering three key questions. First I will briefly introduce what spark cache is by giving specific coding examples in scala. Then I will illustrate the scenario when developers may use spark cache. Finally I will illustrate the working mechanism of spark cache and some notation points derived from my own experience.

Read more »

Spark的基本架构

"spark架构"

Spark应用程序

组成

  • 一个Driver
    • 运行main()函数
    • 维护spark应用程序的相关信息
    • 回应用户的程序或输入
    • 分析任务并分发给若干executor进行处理
  • 一组Executor
    • 执行driver分发给它的代码
    • 将该执行器的计算状态报告给运行driver的节点

重要概念

Read more »

Hadoop各组件基本介绍

前言

Hadoop作为大数据生态中很重要的一环,其各个组件构成了搭建大数据平台的基础。了解清楚其底层各组件,有助于更好的进行相关的开发

大纲

  • HDFS
  • MapReduce
  • Hive
  • Hbase

HDFS

Hadoop Distributed File System实现了一个分布式文件存储系统,简单的来说就是将海量的文件分布的存储在不同的物理磁盘上,使用者不需要关心具体某部分数据被存储在哪里,而能够像在单机文件系统上一样存储读取文件。

Read more »

Kafka基础

基本介绍

“官网”
Technically speaking, event streaming is the practice of capturing data in real-time from event sources like databases, sensors, mobile devices, cloud services, and software applications in the form of streams of events; storing these event streams durably for later retrieval; manipulating, processing, and reacting to the event streams in real-time as well as retrospectively; and routing the event streams to different destination technologies as needed.

Read more »

Spark Dataframe基本操作

创建DataFrame

1
2
val df=spark.read.format("json").load("/data/flight-data/json/2015-summary.json")
df.createOrReplaceTempView("dfTable")
1
2
3
4
5
6
7
8
9
val myManualSchema = new StructType(
Array(new StructField("some",StringType,true),
new StructField("col",StringType,true),
new StructField("names",LongType,false)
))

val myRows=Seq(Row("Hello",null,1L))
val myRDD=spark.sparkContext.parallelize(myRows)
val myDf=spark.createDataFrame(myRDD,myManualSchema)
Read more »

What is a ReadWriteLock

Basically, a ReadWriteLock is designed as a high-level locking mechanism that allows you to add thread-safety feature to a data structure while increasing throughput by allowing multiple threads to read the data concurrently and one thread to update the data exclusively.

Read more »