Features
Features:

Product Tour >

Edraw AI >

Paid Plans:

Individuals >

Business >

Eduaction >
Resources
Blog

History

How-tos & Tips

Discovery

Biography

Business Analysis

Examples

AI concept Map

Free AI Mind Map Generator

Onenote Mind Map

Bcg Matrix Examples

Nike Marketing Strategy

Unilever SWOT Analysis

Make Mind Maps in Google Docs

Guide

FAQs

What's New

Resource Center
Templates
All Templates

Brain Storming Templates

Strategy and Planning Templates

Project Management Templates

Product Management Templates

Human Resources Templates

Agile Workflow Templates

Marketing Templates

Education Templates

Fun and Games Templates

User Gallery
Download
Pricing
Enterprise

MindMap Gallery 23 big data knowledge points and interview summary (1)

23 big data knowledge points and interview summary (1)

Simple template, including java basics, hadoop, hive, data warehouse theory, impala, Data lake theory and other content.

Edited at 2024-01-18 15:05:07

PlotWizard

Recent works View more works>>

23 big data knowledge points and interview summary (1)

PlotWizard

Recent works View more works>>

Recommended to you
Outline

Big data knowledge points and interview summary

java basics

Basic data types

byte,boolean 1 byte, char, short 2 bytes, int, float 4 bytes, long,double 8 bytes

abnormal

Throwable

Error

Catastrophic fatal errors, uncontrollable programs, such as stack overflow

Exception

runtime exception

Null pointer exception, array subscript out of bounds

Compile-time exceptions (non-runtime exceptions)

IOException, ClassNotFoundException

Polymorphism

data structure

Common data structures (8 types)

array

Query and modification are fast, addition and deletion are slow

The storage interval is continuous, the memory usage is serious, and the space complexity is large.

Advantages: Random reading and modification are fast because the array is continuous (strong random access and fast search speed)

Disadvantages: Insertion and deletion are inefficient because after inserting data, the data behind this position must be moved in the memory, and the size is not fixed and cannot be easily expanded dynamically.

linked list

Fast addition and deletion, slow search

The storage space is discrete, the memory occupied is loose, and the space complexity is small.

Advantages: fast insertion and deletion, high memory utilization, no fixed size, flexible expansion

Disadvantages: Cannot search randomly, start from the first one every time, low query efficiency

Hash table

Addition, deletion and search are fast, data hashing is a waste of storage space

queue

First in, first out, tail insertion and top removal are fast, but other accesses are slow.

stack

First in, last out, top removal and insertion are fast, but access to other elements except the top element is slow.

red black tree

Both additions, deletions and searches are fast, and the algorithm structure is complex (see binary tree for details)

Binary tree

Addition, deletion and search are fast, but the deletion algorithm has a complex structure.

The time complexity is O(logn) in the best case and O(n) in the worst case.

special binary tree

full binary tree

If the number of levels of the binary tree is K and the number of nodes is (2^K-1), it is a full binary tree (as shown in the following figure)

Subtopic 1

complete binary tree

The number of layers of the binary tree is h layer, the number of (1~h-1) nodes has reached the maximum, and all nodes of h layer are concentrated on the left (as shown in the following figure)

binary search tree

The values on the left subtree are all smaller than the root node, and the values on the right subtree are all smaller than the root node. In-order traversal must be sorted from small to large (as shown in the figure)

red black tree

A balanced binary tree is an empty tree or the height difference between the left and right subtrees does not exceed 1, and both left and right subtrees are balanced binary trees. The red-black tree must be a binary search tree (as shown in the figure)

Features:

time complexity

The worst time complexity of insertion, deletion and insertion is O(log N)

heightlogN

Nodes are either black or red

The root node must be black

If a node is red, its child nodes must be black.

For any node, the number of black nodes on the path to the leaf node must be the same.

Each leaf node (leaf node is the NULL pointer or NULL node at the end of the tree) is black

Optimal binary tree (Huffman tree)

The weighted path length of the tree reaches the minimum

B-tree

Balanced multi-path search tree (more than two search paths), different from binary tree, is a multi-way tree, O(log n)

B-tree

It is a self-balancing tree data structure that maintains data ordering; the complexity of searching, sequential access, insertion and deletion is O(log n) and the B-tree only stores data in leaf nodes, so some B-trees are eliminated Defects. Non-leaf nodes only store indexes, not actual data. The data is all stored in leaf nodes. O(nlogn)

The height of the tree supports range search

Why does Mysql use B numbers?

Less disk IO and supports range lookup

The difference between B-tree and B-tree

1) All data in B-tree exists in leaf nodes 2) The leaf nodes of the B tree have bidirectional pointers to facilitate range searches, and the data on the leaf nodes are connected sequentially from small to large.

bitmap

Save storage space and make it inconvenient to describe complex data relationships

gather

Traversing collection problem

When using an iterator to traverse a collection and change the objects in the collection (add, delete, modify), an exception will be thrown.

Collection

List

ArrayList

Features: Storage elements are in order, querying is fast and addition and deletion are slow.

The underlying array is implemented, and the capacity can grow automatically. The default expansion is 1.5 times the original capacity.

It is not thread-safe. Single-threading is available. Multi-threading is not recommended. You can use Collections.syncnizedArrayList(List l) to return a thread-safe ArrayList.

The bottom layer calls Arrays.copyof() in large numbers, and System.arraycopy() expands its capacity.

Vector

The underlying array implementation is modified by the synchronized keyword, thread-safe, and low in efficiency. The default initial capacity is 10, and the default expansion capacity is increased by 1 times.

LinkedList

Implementation of doubly linked list structure

Query is slow, addition and deletion are fast

Similarities and differences among LinkedList, ArrayList, and Vector:

Set

SortedSet

TreeSet

The underlying red-black tree implementation, unordered collection, thread safety

HashSet

Subclass: LinkedHashSet

HashSet is implemented by HashMap, so the data structure is an array linked list red-black tree, which allows null, no duplication, disorder, and multi-threading unsafe. It stores elements according to the hash algorithm and has high search, deletion and access performance. Use Collections.synchronizedHashSet() to return A thread-safe HashSet

Elements can only be accessed through iterators

queue

Map

HashMap

The bottom layer of HashMap is implemented using arrays, linked lists, and red-black trees.

When the number of objects in a chain in the array reaches 8, the chain will be converted into a red-black tree. If the red-black tree has less than 6 nodes, it will be converted into a linked list.

The initial default capacity of Hash is 16. When the storage elements in HashMap exceed load factor * current capacity, the capacity will be expanded to 2 times the original size, and then the position of each element in the array will be recalculated.

It is not thread safe, you can use ConcurrentHashMap

ConcurrentHashMap

jdk1.8

Implemented using linked list, red-black tree node array, cas (optimistic lock) synchronized

Subclass

linkedHashMap

HashTable

Subclass

Properties

Thread safety

Thread safety is ensured by locking the entire hash table. This method ensures thread safety, but concurrent execution is inefficient.

SortedMap

TreeMap

Design Patterns

Singleton mode, 3 common types

Lazy mode, instantiated on the first call

Villain mode, the class has been instantiated by itself when initialized, thread safe

Registration mode

proxy mode

static proxy

Subtopic 1

dynamic proxy

jdk dynamic proxy

CGlib

Factory pattern

builder pattern

adapter mode

iterator pattern

Common categories

String

Represents a string

String classes modified with final cannot be inherited and represent immutable character sequences.

The character content of the String object is stored in a character array value[]

If a String constant is spliced with a constant, it will be in the method area constant pool, and there will be no constants with the same content in the constant pool; if a constant is spliced with any variable, it will be stored in the heap.

JVM

javac

A compiler that compiles the java language into a binary file that can be recognized by the jvm

components

jvm memory partition

program counter

It is a small memory space and is an indicator of the code executed by the current thread. By changing the value of the counter, the next bytecode instruction that needs to be executed is selected; each thread has a unique program counter, and there is no program counter between threads. Impact; The only area where OOM will not occur; if the thread is executing a java method, the program counter records the address of the virtual machine bytecode instruction being executed. If a local method is executed, the counter is empty.

java virtual machine stack

Thread private, mainly stores local variables, stack frames, operand stacks, dynamic links, method exits and other information

native method stack

Call operating system class library

heap

Thread sharing, basically all objects are in the Java heap, the main area for garbage collection

method area

Thread sharing mainly stores class information, constants, static variables, and compiled code data

Runtime constant pool

Part of the method area. Storing literal and symbolic references

Multithreading

program

A set of instructions written in a certain language to complete a specific task, that is, a static piece of code, a static object

process

The process of executing a program once or the program being executed is a dynamic process with its own process of creation, existence and demise (life cycle)

Features:

Programs are static, processes are dynamic

Process is the smallest unit of resource allocation. When the system is running, each process will be allocated a different memory area.

thread

The process is further subdivided into threads, which are an execution path of the program.

Features:

If a process executes multiple threads at the same time, it supports multi-threading.

Threads serve as scheduling and execution units. Each thread has an independent stack and program counter (pc), and the overhead of thread switching is small.

Multiple threads in a process share the same memory unit/memory address space -> they allocate objects from the same heap and can access the same objects and variables. This makes communication between threads simpler and more efficient. But sharing between multiple threads will bring security issues

Parallelism and Concurrency

parallel:

Two or more events occur simultaneously

concurrent:

Two or more events occur at the same time interval

Thread creation

Thread

Inherit the Thread class and override the run method implementation. The code in the run method is the thread body. By instantiating the subclass that inherits Thread, call the start method to enable the thread to call the run method; Thread implements the Runnable interface.

Thread common methods

(1) void start(): Start the thread and execute the run() method of the object (2) run(): Operations performed by the thread when being scheduled (3) String getName(): Returns the name of the thread (4) void setName(String name): Set the thread name (5) static Thread currentThread(): Returns the current thread. This is this in Thread subclasses, usually used for the main thread and Runnable implementation classes (6) static void yield(): The thread yields and pauses the currently executing thread, giving the execution opportunity to a thread with the same or higher priority. If there is no thread with the same priority in the queue, ignore this method (7) join(): When the join() method of other threads is called in a certain program execution flow, the calling thread will be blocked until the join thread added by the join() method is executed. Low-priority threads can also get executed (8) static void sleep (long millis): (specified time: milliseconds) causes the currently active thread to give up control of the CPU within the specified time period, giving other threads a chance to be executed, and requeuing after the time is up. Throws InterruptedException (9) stop(): Forces the thread life cycle to end, not recommended (10) boolean isAlive(): Returns boolean to determine whether the thread is still alive

Thread priority:

When a thread is created, it inherits the priority of the parent thread.

Low priority only has a low probability of being scheduled, and does not necessarily need to be called after a high-priority thread.

1. Thread priority level MAX_PRIORITY：10 MIN_PRIORITY：1 NORM_PRIORITY：5 2. Methods involved getPriority(): Returns the thread priority value setPriority(int newPriority): Change the priority of the thread

Thread state Thread.State

New

ready

run

block

die

Runnable

1) Define a subclass and implement the Runnable interface. 2) Rewrite the run method in the Runnable interface in the subclass. 3) Create a thread object through the parameter-containing constructor of the Thread class. 4) Pass the subclass object of the Runnable interface as an actual parameter to the constructor of the Thread class. 5) Call the start method of the Thread class: start the thread and call the run method of the Runnable subclass interface.

callable

callable can be executed by using ExecutorService or as a parameter of FeatureTask

public class MyCallable implements Callable<T> { @Override public T call() throws Exception { // Define code that can be called here } } MyCallable myCallable = new MyCallable(); ExecutorService executor = Executors.newSingleThreadExecutor(); Future<T> future = executor.submit(myCallable); T result = future.get();

Callable<Process> task = () -> { //Execute asynchronous tasks Runtime runtime = Runtime.getRuntime(); Process process = runtime.exec("/Users/mac/Desktop/qc-java-runtime/src/main/java/com/qc/runtime/shell.sh"); return process; }; //Wrap Callable into FutureTask FutureTask<Process> future = new FutureTask<>(task); //Start a new thread to perform asynchronous tasks new Thread(future).start(); // Get the results of the asynchronous task Process result = future.get(); System.out.println(result);

Lock

For concurrent work, you need some way to prevent two tasks from accessing the same resources (effectively competing for shared resources). The way to prevent this conflict is to lock the resource when it is used by a task. The first task to access a resource must lock the resource so that other tasks cannot access it until it is unlocked, at which time another task can lock and use it.

Synchronized

 Any object can be used as a synchronization lock. All objects automatically contain a single lock (monitor).  Locks for synchronized methods: static methods (class name.class), non-static methods (this)  Synchronized code block: Specify it yourself, often it is also specified as this or class name.class

deadlock

Different threads occupy the synchronization resources needed by each other and do not give up. They are all waiting for the other party to give up the resources they need, forming a deadlock.

After a deadlock occurs, no exception or prompt will occur, but all threads are blocked and cannot continue.

lock release

 The execution of the synchronization method and synchronization code block of the current thread ends.

 The current thread encounters break and return in a synchronized code block or synchronized method, which terminates the continued execution of the code block and method.

 An unhandled Error or Exception occurs in the current thread in the synchronized code block or synchronized method, resulting in an abnormal end.

 The current thread executes the wait() method of the thread object in the synchronization code block and synchronization method. The current thread pauses and releases the lock.

Network communication protocol

OSI model (protocol)

The model is too ideal and cannot be promoted on the Internet

OSI layering

Application layer

presentation layer

session layer

transport layer

Network layer

data link layer

physical layer

TCP/IP model (protocol)

de facto international standard

TCP/IP layering

Application layer

transport layer

Network layer

physical data link layer

Comparison of two models

Data encapsulation

Data dismantling

IP and port

Important network transport layer protocols UDP and TCP

UDP (User Datagram Protocol)

1. Encapsulate data, source, and destination into data packets without establishing a connection. 2. The size of each datagram is limited to 64K 3. The sender does not care whether the other party is ready or not, and the receiver does not confirm receipt, so it is unreliable. 4. Can be broadcast and sent 5. No need to release resources when sending data, low overhead and fast speed

TCP (Transmission Control Protocol)

1. Before using the TCP protocol, a TCP connection must be established first to form a data transmission channel. 2. Before transmission, the "three-way handshake" method is used, and point-to-point communication is reliable. 3. Two application processes communicate using TCP protocol: client and server. 4. Large amounts of data can be transmitted during the connection. 5. After the transmission is completed, the established connection needs to be released, which is inefficient

Three-way handshake (establishing connection)

1. The client sends a request message to establish a TCP connection. The message contains a seq sequence number, which is randomly generated by the sending end, and sets the SYN (synchronize) field in the message to 1, indicating that a TCP connection needs to be established. . (SYN=1, seq=x, x is a randomly generated value); 2. The server replies to the TCP connection request message sent by the client, which contains the seq sequence number, which is randomly generated by the replying end, and sets SYN to 1, and generates an ACK field. The ACK field value is sent by the client. Add 1 to the passed sequence number seq to reply, so that when the client receives the information, it knows that its TCP establishment request has been verified. (SYN=1, ACK=x 1, seq=y, y is a randomly generated value) The ack plus 1 here can be understood as confirming who the connection is established with; 3. After the client receives the TCP establishment verification request sent by the server, it will increase its sequence number by 1, reply to the ACK verification request again, and add 1 to the seq sent by the server to reply. (SYN=1, ACK=y 1, seq=x 1).

Wave four times (disconnect)

1. The client sends a message requesting to disconnect the TCP connection. The message contains the seq sequence number, which is randomly generated by the sender. It also sets the FIN field in the message to 1, indicating that the TCP connection needs to be disconnected. . (FIN=1, seq=x, x is randomly generated by the client); 2. The server will reply to the TCP disconnect request message sent by the client, which contains the seq sequence number, which is randomly generated by the reply end, and will generate an ACK field. The value of the ACK field is the seq sequence number sent by the client. Basically add 1 to reply, so that when the client receives the information, it knows that its TCP disconnect request has been verified. (FIN=1, ACK=x 1, seq=y, y is randomly generated by the server); 3. After the server responds to the client's TCP disconnect request, it will not immediately disconnect the TCP connection. The server will first ensure that all data transmitted to A has been transmitted before disconnecting. Once the data transmission is confirmed, , the FIN field of the reply message will be set to 1, and a random seq sequence number will be generated. (FIN=1, ACK=x 1, seq=z, z is randomly generated by the server); 4. After receiving the TCP disconnect request from the server, the client will reply to the server's disconnect request, including a randomly generated seq field and an ACK field. The ACK field will add 1 to the seq of the server's TCP disconnect request, thus Complete the verification reply requested by the server. (FIN=1, ACK=z 1, seq=h, h is randomly generated by the client) At this point, the 4-wave wave process of TCP disconnection is completed.

Subtopic 3

Network Socket

The combination of ip and port forms a socket

The essence of network communication: communication between sockets

Classification

stream socket

Provide reliable byte stream service using TCP

TCP-based Socket programming steps

client

1. Create Socket: Construct a Socket class object based on the IP address or port number of the specified server. If the server responds, a communication line from the client to the server is established. If the connection fails, an exception will occur. 2. Open the input/output stream connected to the Socket: Use the getInputStream() method to obtain the input stream, and use the getOutputStream() method to obtain the output stream for data transmission. 3. Read/write operations on the Socket according to a certain protocol: read the information put by the server into the line through the input stream (but cannot read the information put into the line by yourself), and write the information into the thread through the output stream. 4. Close Socket: Disconnect the client from the server and release the line

Service-Terminal

1. Call ServerSocket(int port): Create a server-side socket and bind it to the specified port. Used to monitor client requests. 2. Call accept(): listen for connection requests. If the client requests a connection, accept the connection and return the communication socket object. 3. Call getOutputStream() and getInputStream() of the Socket class object: obtain the output stream and input stream, and start sending and receiving network data. 4. Close the ServerSocket and Socket objects: the client access is completed and the communication socket is closed.

datagram socket

Use UDP to provide "best effort" data services

UDP-based network programming

1. Classes DatagramSocket and DatagramPacket implement network programs based on UDP protocol. 2. UDP datagrams are sent and received through the datagram socket DatagramSocket. The system does not guarantee that the UDP datagram can be safely delivered to the destination, nor can it determine when it will arrive. 3. The DatagramPacket object encapsulates a UDP datagram, and the datagram contains the IP address and port number of the sender and the IP address and port number of the receiver. 4. Each datagram in the UDP protocol gives complete address information, so there is no need to establish a connection between the sender and the receiver. Just like sending an express package.

process

1. DatagramSocket and DatagramPacket 2. Establish the sending end and receiving end 3. Create data package 4. Call the sending and receiving methods of Socket 5. Close Socket Note: The sending end and receiving end are two independent running programs.

URL

reflection

concept

JAVA mechanism reflection is in the running state. For any class, you can know all the properties and methods of this class; for any object, you can call any of its methods and properties; this dynamically obtained information and dynamic calls The function of the object's method is called the reflection mechanism of the Java language.

Get an instance of the class class

1) Premise: If the specific class is known, it can be obtained through the class attribute of the class. This method is the safest and most reliable, and has the highest program performance. Example: Class clazz = String.class; 2) Premise: If an instance of a certain class is known, call the getClass() method of the instance to obtain the Class object. Example: Class clazz = "www.atguigu.com".getClass(); 3) Prerequisite: The full class name of a class is known, and the class is on the class path and can be retrieved through the static method of the Class class. Obtained using forName(), ClassNotFoundException may be thrown Example: Class clazz =Class.forName("java.lang.String"); 4) Other methods (not required) ClassLoader cl = this.getClass().getClassLoader(); Class clazz4 = cl.loadClass("Full class name of the class");

Common methods of class class

Types with Class objects

(1) class: Outer classes, members (member inner classes, static inner classes), local inner classes, anonymous inner classes (2) interface: interface (3)[]: array (4) enum: enumeration (5) annotation: annotation @interface (6) primitive type: basic data type (7)void

example

Class c1 = Object.class; Class c2 = Comparable.class; Class c3 = String[].class; Class c4 = int[][].class; Class c5 = ElementType.class; Class c6 = Override.class; Class c7 = int.class; Class c8 = void.class; Class c9 = Class.class; int[] a = new int[10]; int[] b = new int[100]; Class c10 = a.getClass(); Class c11 = b.getClass();

Subtopic 11

hadoop

hdfs

hdfs advantages and disadvantages

advantage

Fault tolerance (replication mechanism)

Suitable for processing big data, can handle petabyte-level data, and can handle millions of files

Deployable on cheap machines

shortcoming

Not suitable for low-latency data access

It is impossible to store a large number of small files efficiently. The memory of the namenode is limited and if there are too many small files, the addressing time will be greater than the file processing time.

Concurrent writing is not supported and files are modified randomly.

hdfs architecture composition

namenode

Manage hdfs namespace

Configure replica policy

Manage block mapping information

Handle client requests

datanode

Store data block

Execute read and write requests for data blocks

client

File splitting, split the file into blocks when uploading the file

Interact with namenode to obtain file location information

Interact with datanode, read or write data

client provides commands to manage hdfs

The client provides commands to access HDFS, such as add, delete and check operations on HDFS.

secondaryNamenode

Assist nameNode to work, merge fsimage and edits files, and push them to nameNode

Assisted recovery of Namenode in emergency situations

hdfs file block size

hadoop1.x 64M, hadoop2.x/3.x is 128M

The optimal state is when the addressing time is 1% of the transmission time

Why HDFS file block size cannot be too large or too small

1. If it is too small, it will increase the seeking time.

2. If the block is too large, the data transmission time will be significantly greater than the time to locate the starting block position, causing the program to process this data block very slowly.

shell

upload

hadoop fs -moveFromLocal local file hdfs directory (local cut and copy to hdfs)

hadoop fs -copyFromLocal local file hdfs directory (copy locally to hdfs)

hadoop fs -put local file hdfs directory (copy locally and upload to hdfs)

hadoop fs -appendToFile local file hdfs file (local file data is appended to the end of the hdfs file)

download

hadoop fs -copyToLocal hdfs file local directory (file copied to local)

hadoop fs -get hdfs file local directory (file copied to local)

operating hdfs

hadoop fs -ls directory (display information under the directory)

hadoop fs -mkdir directory (create directory)

-chgrp, -chmod, -chown (modify file ownership)

hadoop fs -cat file (display file contents)

hadoop fs -cp file directory (copy the file to another directory)

hadoop fs -tail file (displays 1kb of data at the end of the file)

hadoop fs -rm delete files or folders

hadoop fs -rm -r directory (recursively delete the directory and the contents under the directory)

hadoop fs -du

hadoop fs -setrep (Set the number of replicas and record them only in the namenode. If the number of datanodes is less than the set number of replicas, then the number of replicas will only be as many as datanodes)

hdfs api operation: through the FileSystem object

hdfs reading and writing process

Reading process

Subtopic 1

write process

Illustration

Subtopic 3

hive

concept

Hive is a data warehouse management tool based on Hadoop, which maps structured data into a table and provides SQL-like query functions.

principle

Essence Convert HQL to mr tasks

metastore

Metadata: including table name, database to which the table belongs (default default), table owner, column/partition field, table type (whether it is an external table), directory where the table's data is located, etc.

client

Provide jdbc/ODBC interface, webUI access, interface access, etc.

driver

ParserSQL Parser

Convert sql into abstract syntax tree AST. This step is usually completed with a third-party tool, such as antrl; perform syntax analysis on SAT to check whether the sql semantics are correct, whether tables and column names exist, etc.

Compiler Physical Plan

Convert AST into logical execution plan

Optimizer Query Optimzer

Optimize logical execution plans

Execution

The logical writeback plan is converted into a physical execution plan that can be run. For hive, it is converted into mr/tez/spark tasks

type of data

Basic data types

hive's String type can theoretically store 2G characters

Collection data type

Collection data type table creation statement

create table test( name string, friends array<string>, children map<string, int>, address struct<street:string, city:string> ) row format delimited fields terminated by ',' //Column delimiter collection items terminated by '_' //The delimiter of MAP STRUCT and ARRAY (data splitting symbol) map keys terminated by ':' //map kye and value separator lines terminated by ' '; //Line separator

type conversion

Hive's atomic data types are implicitly converted, similar to Java's conversion. For example, if an expression uses the INT type, TINYINT will be automatically converted to the INT type, but Hive will not perform the reverse conversion. For example, if an expression uses the TINYINT type, INT will not be automatically converted to the TINYINT type, and it will return an error unless using CAST operation

Implicit conversion rules

1. Any integer type can be implicitly converted to a wider type, such as TINYINT can be converted Replaced with INT, INT can be converted to BIGINT.

2. All integer types, FLOAT and STRING types can be implicitly converted to DOUBLE. (Implicit conversion of String and double can easily cause data skew)

3.TINYINT, SMALLINT, and INT can all be converted to FLOAT.

4. The BOOLEAN type cannot be converted to any other type.

DDL

Create table

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path] [TBLPROPERTIES (property_name=property_value, ...)] [AS select_statement]

explain

(1) CREATE TABLE creates a table with the specified name. If a table with the same name already exists, an exception is thrown; the user can use the IF NOT EXISTS option to ignore this exception. (2) The EXTERNAL keyword allows users to create an external table. When creating the table, they can specify a path (LOCATION) pointing to the actual data. When the table is deleted, the metadata and data of the internal table will be deleted together, and External tables only delete metadata, not data. (3) COMMENT: Add comments to tables and columns. (4) PARTITIONED BY creates a partition table (5) CLUSTERED BY creates bucket table (6) SORTED BY is not commonly used. It is used to sort one or more columns in the bucket. (7) ROW FORMAT DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)] Users can customize SerDe or use the built-in SerDe when creating tables. If ROW FORMAT or ROW FORMAT DELIMITED is not specified, the built-in SerDe will be used. When creating a table, the user also needs to specify columns for the table. When specifying the columns of the table, the user will also specify a custom SerDe. Hive uses SerDe to determine Specific column data of the specified table. SerDe is the abbreviation of Serialize/Deserilize. Hive uses Serde to sequence and deserialize row objects. (8) STORED AS specifies the storage file type. Commonly used storage file types: SEQUENCEFILE (binary sequence file), TEXTFILE (text), RCFILE (column storage format file). If the file data is plain text, you can use STORED AS TEXTFILE. If the data needs to be compressed, use STORED AS SEQUENCEFILE. (9) LOCATION: Specify the storage location of the table on HDFS. (10) AS: followed by a query statement to create a table based on the query results. (11) LIKE allows users to copy the existing table structure, but does not copy the data.

Partition Table

The essence of the partition table is to correspond to a folder on HDFS, and the partition in Hive is a sub-directory.

DML

load

load data [local] inpath '/opt/module/datas/student.txt' [overwrite] into table student [partition (partcol1=val1,…)];

insert

basic insert

insert into(overwrite) table ...

Multiple table insert

from tableName insert into table t1 .... select .... from tableName, insert into table t2 .... select .... from tableName

Query and create tables

create bale tableName as select.....

Upload data to create table

1. Upload data to hdfs

2. Create the table and specify its location on hdfs

Import data into HIve

Note: Use export first to export, and then import the data.

import table student2 partition(month='201709') from '/user/hive/warehouse/export/student';

Truncate deletes data in the table

Only management tables can be deleted, data in external tables cannot be deleted

Commonly used functions

max

min

sum

avg

count

sort

global sort

order by global sorting, there is only one reducer, the efficiency is extremely low when the amount of data is large, but the efficiency is high when the amount of data is small

Partition sorting

distribute by is similar to the custom partitioner in the mr task; the rule of distribute by is that after modulo division of the number of reducers based on the hash code of the partition field, the remainders with the same number will enter the same partition.

local sorting

sort by is the internal sorting of each reducer, and is globally unordered. It is often used in conjunction with distribute by, with distribute by in the front and sort by in the back.

cluster by

It can be replaced when distribute by sort by is used together and the fields are in the same ascending order. Cluster by cannot specify the sorting rule as ASC or DESC.

partition, bucket

bucket

Bucketing must be enabled and configured set hive.enforce.bucketing=true;

Create bucket table

Inserting data into a bucket table is the same as inserting data into a normal table

Bucketing rules

The bucket field determines which bucket the data is in by calculating the modulus of the hash modulus divided by the number of buckets.

The difference between bucketing and partitioning

Partitioning is for the data storage path, that is, the folder; bucketing is for the data files.

Bucket sampling survey

grammar

TABLESAMPLE(BUCKET x OUT OF y) .

Partition

subtopic

Convert between rows and columns

row to column

CONCAT(string A/col, string B/col…)

CONCAT_WS(separator, str1, str2,...)

COLLECT_SET(col)

Function description

example

sql implementation

Column to row

Function description

example

sql implementation

window function

OVER()

current row

n percing

Go forward n lines

n following

n lines back

UNBOUNDED

starting point

unbounded perceding means starting from the front

unbounded following means to the end point behind

LAG(col,n,default_val)

Go nth line forward

lead(col,n,default_val)

nth row from now on

NTILE(n)

Distribute the rows in the ordered partition to the specified data groups. Each group is numbered, starting from 1. For each row, NTILE returns the number of the group to which the row belongs. Note: n must be of type int.

example

data

Requirements and implementation

1. Query the customers and total number of customers who purchased in April 2017

2. Check the customer’s purchase details and monthly purchase total

3. In the above scenario, costs must be accumulated according to date.

4. Query the last purchase time of each customer

5. Query the order information of the previous 20% time

rank

If the order is the same, it will be repeated and the total number will remain unchanged.

dense_rank

If the order is the same, there will be no duplication, and the total number will be reduced.

row_number

will be calculated in order

example

function

built-in functions

Custom function

udf(User-Defined-Function)

Features: One in, one out

Programming steps

1. Inherit class org.apache.hadoop.hive.ql.UDF

2. The evaluate function needs to be implemented; the evaluate function supports overloading;

3. Create a function in the command line window of hive

1.Upload jar

add jar linux_jar_path

2.Create function

create [temporary] function [dbname.]function_name AS class_name;

4. Delete the function in the command line window of hive

Drop [temporary] function [if exists] [dbname.]function_name;

Note: UDF must have a return type and can return null, but the return type cannot be void;

udaf（User-Defined Aggregation Function）

More in and one out

aggregate function

udtf (User-Defined Table-Generating Functions)

One in, many out

lateral view explode()

Compression and storage

Hive commonly used compression and storage

Parquet file format is commonly used in Hive; lzo or snappy is used for compression.

compression

Compression format table

Compression performance comparison

Compression parameter settings

storage

Common file formats in hive

textFile, parquet, orc, sequencefile

File format classification

Row storage

Features

When querying an entire row of data that meets the conditions, column storage needs to go to each aggregated field to find the corresponding value of each column. Row storage only needs to find one of the values, and the rest of the values are in adjacent places, so this Row store queries are faster.

textfile

In the default format, the data is not compressed, resulting in high disk overhead and high data parsing overhead.

sequenceFile

Column storage

Features

Because the data of each field is aggregated and stored, when the query only requires a few fields, the amount of data read can be greatly reduced; the data type of each field must be the same, and columnar storage can be better designed specifically. Designed compression algorithm.

parquet

orc

composition

Orc files generally consist of one or more stripes. Each stripe is generally the size of an HDFS block. Each stripe contains multiple records. These records are stored independently according to columns.

stripe

index data

Subtopic 1

row data

stripe footer

Tuning

fetch capture

It means that in some cases, the query does not need to go through the MR task, such as select * from table. In this case, hive can simply read the file in the table directory and output it to the console.

fetch capture enable settings

Set hive.fetch.task.conversion = more in the hive-default.xml.template file

After turning it on, global search, limit search, and field search will not use the mr task.

local mode

Hive input data is small, and the time consumed to trigger the query execution task may be much longer than the actual job execution time. In this case, you can turn on the local mode to process all tasks on a single machine. Task execution time for small data sets will be significantly shortened

parameter settings

set hive.exec.mode.local.auto=true; //Enable local mr

Table optimization

Large table, small table join

Putting small tables in the front and big tables in the back can effectively reduce the chance of memory overflow errors. Now that it has been optimized, putting small tables in the front and big tables in the back has no effect.

map join (small table join large table)

map join enable settings

map join is enabled by default

set hive.auto.convert.join = true;

set hive.mapjoin.smalltable.filesize=25000000;//The default small table size is 25M

map join principle

join big table

1. Empty key filtering

A large amount of abnormal data with empty keys enters the same reducer and is extremely slow to process. Sometimes it even overflows the memory. You can choose to filter out the empty keys first.

2. Empty key conversion

A large number of empty keys are not abnormal data and must be included in the result set. Then you can assign values to the keys and distribute them evenly in the reducer.

group by

Enable map-side aggregation

parameter settings

1. Enable aggregation on the map side

set hive.map.aggr = true

2. Aggregate the number of items on the map side

set hive.groupby.mapaggr.checkinterval = 100000

3. Perform load balancing when there is data skew (default false)

set hive.groupby.skewindata = true

Principle: Two jobs will be started. The map side of the first job will randomly distribute the keys to the reducers. Each reducer will perform an aggregation operation internally and output the results. The second job will get the results of the first job and put the same The key is sent to the same reducer for aggregation to complete the final aggregation operation.

count(distinct)

Small data doesn’t matter

Large amount of data

In the case of large data volume, count(distinct) will only enter one reducer no matter how many reducers there are, affecting the job progress; you can use group by to perform deduplication statistics.

Cartesian Product

Try to avoid Cartesian product, do not add on conditions when joining, or have invalid on conditions. Hive can only use one reducer to process data.

Row and column filtering

Column processing

In select, only take the required columns and use select * as little as possible

row processing

If the filter conditions of the secondary table are written after where in the association, then the two tables will first be associated with the whole table and then filter the data. It is recommended to first make a subquery with filter conditions for the secondary table and then perform the association.

dynamic partitioning

Enable dynamic partition settings

1. Enable dynamic partitioning function settings (enabled by default)

hive.exec.dynamic.partition=true

2. Set to non-strict mode (the mode of dynamic partitioning, the default is strict, which means that at least one partition must be specified as Static partitioning, nonstrict mode means that all partition fields are allowed to use dynamic partitioning. )

hive.exec.dynamic.partition.mode=nonstrict

3. The maximum number of dynamic partitions that can be created on all nodes executing MR. Default 1000

hive.exec.max.dynamic.partitions=1000

4. The maximum number of dynamic partitions that can be created on each node executing MR. This parameter needs to be based on actual data to set. For example: the source data contains data for one year, that is, the day field has 365 values, then this parameter is It needs to be set to greater than 365. If you use the default value of 100, an error will be reported.

hive.exec.max.dynamic.partitions.pernode=100

5. The maximum number of HDFS files that can be created in the entire MR Job. Default 100000

hive.exec.max.created.files=100000

6. Whether to throw an exception when an empty partition is generated. Generally no settings are required. Default false

hive.error.on.empty.partition=false

Partition

bucket

Data skew

1. Set a reasonable number of Maps

Parallel execution

By default, hive can only execute one stage at a time. However, a specific job may contain many stages, and these stages may not be completely dependent on each other. That is to say, some stages can be performed in parallel, and the job can be completed by executing it in parallel. Faster.

Turn on parameter settings

set hive.exec.parallel=true; //Enable parallel execution of tasks set hive.exec.parallel.thread.number=16; //The maximum degree of parallelism allowed for the same sql, the default is 8.

strict mode

jvm reuse

speculative execution

compression

Number warehouse theory

concept

It is a strategic collection that provides all types of data support for the decision-making process at all levels of the enterprise.

feature

A data warehouse is a master graph-oriented, integrated, non-volatile and time-varying data collection to support decision-making

Topic oriented

Integration

Non-volatile (cannot be changed)

time-varying

etl

Extract Extra, Transform Transfer, Load Load

Data warehouse stratification

source data

There are no changes to the data in this layer. It directly uses the peripheral system data structure and data and is not open to the public. It is a temporary storage layer, which is a temporary storage area for interface data to prepare for the subsequent data processing.

database

Also known as the detail layer, the data in the DW layer should be consistent, accurate, and clean data that has been cleaned (removed of impurities) from the source system data.

Data application

Data sources directly read by front-end applications; data calculated and generated based on report and thematic analysis requirements

why layering

space for time

A large amount of preprocessing is used to improve the user experience (efficiency) of the application system, so there will be a large amount of redundant data in the data warehouse; without layering, if the business rules of the source business system change, it will affect the entire data cleaning process, and the work The amount is huge.

Layering simplifies data cleaning process

The process of data cleaning can be simplified through data hierarchical management, because dividing the original one-step work into multiple steps is equivalent to splitting a complex work into multiple simple tasks and turning a large black box into A white box is created, and the processing logic of each layer is relatively simple and easy to understand. This makes it easier for us to ensure the correctness of each step. When errors occur in the data, we often only need to partially adjust a certain step.

data mart

Data mart architecture

independent data mart

Dependent data mart

Data warehouse stratification

Data warehouse layering principle

1. In order to facilitate data analysis, it is necessary to shield the underlying complex business and expose the data to the analysis layer in a simple, complete and integrated manner.

2. The impact of underlying business changes and upper-level demand changes on the model is minimized. The impact of business system changes is weakened at the basic data layer. Combined with the top-down construction method, the impact of demand changes on the model is weakened.

3. High cohesion and loose coupling, that is, high cohesion of data within a topic or within each complete system, and loose coupling of data between topics or between complete systems.

4. Build the basic data layer of the warehouse to isolate the underlying business data integration work from the upper-layer application development work, laying the foundation for large-scale development of the warehouse. The warehouse hierarchy will be clearer and the externally exposed data will be more unified.

Data warehouse stratification

ods (Operational Data Store)

Access the original data intact

DW (Data Warehouse)

DWD (Data Warehouse Detail) data detail layer

The granularity is consistent with the ods layer and provides certain data quality guarantees. What the dwd layer needs to do is data cleaning, integration, and standardization. Dirty data, junk data, data with inconsistent specifications, inconsistent status definitions, and inconsistent naming specifications will all be processed. At the same time, in order to improve the usability of data, some dimension degradation techniques will be used to degrade dimensions into fact tables and reduce the association between dimension tables and fact tables. At this layer, some data will also be aggregated to bring data in the same subject area into the same table to improve the usability of the data.

DWM (Data WareHouse Middle) data middle layer

On the basis of the DWD layer, the indicator aggregation operation is performed on the common core dimensions to calculate the corresponding statistical indicators. In actual calculations, if the calculation indicators of the wide table are calculated directly from the DWD or ODS layer, there will be problems such as too much calculation and too few dimensions. Therefore, the general approach is to first calculate multiple small intermediate tables in the dwm layer and then put them together into one. A wide spread table. Since the width and width boundaries are difficult to define, you can also remove the dwm layer and leave only the dws layer.

DWS (Data WareHouse Servce) data service layer

The granularity is coarser than the detailed layer. Data based on the dwd layer is integrated and summarized into service data for analyzing a certain subject area, which is generally a wide table. The dws layer should cover 80% of the nutrient scene. Also known as data mart or wide table, it is used for subsequent business queries, olap analysis, data distribution, etc.

APP data application layer

The data is mainly provided for data products and data analysis. It is generally stored in ES, PostgreSql, Redis and other systems for use by online systems. It may also be stored in Hive or Druid for data analysis and data mining. For example, the report data we often talk about is usually placed here.

dim dimension surface layer

If you have dimension tables, you can design this layer separately.

High cardinality data

Generally, it is similar to user data table, product table data table. The amount of data may be tens of millions or hundreds of millions

Low base girth data

Generally, it is a configuration table, such as the Chinese meaning corresponding to the enumeration value, or the date dimension table. The amount of data may be single digits, hundreds, thousands, or tens of thousands.

Data warehouse modeling method

dimensional modeling

significance

Build a model based on the needs of analyzed decisions, and the constructed data model serves the analysis needs, focusing on solving the problem of users completing analysis quickly, and also has good response performance for large-scale complex queries.

fact table

periodic snapshot fact table

transaction fact table

Cumulative snapshot fact table

fact table without facts

aggregate fact table

Merge fact tables

dimension table

star schema

All dimension tables are connected to the fact table, and dimension tables are not associated with other dimension tables.

constellation model

Extended from the star schema, multiple fact tables share dimension information

snowflake model

The fact table is associated with the dimension table. The dimension table can have other dimension tables. There are many associations and low performance.

Dimensional modeling process

Select business process

Declaration granularity

The same fact table must have the same granularity. Do not mix multiple different granularities in the same fact table. Different fact tables are created for different granularity data. For data whose requirements are unclear, we establish atomic granularity.

Confirm dimensions

Confirm the facts

paradigm modeling

6 paradigms of database

First normal form (1NF)

What is emphasized is the atomicity of the column, that is, the column cannot be divided into other columns.

Second normal form (2NF)

To satisfy the first normal form, two conditions must be met. First, the table must have a primary key; second, the columns not included in the primary key must completely depend on the primary key, and cannot only depend on part of the primary key.

Third normal form (3NF)

To satisfy 2nf, non-primary key columns must directly depend on the primary key, and there cannot be transitive dependencies. That is, it cannot exist: non-primary key column A depends on non-primary key column B, and non-primary key column B

Boyce-Codd normal form (BCNF)

Fourth normal form (4NF)

Fifth normal form (5NF)

Modeling steps

1. Conceptual model

2. Logic model

3.Physical model

Data warehouse core construction ideas

From the design, development, deployment and application levels, we avoid repeated construction and redundant index construction, thereby ensuring the standardization and unification of data caliber, and ultimately realizing the full-link association of data assets, providing standard data output and a unified data public layer

Data center construction process

impala

type of data

shell command

Architecture

module

impalad

Receive the client's request, execute the query and return to the central coordination point

The daemon process on the child node is responsible for maintaining communication with the statestore and reporting work

statestore

Responsible for collecting resource information distributed in each impalad process, the health status of each node, and synchronizing node information

Responsible for the coordination and scheduling of queries

catalog

Distribute table metadata information to each impalad

Receive all requests from statestore

hiveMetastore

hdfs

ddl

Impala does not support WITH DBPROPERTIES (for example: adding creator information and creation date to the library)

Delete database

drop database

impala does not support alter database

dml

Data import (basically the same as hive), impala does not support load data local inpath…

Data export (impala does not support insert overwrite... syntax to export data, you can use impala -o instead)

Export and import commands are not supported

Inquire

The basic syntax is roughly the same as the hive query statement.

impala does not support bucketing

impala does not support cluster by, sort by, distributed by

Impala does not support the COLLECT_SET(col) and explode(col) functions

Impala supports windowing functions

impala custom function

udf

Storage and compression

optimization

1. Try to deploy stateStore and catalog on the same server as much as possible to ensure their communication.

2. Improve work efficiency by limiting the memory of impala dameon (default 256) and the number of statestore threads

3. SQL optimization, call the execution plan before executing SQL

4. Choose the appropriate file format for storage to improve query efficiency

5. Avoid generating many small files (if there are small files generated by other programs, you can use an intermediate table to store the small file data in the intermediate table. Then insert the data from the intermediate table into the final table through insert...select...)

6. Use appropriate partitioning technology and calculate according to partition granularity

7. Use compute stats to collect table information. When a content table or partition changes significantly, recalculate the statistics related data table or partition. Because differences in the number of rows and distinct values may cause impala to choose a different join order when using a query on the table.

8. Optimization of network io

–a. Avoid sending the entire data to the client

–b. Do conditional filtering as much as possible

–c. Use limit clause

–d. When outputting files, avoid using beautified output

–e. Try to use less full metadata refresh

9. Use profile to output the underlying information plan and optimize the environment accordingly.

Data Lake Theory

spark

Built-in module

spark core

Implemented the basic functions of Spark, including modules such as task scheduling, memory management, error recovery, and interaction with storage systems. Spark Core also includes API definitions for Resilient Distributed DataSet (RDD).

spark sql

Spark is a package for manipulating structured data and can support HQL

spark streaming

Streaming computing component, API is highly corresponding to spark core

spark mlib

Provides common machine learning libraries

spark graghx

graph calculation

operating mode

local mode

standlone mode

yarn mode

yarn-client

Features:

The driver runs on the client and is suitable for interaction and debugging.

yarn-cluster

Features

The driver runs the applicationMaster started by resourceManager and is suitable for production environments.

sparkcore

rdd

compute

partitions

partitioner

dependencies

operator

conversion operator

value type

map(func)

mapPartitions(func)

The function type of func must be Iterator[T] => Iterator[U]

mapPartitionsWithIndex(func)

The function type of func must be (Int, Interator[T]) => Iterator[U]

flatMap(func)

glom()

Form each partition into an array and form a new RDD type RDD[Array[T]]

groupBy()

Subtopic 1

Group by the return value of the function passed in. Put the values corresponding to the same key into an iterator.

filter

sample(withReplacement, fraction, seed)

Randomly sample a fraction of data with the specified random seed. withReplacement indicates whether the extracted data is replaced. true means sampling with replacement, false means sampling without replacement. seed is used to specify the random number generator seed.

distinct([numTasks]))

Returns a new RDD after deduplicating the source RDD. By default, only 8 parallel tasks operate, but this can be changed by passing an optional numTasks parameter.

coalesce(numPartitions)

Reduce the number of partitions to improve the execution efficiency of small data sets after filtering large data sets.

repartition(numPartitions)

All data is randomly shuffled through the network again based on the number of partitions.

The difference between coalesce and repartition

1. coalesce repartitions, you can choose whether to perform the shuffle process. By parameter shuffle: Boolean = false/true Decide.

2. Repartition actually calls coalesce to perform shuffle.

sortBy(func,[ascending], [numTasks])

Use func to process the data first, and sort according to the processed data comparison results. The default is positive order.

pipe(command, [envVars])

The pipeline, for each partition, executes a shell script and returns an RDD of output.

Double value type

union

subtract(otherDataset)

A function that calculates the difference, removing the same elements in two RDDs, and different RDDs will remain

intersection(otherDataset)

Returns a new RDD after intersecting the source RDD and parameter RDD.

cartesian(otherDataset)

Cartesian Product

zip(otherDataset)

Combine two RDDs into an RDD in the form of Key/Value. By default, the number of partitions and elements of the two RDDs are the same, otherwise an exception will be thrown.

key/value type

partitionBy(partitioner)

Perform partitioning operation on pairRDD. If the original partionRDD is consistent with the existing partionRDD, no partitioning will be performed. Otherwise, ShuffleRDD will be generated, which will generate a shuffle process.

reduceByKey(func, [numTasks])

Called on a (K, V) RDD, returns a (K, V) RDD, using the specified reduce function to aggregate the values of the same key together. The number of reduce tasks can be passed through the second optional parameters to set.

groupByKey()

groupByKey also operates on each key, but only generates one seq.

The difference between reduceByKey and groupByKey

1. reduceByKey: Aggregation based on key, there is a combine (pre-aggregation) operation before shuffle, and the return result is RDD[k,v]. 2. groupByKey: Group by key and shuffle directly. 3. Development guidance: reduceByKey is recommended over groupByKey. But you need to pay attention to whether it will affect the business logic.

aggregateByKey

Parameters: (zeroValue:U,[partitioner: Partitioner]) (seqOp: (U, V) => U,combOp: (U, U) => U)

In the RDD of kv pairs, values are grouped and merged according to key. When merging, each value and the initial value are used as parameters of the seq function for calculation, and the returned result is used as a new kv pair, and then the result is calculated according to The key is merged, and finally the value of each group is passed to the combine function for calculation (the first two values are calculated first, and the return result and the next value are passed to the combine function, and so on), and the key and the calculation result are as A new kv pair is output.

foldByKey

Parameters: (zeroValue: V)(func: (V, V) => V): RDD[(K, V)]

Simplified operation of aggregateByKey, seqop and combop are the same

combineByKey[C]

Parameters: (createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C)

Function: Merge V into a set for the same K

Subtopic 3

sortByKey([ascending], [numTasks])

When called on an RDD of (K, V), K must implement the Ordered interface and return an RDD of (K, V) sorted by key.

mapValues

For types of the form (K, V), only operate on V

join(otherDataset, [numTasks])

Called on RDDs of type (K, V) and (K, W), it returns an RDD of (K, (V, W)) in which all elements corresponding to the same key are paired together.

cogroup(otherDataset, [numTasks])

Called on RDDs of type (K,V) and (K,W), return an RDD of type (K,(Iterable<V>,Iterable<W>))

action

reduce(func)

Aggregate all elements in the RDD through the func function, first aggregate the data within the partition, and then aggregate the data between partitions

collect()

In the driver, return all elements of the dataset as an array.

count()

Returns the number of elements in the RDD

first()

Returns the first element in the RDD

take(n)

Returns an array consisting of the first n elements of the RDD

takeOrdered(n)

Returns an array consisting of the first n elements sorted by this RDD

aggregate

Parameters: (zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)

The aggregate function aggregates the elements in each partition through seqOp and the initial value, and then uses the combine function to combine the result of each partition with the initial value (zeroValue). The final type returned by this function does not need to be consistent with the element type in the RDD

fold(num)(func)

Folding operation, simplified operation of aggregate, seqop and combop are the same.

Subtopic 9

spark kernel

spark core components

driver

excutor

spark general running process

1) After the task is submitted, the Driver program will be started first; 2) Then the Driver registers the application with the cluster manager; 3) The cluster manager then allocates and starts the Executor according to the configuration file of this task; 4) The Driver starts executing the main function, and the Spark query is executed lazily. When the Action operator is executed, reverse calculation begins, and the stages are divided according to wide dependencies. Then each stage corresponds to a Taskset, and there are multiple Tasks in the Taskset. Search Available resources Executor for scheduling; 5) According to the localization principle, the Task will be distributed to the designated Executor for execution. During the task execution process, the Executor will continue to communicate with the Driver and report the task running status.

Deployment mode

standlone

hadoop yarn

yarn-client

yarn-cluster

mesos

k8s

wordcount

sc.textFile("xx").flatMap(_.split(" ")).map((_,1)).reduceByKey(_ _).saveAsTextFile("xx")

optimization

Data skew

memory model

Common algorithms

kafka

Hbase

definition

Hbase is a noSql system with high reliability, high performance, column-oriented, scalable distributed storage and real-time storage.

Features

1. Mass storage

Hbase is suitable for storing PB-level data, and PC-level data can be returned within tens to hundreds of milliseconds.

2. Column storage

HBase storage only has column family storage, which contains many columns.

3. Extremely easy to expand

The scalability of Hbase is mainly reflected in two aspects, one is expansion based on upper-layer processing capabilities (RegionServer), and the other is expansion based on storage (HDFS). By horizontally adding RegionSever machines, horizontal expansion is performed to improve the processing capabilities of Hbase's upper layer and improve Hbsae's ability to serve more Regions.

4. High concurrency

Since most of the current architectures using Hbase use cheap PCs, the delay of a single IO is actually not small, generally between tens to hundreds of ms. The high concurrency mentioned here mainly means that in the case of concurrency, the single IO delay of Hbase does not drop much. Able to obtain high-concurrency, low-latency services.

5. Sparse

Sparse is mainly for the flexibility of Hbase columns. In the column family, you can specify as many columns as you want. When the column data is empty, it will not occupy storage space.

Hbase architecture

Architecture diagram

client

Contains the interface for accessing HBase, and also includes cache to speed up HBase access. For example, the cache contains .META information.

zookeeper

Mainly responsible for high availability of master, monitoring of reginserver, metadata entry and maintenance of cluster configuration.

ZK is used to ensure that only one master is running in the cluster. When the master fails, a new master is generated through a competition mechanism.

Monitor the status of the regionserver through zk. When the regionserver is abnormal, notify the Master of the regionserver's online and offline information in the form of callbacks.

Unified entry information for storing metadata through zk

hdfs

Provide underlying data storage services for HBase

Provides underlying distributed storage services for metadata and table data

Multiple copies of data ensure high reliability and availability

HMaster

Assign Region to RegionServer

Maintain load balancing across the cluster

Maintain cluster metadata information

Discover the failed Region and allocate the failed Region to the normal RegionServer

When the RegionSever fails, coordinate the splitting of the corresponding Hlog

HReginServer

Mainly handles users’ read and write requests

Manage the Region allocated by the master

Handle read and write requests from clients

Responsible for interacting with the underlying HDFS and storing data to HDFS

Responsible for splitting the Region after it becomes larger

Responsible for merging Storefiles

Role

HMaster

1. Monitor RegionServer

2. Handling RegionServer failover

3． Handle metadata changes

4. Handle region allocation or transfer

5. Load balancing data during idle time

6. Publish your location to clients through Zookeeper

RegionServer

1. Responsible for storing the actual data of HBase

2. Process the Region assigned to it

3． Flush cache to HDFS

4. Maintain Hlog

5. Perform compression

6. Responsible for processing Region fragmentation

Other components

Write-Ahead logs (wal)

HBase modification record, when reading and writing data to HBase, the data is not written directly to the disk, it will be retained in the memory for a period of time (time and data volume thresholds can be set). However, saving data in memory may have a higher probability of causing data loss. In order to solve this problem, the data will first be written in a file called Write-Ahead logfile, and then written to the memory. So when the system fails, the data can be reconstructed through this log file

region

For the sharding of the HBase table, the HBase table will be divided into different regions based on the RowKey value and stored in the RegionServer. There can be multiple different regions in a RegionServer.

store

HFile is stored in Store, and one Store corresponds to a column family in the HBase table.

MemStore

As the name suggests, it is memory storage, which is located in the memory and is used to save the current data operations. Therefore, after the data is saved in the WAL, the RegionServer will store the key-value pairs in the memory.

HFile

This is the actual physical file that holds the original data on disk, the actual storage file. StoreFile is stored in HDFS in the form of Hfile.

HBase operations

HBase data structure

RowKey

Like nosql database, rowkey is the primary key for accessing data. There are three ways to access HBase rows

1.Access through a single rowkey

2. Pass the range of rowkey (regular)

3. Full table scan

RowKey row key (RowKey) can be any string (the maximum length is 64KB, and the length in actual applications is generally 10-100bytes). Inside HBASE, RowKey is saved as a byte array. When storing, data is stored in lexicographic order (byte order) of RowKey. When designing RowKey, it is necessary to fully sort the storage properties and store rows that are frequently read together together. (location dependence)

column famlily

Column family: Each column in the HBASE table belongs to a certain column family. Column families are part of the table's schema (columns are not) and must be defined before using the table. Column names are prefixed by column family. For example, courses:history and courses:math all belong to the courses column family.

cell

A unit uniquely identified by {rowkey, column Family:columu, version}. The data in the cell has no type and is all stored in bytecode form.

time stamp

In HBase, a storage unit determined by rowkey and column is called a cell. Each cell stores multiple versions of the same data. Versions are indexed by timestamp. The timestamp type is a 64-bit integer. The timestamp can be assigned by HBASE (automatically when data is written), in which case the timestamp is the current system time accurate to milliseconds. The timestamp can also be assigned explicitly by the client. If an application wants to avoid data version conflicts, it must generate its own unique timestamps. In each cell, different versions of data are sorted in reverse chronological order, that is, the latest data is listed first. In order to avoid the management (including storage and indexing) burden caused by too many versions of data, HBASE provides two data version recycling methods. One is to save the last n versions of the data, and the other is to save the versions within the latest period of time (such as the last seven days). Users can set this for each column family.

Namespaces

namespace structure

Table

Table, all tables are members of the namespace, that is, the table must belong to a certain namespace. If not specified, it will be in the default namespace.

RegionServer group

A namespace contains the default RegionServer Group.

Permission

Permissions and namespaces allow us to define access control lists (ACLs).

Quota

Quotas can enforce the number of regions a namespace can contain.

HBase data principles

HBase reading principle

1) Client first accesses zookeeper, reads the location of the region from the meta table, and then reads the data in the meta table. The meta stores the region information of the user table;

2) Find the corresponding region information in the meta table based on namespace, table name and rowkey;

3) Find the regionserver corresponding to this region;

4) Find the corresponding region;

5) First find the data from MemStore, if not, then read it from BlockCache;

6) If the BlockCache is not available yet, then read it from the StoreFile (for the sake of reading efficiency);

7) If the data is read from StoreFile, it is not returned directly to the client, but is first written to BlockCache and then returned to the client.

HBase writing process

1) Client sends a write request to HregionServer;

2) HregionServer writes data to HLog (write ahead log). For data persistence and recovery;

3) HregionServer writes data to memory (MemStore);

4) Feedback that the Client is written successfully.

Data flush process

1) When the MemStore data reaches the threshold (the default is 128M, the old version is 64M), the data is flushed to the hard disk, the data in the memory is deleted, and the historical data in the HLog is deleted;

2) And store the data in HDFS;

3) Make mark points in HLog.

data merging process

1) When the number of data blocks reaches 4, Hmaster triggers the merge operation, and Region loads the data blocks locally for merging;

2) When the merged data exceeds 256M, split it and assign the split Regions to different HregionServer management;

3) When HregionServer goes down, split the hlog on HregionServer, then assign it to different HregionServer to load, and modify .META.;

4) Note: HLog will be synchronized to HDFS.

Integrated use of hive and HBase

Integrated through hive-hbase-handler-1.2.2.jar

Create a Hive table, associate it with the HBase table, and insert data into the Hive table while affecting the HBase table.

A table hbase_emp_table has been stored in HBase, and then an external table is created in Hive to associate the hbase_emp_table table in HBase, so that it can use Hive to analyze the data in the HBase table.

HBase optimization

High availability (multiple Hmasters)

In HBase, Hmaster is responsible for monitoring the life cycle of RegionServer and balancing the load of RegionServer, such as If Hmaster hangs up, the entire HBase cluster will fall into an unhealthy state, and the working status at this time will not be the same. Will last too long. Therefore, HBase supports high-availability configuration of Hmaster.

pre-partitioned

Manual prepartitioning

create 'staff1','info','partition1',SPLITS => ['1000','2000','3000','4000']

Represents five partitions [min,1000),[1000,2000),[2000,3000),[3000,4000),[4000,max]

Generate hexadecimal sequence pre-partition

create 'staff2','info','partition2',{NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}

Indicates that hexadecimal is divided into 15 partitions

Pre-partition according to the rules set in the file

create 'staff3','partition3',SPLITS_FILE => 'splits.txt'

The content of splits.txt is as follows

aaaa bbbb cccc dddd

javaAPI creates partition

//Customize the algorithm to generate a series of Hash hash values and store them in a two-dimensional array byte[][] splitKeys = a hash value function //Create HBaseAdmin instance HBaseAdmin hAdmin = new HBaseAdmin(HBaseConfiguration.create()); //Create HTableDescriptor instance HTableDescriptor tableDesc = new HTableDescriptor(tableName); //Create an HBase table with pre-partitioning through HTableDescriptor instance and two-dimensional array of hash values hAdmin.createTable(tableDesc, splitKeys);

rowkeydesign

Rowkey is the unique identifier of a piece of data. It is necessary to design the rowkey properly so that the rowkey is evenly distributed in the region to prevent data skew.

1. Generate random numbers, hashes, and hash values

eg: The original rowKey is 1001, and after SHA1 it becomes: dd01903921ea24941c26a48f2cec24e0bb0e8cc7 The original rowKey was 3001, and after SHA1 it became: 49042c54de64a1e9bf0b33e00245660ef92dc7bd The original rowKey was 5001, and after SHA1 it became: 7b61dec07e02c188790670af43e717f0f46e8913 Before doing this, we generally choose to extract samples from the data set to determine what kind of rowKey to hash as the critical value of each partition.

2. String reversal

eg:20170524000001 converted to 10000042507102

3. String concatenation

20170524000001_a12e 20170524000001_93i7

Memory optimization

HBase requires a lot of memory overhead during operation. After all, Table can be cached in memory. Generally, 70% of the entire available memory is allocated to the Java heap of HBase. However, it is not recommended to allocate very large heap memory, because if the GC process continues for too long, the RegionServer will be in a long-term unavailable state. Generally, 16~48G memory is enough. If the memory occupied by the framework is too high, resulting in insufficient system memory, the framework will also be dragged to death by system services.

Basic optimization

1. Allow appending content to HDFS files

hdfs-site.xml, hbase-site.xml Property: dfs.support.append Explanation: Turning on HDFS append synchronization can perfectly cooperate with HBase data synchronization and persistence. The default value is true.

2. Optimize the maximum number of open files allowed by DataNode

hdfs-site.xml Property: dfs.datanode.max.transfer.threads Explanation: HBase generally operates a large number of files at the same time. Depending on the number and size of the cluster and data actions, Set to 4096 or higher. Default value: 4096

3. Optimize the waiting time of data operations with high latency

hdfs-site.xml Property: dfs.image.transfer.timeout Explanation: If the delay for a certain data operation is very high and the socket needs to wait longer, it is recommended to This value is set to a larger value (default 60000 milliseconds) to ensure that the socket is not timed out.

4. Optimize data writing efficiency

mapred-site.xml Attributes: mapreduce.map.output.compress mapreduce.map.output.compress.codec Explanation: Enabling these two data can greatly improve file writing efficiency and reduce writing time. Change the first attribute value to true, and change the second attribute value to: org.apache.hadoop.io.compress.GzipCodec or other compression methods

5. Set the number of RPC listeners

hbase-site.xml Property: hbase.regionserver.handler.count Explanation: The default value is 30, which is used to specify the number of RPC listeners. It can be adjusted according to the number of client requests. When there are many read and write requests, increase this value.

6. Optimize HStore file size

hbase-site.xml Property: hbase.hregion.max.filesize Explanation: The default value is 10737418240 (10GB). If you need to run HBase MR tasks, you can reduce this value because one region corresponds to a map task. If a single region is too large, the map task execution time will be too long. The meaning of this value is that if the size of the HFile reaches this value, the region will be split into two Hfiles.

7. Optimize hbase client cache

hbase-site.xml Property: hbase.client.write.buffer Explanation: Used to specify the HBase client cache. Increasing this value can reduce the number of RPC calls, but will consume more memory, and vice versa. Generally we need to set a certain cache size to reduce the number of RPCs.

8. Specify scan.next to scan the number of rows obtained by HBase

hbase-site.xml Property: hbase.client.scanner.caching Explanation: Used to specify the default number of rows obtained by the scan.next method. The larger the value, the greater the memory consumption.

9.flush, compact, split mechanism

When the MemStore reaches the threshold, the data in the Memstore is flushed into the Storefile; the compact mechanism merges the flushed small files into a large Storefile. split means that when the Region reaches the threshold, the overly large Region will be divided into two.

Attributes involved: That is: 128M is the default threshold of Memstore hbase.hregion.memstore.flush.size: 134217728 That is: the function of this parameter is to flush all memstores of a single HRegion when the sum of the sizes of all Memstores in a single HRegion exceeds the specified value. RegionServer's flush is processed asynchronously by adding the request to a queue to simulate the production and consumption model. There is a problem here. When the queue has no time to consume and generates a large backlog of requests, it may cause a sudden increase in memory, and in the worst case, trigger OOM. hbase.regionserver.global.memstore.upperLimit: 0.4 hbase.regionserver.global.memstore.lowerLimit: 0.38 That is: when the total amount of memory used by MemStore reaches the value specified by hbase.regionserver.global.memstore.upperLimit, multiple MemStores will be flushed to the file. The order of MemStore flush is executed in descending order of size until the memory used by MemStore is flushed. Less than lowerLimit

bloom filter

flink

Datax1

presto

Oozie

data assets

springboot

Interviewer Shrimp and Pig Heart Interview Questions

Linux

Self introduction

1 minute to introduce yourself: In my previous company, I mainly built the China Continent Insurance data platform, which mainly accessed Xiaomi, JD.com, Che300, Continent Insurance Company’s internally generated and bank-related insurance data; the data warehouse was divided into many topics, including order topics, customers Topics, claims topics, overdue topics, approval topics, and complaint topics; the business I develop involves order topics, user topics, and overdue topics. What I do at work is related to data warehouse modeling, data processing, data quality monitoring, and topic indicator calculations. ;The main applied technologies include hive, spark, hue, impala, hadoop, infomatical, etc.; the platform adds about 1TB of data every day.

Business introduction