MindMap Gallery 23 big data knowledge points and interview summary (1)
Simple template, including java basics, hadoop, hive, data warehouse theory, impala, Data lake theory and other content.
Edited at 2024-01-18 15:05:07This is a mind map about bacteria, and its main contents include: overview, morphology, types, structure, reproduction, distribution, application, and expansion. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about plant asexual reproduction, and its main contents include: concept, spore reproduction, vegetative reproduction, tissue culture, and buds. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about the reproductive development of animals, and its main contents include: insects, frogs, birds, sexual reproduction, and asexual reproduction. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about bacteria, and its main contents include: overview, morphology, types, structure, reproduction, distribution, application, and expansion. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about plant asexual reproduction, and its main contents include: concept, spore reproduction, vegetative reproduction, tissue culture, and buds. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about the reproductive development of animals, and its main contents include: insects, frogs, birds, sexual reproduction, and asexual reproduction. The summary is comprehensive and meticulous, suitable as review materials.
Big data knowledge points and interview summary
java basics
Basic data types
byte,boolean 1 byte, char, short 2 bytes, int, float 4 bytes, long,double 8 bytes
abnormal
Throwable
Error
Catastrophic fatal errors, uncontrollable programs, such as stack overflow
Exception
runtime exception
Null pointer exception, array subscript out of bounds
Compile-time exceptions (non-runtime exceptions)
IOException, ClassNotFoundException
Polymorphism
data structure
Common data structures (8 types)
array
Query and modification are fast, addition and deletion are slow
The storage interval is continuous, the memory usage is serious, and the space complexity is large.
Advantages: Random reading and modification are fast because the array is continuous (strong random access and fast search speed)
Disadvantages: Insertion and deletion are inefficient because after inserting data, the data behind this position must be moved in the memory, and the size is not fixed and cannot be easily expanded dynamically.
linked list
Fast addition and deletion, slow search
The storage space is discrete, the memory occupied is loose, and the space complexity is small.
Advantages: fast insertion and deletion, high memory utilization, no fixed size, flexible expansion
Disadvantages: Cannot search randomly, start from the first one every time, low query efficiency
Hash table
Addition, deletion and search are fast, data hashing is a waste of storage space
queue
First in, first out, tail insertion and top removal are fast, but other accesses are slow.
stack
First in, last out, top removal and insertion are fast, but access to other elements except the top element is slow.
red black tree
Both additions, deletions and searches are fast, and the algorithm structure is complex (see binary tree for details)
Binary tree
Addition, deletion and search are fast, but the deletion algorithm has a complex structure.
The time complexity is O(logn) in the best case and O(n) in the worst case.
special binary tree
full binary tree
If the number of levels of the binary tree is K and the number of nodes is (2^K-1), it is a full binary tree (as shown in the following figure)
Subtopic 1
complete binary tree
The number of layers of the binary tree is h layer, the number of (1~h-1) nodes has reached the maximum, and all nodes of h layer are concentrated on the left (as shown in the following figure)
binary search tree
The values on the left subtree are all smaller than the root node, and the values on the right subtree are all smaller than the root node. In-order traversal must be sorted from small to large (as shown in the figure)
red black tree
A balanced binary tree is an empty tree or the height difference between the left and right subtrees does not exceed 1, and both left and right subtrees are balanced binary trees. The red-black tree must be a binary search tree (as shown in the figure)
Features:
time complexity
The worst time complexity of insertion, deletion and insertion is O(log N)
heightlogN
Nodes are either black or red
The root node must be black
If a node is red, its child nodes must be black.
For any node, the number of black nodes on the path to the leaf node must be the same.
Each leaf node (leaf node is the NULL pointer or NULL node at the end of the tree) is black
Optimal binary tree (Huffman tree)
The weighted path length of the tree reaches the minimum
B-tree
Balanced multi-path search tree (more than two search paths), different from binary tree, is a multi-way tree, O(log n)
B-tree
It is a self-balancing tree data structure that maintains data ordering; the complexity of searching, sequential access, insertion and deletion is O(log n) and the B-tree only stores data in leaf nodes, so some B-trees are eliminated Defects. Non-leaf nodes only store indexes, not actual data. The data is all stored in leaf nodes. O(nlogn)
The height of the tree supports range search
Why does Mysql use B numbers?
Less disk IO and supports range lookup
The difference between B-tree and B-tree
1) All data in B-tree exists in leaf nodes 2) The leaf nodes of the B tree have bidirectional pointers to facilitate range searches, and the data on the leaf nodes are connected sequentially from small to large.
bitmap
Save storage space and make it inconvenient to describe complex data relationships
gather
Traversing collection problem
When using an iterator to traverse a collection and change the objects in the collection (add, delete, modify), an exception will be thrown.
Collection
List
ArrayList
Features: Storage elements are in order, querying is fast and addition and deletion are slow.
The underlying array is implemented, and the capacity can grow automatically. The default expansion is 1.5 times the original capacity.
It is not thread-safe. Single-threading is available. Multi-threading is not recommended. You can use Collections.syncnizedArrayList(List l) to return a thread-safe ArrayList.
The bottom layer calls Arrays.copyof() in large numbers, and System.arraycopy() expands its capacity.
Vector
The underlying array implementation is modified by the synchronized keyword, thread-safe, and low in efficiency. The default initial capacity is 10, and the default expansion capacity is increased by 1 times.
LinkedList
Implementation of doubly linked list structure
Query is slow, addition and deletion are fast
Similarities and differences among LinkedList, ArrayList, and Vector:
Set
SortedSet
TreeSet
The underlying red-black tree implementation, unordered collection, thread safety
HashSet
Subclass: LinkedHashSet
HashSet is implemented by HashMap, so the data structure is an array linked list red-black tree, which allows null, no duplication, disorder, and multi-threading unsafe. It stores elements according to the hash algorithm and has high search, deletion and access performance. Use Collections.synchronizedHashSet() to return A thread-safe HashSet
Elements can only be accessed through iterators
queue
Map
HashMap
HashMap
The bottom layer of HashMap is implemented using arrays, linked lists, and red-black trees.
When the number of objects in a chain in the array reaches 8, the chain will be converted into a red-black tree. If the red-black tree has less than 6 nodes, it will be converted into a linked list.
The initial default capacity of Hash is 16. When the storage elements in HashMap exceed load factor * current capacity, the capacity will be expanded to 2 times the original size, and then the position of each element in the array will be recalculated.
It is not thread safe, you can use ConcurrentHashMap
ConcurrentHashMap
jdk1.8
Implemented using linked list, red-black tree node array, cas (optimistic lock) synchronized
Subclass
linkedHashMap
HashTable
Subclass
Properties
Thread safety
Thread safety is ensured by locking the entire hash table. This method ensures thread safety, but concurrent execution is inefficient.
SortedMap
TreeMap
Design Patterns
Singleton mode, 3 common types
Lazy mode, instantiated on the first call
Villain mode, the class has been instantiated by itself when initialized, thread safe
Registration mode
proxy mode
static proxy
Subtopic 1
dynamic proxy
jdk dynamic proxy
CGlib
Factory pattern
builder pattern
adapter mode
iterator pattern
Common categories
String
Represents a string
String classes modified with final cannot be inherited and represent immutable character sequences.
The character content of the String object is stored in a character array value[]
If a String constant is spliced with a constant, it will be in the method area constant pool, and there will be no constants with the same content in the constant pool; if a constant is spliced with any variable, it will be stored in the heap.
JVM
javac
A compiler that compiles the java language into a binary file that can be recognized by the jvm
components
jvm memory partition
program counter
It is a small memory space and is an indicator of the code executed by the current thread. By changing the value of the counter, the next bytecode instruction that needs to be executed is selected; each thread has a unique program counter, and there is no program counter between threads. Impact; The only area where OOM will not occur; if the thread is executing a java method, the program counter records the address of the virtual machine bytecode instruction being executed. If a local method is executed, the counter is empty.
java virtual machine stack
Thread private, mainly stores local variables, stack frames, operand stacks, dynamic links, method exits and other information
native method stack
Call operating system class library
heap
Thread sharing, basically all objects are in the Java heap, the main area for garbage collection
method area
Thread sharing mainly stores class information, constants, static variables, and compiled code data
Runtime constant pool
Part of the method area. Storing literal and symbolic references
Multithreading
program
A set of instructions written in a certain language to complete a specific task, that is, a static piece of code, a static object
process
The process of executing a program once or the program being executed is a dynamic process with its own process of creation, existence and demise (life cycle)
Features:
Programs are static, processes are dynamic
Process is the smallest unit of resource allocation. When the system is running, each process will be allocated a different memory area.
thread
The process is further subdivided into threads, which are an execution path of the program.
Features:
If a process executes multiple threads at the same time, it supports multi-threading.
Threads serve as scheduling and execution units. Each thread has an independent stack and program counter (pc), and the overhead of thread switching is small.
Multiple threads in a process share the same memory unit/memory address space -> they allocate objects from the same heap and can access the same objects and variables. This makes communication between threads simpler and more efficient. But sharing between multiple threads will bring security issues
Parallelism and Concurrency
parallel:
Two or more events occur simultaneously
concurrent:
Two or more events occur at the same time interval
Thread creation
Thread
Inherit the Thread class and override the run method implementation. The code in the run method is the thread body. By instantiating the subclass that inherits Thread, call the start method to enable the thread to call the run method; Thread implements the Runnable interface.
Thread common methods
(1) void start(): Start the thread and execute the run() method of the object (2) run(): Operations performed by the thread when being scheduled (3) String getName(): Returns the name of the thread (4) void setName(String name): Set the thread name (5) static Thread currentThread(): Returns the current thread. This is this in Thread subclasses, usually used for the main thread and Runnable implementation classes (6) static void yield(): The thread yields and pauses the currently executing thread, giving the execution opportunity to a thread with the same or higher priority. If there is no thread with the same priority in the queue, ignore this method (7) join(): When the join() method of other threads is called in a certain program execution flow, the calling thread will be blocked until the join thread added by the join() method is executed. Low-priority threads can also get executed (8) static void sleep (long millis): (specified time: milliseconds) causes the currently active thread to give up control of the CPU within the specified time period, giving other threads a chance to be executed, and requeuing after the time is up. Throws InterruptedException (9) stop(): Forces the thread life cycle to end, not recommended (10) boolean isAlive(): Returns boolean to determine whether the thread is still alive
Thread priority:
When a thread is created, it inherits the priority of the parent thread.
Low priority only has a low probability of being scheduled, and does not necessarily need to be called after a high-priority thread.
1. Thread priority level MAX_PRIORITY:10 MIN_PRIORITY:1 NORM_PRIORITY:5 2. Methods involved getPriority(): Returns the thread priority value setPriority(int newPriority): Change the priority of the thread
Thread state Thread.State
New
ready
run
block
die
Runnable
1) Define a subclass and implement the Runnable interface. 2) Rewrite the run method in the Runnable interface in the subclass. 3) Create a thread object through the parameter-containing constructor of the Thread class. 4) Pass the subclass object of the Runnable interface as an actual parameter to the constructor of the Thread class. 5) Call the start method of the Thread class: start the thread and call the run method of the Runnable subclass interface.
callable
callable can be executed by using ExecutorService or as a parameter of FeatureTask
public class MyCallable implements Callable<T> { @Override public T call() throws Exception { // Define code that can be called here } } MyCallable myCallable = new MyCallable(); ExecutorService executor = Executors.newSingleThreadExecutor(); Future<T> future = executor.submit(myCallable); T result = future.get();
Callable<Process> task = () -> { //Execute asynchronous tasks Runtime runtime = Runtime.getRuntime(); Process process = runtime.exec("/Users/mac/Desktop/qc-java-runtime/src/main/java/com/qc/runtime/shell.sh"); return process; }; //Wrap Callable into FutureTask FutureTask<Process> future = new FutureTask<>(task); //Start a new thread to perform asynchronous tasks new Thread(future).start(); // Get the results of the asynchronous task Process result = future.get(); System.out.println(result);
Lock
For concurrent work, you need some way to prevent two tasks from accessing the same resources (effectively competing for shared resources). The way to prevent this conflict is to lock the resource when it is used by a task. The first task to access a resource must lock the resource so that other tasks cannot access it until it is unlocked, at which time another task can lock and use it.
Synchronized
Any object can be used as a synchronization lock. All objects automatically contain a single lock (monitor). Locks for synchronized methods: static methods (class name.class), non-static methods (this) Synchronized code block: Specify it yourself, often it is also specified as this or class name.class
deadlock
Different threads occupy the synchronization resources needed by each other and do not give up. They are all waiting for the other party to give up the resources they need, forming a deadlock.
After a deadlock occurs, no exception or prompt will occur, but all threads are blocked and cannot continue.
lock release
The execution of the synchronization method and synchronization code block of the current thread ends.
The current thread encounters break and return in a synchronized code block or synchronized method, which terminates the continued execution of the code block and method.
An unhandled Error or Exception occurs in the current thread in the synchronized code block or synchronized method, resulting in an abnormal end.
The current thread executes the wait() method of the thread object in the synchronization code block and synchronization method. The current thread pauses and releases the lock.
Network communication protocol
OSI model (protocol)
The model is too ideal and cannot be promoted on the Internet
OSI layering
Application layer
presentation layer
session layer
transport layer
Network layer
data link layer
physical layer
TCP/IP model (protocol)
de facto international standard
TCP/IP layering
Application layer
transport layer
Network layer
physical data link layer
Comparison of two models
Data encapsulation
Data dismantling
IP and port
Important network transport layer protocols UDP and TCP
UDP (User Datagram Protocol)
1. Encapsulate data, source, and destination into data packets without establishing a connection. 2. The size of each datagram is limited to 64K 3. The sender does not care whether the other party is ready or not, and the receiver does not confirm receipt, so it is unreliable. 4. Can be broadcast and sent 5. No need to release resources when sending data, low overhead and fast speed
TCP (Transmission Control Protocol)
1. Before using the TCP protocol, a TCP connection must be established first to form a data transmission channel. 2. Before transmission, the "three-way handshake" method is used, and point-to-point communication is reliable. 3. Two application processes communicate using TCP protocol: client and server. 4. Large amounts of data can be transmitted during the connection. 5. After the transmission is completed, the established connection needs to be released, which is inefficient
Three-way handshake (establishing connection)
1. The client sends a request message to establish a TCP connection. The message contains a seq sequence number, which is randomly generated by the sending end, and sets the SYN (synchronize) field in the message to 1, indicating that a TCP connection needs to be established. . (SYN=1, seq=x, x is a randomly generated value); 2. The server replies to the TCP connection request message sent by the client, which contains the seq sequence number, which is randomly generated by the replying end, and sets SYN to 1, and generates an ACK field. The ACK field value is sent by the client. Add 1 to the passed sequence number seq to reply, so that when the client receives the information, it knows that its TCP establishment request has been verified. (SYN=1, ACK=x 1, seq=y, y is a randomly generated value) The ack plus 1 here can be understood as confirming who the connection is established with; 3. After the client receives the TCP establishment verification request sent by the server, it will increase its sequence number by 1, reply to the ACK verification request again, and add 1 to the seq sent by the server to reply. (SYN=1, ACK=y 1, seq=x 1).
Wave four times (disconnect)
1. The client sends a message requesting to disconnect the TCP connection. The message contains the seq sequence number, which is randomly generated by the sender. It also sets the FIN field in the message to 1, indicating that the TCP connection needs to be disconnected. . (FIN=1, seq=x, x is randomly generated by the client); 2. The server will reply to the TCP disconnect request message sent by the client, which contains the seq sequence number, which is randomly generated by the reply end, and will generate an ACK field. The value of the ACK field is the seq sequence number sent by the client. Basically add 1 to reply, so that when the client receives the information, it knows that its TCP disconnect request has been verified. (FIN=1, ACK=x 1, seq=y, y is randomly generated by the server); 3. After the server responds to the client's TCP disconnect request, it will not immediately disconnect the TCP connection. The server will first ensure that all data transmitted to A has been transmitted before disconnecting. Once the data transmission is confirmed, , the FIN field of the reply message will be set to 1, and a random seq sequence number will be generated. (FIN=1, ACK=x 1, seq=z, z is randomly generated by the server); 4. After receiving the TCP disconnect request from the server, the client will reply to the server's disconnect request, including a randomly generated seq field and an ACK field. The ACK field will add 1 to the seq of the server's TCP disconnect request, thus Complete the verification reply requested by the server. (FIN=1, ACK=z 1, seq=h, h is randomly generated by the client) At this point, the 4-wave wave process of TCP disconnection is completed.
Subtopic 3
Network Socket
The combination of ip and port forms a socket
The essence of network communication: communication between sockets
Classification
stream socket
Provide reliable byte stream service using TCP
TCP-based Socket programming steps
client
1. Create Socket: Construct a Socket class object based on the IP address or port number of the specified server. If the server responds, a communication line from the client to the server is established. If the connection fails, an exception will occur. 2. Open the input/output stream connected to the Socket: Use the getInputStream() method to obtain the input stream, and use the getOutputStream() method to obtain the output stream for data transmission. 3. Read/write operations on the Socket according to a certain protocol: read the information put by the server into the line through the input stream (but cannot read the information put into the line by yourself), and write the information into the thread through the output stream. 4. Close Socket: Disconnect the client from the server and release the line
Service-Terminal
1. Call ServerSocket(int port): Create a server-side socket and bind it to the specified port. Used to monitor client requests. 2. Call accept(): listen for connection requests. If the client requests a connection, accept the connection and return the communication socket object. 3. Call getOutputStream() and getInputStream() of the Socket class object: obtain the output stream and input stream, and start sending and receiving network data. 4. Close the ServerSocket and Socket objects: the client access is completed and the communication socket is closed.
datagram socket
Use UDP to provide "best effort" data services
UDP-based network programming
1. Classes DatagramSocket and DatagramPacket implement network programs based on UDP protocol. 2. UDP datagrams are sent and received through the datagram socket DatagramSocket. The system does not guarantee that the UDP datagram can be safely delivered to the destination, nor can it determine when it will arrive. 3. The DatagramPacket object encapsulates a UDP datagram, and the datagram contains the IP address and port number of the sender and the IP address and port number of the receiver. 4. Each datagram in the UDP protocol gives complete address information, so there is no need to establish a connection between the sender and the receiver. Just like sending an express package.
process
1. DatagramSocket and DatagramPacket 2. Establish the sending end and receiving end 3. Create data package 4. Call the sending and receiving methods of Socket 5. Close Socket Note: The sending end and receiving end are two independent running programs.
URL
reflection
concept
JAVA mechanism reflection is in the running state. For any class, you can know all the properties and methods of this class; for any object, you can call any of its methods and properties; this dynamically obtained information and dynamic calls The function of the object's method is called the reflection mechanism of the Java language.
Get an instance of the class class
1) Premise: If the specific class is known, it can be obtained through the class attribute of the class. This method is the safest and most reliable, and has the highest program performance. Example: Class clazz = String.class; 2) Premise: If an instance of a certain class is known, call the getClass() method of the instance to obtain the Class object. Example: Class clazz = "www.atguigu.com".getClass(); 3) Prerequisite: The full class name of a class is known, and the class is on the class path and can be retrieved through the static method of the Class class. Obtained using forName(), ClassNotFoundException may be thrown Example: Class clazz =Class.forName("java.lang.String"); 4) Other methods (not required) ClassLoader cl = this.getClass().getClassLoader(); Class clazz4 = cl.loadClass("Full class name of the class");
Common methods of class class
Types with Class objects
(1) class: Outer classes, members (member inner classes, static inner classes), local inner classes, anonymous inner classes (2) interface: interface (3)[]: array (4) enum: enumeration (5) annotation: annotation @interface (6) primitive type: basic data type (7)void
example
Class c1 = Object.class; Class c2 = Comparable.class; Class c3 = String[].class; Class c4 = int[][].class; Class c5 = ElementType.class; Class c6 = Override.class; Class c7 = int.class; Class c8 = void.class; Class c9 = Class.class; int[] a = new int[10]; int[] b = new int[100]; Class c10 = a.getClass(); Class c11 = b.getClass();
Subtopic 11
hadoop
hdfs
hdfs advantages and disadvantages
advantage
Fault tolerance (replication mechanism)
Suitable for processing big data, can handle petabyte-level data, and can handle millions of files
Deployable on cheap machines
shortcoming
Not suitable for low-latency data access
It is impossible to store a large number of small files efficiently. The memory of the namenode is limited and if there are too many small files, the addressing time will be greater than the file processing time.
Concurrent writing is not supported and files are modified randomly.
hdfs architecture composition
namenode
Manage hdfs namespace
Configure replica policy
Manage block mapping information
Handle client requests
datanode
Store data block
Execute read and write requests for data blocks
client
File splitting, split the file into blocks when uploading the file
Interact with namenode to obtain file location information
Interact with datanode, read or write data
client provides commands to manage hdfs
The client provides commands to access HDFS, such as add, delete and check operations on HDFS.
secondaryNamenode
Assist nameNode to work, merge fsimage and edits files, and push them to nameNode
Assisted recovery of Namenode in emergency situations
hdfs file block size
hadoop1.x 64M, hadoop2.x/3.x is 128M
The optimal state is when the addressing time is 1% of the transmission time
Why HDFS file block size cannot be too large or too small
1. If it is too small, it will increase the seeking time.
2. If the block is too large, the data transmission time will be significantly greater than the time to locate the starting block position, causing the program to process this data block very slowly.
shell
upload
hadoop fs -moveFromLocal local file hdfs directory (local cut and copy to hdfs)
hadoop fs -copyFromLocal local file hdfs directory (copy locally to hdfs)
hadoop fs -put local file hdfs directory (copy locally and upload to hdfs)
hadoop fs -appendToFile local file hdfs file (local file data is appended to the end of the hdfs file)
download
hadoop fs -copyToLocal hdfs file local directory (file copied to local)
hadoop fs -get hdfs file local directory (file copied to local)
operating hdfs
hadoop fs -ls directory (display information under the directory)
hadoop fs -mkdir directory (create directory)
-chgrp, -chmod, -chown (modify file ownership)
hadoop fs -cat file (display file contents)
hadoop fs -cp file directory (copy the file to another directory)
hadoop fs -tail file (displays 1kb of data at the end of the file)
hadoop fs -rm delete files or folders
hadoop fs -rm -r directory (recursively delete the directory and the contents under the directory)
hadoop fs -du
hadoop fs -setrep (Set the number of replicas and record them only in the namenode. If the number of datanodes is less than the set number of replicas, then the number of replicas will only be as many as datanodes)
hdfs api operation: through the FileSystem object
hdfs reading and writing process
Reading process
Subtopic 1
write process
Illustration
Subtopic 3
hive
concept
Hive is a data warehouse management tool based on Hadoop, which maps structured data into a table and provides SQL-like query functions.
principle
Essence Convert HQL to mr tasks
metastore
Metadata: including table name, database to which the table belongs (default default), table owner, column/partition field, table type (whether it is an external table), directory where the table's data is located, etc.
client
Provide jdbc/ODBC interface, webUI access, interface access, etc.
driver
ParserSQL Parser
Convert sql into abstract syntax tree AST. This step is usually completed with a third-party tool, such as antrl; perform syntax analysis on SAT to check whether the sql semantics are correct, whether tables and column names exist, etc.
Compiler Physical Plan
Convert AST into logical execution plan
Optimizer Query Optimzer
Optimize logical execution plans
Execution
The logical writeback plan is converted into a physical execution plan that can be run. For hive, it is converted into mr/tez/spark tasks
type of data
Basic data types
hive's String type can theoretically store 2G characters
Collection data type
Collection data type table creation statement
create table test( name string, friends array<string>, children map<string, int>, address struct<street:string, city:string> ) row format delimited fields terminated by ',' //Column delimiter collection items terminated by '_' //The delimiter of MAP STRUCT and ARRAY (data splitting symbol) map keys terminated by ':' //map kye and value separator lines terminated by ' '; //Line separator
type conversion
Hive's atomic data types are implicitly converted, similar to Java's conversion. For example, if an expression uses the INT type, TINYINT will be automatically converted to the INT type, but Hive will not perform the reverse conversion. For example, if an expression uses the TINYINT type, INT will not be automatically converted to the TINYINT type, and it will return an error unless using CAST operation
Implicit conversion rules
1. Any integer type can be implicitly converted to a wider type, such as TINYINT can be converted Replaced with INT, INT can be converted to BIGINT.
2. All integer types, FLOAT and STRING types can be implicitly converted to DOUBLE. (Implicit conversion of String and double can easily cause data skew)
3.TINYINT, SMALLINT, and INT can all be converted to FLOAT.
4. The BOOLEAN type cannot be converted to any other type.
DDL
Create table
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path] [TBLPROPERTIES (property_name=property_value, ...)] [AS select_statement]
explain
(1) CREATE TABLE creates a table with the specified name. If a table with the same name already exists, an exception is thrown; the user can use the IF NOT EXISTS option to ignore this exception. (2) The EXTERNAL keyword allows users to create an external table. When creating the table, they can specify a path (LOCATION) pointing to the actual data. When the table is deleted, the metadata and data of the internal table will be deleted together, and External tables only delete metadata, not data. (3) COMMENT: Add comments to tables and columns. (4) PARTITIONED BY creates a partition table (5) CLUSTERED BY creates bucket table (6) SORTED BY is not commonly used. It is used to sort one or more columns in the bucket. (7) ROW FORMAT DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)] Users can customize SerDe or use the built-in SerDe when creating tables. If ROW FORMAT or ROW FORMAT DELIMITED is not specified, the built-in SerDe will be used. When creating a table, the user also needs to specify columns for the table. When specifying the columns of the table, the user will also specify a custom SerDe. Hive uses SerDe to determine Specific column data of the specified table. SerDe is the abbreviation of Serialize/Deserilize. Hive uses Serde to sequence and deserialize row objects. (8) STORED AS specifies the storage file type. Commonly used storage file types: SEQUENCEFILE (binary sequence file), TEXTFILE (text), RCFILE (column storage format file). If the file data is plain text, you can use STORED AS TEXTFILE. If the data needs to be compressed, use STORED AS SEQUENCEFILE. (9) LOCATION: Specify the storage location of the table on HDFS. (10) AS: followed by a query statement to create a table based on the query results. (11) LIKE allows users to copy the existing table structure, but does not copy the data.
Partition Table
The essence of the partition table is to correspond to a folder on HDFS, and the partition in Hive is a sub-directory.
DML
load
load data [local] inpath '/opt/module/datas/student.txt' [overwrite] into table student [partition (partcol1=val1,…)];
insert
basic insert
insert into(overwrite) table ...
Multiple table insert
from tableName insert into table t1 .... select .... from tableName, insert into table t2 .... select .... from tableName
Query and create tables
create bale tableName as select.....
Upload data to create table
1. Upload data to hdfs
2. Create the table and specify its location on hdfs
Import data into HIve
Note: Use export first to export, and then import the data.
import table student2 partition(month='201709') from '/user/hive/warehouse/export/student';
Truncate deletes data in the table
Only management tables can be deleted, data in external tables cannot be deleted
Commonly used functions
max
min
sum
avg
count
sort
global sort
order by global sorting, there is only one reducer, the efficiency is extremely low when the amount of data is large, but the efficiency is high when the amount of data is small
Partition sorting
distribute by is similar to the custom partitioner in the mr task; the rule of distribute by is that after modulo division of the number of reducers based on the hash code of the partition field, the remainders with the same number will enter the same partition.
local sorting
sort by is the internal sorting of each reducer, and is globally unordered. It is often used in conjunction with distribute by, with distribute by in the front and sort by in the back.
cluster by
It can be replaced when distribute by sort by is used together and the fields are in the same ascending order. Cluster by cannot specify the sorting rule as ASC or DESC.
partition, bucket
bucket
Bucketing must be enabled and configured set hive.enforce.bucketing=true;
Create bucket table
Inserting data into a bucket table is the same as inserting data into a normal table
Bucketing rules
The bucket field determines which bucket the data is in by calculating the modulus of the hash modulus divided by the number of buckets.
The difference between bucketing and partitioning
Partitioning is for the data storage path, that is, the folder; bucketing is for the data files.
Bucket sampling survey
grammar
TABLESAMPLE(BUCKET x OUT OF y) .
Partition
subtopic
Convert between rows and columns
row to column
CONCAT(string A/col, string B/col…)
CONCAT_WS(separator, str1, str2,...)
COLLECT_SET(col)
Function description
example
sql implementation
Column to row
Function description
example
sql implementation
window function
OVER()
current row
current row
n percing
Go forward n lines
n following
n lines back
UNBOUNDED
starting point
unbounded perceding means starting from the front
unbounded following means to the end point behind
LAG(col,n,default_val)
Go nth line forward
lead(col,n,default_val)
nth row from now on
NTILE(n)
Distribute the rows in the ordered partition to the specified data groups. Each group is numbered, starting from 1. For each row, NTILE returns the number of the group to which the row belongs. Note: n must be of type int.
example
data
Requirements and implementation
1. Query the customers and total number of customers who purchased in April 2017
2. Check the customer’s purchase details and monthly purchase total
3. In the above scenario, costs must be accumulated according to date.
4. Query the last purchase time of each customer
5. Query the order information of the previous 20% time
rank
If the order is the same, it will be repeated and the total number will remain unchanged.
dense_rank
If the order is the same, there will be no duplication, and the total number will be reduced.
row_number
will be calculated in order
example
function
built-in functions
Custom function
udf(User-Defined-Function)
Features: One in, one out
Programming steps
1. Inherit class org.apache.hadoop.hive.ql.UDF
2. The evaluate function needs to be implemented; the evaluate function supports overloading;
3. Create a function in the command line window of hive
1.Upload jar
add jar linux_jar_path
2.Create function
create [temporary] function [dbname.]function_name AS class_name;
4. Delete the function in the command line window of hive
Drop [temporary] function [if exists] [dbname.]function_name;
Note: UDF must have a return type and can return null, but the return type cannot be void;
udaf(User-Defined Aggregation Function)
More in and one out
aggregate function
udtf (User-Defined Table-Generating Functions)
One in, many out
lateral view explode()
Compression and storage
Hive commonly used compression and storage
Parquet file format is commonly used in Hive; lzo or snappy is used for compression.
compression
Compression format table
Compression performance comparison
Compression parameter settings
storage
Common file formats in hive
textFile, parquet, orc, sequencefile
File format classification
Row storage
Features
When querying an entire row of data that meets the conditions, column storage needs to go to each aggregated field to find the corresponding value of each column. Row storage only needs to find one of the values, and the rest of the values are in adjacent places, so this Row store queries are faster.
textfile
In the default format, the data is not compressed, resulting in high disk overhead and high data parsing overhead.
sequenceFile
Column storage
Features
Because the data of each field is aggregated and stored, when the query only requires a few fields, the amount of data read can be greatly reduced; the data type of each field must be the same, and columnar storage can be better designed specifically. Designed compression algorithm.
parquet
orc
composition
Orc files generally consist of one or more stripes. Each stripe is generally the size of an HDFS block. Each stripe contains multiple records. These records are stored independently according to columns.
stripe
index data
Subtopic 1
row data
stripe footer
Tuning
fetch capture
It means that in some cases, the query does not need to go through the MR task, such as select * from table. In this case, hive can simply read the file in the table directory and output it to the console.
fetch capture enable settings
Set hive.fetch.task.conversion = more in the hive-default.xml.template file
After turning it on, global search, limit search, and field search will not use the mr task.
local mode
Hive input data is small, and the time consumed to trigger the query execution task may be much longer than the actual job execution time. In this case, you can turn on the local mode to process all tasks on a single machine. Task execution time for small data sets will be significantly shortened
parameter settings
set hive.exec.mode.local.auto=true; //Enable local mr
Table optimization
Large table, small table join
Putting small tables in the front and big tables in the back can effectively reduce the chance of memory overflow errors. Now that it has been optimized, putting small tables in the front and big tables in the back has no effect.
map join (small table join large table)
map join enable settings
map join is enabled by default
set hive.auto.convert.join = true;
set hive.mapjoin.smalltable.filesize=25000000;//The default small table size is 25M
map join principle
join big table
1. Empty key filtering
A large amount of abnormal data with empty keys enters the same reducer and is extremely slow to process. Sometimes it even overflows the memory. You can choose to filter out the empty keys first.
2. Empty key conversion
A large number of empty keys are not abnormal data and must be included in the result set. Then you can assign values to the keys and distribute them evenly in the reducer.
group by
Enable map-side aggregation
parameter settings
1. Enable aggregation on the map side
set hive.map.aggr = true
2. Aggregate the number of items on the map side
set hive.groupby.mapaggr.checkinterval = 100000
3. Perform load balancing when there is data skew (default false)
set hive.groupby.skewindata = true
Principle: Two jobs will be started. The map side of the first job will randomly distribute the keys to the reducers. Each reducer will perform an aggregation operation internally and output the results. The second job will get the results of the first job and put the same The key is sent to the same reducer for aggregation to complete the final aggregation operation.
count(distinct)
Small data doesn’t matter
Large amount of data
In the case of large data volume, count(distinct) will only enter one reducer no matter how many reducers there are, affecting the job progress; you can use group by to perform deduplication statistics.
Cartesian Product
Try to avoid Cartesian product, do not add on conditions when joining, or have invalid on conditions. Hive can only use one reducer to process data.
Row and column filtering
Column processing
In select, only take the required columns and use select * as little as possible
row processing
If the filter conditions of the secondary table are written after where in the association, then the two tables will first be associated with the whole table and then filter the data. It is recommended to first make a subquery with filter conditions for the secondary table and then perform the association.
dynamic partitioning
Enable dynamic partition settings
1. Enable dynamic partitioning function settings (enabled by default)
hive.exec.dynamic.partition=true
2. Set to non-strict mode (the mode of dynamic partitioning, the default is strict, which means that at least one partition must be specified as Static partitioning, nonstrict mode means that all partition fields are allowed to use dynamic partitioning. )
hive.exec.dynamic.partition.mode=nonstrict
3. The maximum number of dynamic partitions that can be created on all nodes executing MR. Default 1000
hive.exec.max.dynamic.partitions=1000
4. The maximum number of dynamic partitions that can be created on each node executing MR. This parameter needs to be based on actual data to set. For example: the source data contains data for one year, that is, the day field has 365 values, then this parameter is It needs to be set to greater than 365. If you use the default value of 100, an error will be reported.
hive.exec.max.dynamic.partitions.pernode=100
5. The maximum number of HDFS files that can be created in the entire MR Job. Default 100000
hive.exec.max.created.files=100000
6. Whether to throw an exception when an empty partition is generated. Generally no settings are required. Default false
hive.error.on.empty.partition=false
Partition
bucket
Data skew
1. Set a reasonable number of Maps
Parallel execution
By default, hive can only execute one stage at a time. However, a specific job may contain many stages, and these stages may not be completely dependent on each other. That is to say, some stages can be performed in parallel, and the job can be completed by executing it in parallel. Faster.
Turn on parameter settings
set hive.exec.parallel=true; //Enable parallel execution of tasks set hive.exec.parallel.thread.number=16; //The maximum degree of parallelism allowed for the same sql, the default is 8.
strict mode
jvm reuse
speculative execution
compression
Number warehouse theory
concept
It is a strategic collection that provides all types of data support for the decision-making process at all levels of the enterprise.
feature
A data warehouse is a master graph-oriented, integrated, non-volatile and time-varying data collection to support decision-making
Topic oriented
Integration
Non-volatile (cannot be changed)
time-varying
etl
Extract Extra, Transform Transfer, Load Load
Data warehouse stratification
source data
There are no changes to the data in this layer. It directly uses the peripheral system data structure and data and is not open to the public. It is a temporary storage layer, which is a temporary storage area for interface data to prepare for the subsequent data processing.
database
Also known as the detail layer, the data in the DW layer should be consistent, accurate, and clean data that has been cleaned (removed of impurities) from the source system data.
Data application
Data sources directly read by front-end applications; data calculated and generated based on report and thematic analysis requirements
why layering
space for time
A large amount of preprocessing is used to improve the user experience (efficiency) of the application system, so there will be a large amount of redundant data in the data warehouse; without layering, if the business rules of the source business system change, it will affect the entire data cleaning process, and the work The amount is huge.
Layering simplifies data cleaning process
The process of data cleaning can be simplified through data hierarchical management, because dividing the original one-step work into multiple steps is equivalent to splitting a complex work into multiple simple tasks and turning a large black box into A white box is created, and the processing logic of each layer is relatively simple and easy to understand. This makes it easier for us to ensure the correctness of each step. When errors occur in the data, we often only need to partially adjust a certain step.
data mart
Data mart architecture
independent data mart
Dependent data mart
Data warehouse stratification
Data warehouse layering principle
1. In order to facilitate data analysis, it is necessary to shield the underlying complex business and expose the data to the analysis layer in a simple, complete and integrated manner.
2. The impact of underlying business changes and upper-level demand changes on the model is minimized. The impact of business system changes is weakened at the basic data layer. Combined with the top-down construction method, the impact of demand changes on the model is weakened.
3. High cohesion and loose coupling, that is, high cohesion of data within a topic or within each complete system, and loose coupling of data between topics or between complete systems.
4. Build the basic data layer of the warehouse to isolate the underlying business data integration work from the upper-layer application development work, laying the foundation for large-scale development of the warehouse. The warehouse hierarchy will be clearer and the externally exposed data will be more unified.
Data warehouse stratification
ods (Operational Data Store)
Access the original data intact
DW (Data Warehouse)
DWD (Data Warehouse Detail) data detail layer
The granularity is consistent with the ods layer and provides certain data quality guarantees. What the dwd layer needs to do is data cleaning, integration, and standardization. Dirty data, junk data, data with inconsistent specifications, inconsistent status definitions, and inconsistent naming specifications will all be processed. At the same time, in order to improve the usability of data, some dimension degradation techniques will be used to degrade dimensions into fact tables and reduce the association between dimension tables and fact tables. At this layer, some data will also be aggregated to bring data in the same subject area into the same table to improve the usability of the data.
DWM (Data WareHouse Middle) data middle layer
On the basis of the DWD layer, the indicator aggregation operation is performed on the common core dimensions to calculate the corresponding statistical indicators. In actual calculations, if the calculation indicators of the wide table are calculated directly from the DWD or ODS layer, there will be problems such as too much calculation and too few dimensions. Therefore, the general approach is to first calculate multiple small intermediate tables in the dwm layer and then put them together into one. A wide spread table. Since the width and width boundaries are difficult to define, you can also remove the dwm layer and leave only the dws layer.
DWS (Data WareHouse Servce) data service layer
The granularity is coarser than the detailed layer. Data based on the dwd layer is integrated and summarized into service data for analyzing a certain subject area, which is generally a wide table. The dws layer should cover 80% of the nutrient scene. Also known as data mart or wide table, it is used for subsequent business queries, olap analysis, data distribution, etc.
APP data application layer
The data is mainly provided for data products and data analysis. It is generally stored in ES, PostgreSql, Redis and other systems for use by online systems. It may also be stored in Hive or Druid for data analysis and data mining. For example, the report data we often talk about is usually placed here.
dim dimension surface layer
If you have dimension tables, you can design this layer separately.
High cardinality data
Generally, it is similar to user data table, product table data table. The amount of data may be tens of millions or hundreds of millions
Low base girth data
Generally, it is a configuration table, such as the Chinese meaning corresponding to the enumeration value, or the date dimension table. The amount of data may be single digits, hundreds, thousands, or tens of thousands.
Data warehouse modeling method
dimensional modeling
significance
Build a model based on the needs of analyzed decisions, and the constructed data model serves the analysis needs, focusing on solving the problem of users completing analysis quickly, and also has good response performance for large-scale complex queries.
fact table
periodic snapshot fact table
transaction fact table
Cumulative snapshot fact table
fact table without facts
aggregate fact table
Merge fact tables
dimension table
star schema
All dimension tables are connected to the fact table, and dimension tables are not associated with other dimension tables.
constellation model
Extended from the star schema, multiple fact tables share dimension information
snowflake model
The fact table is associated with the dimension table. The dimension table can have other dimension tables. There are many associations and low performance.
Dimensional modeling process
Select business process
Declaration granularity
The same fact table must have the same granularity. Do not mix multiple different granularities in the same fact table. Different fact tables are created for different granularity data. For data whose requirements are unclear, we establish atomic granularity.
Confirm dimensions
Confirm the facts
paradigm modeling
6 paradigms of database
First normal form (1NF)
What is emphasized is the atomicity of the column, that is, the column cannot be divided into other columns.
Second normal form (2NF)
To satisfy the first normal form, two conditions must be met. First, the table must have a primary key; second, the columns not included in the primary key must completely depend on the primary key, and cannot only depend on part of the primary key.
Third normal form (3NF)
To satisfy 2nf, non-primary key columns must directly depend on the primary key, and there cannot be transitive dependencies. That is, it cannot exist: non-primary key column A depends on non-primary key column B, and non-primary key column B
Boyce-Codd normal form (BCNF)
Fourth normal form (4NF)
Fifth normal form (5NF)
Modeling steps
1. Conceptual model
2. Logic model
3.Physical model
Data warehouse core construction ideas
From the design, development, deployment and application levels, we avoid repeated construction and redundant index construction, thereby ensuring the standardization and unification of data caliber, and ultimately realizing the full-link association of data assets, providing standard data output and a unified data public layer
Data center construction process
impala
type of data
shell command
Architecture
module
impalad
Receive the client's request, execute the query and return to the central coordination point
The daemon process on the child node is responsible for maintaining communication with the statestore and reporting work
statestore
Responsible for collecting resource information distributed in each impalad process, the health status of each node, and synchronizing node information
Responsible for the coordination and scheduling of queries
catalog
Distribute table metadata information to each impalad
Receive all requests from statestore
hiveMetastore
hdfs
ddl
Impala does not support WITH DBPROPERTIES (for example: adding creator information and creation date to the library)
Delete database
drop database
impala does not support alter database
dml
Data import (basically the same as hive), impala does not support load data local inpath…
Data export (impala does not support insert overwrite... syntax to export data, you can use impala -o instead)
Export and import commands are not supported
Inquire
The basic syntax is roughly the same as the hive query statement.
impala does not support bucketing
impala does not support cluster by, sort by, distributed by
Impala does not support the COLLECT_SET(col) and explode(col) functions
Impala supports windowing functions
impala custom function
udf
Storage and compression
optimization
1. Try to deploy stateStore and catalog on the same server as much as possible to ensure their communication.
2. Improve work efficiency by limiting the memory of impala dameon (default 256) and the number of statestore threads
3. SQL optimization, call the execution plan before executing SQL
4. Choose the appropriate file format for storage to improve query efficiency
5. Avoid generating many small files (if there are small files generated by other programs, you can use an intermediate table to store the small file data in the intermediate table. Then insert the data from the intermediate table into the final table through insert...select...)
6. Use appropriate partitioning technology and calculate according to partition granularity
7. Use compute stats to collect table information. When a content table or partition changes significantly, recalculate the statistics related data table or partition. Because differences in the number of rows and distinct values may cause impala to choose a different join order when using a query on the table.
8. Optimization of network io
–a. Avoid sending the entire data to the client
–b. Do conditional filtering as much as possible
–c. Use limit clause
–d. When outputting files, avoid using beautified output
–e. Try to use less full metadata refresh
9. Use profile to output the underlying information plan and optimize the environment accordingly.
Data Lake Theory
spark
Built-in module
spark core
Implemented the basic functions of Spark, including modules such as task scheduling, memory management, error recovery, and interaction with storage systems. Spark Core also includes API definitions for Resilient Distributed DataSet (RDD).
spark sql
Spark is a package for manipulating structured data and can support HQL
spark streaming
Streaming computing component, API is highly corresponding to spark core
spark mlib
Provides common machine learning libraries
spark graghx
graph calculation
operating mode
local mode
standlone mode
yarn mode
yarn-client
Features:
The driver runs on the client and is suitable for interaction and debugging.
yarn-cluster
Features
The driver runs the applicationMaster started by resourceManager and is suitable for production environments.
sparkcore
rdd
compute
partitions
partitioner
dependencies
operator
conversion operator
value type
map(func)
mapPartitions(func)
The function type of func must be Iterator[T] => Iterator[U]
mapPartitionsWithIndex(func)
The function type of func must be (Int, Interator[T]) => Iterator[U]
flatMap(func)
glom()
Form each partition into an array and form a new RDD type RDD[Array[T]]
groupBy()
Subtopic 1
Group by the return value of the function passed in. Put the values corresponding to the same key into an iterator.
filter
sample(withReplacement, fraction, seed)
Randomly sample a fraction of data with the specified random seed. withReplacement indicates whether the extracted data is replaced. true means sampling with replacement, false means sampling without replacement. seed is used to specify the random number generator seed.
distinct([numTasks]))
Returns a new RDD after deduplicating the source RDD. By default, only 8 parallel tasks operate, but this can be changed by passing an optional numTasks parameter.
coalesce(numPartitions)
Reduce the number of partitions to improve the execution efficiency of small data sets after filtering large data sets.
repartition(numPartitions)
All data is randomly shuffled through the network again based on the number of partitions.
The difference between coalesce and repartition
1. coalesce repartitions, you can choose whether to perform the shuffle process. By parameter shuffle: Boolean = false/true Decide.
2. Repartition actually calls coalesce to perform shuffle.
sortBy(func,[ascending], [numTasks])
Use func to process the data first, and sort according to the processed data comparison results. The default is positive order.
pipe(command, [envVars])
The pipeline, for each partition, executes a shell script and returns an RDD of output.
Double value type
union
subtract(otherDataset)
A function that calculates the difference, removing the same elements in two RDDs, and different RDDs will remain
intersection(otherDataset)
Returns a new RDD after intersecting the source RDD and parameter RDD.
cartesian(otherDataset)
Cartesian Product
zip(otherDataset)
Combine two RDDs into an RDD in the form of Key/Value. By default, the number of partitions and elements of the two RDDs are the same, otherwise an exception will be thrown.
key/value type
partitionBy(partitioner)
Perform partitioning operation on pairRDD. If the original partionRDD is consistent with the existing partionRDD, no partitioning will be performed. Otherwise, ShuffleRDD will be generated, which will generate a shuffle process.
reduceByKey(func, [numTasks])
Called on a (K, V) RDD, returns a (K, V) RDD, using the specified reduce function to aggregate the values of the same key together. The number of reduce tasks can be passed through the second optional parameters to set.
groupByKey()
groupByKey also operates on each key, but only generates one seq.
The difference between reduceByKey and groupByKey
1. reduceByKey: Aggregation based on key, there is a combine (pre-aggregation) operation before shuffle, and the return result is RDD[k,v]. 2. groupByKey: Group by key and shuffle directly. 3. Development guidance: reduceByKey is recommended over groupByKey. But you need to pay attention to whether it will affect the business logic.
aggregateByKey
Parameters: (zeroValue:U,[partitioner: Partitioner]) (seqOp: (U, V) => U,combOp: (U, U) => U)
In the RDD of kv pairs, values are grouped and merged according to key. When merging, each value and the initial value are used as parameters of the seq function for calculation, and the returned result is used as a new kv pair, and then the result is calculated according to The key is merged, and finally the value of each group is passed to the combine function for calculation (the first two values are calculated first, and the return result and the next value are passed to the combine function, and so on), and the key and the calculation result are as A new kv pair is output.
foldByKey
Parameters: (zeroValue: V)(func: (V, V) => V): RDD[(K, V)]
Simplified operation of aggregateByKey, seqop and combop are the same
combineByKey[C]
Parameters: (createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C)
Function: Merge V into a set for the same K
Subtopic 3
sortByKey([ascending], [numTasks])
When called on an RDD of (K, V), K must implement the Ordered interface and return an RDD of (K, V) sorted by key.
mapValues
For types of the form (K, V), only operate on V
join(otherDataset, [numTasks])
Called on RDDs of type (K, V) and (K, W), it returns an RDD of (K, (V, W)) in which all elements corresponding to the same key are paired together.
cogroup(otherDataset, [numTasks])
Called on RDDs of type (K,V) and (K,W), return an RDD of type (K,(Iterable<V>,Iterable<W>))
action
reduce(func)
Aggregate all elements in the RDD through the func function, first aggregate the data within the partition, and then aggregate the data between partitions
collect()
In the driver, return all elements of the dataset as an array.
count()
Returns the number of elements in the RDD
first()
Returns the first element in the RDD
take(n)
Returns an array consisting of the first n elements of the RDD
takeOrdered(n)
Returns an array consisting of the first n elements sorted by this RDD
aggregate
Parameters: (zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)
The aggregate function aggregates the elements in each partition through seqOp and the initial value, and then uses the combine function to combine the result of each partition with the initial value (zeroValue). The final type returned by this function does not need to be consistent with the element type in the RDD
fold(num)(func)
Folding operation, simplified operation of aggregate, seqop and combop are the same.
Subtopic 9
spark kernel
spark core components
driver
excutor
spark general running process
1) After the task is submitted, the Driver program will be started first; 2) Then the Driver registers the application with the cluster manager; 3) The cluster manager then allocates and starts the Executor according to the configuration file of this task; 4) The Driver starts executing the main function, and the Spark query is executed lazily. When the Action operator is executed, reverse calculation begins, and the stages are divided according to wide dependencies. Then each stage corresponds to a Taskset, and there are multiple Tasks in the Taskset. Search Available resources Executor for scheduling; 5) According to the localization principle, the Task will be distributed to the designated Executor for execution. During the task execution process, the Executor will continue to communicate with the Driver and report the task running status.
Deployment mode
standlone
hadoop yarn
yarn-client
yarn-cluster
mesos
k8s
wordcount
sc.textFile("xx").flatMap(_.split(" ")).map((_,1)).reduceByKey(_ _).saveAsTextFile("xx")
optimization
Data skew
memory model
Common algorithms
kafka
Hbase
definition
Hbase is a noSql system with high reliability, high performance, column-oriented, scalable distributed storage and real-time storage.
Features
1. Mass storage
Hbase is suitable for storing PB-level data, and PC-level data can be returned within tens to hundreds of milliseconds.
2. Column storage
HBase storage only has column family storage, which contains many columns.
3. Extremely easy to expand
The scalability of Hbase is mainly reflected in two aspects, one is expansion based on upper-layer processing capabilities (RegionServer), and the other is expansion based on storage (HDFS). By horizontally adding RegionSever machines, horizontal expansion is performed to improve the processing capabilities of Hbase's upper layer and improve Hbsae's ability to serve more Regions.
4. High concurrency
Since most of the current architectures using Hbase use cheap PCs, the delay of a single IO is actually not small, generally between tens to hundreds of ms. The high concurrency mentioned here mainly means that in the case of concurrency, the single IO delay of Hbase does not drop much. Able to obtain high-concurrency, low-latency services.
5. Sparse
Sparse is mainly for the flexibility of Hbase columns. In the column family, you can specify as many columns as you want. When the column data is empty, it will not occupy storage space.
Hbase architecture
Architecture diagram
client
Contains the interface for accessing HBase, and also includes cache to speed up HBase access. For example, the cache contains .META information.
zookeeper
Mainly responsible for high availability of master, monitoring of reginserver, metadata entry and maintenance of cluster configuration.
ZK is used to ensure that only one master is running in the cluster. When the master fails, a new master is generated through a competition mechanism.
Monitor the status of the regionserver through zk. When the regionserver is abnormal, notify the Master of the regionserver's online and offline information in the form of callbacks.
Unified entry information for storing metadata through zk
hdfs
Provide underlying data storage services for HBase
Provides underlying distributed storage services for metadata and table data
Multiple copies of data ensure high reliability and availability
HMaster
Assign Region to RegionServer
Maintain load balancing across the cluster
Maintain cluster metadata information
Discover the failed Region and allocate the failed Region to the normal RegionServer
When the RegionSever fails, coordinate the splitting of the corresponding Hlog
HReginServer
Mainly handles users’ read and write requests
Manage the Region allocated by the master
Handle read and write requests from clients
Responsible for interacting with the underlying HDFS and storing data to HDFS
Responsible for splitting the Region after it becomes larger
Responsible for merging Storefiles
Role
HMaster
1. Monitor RegionServer
2. Handling RegionServer failover
3. Handle metadata changes
4. Handle region allocation or transfer
5. Load balancing data during idle time
6. Publish your location to clients through Zookeeper
RegionServer
1. Responsible for storing the actual data of HBase
2. Process the Region assigned to it
3. Flush cache to HDFS
4. Maintain Hlog
5. Perform compression
6. Responsible for processing Region fragmentation
Other components
Write-Ahead logs (wal)
HBase modification record, when reading and writing data to HBase, the data is not written directly to the disk, it will be retained in the memory for a period of time (time and data volume thresholds can be set). However, saving data in memory may have a higher probability of causing data loss. In order to solve this problem, the data will first be written in a file called Write-Ahead logfile, and then written to the memory. So when the system fails, the data can be reconstructed through this log file
region
For the sharding of the HBase table, the HBase table will be divided into different regions based on the RowKey value and stored in the RegionServer. There can be multiple different regions in a RegionServer.
store
HFile is stored in Store, and one Store corresponds to a column family in the HBase table.
MemStore
As the name suggests, it is memory storage, which is located in the memory and is used to save the current data operations. Therefore, after the data is saved in the WAL, the RegionServer will store the key-value pairs in the memory.
HFile
This is the actual physical file that holds the original data on disk, the actual storage file. StoreFile is stored in HDFS in the form of Hfile.
HBase operations
HBase data structure
RowKey
Like nosql database, rowkey is the primary key for accessing data. There are three ways to access HBase rows
1.Access through a single rowkey
2. Pass the range of rowkey (regular)
3. Full table scan
RowKey row key (RowKey) can be any string (the maximum length is 64KB, and the length in actual applications is generally 10-100bytes). Inside HBASE, RowKey is saved as a byte array. When storing, data is stored in lexicographic order (byte order) of RowKey. When designing RowKey, it is necessary to fully sort the storage properties and store rows that are frequently read together together. (location dependence)
column famlily
Column family: Each column in the HBASE table belongs to a certain column family. Column families are part of the table's schema (columns are not) and must be defined before using the table. Column names are prefixed by column family. For example, courses:history and courses:math all belong to the courses column family.
cell
A unit uniquely identified by {rowkey, column Family:columu, version}. The data in the cell has no type and is all stored in bytecode form.
time stamp
In HBase, a storage unit determined by rowkey and column is called a cell. Each cell stores multiple versions of the same data. Versions are indexed by timestamp. The timestamp type is a 64-bit integer. The timestamp can be assigned by HBASE (automatically when data is written), in which case the timestamp is the current system time accurate to milliseconds. The timestamp can also be assigned explicitly by the client. If an application wants to avoid data version conflicts, it must generate its own unique timestamps. In each cell, different versions of data are sorted in reverse chronological order, that is, the latest data is listed first. In order to avoid the management (including storage and indexing) burden caused by too many versions of data, HBASE provides two data version recycling methods. One is to save the last n versions of the data, and the other is to save the versions within the latest period of time (such as the last seven days). Users can set this for each column family.
Namespaces
namespace structure
Table
Table, all tables are members of the namespace, that is, the table must belong to a certain namespace. If not specified, it will be in the default namespace.
RegionServer group
A namespace contains the default RegionServer Group.
Permission
Permissions and namespaces allow us to define access control lists (ACLs).
Quota
Quotas can enforce the number of regions a namespace can contain.
HBase data principles
HBase reading principle
1) Client first accesses zookeeper, reads the location of the region from the meta table, and then reads the data in the meta table. The meta stores the region information of the user table;
2) Find the corresponding region information in the meta table based on namespace, table name and rowkey;
3) Find the regionserver corresponding to this region;
4) Find the corresponding region;
5) First find the data from MemStore, if not, then read it from BlockCache;
6) If the BlockCache is not available yet, then read it from the StoreFile (for the sake of reading efficiency);
7) If the data is read from StoreFile, it is not returned directly to the client, but is first written to BlockCache and then returned to the client.
HBase writing process
1) Client sends a write request to HregionServer;
2) HregionServer writes data to HLog (write ahead log). For data persistence and recovery;
3) HregionServer writes data to memory (MemStore);
4) Feedback that the Client is written successfully.
Data flush process
1) When the MemStore data reaches the threshold (the default is 128M, the old version is 64M), the data is flushed to the hard disk, the data in the memory is deleted, and the historical data in the HLog is deleted;
2) And store the data in HDFS;
3) Make mark points in HLog.
data merging process
1) When the number of data blocks reaches 4, Hmaster triggers the merge operation, and Region loads the data blocks locally for merging;
2) When the merged data exceeds 256M, split it and assign the split Regions to different HregionServer management;
3) When HregionServer goes down, split the hlog on HregionServer, then assign it to different HregionServer to load, and modify .META.;
4) Note: HLog will be synchronized to HDFS.
Integrated use of hive and HBase
Integrated through hive-hbase-handler-1.2.2.jar
Create a Hive table, associate it with the HBase table, and insert data into the Hive table while affecting the HBase table.
A table hbase_emp_table has been stored in HBase, and then an external table is created in Hive to associate the hbase_emp_table table in HBase, so that it can use Hive to analyze the data in the HBase table.
HBase optimization
High availability (multiple Hmasters)
In HBase, Hmaster is responsible for monitoring the life cycle of RegionServer and balancing the load of RegionServer, such as If Hmaster hangs up, the entire HBase cluster will fall into an unhealthy state, and the working status at this time will not be the same. Will last too long. Therefore, HBase supports high-availability configuration of Hmaster.
pre-partitioned
Manual prepartitioning
create 'staff1','info','partition1',SPLITS => ['1000','2000','3000','4000']
Represents five partitions [min,1000),[1000,2000),[2000,3000),[3000,4000),[4000,max]
Generate hexadecimal sequence pre-partition
create 'staff2','info','partition2',{NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}
Indicates that hexadecimal is divided into 15 partitions
Pre-partition according to the rules set in the file
create 'staff3','partition3',SPLITS_FILE => 'splits.txt'
The content of splits.txt is as follows
aaaa bbbb cccc dddd
javaAPI creates partition
//Customize the algorithm to generate a series of Hash hash values and store them in a two-dimensional array byte[][] splitKeys = a hash value function //Create HBaseAdmin instance HBaseAdmin hAdmin = new HBaseAdmin(HBaseConfiguration.create()); //Create HTableDescriptor instance HTableDescriptor tableDesc = new HTableDescriptor(tableName); //Create an HBase table with pre-partitioning through HTableDescriptor instance and two-dimensional array of hash values hAdmin.createTable(tableDesc, splitKeys);
rowkeydesign
Rowkey is the unique identifier of a piece of data. It is necessary to design the rowkey properly so that the rowkey is evenly distributed in the region to prevent data skew.
1. Generate random numbers, hashes, and hash values
eg: The original rowKey is 1001, and after SHA1 it becomes: dd01903921ea24941c26a48f2cec24e0bb0e8cc7 The original rowKey was 3001, and after SHA1 it became: 49042c54de64a1e9bf0b33e00245660ef92dc7bd The original rowKey was 5001, and after SHA1 it became: 7b61dec07e02c188790670af43e717f0f46e8913 Before doing this, we generally choose to extract samples from the data set to determine what kind of rowKey to hash as the critical value of each partition.
2. String reversal
eg:20170524000001 converted to 10000042507102
3. String concatenation
20170524000001_a12e 20170524000001_93i7
Memory optimization
HBase requires a lot of memory overhead during operation. After all, Table can be cached in memory. Generally, 70% of the entire available memory is allocated to the Java heap of HBase. However, it is not recommended to allocate very large heap memory, because if the GC process continues for too long, the RegionServer will be in a long-term unavailable state. Generally, 16~48G memory is enough. If the memory occupied by the framework is too high, resulting in insufficient system memory, the framework will also be dragged to death by system services.
Basic optimization
1. Allow appending content to HDFS files
hdfs-site.xml, hbase-site.xml Property: dfs.support.append Explanation: Turning on HDFS append synchronization can perfectly cooperate with HBase data synchronization and persistence. The default value is true.
2. Optimize the maximum number of open files allowed by DataNode
hdfs-site.xml Property: dfs.datanode.max.transfer.threads Explanation: HBase generally operates a large number of files at the same time. Depending on the number and size of the cluster and data actions, Set to 4096 or higher. Default value: 4096
3. Optimize the waiting time of data operations with high latency
hdfs-site.xml Property: dfs.image.transfer.timeout Explanation: If the delay for a certain data operation is very high and the socket needs to wait longer, it is recommended to This value is set to a larger value (default 60000 milliseconds) to ensure that the socket is not timed out.
4. Optimize data writing efficiency
mapred-site.xml Attributes: mapreduce.map.output.compress mapreduce.map.output.compress.codec Explanation: Enabling these two data can greatly improve file writing efficiency and reduce writing time. Change the first attribute value to true, and change the second attribute value to: org.apache.hadoop.io.compress.GzipCodec or other compression methods
5. Set the number of RPC listeners
hbase-site.xml Property: hbase.regionserver.handler.count Explanation: The default value is 30, which is used to specify the number of RPC listeners. It can be adjusted according to the number of client requests. When there are many read and write requests, increase this value.
6. Optimize HStore file size
hbase-site.xml Property: hbase.hregion.max.filesize Explanation: The default value is 10737418240 (10GB). If you need to run HBase MR tasks, you can reduce this value because one region corresponds to a map task. If a single region is too large, the map task execution time will be too long. The meaning of this value is that if the size of the HFile reaches this value, the region will be split into two Hfiles.
7. Optimize hbase client cache
hbase-site.xml Property: hbase.client.write.buffer Explanation: Used to specify the HBase client cache. Increasing this value can reduce the number of RPC calls, but will consume more memory, and vice versa. Generally we need to set a certain cache size to reduce the number of RPCs.
8. Specify scan.next to scan the number of rows obtained by HBase
hbase-site.xml Property: hbase.client.scanner.caching Explanation: Used to specify the default number of rows obtained by the scan.next method. The larger the value, the greater the memory consumption.
9.flush, compact, split mechanism
When the MemStore reaches the threshold, the data in the Memstore is flushed into the Storefile; the compact mechanism merges the flushed small files into a large Storefile. split means that when the Region reaches the threshold, the overly large Region will be divided into two.
Attributes involved: That is: 128M is the default threshold of Memstore hbase.hregion.memstore.flush.size: 134217728 That is: the function of this parameter is to flush all memstores of a single HRegion when the sum of the sizes of all Memstores in a single HRegion exceeds the specified value. RegionServer's flush is processed asynchronously by adding the request to a queue to simulate the production and consumption model. There is a problem here. When the queue has no time to consume and generates a large backlog of requests, it may cause a sudden increase in memory, and in the worst case, trigger OOM. hbase.regionserver.global.memstore.upperLimit: 0.4 hbase.regionserver.global.memstore.lowerLimit: 0.38 That is: when the total amount of memory used by MemStore reaches the value specified by hbase.regionserver.global.memstore.upperLimit, multiple MemStores will be flushed to the file. The order of MemStore flush is executed in descending order of size until the memory used by MemStore is flushed. Less than lowerLimit
bloom filter
flink
Datax1
presto
Oozie
data assets
springboot
Interviewer Shrimp and Pig Heart Interview Questions
Linux
Self introduction
1 minute to introduce yourself: In my previous company, I mainly built the China Continent Insurance data platform, which mainly accessed Xiaomi, JD.com, Che300, Continent Insurance Company’s internally generated and bank-related insurance data; the data warehouse was divided into many topics, including order topics, customers Topics, claims topics, overdue topics, approval topics, and complaint topics; the business I develop involves order topics, user topics, and overdue topics. What I do at work is related to data warehouse modeling, data processing, data quality monitoring, and topic indicator calculations. ;The main applied technologies include hive, spark, hue, impala, hadoop, infomatical, etc.; the platform adds about 1TB of data every day.
Business introduction