MindMap Gallery data analysis
A more detailed python data analysis map, including installation, pandas, numpy, etc. If you need it, please collect it quickly!
Edited at 2023-10-24 00:06:19This is a mind map about bacteria, and its main contents include: overview, morphology, types, structure, reproduction, distribution, application, and expansion. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about plant asexual reproduction, and its main contents include: concept, spore reproduction, vegetative reproduction, tissue culture, and buds. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about the reproductive development of animals, and its main contents include: insects, frogs, birds, sexual reproduction, and asexual reproduction. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about bacteria, and its main contents include: overview, morphology, types, structure, reproduction, distribution, application, and expansion. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about plant asexual reproduction, and its main contents include: concept, spore reproduction, vegetative reproduction, tissue culture, and buds. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about the reproductive development of animals, and its main contents include: insects, frogs, birds, sexual reproduction, and asexual reproduction. The summary is comprehensive and meticulous, suitable as review materials.
data analysis
1||| Base
Jupyter Notebook usage of
1. jupyter installation and adaptation
1. First install jupyter in the terminal
pip install jupyter
pip install jupyter --upgrade
2. Install jupyter extension library
pip install autopep8 # Install the module of pep8 code specification
pip install jupyter_nbextensions_configurator # Extension tool configuration items
pip install jupyter_contrib_nbextensions # Install jupyter extension package
pip install yapf #Install the third-party function modules that the expansion package depends on
3. After installing the expansion pack, you need to execute command adaptation.
"""Expansion Pack Adaptation"""
jupyter contrib nbextension install --user
jupyter nbextensions_configurator enable --user
jupyter nbextension enable code_prettify/code_prettify
4. After the environment is set up, enter the `jupyter notebook` command on the command line, and a browser window will automatically pop up to open Jupyter Notebook.
# input the command
jupyter notebook
5. An error may occur
Wrong picture
Reference article to solve
6. After opening `jupyter notebook`, Check the configuration option under the `Nbextensions` option
7. Configure environment variables
Variable name: LANG
Variable value: zh_CN.UTF8
2. Create a file
1. Select the path to be created --> Delete input cmd--> Enter jupyter notebook in the terminal
2. Press (ctrl shift right mouse button) on an empty space in the folder -->Click (Open Powershell window here), The operation is the same as cmd terminal
Two methods
3. cell cell operation
1. What is a cell? - **cell**: A pair of In Out sessions is regarded as a code unit, called cell- The * before the cell line number indicates that the code is running
2. Two modes
3. Shortcut key operation
Common shortcut keys for both modes
command mode Press ESC to enter
Other (understand)
edit mode Press Enter to enter
Other (understand)
4. Mouse operation
5. markdown demo Just master titles and indentation
6. Other operations Function name? View source code
7. Tab auto-completion: If there is no automatic completion when typing the code, You can use **Tab** to view code tips
4. appendix Jupyter common shortcut keys
1. command mode (press Esc key)
2. edit mode (Press enter key)
2||| matplotlib plotting
Matplotlib
Install module pip install matplotlib
Basic usage
Import module import matplotlib.pyplot as plt
Steps
line chart
Drawing and display steps
Add accessibility
Prepare data
Add x,y scale
Add grid display
Add description
Draw multiple images
1 multiple plots
2 Set graphic style
3 Show legend plt.legend(loc="best")
4 Multiple coordinate system display
plt.subplots (object-oriented drawing method)
matplotlib.pyplot.subplots(nrows=1, ncols=1, **fig_kw) Create a plot with multiple axes (coordinate systems/drawing areas)
More methods on the axes subcoordinate system: Reference
Notice:
summary
Common graphics drawing
1. line chart
2. Histogram (Bar Chart)
`plt.bar` method
The `plt.bar` method has the following common parameters
More references
1 Drawing of bar graph
2 horizontal bar chart
3 Grouped Bar Chart
4 stacked bar chart
Case URL
3. Scatter plot
4. pie chart
a. parameter
1. `x`: proportion sequence of pie chart.
2. `labels`: The name text of each block on the pie chart.
3. `explode`: Set whether certain blocks should be separated into pie charts.
7. Other parameters:
4. `autopct`: Set the display method of proportional text. For example, keep a few decimals, etc.
5. `shadow`: whether to display shadow.
6. `textprops`: Text properties (color, size, etc.).
b. return value
1. `patches`: the object of each segment on the pie chart.
2. `texts`: Blocked name text objects.
3. `autotexts`: Blocked proportional text objects.
5. radar chart
1. plt.polar draws radar chart
2. Precautions
1. Because `polar` does not complete the closed drawing of lines, So when we draw, we need to repeatedly add the value of the 0th position in `theta` and `values` at the end. Then it can be closed with the first point when drawing.
2. `polar` just draws lines, So if you want to fill it with color, Then you need to call the `fill` function to achieve it.
3. `polar`’s default circle coordinates are angles. If we want to change it to text display, Then it can be set through `xticks`.
6. summary
(1) - Line chart [Know]
(2) - Ability to display changing trends in data and reflect changes in things. (Variety)
(3) - plt.plot()
(4) - Scatter plot [Know]
(5) - Determine whether there is a quantitative correlation trend between variables and display outliers (distribution rules)
(6) - plt.scatter()
(7) - Bar chart [Know]
(8) - Draw continuous discrete data, you can see the size of each data at a glance and compare the differences between the data. (statistics/comparison)
(9) - plt.bar(x, width, align="center")
(10) - Histogram [Know]
(11) - Draw continuous data to show the distribution of one or more sets of data (statistics)
(12) - plt.hist(x, bins)
(13) - Pie Chart [Know]
(14) - Used to express the proportion of different categories, and compare various categories through the size of the arc.
(15) - plt.pie(x, labels, autopct, colors)
3||| Numpy library
1. Install
Install
It can be installed through `pip install numpy`. If it is anaconda environment, it is installed by default.
Tutorial address
Official website
Chinese documentation
2. ndarray
1||| Properties of ndarray
Array properties reflect information inherent in the array itself.
| Attribute name | Attribute explanation |
| :------------------: | :------------------------: |
| **ndarray.shape** | Tuple of array dimensions |
| ndarray.ndim | Array dimensions |
| ndarray.size | Number of elements in the array |
| ndarray.itemsize | The length of an array element (bytes) |
| **ndarray.dtype** | Type of array elements |
2||| shape of ndarray score.shape
3||| Type of ndarray type(score.dtype)
3. Basic operations
1. How to generate an array
np.ones(shape, dtype)
np.zeros(shape, dtype)
2. Generate from existing array
Generation method
np.array(object, dtype)**
np.asarray(a, dtype
3. Generate fixed range array
a. Arithmetic array — specified number
np.linspace (start, stop, num, endpoint)
parameter:
- start: the starting value of the sequence
- stop: the termination value of the sequence
- num: the number of equally spaced samples to be generated, the default is 50
- endpoint: whether the sequence contains a stop value, the default is true
b. Arithmetic array — specifies the step size
np.arange(start,stop, step, dtype) step: step size, default value is 1
c. geometric sequence
num: The number of geometric series to be generated, the default is 50
4. Random array module np.random module
a. normal distribution
create
np.random.randn(d0, d1, …, dn) Function: Returns one or more sample values from the standard normal distribution
np.random.normal(loc=0.0, scale=1.0, size=None)**
loc: float The mean of this probability distribution (corresponding to the center of the entire distribution)
scale: float The standard deviation of this probability distribution (corresponding to the width of the distribution, the larger the scale, the shorter and fatter it is, the smaller the scale, the thinner and taller it is)
size:int or tuple of ints The output shape, the default is None, only one value is output
parameter
np.random.standard_normal(size=None) Returns an array of standard normal distributions of the specified shape.
b. Evenly distributed
np.random.rand(*d0*, *d1*, *...*, *dn*) Returns a set of uniformly distributed numbers within [0.0, 1.0)**.
np.random.uniform(low=0.0, high=1.0, size=None) Function: Randomly sample from a uniform distribution [low, high). Note that the definition domain is closed on the left and open on the right, that is, it includes low but does not include high.
Parameter introduction:
low: sampling lower bound, float type, default value is 0;
high: sampling upper bound, float type, default value is 1;
size: the number of output samples, int or tuple type, For example, size=(m,n,k), then m*n*k samples are output, and 1 value is output by default.
Return value: ndarray type, its shape is consistent with the description in the parameter size.
np.random.randint(*low*, *high=None*, *size=None*, *dtype='l'*)
Number range: If high is not None, take a random integer between [low, high), otherwise take a random integer between [0, low).
Randomly sample from a uniform distribution and generate an integer or N-dimensional array of integers
5. Array operations
A. Modify array dimensions ndarray.reshape(shape, order) Shape (shape row, order column)
B. reshape and resize methods
reshape is to convert the array into the specified shape. Then return the converted result, The shape of the original array will not change
resize is to convert the array into the specified shape. will directly modify the array itself. and will not return any value
C. flatten and ravel methods
Both methods convert multi-dimensional arrays into one-dimensional arrays, but have the following differences:
1. flatten is after converting the array into a one-dimensional array, Then return this copy back, Therefore, subsequent modifications to this return value will not affect the previous array.
2. ravel is to convert the array into a one-dimensional array, Return this view (which can be understood as a reference), Therefore, subsequent modifications to this return value will affect the previous array.
D. ndarray.T
transpose of array
Swap the rows and columns of an array
E. Array operations
1||| Index, slice
- Direct indexing and slicing
- Object[:, :] -- row first then column
2||| Array cutting (understanding) An array can be split through `hsplit`, `vsplit` and `array_split`
hsplit: Cut horizontally. Used to specify how many columns to divide into. You can use numbers to represent how many parts it is divided into, or you can use an array to represent where to divide.
vsplit: Cut vertically. Used to specify how many lines to divide into. You can use numbers to represent how many parts to divide into, or you can use an array to represent where to divide.
split/array_split(array,indicate_or_seciont,axis): Used to specify the cutting method. When cutting, you need to specify whether it is based on rows or columns. `axis=1` means according to columns, `axis=0` means according to rows
3||| Data splicing
`vstack`: Stack arrays vertically. The number of columns of the array must be the same for superposition
`hstack`: Stack arrays horizontally. Rows of arrays must be identical to be superimposed
4||| Array broadcast (emphasis)
(1) broadcast mechanism
1. The broadcast mechanism needs to **expand the array with small dimensions**, Make it the same shape value as the array with the largest dimension, To perform operations using element-level functions or operators.
2. The broadcast mechanism implements operations on two or more arrays, even if the shapes of these arrays are not exactly the same. You only need to meet any of the following conditions.
- 1. A certain dimension of the array is of equal length.
- 2. A certain dimension of one of the arrays is 1. But once there is a situation where the width of the two data in a certain dimension is not equal, and both are not 1, it cannot be broadcast.
(2) general function
1||| unary function
2||| binary function
3||| aggregate function
4||| boolean array function
5||| More
4||| pandas Official website
1. Base
1. import import pandas as pd
2. Data reading and storage Official documentation
1||| CSV
a. pd.read_csv reads csv
pandas.read_csv(filepath_or_buffer, sep =',', usecols )
filepath_or_buffer: file path
sep: separator, separated by "," by default
usecols: Specify the column names to be read, in list form
encoding : encoding
parameter
b. pd.to_csv save csv
DataFrame.to_csv( path_or_buf=None, sep=', ’,columns=None, header=True, index=True, mode='w',encoding=None)
path_or_buf: file path
sep: separator, separated by "," by default
columns: select the required column index
header: boolean or list of string, default True, whether to write the column index value
index: whether to write for indexing
mode: 'w': rewrite, 'a' append
parameter
2||| json
a. pd.read_json reads json
b. Change the JSON format to the default one Pandas DataFrame format
orient : string,Indication of expected JSON string format.
split : dict like {index -> [index], columns -> [columns], data -> [values]}
split summarizes index to index, column name to column name, and data to data. separated the three parts
records : list like [{column -> value}, ... , {column -> value}]
records are output in the form of `columns:values`
index : dict like {index -> {column -> value}}
index is output in the form `index:{columns:values}...`
columns : dict like {column -> {index -> value}}
Default format Colums are output in the form of `columns:{index:values}`
lines: boolean, default False
Read json object per line
typ: default ‘frame’, Specify the object type to be converted into series or dataframe
c. pd.to_json save json
(1) DataFrame.to_json(path_or_buf=None, orient=None,lines=False)
1||| Store Pandas objects in json format
2||| orient: stored json form, {'split','records','index','columns','values'}
3||| path_or_buf=None*: file address
4||| lines: An object is stored as a line
(2) save format
a. split See comments for results
b. index See comments for results
c. table See comments for results
3||| HDF5
a. Install pip install tables
b. **Reading and storing HDF5 files requires specifying a key whose value is the DataFrame to be stored**
c. read Variable=pd.read_hdf("file name")
d. storage Variable.to_hdf("storage file name", key="variable")
4||| Excel
a. read
b. save
3. pandas data structure
a. Creation of Series
1||| Created from existing data
2||| Specify row index name
3||| Created from dictionary data
b. Series properties
1||| index
2||| values
c. DataFrame
1||| Creation of DataFrame
a. Created from existing data
b. Add row and column index
2||| DataFrame properties
a. shape
b. index DataFrame row index list
c. columns DataFrame row index list
d. values Directly obtain the value of the array
e. T Transpose
f. tail(5): displays the last 5 lines of content If no parameters are added, the default is 5 lines. If parameter N is filled in, the next N lines will be displayed.
3||| DataFrame index
a. Modify row and column index values
b. Reset index
c. Set a column value as a new index
4. index object
a. Indexindex
b. Pandas hierarchical index
c. MultiIndex index object
value
1||| outer selection
2||| inner selection
5. summary
6. Summary of basic operations
2. Index sorting and calculation
a. Basic operations
1||| Index operations
1||| Use row and column index directly (column first, row second)
2||| Use indexing with loc or iloc
2||| Assignment operation
3||| sort
1||| DataFrame sorting
2||| Series sorting
4||| Summarize
b. DataFrame operations
1||| Arithmetic operations
2||| logic operation
3||| logical operation function
4||| statistical operations
(1) describe
(2) statistical function
(3) Cumulative statistics function
5||| summary
3. Data cleaning
a. Data quality guidelines 4 Key Points--"Complete Oneness"
1. Completeness: Whether there are null values in a single piece of data, and whether the statistical fields are complete.
2. Comprehensiveness: observe all values in a column, For example, in an Excel table, when we select a column, we can see the average, maximum, and minimum values of the column. We can use common sense to judge whether there is a problem with the column, such as: data definition, unit identification, and the value itself.
3. Legality: the legality of data type, content, and size. For example, there are non-ASCII characters in the data, the gender is unknown, the age is over 150 years old, etc.
4. Uniqueness: Whether there are duplicate records in the data, because the data usually comes from the aggregation of different channels, and duplication is common. Both row data and column data need to be unique. For example, a person cannot be recorded multiple times, and a person's weight cannot be recorded multiple times in column indicators.
b. Handle missing data
(1) Missing values
1||| None: missing value for Python object type
2||| NaN: Missing value of numeric type
3||| Difference between NaN and None in Pandas
4||| Pandas conversion rules for different types of missing values
(2) Handle missing data
a. Methods and instructions
| Method | Description |
| isnull() | Creates a Boolean mask label for missing values. |
| notnull() | The opposite of the isnull() operation. |
Determine null value
| dropna() | Returns a data with missing values removed. |
| fillna() | Returns a copy of the data filled with missing values. |
b. Determine null value isnull() and notnull()
c. Remove missing values dropna() method
d. Fill missing values fillna() method
c. Duplicate data
(1) Handle duplicate data
(2) Filter duplicate rows
d. Data replacement
(1) replacement value
1||| replace replaces based on the content of the value
2||| Regular replacement
(2) function data replacement
1||| map
2||| apply
3||| applymap
e. String operations
(1) String methods
(2) Regular expression method
(3) pandas string functions
4. Grouped aggregation and time series
A. Group aggregation
a. GroupBy object
b. Aggregation (agg)
c. filter
B. Data merging (know)
a. Data merge (pd.merge)
(1) Inner join inner: joins the intersection of keys in both tables Where the two intersect
(2) Full join outer: Union of the union of keys in both tables all
(3) Left join left: Union of all keys of the left table All on the left
(4) Right join right: Union of all keys of the right table All on the right
Sample code and results See note
(5) Handle duplicate column names
(6) Join by index
b. Data merging (pd.concat)
(1) NumPy concat
(2) pd.concat
c. Reshape
(1) stack
(2) unstack
C. sequentially
Time and date data types
(1) datetime
(2) Conversion between string and datetime
sequentially
1||| datetime.strptime You can use these formatting encodings to convert strings to dates
2||| datetime.strptime is the best way to do date parsing from a known format. But it is troublesome to write format definitions every time, especially for some common date formats. In this case, you can use the parser.parse method in the dateutil third-party package (it is automatically installed in pandas)
3||| dateutil Can parse almost any human-understandable date representation
4||| In the internationally accepted format, it is very common for the day to appear in front of the month. This problem can be solved by passing dayfirst=True:
5||| pandas is usually used to process group dates. It doesn't matter whether these dates are the axis indices or columns of the DataFrame. The to_datetime method can parse many different date representations. Very fast parsing of standard date formats such as ISO8601
6||| It can also handle missing values (None, empty string, etc.)
7||| NaT (Not a Time) is the null value of timestamp data in pandas.
(3) Time series basics
1||| The most basic time series type in pandas is timestamp (usually represented as a Python string or datatime object) as the indexed Series
2||| These datetime objects are actually placed in a DatetimeIndex
3||| Like other Series, arithmetic operations between time series with different indexes are automatically aligned by date:
(4) Indexing, selection, subset construction
1||| When you select data based on label index, Time series are very similar to other pandas.Series
2||| There is a more convenient usage: Pass in a string that can be interpreted as a date:
(5) Date range, frequency, and movement
(6) Generate date range
1||| pandas.date_range Can be used to generate a specified length based on a specified frequency DatetimeIndex
2||| date_range will produce time points calculated in days. If you only pass in the start or end date, you must also pass in a number representing a period of time.
3||| The start and end dates define the strict boundaries of the date index.
(7) Resampling and frequency conversion resample is a flexible and efficient method. Can be used to process very large time series.
5||| pyecharts Official documentation
1. dynamic visualization
A. Install pip install pyecharts
B. View version number pyecharts.__version__
C. Version Both are incompatible
1||| two versions
a. v1
b. v0.5.X
2||| Support chain calls
D. import
(1) import pyecharts
(2) from pyecharts.charts import Bar Function import (Bar is a bar chart)
E. can use jupyter notebook
2. Global configuration items
6||| seaborn
graphics