MindMap Gallery CFA Level 2, Big Data
CFA Level 2, steps of big data analysis, data mining, data analysis, machine learning
Edited at 2020-04-18 14:51:21This is a mind map about bacteria, and its main contents include: overview, morphology, types, structure, reproduction, distribution, application, and expansion. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about plant asexual reproduction, and its main contents include: concept, spore reproduction, vegetative reproduction, tissue culture, and buds. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about the reproductive development of animals, and its main contents include: insects, frogs, birds, sexual reproduction, and asexual reproduction. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about bacteria, and its main contents include: overview, morphology, types, structure, reproduction, distribution, application, and expansion. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about plant asexual reproduction, and its main contents include: concept, spore reproduction, vegetative reproduction, tissue culture, and buds. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about the reproductive development of animals, and its main contents include: insects, frogs, birds, sexual reproduction, and asexual reproduction. The summary is comprehensive and meticulous, suitable as review materials.
Big Data in Investment Management
data
data:4Vs
volume
variety
velocity
veracity
Textual big data
topics
sentiment
Steps
Structured Data
Conceptualization of the modeling task
Data collection.
Data preparation and wrangling.
cleaning and preprocessing of the raw data
Data exploration
data analysis, feature selection, and feature engineering.
Model training
Unstructured Data
Text problem formulation
Data (text) curation
web spidering (scraping or crawling) programs
Text preparation and wrangling.
structured inputs
steps of text cleansing process
1. remove html tags
2. remove punctuations
3. remove numbers
4. remove white spaces
steps of text wrangling
normalization process
lowercasting
stop words
the,is,a
stemming
lemmatization
Text exploration
word clouds
Model training
Data collection
databases
application programming interface (API)
Data Preparation and Wrangling
Data Preparation (Cleansing)
Incompleteness error
Missing values/and NAs
mean, median, or mode of the variable or simply assuming zero
Invalidity error
Invalidity error is where the data are outside of a meaningful range, resulting in invalid data.
verifying other administrative data records
Inaccuracy error
Inaccuracy error is where the data are not a measure of true value.
“Don’t Know”
rectified with the help of business records and administrators
Inconsistency error
Title: Ms. Li, male
Non-uniformity error
Non-uniformity error is where the data are not present in an identical format
Date of Birth column is present in various formats:12/5/1970;15 Jan, 1975
converting the data points into a preferable standard format.
Duplication error
Data Wrangling (Preprocessing)
transformations
Extraction
Extract new variables from existing variables
Extract Age from Date of Birth; for example, create a new field "Total loans as a proportion of income"
Aggregation
Salary and Other Income——> Total Income
Filtration:
Filter rows for "Beijing" from the "city" column
Selection
Only select useful features
Conversion
Convert the amount to USD uniformly
Outliers
Standard deviation
3 standard deviations from the mean
interquartile range (IQR)
the 75th and the 25th percentile values of the data
outside of 1.5 IQR:outliers
outside of 3 IQR:extreme values
trimming
a 5% trimmed dataset is one for which the 5% highest and the 5% lowest values have been removed.
winsorization shrinking
The process of replacing extreme values and outliers in a dataset with the maximum (for large value outliers) and minimum (for small value outliers) values of data points that are not outliers.
Scaling
Normalization
Sensitive to outliers and suitable for situations where the distribution is unknown
Standardization
Suitable for samples that follow a normal distribution
Unstructured (Text) Data
text, images, videos, and audio files
In ML model, unstructured data must be converted into structured data
Text processing
cleansing
regular expression (regex)
a series that contains characters in a particular order
steps
1 Remove html tags:
2 Remove Punctuations
Pay attention to special punctuation marks, such as periods, document marks, etc. to indicate sentence breaks or semantics, hyphens, underlines, etc. to keep words intact, and lit lights to indicate abbreviations.
3 Remove Numbers
information extraction (IE)
4 Remove white spaces
subtopic
preprocessing
floating theme