Lecture Administration and visualization: Chapter 5.2 - Feature engineering

Chia sẻ: _ _ | Ngày: | Loại File: PDF | Số trang:66

Thêm vào BST

Báo xấu

9
lượt xem 4
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

Lecture "Administration and visualization: Chapter 5.2 - Feature engineering" provides students with content about: Feature engineering toolbox; Variable data types; Number variables; Quantization or binning;... Please refer to the detailed content of the lecture!

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Lecture Administration and visualization: Chapter 5.2 - Feature engineering

1
Feature engineering 2
Feature engineering • "Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data." – Jason Brownlee 3
Feature engineering • “Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering.” – Andrew Ng 4
The dream ... Raw Datas Mod Tas data et el k 5
… The Reality ? Features ML Ready ? Model Task dataset Raw data
Feature engineering toolbox • Just kidding :)
Variable data types 8
Number variables 9
Binarization • Counts can quickly accumulate without bound • convert them into binary values (0, 1) to indicate presence 10
Quantization or Binning • Group the counts into bins • Maps a continuous number to a discrete one • Bin size • Fixed-width binning • Eg. • 0–12 years old • 12–17 years old • 18–24 years old • 25–34 years old • Adaptive-width binning 11
Equal Width Binning • divides the continuous variable into several categories having bins or range of the same width • Pros • easy to compute • Cons • large gaps in the counts • many empty bins with no data 12
Adaptive-width binning • Equal frequency binning • Quantiles: values that divide the data into equal portions (continuous intervals with equal probabilities) • Some q-quantiles have special names • The only 2-quantile is called the median • The 4-quantiles are called quartiles → Q • The 6-quantiles are called sextiles → S • The 8-quantiles are called octiles • The 10-quantiles are called deciles → D 13
Example: quartiles 14
Log Transformation • Original number = x • Transformed number x'=log10(x) • Backtransformed number = 10x' 15
Box-Cox transformation 16
Feature Scaling or Normalization • Models that are smooth functions of the input, such as linear regression, logistic regression are affected by the scale of the input • Feature scaling or normalization changes the scale of the features 17
Min-max scaling ● Squeezes (or stretches) all values within the range of [0, 1] to add robustness to very small standard deviations and preserving zeros for sparse data. • >>> from sklearn import preprocessing • >>> X_train = np.array([[ 1., -1., 2.], • ... [ 2., 0., 0.], • ... [ 0., 1., -1.]]) • ... • >>> min_max_scaler = preprocessing.MinMaxScaler() • >>> X_train_minmax = min_max_scaler.fit_transform(X_train) array([[ 0.5 , 0. , 1. ], [ 1. , 0.5 , 0.33333333], [ 0. , 1. , 0. ]])
Standard (Z) Scaling After Standardization, a feature has mean of 0 and variance of 1 (assumption of many learning algorithms) >>> from sklearn import preprocessing >>> import numpy as np >>> X = np.array([[ 1., -1., 2.], ... [ 2., 0., 0.], ... [ 0., 1., -1.]]) >>> X_scaled = preprocessing.scale(X) >>> X_scaled array([[ 0. ..., -1.22..., 1.33...], [ 1.22..., 0. ..., -0.26...], [-1.22..., 1.22..., -1.06...]]) >> X_scaled.mean(axis=0) array([ 0., 0., 0.]) >>> X_scaled.std(axis=0) array([ 1., 1., 1.]) Standardization with scikit-learn
l2 Normalization • also known as the Euclidean norm • measures the length of the vector in coordinate space from pandas import read_csv from numpy import set_printoptions from sklearn.preprocessing import Normalizer path = r’./pima-indians-diabetes.csv’ names = ['preg', 'test', 'mass', 'pedi', 'age', 'class’] dataframe = read_csv (path, names=names) array = dataframe.values Data_normalizer = Normalizer(norm='l2').fit(array) Data_normalized = Data_normalizer.transform(array) 20