# [scikit-learn] 特性二值化

www.MyException.Cn  网友分享于：2013-09-03  浏览：0次
[scikit-learn] 特征二值化

### 1.首先造一个测试数据集

```#coding:utf-8
import numpy
import pandas as pd

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer

def t2():
testdata = pd.DataFrame({'pet': ['chinese', 'english', 'english', 'math'],
'age': [6 , 5, 2, 2],
'salary':[7, 5, 2, 5]})
print testdata

t2()```

### 2. 对付数值型类别变量

`OneHotEncoder(sparse = False).fit_transform(testdata.age) # testdata.age 这里与 testdata[['age']]等价`

`OneHotEncoder(sparse = False).fit_transform(testdata[['age']])`

``array([[ 0.,  1.,  0.],       [ 0.,  0.,  1.],       [ 1.,  0.,  0.],       [ 1.,  0.,  0.]])``

```import numpy

result1 = OneHotEncoder(sparse = False).fit_transform(testdata[['age']])
result2 = OneHotEncoder(sparse=False).fit_transform(testdata[['salary']])
final_output = numpy.hstack((result1,result2))
print final_output```

`result = OneHotEncoder(sparse = False).fit_transform( testdata[['age', 'salary']])`
``结果为``
``array([[ 0.,  1.,  0.,  0.,  1.,  0.],       [ 0.,  0.,  1.,  0.,  0.,  1.],       [ 1.,  0.,  0.,  1.,  0.,  0.],       [ 1.,  0.,  0.,  1.,  0.,  0.]])``

### 3. 对付字符串型类别变量

• 方法一 先用 LabelEncoder() 转换成连续的数值型变量，再用 OneHotEncoder() 二值化

• 方法二 直接用 LabelBinarizer() 进行二值化

``# 方法一: LabelEncoder() + OneHotEncoder()a = LabelEncoder().fit_transform(testdata['pet'])OneHotEncoder( sparse=False ).fit_transform(a.reshape(-1,1)) # 注意: 这里把 a 用 reshape 转换成 2-D array# 方法二: 直接用 LabelBinarizer()LabelBinarizer().fit_transform(testdata['pet'])``

``array([[ 1.,  0.,  0.],       [ 0.,  1.,  0.],       [ 0.,  1.,  0.],       [ 0.,  0.,  1.]])``