LightGBM使用简单介绍

王茂南

3388
文章

75
评论

2019年6月23日07:27:32

评论1 16346字阅读54分29秒

摘要这一篇介绍一下关于LightGBM的简单使用方式。主要介绍一下各个参数的含义和一些简单的例子，简单的功能。

简介

这一篇会介绍关于LightGBM的简单的使用，原理方面这一篇不会涉及，会在别的文章里进行介绍。

参考资料

参数介绍

导入数据常用参数

feature_name (list of strings or 'auto', optional (default="auto")) – Feature names. If 'auto' and data is pandas DataFrame, data columns names are used.
free_raw_data (bool, optional(default=True)) – If True, raw data is freed after constructing inner Dataset.(用来释放内存)
weight (list, numpy 1-D array, pandas Series or None, optional(default=None)) – Weight for each instance.(每个样本的权重)
reference (Dataset or None, optional(default=None)) – If this is Dataset for validation, training data should be used as reference.(如果数据用来做validation, training data应该作为reference)

模型常用参数介绍

这里就介绍一下常用的参数，具体的参数说明可以查看链接 : Parameters, 中文参数说明

objective(任务类型)
- regression(回归)
- binary, binary log loss classification (or logistic regression). Requires labels in {0, 1}; see cross-entropy application for general probability labels in [0, 1](0,1二分类)
learning_rate, default = 0.1, type = double, aliases: shrinkage_rate, eta, constraints: learning_rate > 0.0
num_leaves , default = 31, type = int, aliases: num_leaf, max_leaves, max_leaf, constraints: num_leaves > 1
- max number of leaves in one tree(一个树的最大叶子节点, 越多分类效果越好，但是会容易出现过拟合的现象)
- 下面是一个层数和叶子节点个数的关系(这里假设是满树, 一般层数不要超过7层，但因为可能不是满树，所有100+的叶子节点树可能会很深)

metric(计算误差的函数)
- l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1
- l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression
- l2_root, root square loss, aliases: root_mean_squared_error, rmse
- auc, AUC
- binary_logloss, log loss, aliases: binary
feature_fraction(特征的选择), default = 1.0, type = double, aliases: sub_feature, colsample_bytree, constraints: 0.0 < feature_fraction <= 1.0
- LightGBM will randomly select part of features on each iteration if feature_fraction smaller than 1.0. For example, if you set it to 0.8, LightGBM will select 80% of features before training each tree.(不会使用全部的特征进行训练，会选择部分特征进行训练)
- can be used to speed up training(加快训练速度)
- can be used to deal with over-fitting(防止出现过拟合)
feature_fraction_seed, default = 2, type = int
- random seed for feature_fraction
- 可以用来做模型的融合(两个模型选择的特征是不一样的)
bagging_fraction(数据的选择), default = 1.0, type = double, aliases: sub_row, subsample, bagging, constraints: 0.0 < bagging_fraction <= 1.0
- like feature_fraction, but this will randomly select part of data without resampling(训练的时候会选择部分数据进行训练, 且不会重复取值, 这里代表的是每次迭代的时候使用的数据的比例)
- can be used to speed up training
- can be used to deal with over-fitting
- Note: to enable bagging, bagging_freq should be set to a non zero value as well
bagging_freq, default = 0, type = int, aliases: subsample_freq
- frequency for bagging(每次)
- 0 means disable bagging; k means perform bagging at every k iteration
- Note: to enable bagging, bagging_fraction should be set to value smaller than 1.0 as well

训练常用参数

num_iterations(迭代次数), default = 100, type = int, aliases: num_iteration, n_iter, num_tree, num_trees, num_round, num_rounds, num_boost_round, n_estimators, constraints: num_iterations >= 0
- number of boosting iterations
categorical_feature(用来指定哪些是类别特征), default = "", type = multi-int or string, aliases: cat_feature, categorical_column, cat_column
- used to specify categorical features
- use number for index, e.g. categorical_feature=0,1,2 means column_0, column_1 and column_2 are categorical features(使用的方式)
- add a prefix name: for column name, e.g. categorical_feature=name:c1,c2,c3means c1, c2 and c3 are categorical features
- Note: only supports categorical with int type(只支持int数据类型)
- Note: index starts from 0 and it doesn't count the label column when passing type is int
- Note: all values should be less than Int32.MaxValue (2147483647)
- Note: using large values could be memory consuming. Tree decision rule works best when categorical features are presented by consecutive integers starting from zero
- Note: all negative values will be treated as missing values

加快训练速度的建议

获得更高的准确率

防止过拟合

其中的第三条和第五条通常会有较好的效果。

LightGBM例子说明

这一部分简单说明一下LightGBM的具体的使用。所有内容来源于官方的例子。Python-package Examples

Simple LightGBM Example(Regression)

这一部分是一个简单的LightGBM来做回归的例子。在这里主要说明下面的几个问题。

创建数据集(1. 导入数据集, 2. 创建LightGBM的dataset)
基本的训练和预测(参数的设置)
- 在训练过程中进行测试
- 提前停止训练
将模型保存到文件(保存为txt文件)

准备工作及创建数据集

首先做一下准备工作。

import lightgbm as lgb
import pandas as pd
from sklearn.metrics import mean_squared_error

接着创建数据集，导入数据并创建lightgbm的dataset.

df_train = pd.read_csv('../regression/regression.train', header=None, sep='\t')
df_test = pd.read_csv('../regression/regression.test', header=None, sep='\t')

查看一下数据集的大概的内容和大小。

# -----------
# 切分数据集
# -----------
x_train = df_train.drop(0,axis=1) # 获得训练集的x
y_train = df_train[0] # 获取训练集的y
x_test = df_test.drop(0,axis=1) # 获取测试集的x
y_test = df_test[0] # 获取测试集的y
# ------------------------------
# create dataset for lightgbm
# ------------------------------
lgb_train = lgb.Dataset(x_train, y_train)
lgb_eval = lgb.Dataset(x_test, y_test, reference=lgb_train)

模型的训练和模型的保存

下面进行参数的设置，模型的训练和模型的保存。

首先进行参数的设置。

params = {
'boosting_type': 'gbdt',
'objective': 'regression',
'metric': {'l2', 'l1'}, # l1和l2代表两种误差计算
'num_leaves': 31,
'learning_rate': 0.05,
'feature_fraction': 0.9,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'verbose': 0
}

开始模型的训练。传入设定的参数，训练的数据，验证的数据。

gbm = lgb.train(params,
lgb_train,
num_boost_round=20,
valid_sets=lgb_eval,
early_stopping_rounds=5)

打印的结果，l1和l2分别表示两个误差的计算结果。

训练完毕之后，即可以进行模型的保存。

# 保存模型
gbm.save_model('Regressionmodel.txt')

模型的预测

最后，使用我们训练完毕的模型来进行预测。预测的输入可以是dataframe的格式。

y_pred = gbm.predict(x_test, num_iteration=gbm.best_iteration)

Simple LightGBM Example(Classification)

上面介绍了关于回归的模型, 这里介绍一下多分类模型的使用的方式. 我们就主要介绍一下不同的地方.

参数的设置

对于多分类的模型, 主要关注的是objective, metric, num_class这三个参数. 特别是要注意的是, 我们需要设置num_class, 即类别的个数.

params = {
'boosting_type': 'gbdt',
'objective': 'multiclass',
'num_class':6,
'metric': {'multi_logloss'}, # l1和l2代表两种误差计算
'num_leaves': 15,
'learning_rate': 0.1,
'feature_fraction': 0.9,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'verbose': 0
}

之后关于训练的方式就是和上面是一样的了. 下面看一下查看每一个特征的重要度的方式.

查看特征的重要度

# 打印特征的重要度
gbm.feature_importance()

Advanced LightGBM Example

这一部分是一个关于LightGBM进一步使用的例子(还是使用官网的例子进行说明)。主要介绍以下的内容。

初始设置样本权重
模型的三种保存方式(1. 保存为txt格式, 2. 保存为json格式, 3. 使用pickle保存)
模型的继续训练(将保存的模型重新导入, 继续训练)
- 训练过程中逐渐减小learning rate
- 训练过程中修改参数
- 训练过程中使用自定义的目标函数和评估函数
打印特征重要程度(设置特征的名字)

导入样本数据并设置权重

准备工作，导入相应的库。

import json
import lightgbm as lgb
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
try:
import cPickle as pickle
except BaseException:
import pickle

首先我们导入训练数据，并给每一个instance设置权重.

# 读入数据
df_train = pd.read_csv('../binary_classification/binary.train', header=None, sep='\t')
df_test = pd.read_csv('../binary_classification/binary.test', header=None, sep='\t')
# 读入一个权重数据
W_train = pd.read_csv('../binary_classification/binary.train.weight', header=None)[0]
W_test = pd.read_csv('../binary_classification/binary.test.weight', header=None)[0]

可以看到设置权重的大小和样本个数是一样的，且目前权重都设置为1。接着我们创建对应的数据集。

y_train = df_train[0]
y_test = df_test[0]
x_train = df_train.drop(0, axis=1)
x_test = df_test.drop(0, axis=1)
# 创建数据集
lgb_train = lgb.Dataset(x_train, y_train, weight=W_train, free_raw_data=False)
lgb_eval = lgb.Dataset(x_test, y_test, weight=W_test, free_raw_data=False, reference=lgb_train)

模型的训练与保存

接着我们简单将模型训练以下，并进行保存。首先我们设置参数。

params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': ['binary_logloss','auc'],
'num_leaves': 31,
'learning_rate': 0.05,
'feature_fraction': 0.9,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'verbose': 0
}
# 产生feature name => 用来训练的时候给feature起名字
num_train, num_feature = x_train.shape
feature_name = ['feature_' + str(col) for col in range(num_feature)]
print(feature_name)
"""
['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20', 'feature_21', 'feature_22', 'feature_23', 'feature_24', 'feature_25', 'feature_26', 'feature_27']
"""

接着我们进行训练，这次打印一下auc。

# 模型的训练
gbm = lgb.train(params,
lgb_train,
num_boost_round=10,
valid_sets = lgb_eval,
feature_name=feature_name,
categorical_feature=[21])

接下来就是模型的保存。在这里我们会介绍三种模型保存的方式。

模型保存方式一 : 保存为txt格式

# 模型的保存--保存方式1
gbm.save_model('BinaryModel.txt')
# 加载保存的模型
bst = lgb.Booster(model_file='BinaryModel.txt')
# 这样保存只能保存最好的一个iteration
# can only predict with the best iteration (or the saving iteration)
y_pred = bst.predict(x_test)

模型保存方式二 : 保存为json格式

# 将模型转为json格式, 并保存到文件--保存方式2
model_json = gbm.dump_model()
with open('BinaryModel.json','w+') as f:
json.dump(model_json, f, indent=4)

模型保存方式三 : 使用pickle进行保存

# 模型的保存, 使用pickle
# dump model with pickle
with open('model.pkl', 'wb') as fout:
pickle.dump(gbm, fout)
# load model with pickle to predict
with open('model.pkl', 'rb') as fin:
pkl_bst = pickle.load(fin)
# can predict with any iteration when loaded in pickle way
# 指定使用某一次迭代的结果进行预测
y_pred = pkl_bst.predict(x_test, num_iteration=7)

模型的继续训练

我们可以对保存的模型进行继续训练。我们使用init_model来完成模型的继续训练。

# 10-20次训练
gbm = lgb.train(params,
lgb_train,
num_boost_round=10,
init_model='BinaryModel.txt',
valid_sets=lgb_eval,
categorical_feature=[21])

我们可以看到这里是继续从11-20开始训练，上面第一次我们是训练了10个iterations。

训练过程逐步减少lr

同时在训练的过程中，我们还可以设置学习率是逐步减小的。

# 20-30次训练
# 学习率递减
gbm = lgb.train(params,
lgb_train,
num_boost_round=10,
init_model=gbm,
learning_rates=lambda iter:0.05 * (0.99**iter),
valid_sets=lgb_eval,
categorical_feature=[21])

训练过程修改参数

除了上面的学习率可以变化之外，我们还可以对参数其他参数进行修改。如这里我们修改了bagging_fraction的值。

# 30-40次训练
# 在训练过程中修改参数
gbm = lgb.train(params,
lgb_train,
num_boost_round=10,
init_model=gbm,
valid_sets=lgb_eval,
categorical_feature=[21],
callbacks=[lgb.reset_parameter(bagging_fraction=[0.7]*5 + [0.6]*5)])

自定义目标函数和损失函数

# 40-50次训练
# 自定义目标函数
def loglikelihood(preds, train_data):
labels = train_data.get_label()
preds = 1. / (1. + np.exp(-preds))
grad = preds - labels
hess = preds * (1. - preds)
return grad, hess
# 自定义损失函数
def binary_error(preds, train_data):
labels = train_data.get_label()
return 'error', np.mean(labels != (preds > 0.5)), False

使用上述的损失函数。

gbm = lgb.train(params,
lgb_train,
num_boost_round=10,
init_model=gbm,
fobj=loglikelihood, # 自定义目标函数
feval=binary_error, # 自定义评估函数
valid_sets=lgb_eval,
categorical_feature=[21])

模型特征重要度分析

做完上面的工作之后，我们可以分析一下模型的特征的重要程度。

# 打印feature name
print('Feature names : ', gbm.feature_name())
"""
Feature names : ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20', 'feature_21', 'feature_22', 'feature_23', 'feature_24', 'feature_25', 'feature_26', 'feature_27']
"""

这里的feature name是我们最初的时候就设定的。下面打印一下特征的重要程度。

print('Feature importances : ', list(gbm.feature_importance()))
"""
Feature importances : [67, 31, 11, 86, 19, 173, 26, 18, 9, 57, 16, 19, 4, 50, 40, 14, 0, 38, 15, 27, 5, 0, 160, 15, 133, 201, 130, 136]
"""

网格搜索(lightGBM与Sklearn结合)

这一部分单独讲一下将lightGBM与sklearn进行结合，使用GridSearchCV来进行参数的寻找。下面还是具体讲一个例子来进行说明。

例子参考来源 : Microsoft LightGBM with parameter tuning (~0.823)

导入库

import numpy as np
import pandas as pd
import lightgbm as lgb
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV

导入数据集, 创建dataset

# 读入数据
df_train = pd.read_csv('../binary_classification/binary.train', header=None, sep='\t')
df_test = pd.read_csv('../binary_classification/binary.test', header=None, sep='\t')
y_train = df_train[0]
y_test = df_test[0]
x_train = df_train.drop(0, axis=1)
x_test = df_test.drop(0, axis=1)
# 创建数据集
lgb_train = lgb.Dataset(x_train, y_train, free_raw_data=False)
lgb_eval = lgb.Dataset(x_test, y_test, free_raw_data=False, reference=lgb_train)

定义模型参数和搜索空间

首先我们将模型需要使用的参数存在字典中，方便我们之后的使用。

params = {'boosting_type': 'gbdt',
'max_depth' : -1,
'objective': 'binary',
'nthread': 3, # Updated from nthread
'num_leaves': 64,
'learning_rate': 0.05,
'max_bin': 512,
'subsample_for_bin': 200,
'subsample': 1,
'subsample_freq': 1,
'colsample_bytree': 0.8,
'reg_alpha': 5,
'reg_lambda': 10,
'min_split_gain': 0.5,
'min_child_weight': 1,
'min_child_samples': 5,
'scale_pos_weight': 1,
'num_class' : 1,
'metric' : 'binary_error'}

接着，我们定义需要搜索的参数的范围。

gridParams = {
'learning_rate': [0.005],
'n_estimators': [40],
'num_leaves': [6,8,12,16],
'boosting_type' : ['gbdt'],
'objective' : ['binary'],
'random_state' : [501], # Updated from 'seed'
'colsample_bytree' : [0.65, 0.66],
'subsample' : [0.7,0.75],
'reg_alpha' : [1,1.2],
'reg_lambda' : [1,1.2,1.4],
}

最后，我们定义模型，在这里传入参数不能传入dict作为参数, 需要取出每个参数对应的值。

mdl = lgb.LGBMClassifier(boosting_type= 'gbdt',
objective = 'binary',
n_jobs = 3, # Updated from 'nthread'
silent = True,
max_depth = params['max_depth'],
max_bin = params['max_bin'],
subsample_for_bin = params['subsample_for_bin'],
subsample = params['subsample'],
subsample_freq = params['subsample_freq'],
min_split_gain = params['min_split_gain'],
min_child_weight = params['min_child_weight'],
min_child_samples = params['min_child_samples'],
scale_pos_weight = params['scale_pos_weight'])

参数的搜索, 保存最优参数

接下来就可以进行参数的搜索

# Create the grid
grid = GridSearchCV(mdl, gridParams,
verbose=0,
cv=4,
n_jobs=2)
# Run the grid
grid.fit(X_train, y_train)

接着我们查看相对结果最好的参数：

接着我们打印出最好的结果，同时将最优的参数赋值到初始的dict中。

print(grid.best_params_)
print(grid.best_score_)
"""
{'boosting_type': 'gbdt', 'colsample_bytree': 0.65, 'learning_rate': 0.005, 'n_estimators': 40, 'num_leaves': 16, 'objective': 'binary', 'random_state': 501, 'reg_alpha': 1.2, 'reg_lambda': 1, 'subsample': 0.75}
0.6068571428571429
"""

参数的赋值, 之后我们再训练分类的时候，就可以直接使用params这个dict了。

# Using parameters already set above, replace in the best from the grid search
params['colsample_bytree'] = grid.best_params_['colsample_bytree']
params['learning_rate'] = grid.best_params_['learning_rate']
# params['max_bin'] = grid.best_params_['max_bin']
params['num_leaves'] = grid.best_params_['num_leaves']
params['reg_alpha'] = grid.best_params_['reg_alpha']
params['reg_lambda'] = grid.best_params_['reg_lambda']
params['subsample'] = grid.best_params_['subsample']
# params['subsample_for_bin'] = grid.best_params_['subsample_for_bin']

查看一下最终的效果：

print('Fitting with params: ')
print(params)
"""
Fitting with params:
{'boosting_type': 'gbdt', 'max_depth': -1, 'objective': 'binary', 'nthread': 3, 'num_leaves': 16, 'learning_rate': 0.005, 'max_bin': 512, 'subsample_for_bin': 200, 'subsample': 0.75, 'subsample_freq': 1, 'colsample_bytree': 0.65, 'reg_alpha': 1.2, 'reg_lambda': 1, 'min_split_gain': 0.5, 'min_child_weight': 1, 'min_child_samples': 5, 'scale_pos_weight': 1, 'num_class': 1, 'metric': 'binary_error'}
"""