时间序列处理

王茂南

3466
文章

75
评论

2018年4月30日22:36:03

评论 10432字阅读34分46秒

摘要这一篇文章会介绍一下使用pandas做时间序列分析，在这里记录一下，方便自己之后的查看。发现现在东西做的多了，如果一两个月不去看他，就会忘记，所以要记录在这里。

这一篇的内容会介绍一下Pandas针对时间序列分析处理的一些方法。

文章目录(Table of Contents)

时间序列处理

对于一些时间序列的数据，我们可能会遇到下面的情况：

某一段时间缺失，需要填充
时间序列错位，需要对齐
数据表a和数据表b所采用的时间间隔不一致，需要重新采用

这一部分的内容我们要来尝试解决上面的问题。

Timestamp时间戳

时间戳，即代表一个时间时刻。我们可以直接用 pd.Timestamp()来创建时间戳。

import pandas as pd

pd.Timestamp('2018/1/1')
>> Timestamp('2018-01-01 00:00:00')

pd.Timestamp('2018/1/1 10:20:32')
>> Timestamp('2018-01-01 10:20:32')

时间戳索引

我们可以看到，单个时间戳为 Timestamp 数据，而时间戳以列表形式存在时，Pandas将强制转换为 DatetimeIndex。此时，我们就不能再使用pd.Timestamp()来创建时间戳了，而是pd.to_datetime()来创建。

说简单简单一点就是可以把一串文本变成可以被识别的时间(因为我们存储的时候书写时间的方式不一样，所以这一步就显得很重要了)。

date_list = ['2017-1-1','2017-1-2','2017-1-3']
pd.to_datetime(date_list)
>>  DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03'], dtype='datetime64[ns]', freq=None)

对于欧洲时区普遍采用的书写样式,日－月－年,可以通过dayfirst=True来进行修正

date_list = ['1-1-2017','1-2-2017','1-3-2017']
pd.to_datetime(date_list,dayfirst=True)
>> DatetimeIndex(['2017-01-01', '2017-02-01', '2017-03-01'], dtype='datetime64[ns]', freq=None)

对于pandas的数据，我们还可以根据列名进行转换，可以看下面的例子

date_list = pd.DataFrame({'year':[2017,2017,2018],'month':[1,2,3],'day':[4,5,6],'hour':[7,8,9]})
pd.to_datetime(date_list)
>>
0   2017-01-04 07:00:00
1   2017-02-05 08:00:00
2   2018-03-06 09:00:00
dtype: datetime64[ns]

如果要转换如上所示的DataFrame，必须存在的列名有year,month，day。另外hour,minute,second,millisecond,microsecond,nanosecond可选。

无效数据的处理

对于无效的数据，我们有下面三种的处理方法

# 遇到无效数据报错
pd.to_datetime(['2017-1-1', 'invalid'], errors='raise')
ValueError: Unknown string format

# 忽略无效数据
pd.to_datetime(['2017-1-1', 'invalid'], errors='ignore')
>> array(['2017-1-1', 'invalid'], dtype=object)

# 将无效数据显示为 NaT
pd.to_datetime(['2017-1-1', 'invalid'], errors='coerce')
>> DatetimeIndex(['2017-01-01', 'NaT'], dtype='datetime64[ns]', freq=None)

使用pandas.data_range生成时间戳

pandas.data_range方法带有的默认参数如下：

pandas.date_range(start=None, end=None, periods=None, freq=’D’, tz=None,normalize=False,name=None, closed=None, **kwargs)

常用参数的含义如下：

start= ：设置起始时间
end=：设置截至时间
periods= ：设置时间区间，若 None 则需要设置单独设置起止和截至时间。
freq= ：设置间隔周期。
tz=：设置时区。

其中，freq= 参数是非常关键的参数，我们可以设置的周期有：

freq='s': 秒
freq='min' : 分钟
freq='H': 小时
freq='D': 天
freq='w': 周
freq='m': 月
freq='BM': 每个月最后一天
freq='W'：每周的星期日

下面我们来看一下例子

# 从2018.1.1到2018.1.2以小时为间隔
pd.date_range('2018/1/1','2018/1/2',freq='H')
>>
DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 01:00:00',
               '2018-01-01 02:00:00', '2018-01-01 03:00:00',
               '2018-01-01 04:00:00', '2018-01-01 05:00:00',
               '2018-01-01 06:00:00', '2018-01-01 07:00:00',
               '2018-01-01 08:00:00', '2018-01-01 09:00:00',
               '2018-01-01 10:00:00', '2018-01-01 11:00:00',
               '2018-01-01 12:00:00', '2018-01-01 13:00:00',
               '2018-01-01 14:00:00', '2018-01-01 15:00:00',
               '2018-01-01 16:00:00', '2018-01-01 17:00:00',
               '2018-01-01 18:00:00', '2018-01-01 19:00:00',
               '2018-01-01 20:00:00', '2018-01-01 21:00:00',
               '2018-01-01 22:00:00', '2018-01-01 23:00:00',
               '2018-01-02 00:00:00'],
              dtype='datetime64[ns]', freq='H')

# 从2018.1.1开始，以1H30min为间隔，向后推10次
pd.date_range('2018/1/1',periods=10,freq='1H30min')
>>
DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 01:30:00',
               '2018-01-01 03:00:00', '2018-01-01 04:30:00',
               '2018-01-01 06:00:00', '2018-01-01 07:30:00',
               '2018-01-01 09:00:00', '2018-01-01 10:30:00',
               '2018-01-01 12:00:00', '2018-01-01 13:30:00'],
              dtype='datetime64[ns]', freq='90T')

对于生成的DatetimeIndex，我们可以进行选择，切片操作

a = pd.date_range('2018/1/1',periods=10,freq='1H30min')
>>
DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 01:30:00',
               '2018-01-01 03:00:00', '2018-01-01 04:30:00',
               '2018-01-01 06:00:00', '2018-01-01 07:30:00',
               '2018-01-01 09:00:00', '2018-01-01 10:30:00',
               '2018-01-01 12:00:00', '2018-01-01 13:30:00'],
              dtype='datetime64[ns]', freq='90T')
a[:5]
>>
DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 01:30:00',
               '2018-01-01 03:00:00', '2018-01-01 04:30:00',
               '2018-01-01 06:00:00'],
              dtype='datetime64[ns]', freq='90T')

DateOffset对象(整体移动)

有的时候我们生成了时间戳后，还需要整体进行调整，这个时候就要使用DateOffset了，我们来看一下是如何使用的。

from pandas import offsets

a = pd.date_range('2018/1/1',periods=10,freq='1H30min')
>>
DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 01:30:00',
               '2018-01-01 03:00:00', '2018-01-01 04:30:00',
               '2018-01-01 06:00:00', '2018-01-01 07:30:00',
               '2018-01-01 09:00:00', '2018-01-01 10:30:00',
               '2018-01-01 12:00:00', '2018-01-01 13:30:00'],
              dtype='datetime64[ns]', freq='90T')

# 使用 DateOffset 对象让 a 依次增加 1 个月 + 1 天 + 1 小时
a + offsets.DateOffset(months=1,days=1,hours=1)
>> 
DatetimeIndex(['2018-02-02 01:00:00', '2018-02-02 02:30:00',
               '2018-02-02 04:00:00', '2018-02-02 05:30:00',
               '2018-02-02 07:00:00', '2018-02-02 08:30:00',
               '2018-02-02 10:00:00', '2018-02-02 11:30:00',
               '2018-02-02 13:00:00', '2018-02-02 14:30:00'],
              dtype='datetime64[ns]', freq='90T')

Period时间间隔

我们对 Timestamp 时间戳和 DatetimeIndex 时间戳索引都有了较为充分的认识。除此之外 Pandas 中还存在 Period 时间间隔和 PeriodIndex 时间间隔索引对象。它们用来定义一定时间跨度。

# 一年的跨度
In[104]: pd.Period(2018)
Out[104]: Period('2018', 'A-DEC')

# 一个月的跨度
In[105]: pd.Period('2018/1')
Out[105]: Period('2018-01', 'M')

# 一天的跨度
In[106]: pd.Period('2018/1/1')
Out[106]: Period('2018-01-01', 'D')

# 一小时的跨度
In[112]: pd.Period('2018/1/1 01')
Out[112]: Period('2018-01-01 01:00', 'H')

# 一分钟的跨度
In[113]: pd.Period('2018/1/1 01:01')
Out[113]: Period('2018-01-01 01:01', 'T')

# 一秒的跨度
In[114]: pd.Period('2018/1/1 01:01:01')
Out[114]: Period('2018-01-01 01:01:01', 'S')

同样，我们可以通过pandas.period_range()生成序列

In[115]: pd.period_range('2018/1','2019/1',freq='M')
Out[115]:
PeriodIndex(['2018-01', '2018-02', '2018-03', '2018-04', '2018-05', '2018-06',
             '2018-07', '2018-08', '2018-09', '2018-10', '2018-11', '2018-12',
             '2019-01'],
            dtype='period[M]', freq='M')

那么Timestamp和Period有什么区别呢？

前者是时间段
后者是时间点

我们可以看下面的例子来理解一下

In[15]: pd.Period('2017-1-1')
Out[15]: Period('2017-01-01', 'D')

In[16]: pd.Timestamp('2017-1-1')
Out[16]: Timestamp('2017-01-01 00:00:00')

可以看到，上面代表是2017-01-01这一天，而下面仅代表 2017-01-01 00:00:00 这一时刻。

时间序列索引

之前都是在将时间索引的构建，但是这个索引构建了有什么用呢，主要还是为了方便我们来处理时间序列的数据。

In[116]: index = pd.date_range('2018/1/1',periods=20,freq='M')

In[117]: index
Out[117]:
DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31',
               '2018-09-30', '2018-10-31', '2018-11-30', '2018-12-31',
               '2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30',
               '2019-05-31', '2019-06-30', '2019-07-31', '2019-08-31'],
              dtype='datetime64[ns]', freq='M')

In[119]: data = pd.Series(np.random.random(len(index)),index=index)

In[120]: data
Out[120]:
2018-01-31    0.286474
2018-02-28    0.186806
2018-03-31    0.510955
2018-04-30    0.595571
2018-05-31    0.341229
2018-06-30    0.015576
2018-07-31    0.561802
2018-08-31    0.725536
2018-09-30    0.261595
2018-10-31    0.853460
2018-11-30    0.953558
2018-12-31    0.435351
2019-01-31    0.851568
2019-02-28    0.451887
2019-03-31    0.200710
2019-04-30    0.343604
2019-05-31    0.567475
2019-06-30    0.635831
2019-07-31    0.932632
2019-08-31    0.930855
Freq: M, dtype: float64

下面我们来检索一些数据

# 检索2018年的数据
In[121]: data['2018']
Out[121]:
2018-01-31    0.286474
2018-02-28    0.186806
2018-03-31    0.510955
2018-04-30    0.595571
2018-05-31    0.341229
2018-06-30    0.015576
2018-07-31    0.561802
2018-08-31    0.725536
2018-09-30    0.261595
2018-10-31    0.853460
2018-11-30    0.953558
2018-12-31    0.435351
Freq: M, dtype: float64

# 检索2018/1-2018/3
In[124]: data['2018/01':'2018/03']
Out[124]:
2018-01-31    0.286474
2018-02-28    0.186806
2018-03-31    0.510955
Freq: M, dtype: float64

时间数据的偏移

对于时间序列的数据，我们有时候需要用到偏移，即把2018-1-1的数据移到2018-1-3，偏移两天，我们看一下如何实现。

首先我们生成数据

In[127]: index = pd.date_range('2018/1/1',periods=5,freq='M')

In[128]: index
Out[128]:
DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31'],
              dtype='datetime64[ns]', freq='M')

In[130]: data = pd.Series(np.random.random(len(index)),index=index)

In[131]: data
Out[131]:
2018-01-31    0.950807
2018-02-28    0.439691
2018-03-31    0.171534
2018-04-30    0.545591
2018-05-31    0.867047
Freq: M, dtype: float64

# 将数据向后移动两个单位
In[132]: data.shift(2)
Out[132]:
2018-01-31         NaN
2018-02-28         NaN
2018-03-31    0.950807
2018-04-30    0.439691
2018-05-31    0.171534
Freq: M, dtype: float64

# 将数据向前移动两个单位
In[133]: data.shift(-2)
Out[133]:
2018-01-31    0.171534
2018-02-28    0.545591
2018-03-31    0.867047
2018-04-30         NaN
2018-05-31         NaN
Freq: M, dtype: float64      

# 将数据向后移动两天
In[134]: data.shift(2,freq='D')
Out[134]:
2018-02-02    0.950807
2018-03-02    0.439691
2018-04-02    0.171534
2018-05-02    0.545591
2018-06-02    0.867047
dtype: float64

时间数据重采样

除了Shifting方法，重采样Resample也会经常用到Resample可以提升或降低一个时间索引序列的频率，大有用处。例如：当时间序列数据量非常大时，我们可以通过低频率采样的方法得到规模较小到时间覆盖依然较为全面的新数据集。另外，对于多个不同频率的数据集需要数据对齐时，重采样可以十分重要的手段。

我们首先生成要使用的数据

In[135]: date_list = pd.date_range('2017-1-1',periods=20,freq='D')
     ...: data = pd.Series(np.random.rand(len(date_list)),index=date_list)
     ...:

In[136]: data
Out[136]:
2017-01-01    0.341977
2017-01-02    0.016024
2017-01-03    0.056727
2017-01-04    0.305734
2017-01-05    0.458855
2017-01-06    0.518707
2017-01-07    0.183312
2017-01-08    0.811064
2017-01-09    0.834227
2017-01-10    0.452781
2017-01-11    0.469344
2017-01-12    0.197777
2017-01-13    0.434135
2017-01-14    0.924444
2017-01-15    0.611765
2017-01-16    0.945784
2017-01-17    0.772590
2017-01-18    0.623635
2017-01-19    0.132294
2017-01-20    0.618067
Freq: D, dtype: float64

下面来演示一下如何进行重新采样

# 按照 2 天进行降采样，并对 2 天对应的数据求和作为新数据
In[137]: data.resample('2D').sum()
Out[137]:
2017-01-01    0.358000
2017-01-03    0.362462
2017-01-05    0.977562
2017-01-07    0.994376
2017-01-09    1.287008
2017-01-11    0.667121
2017-01-13    1.358579
2017-01-15    1.557549
2017-01-17    1.396225
2017-01-19    0.750360
dtype: float64

# 按照 2 天进行降采样，并对 2 天对应的数据求平均值作为新数据
In[138]: data.resample('2D').mean()
Out[138]:
2017-01-01    0.179000
2017-01-03    0.181231
2017-01-05    0.488781
2017-01-07    0.497188
2017-01-09    0.643504
2017-01-11    0.333560
2017-01-13    0.679289
2017-01-15    0.778775
2017-01-17    0.698113
2017-01-19    0.375180
dtype: float64

# 按照 2 天进行降采样，并选取对应 2 天的最大值作为新数据
In[139]: data.resample('2D').max()
Out[139]:
2017-01-01    0.341977
2017-01-03    0.305734
2017-01-05    0.518707
2017-01-07    0.811064
2017-01-09    0.834227
2017-01-11    0.469344
2017-01-13    0.924444
2017-01-15    0.945784
2017-01-17    0.772590
2017-01-19    0.618067
dtype: float64

# 按照 2 天进行降采样，并将对应 2 天数据的原值、最大值、最小值、以及临近值列出
In[140]: data.resample('2D').ohlc()
Out[140]:
                open      high       low     close
2017-01-01  0.341977  0.341977  0.016024  0.016024
2017-01-03  0.056727  0.305734  0.056727  0.305734
2017-01-05  0.458855  0.518707  0.458855  0.518707
2017-01-07  0.183312  0.811064  0.183312  0.811064
2017-01-09  0.834227  0.834227  0.452781  0.452781
2017-01-11  0.469344  0.469344  0.197777  0.197777
2017-01-13  0.434135  0.924444  0.434135  0.924444
2017-01-15  0.611765  0.945784  0.611765  0.945784
2017-01-17  0.772590  0.772590  0.623635  0.623635
2017-01-19  0.132294  0.618067  0.132294  0.618067

上面是降频的操作，下面说一下升频

# 时间频率从天提升到小时，并使用相同的数据对新增加行填充
In[11]: data.resample('H').ffill()
Out[11]:
2017-01-01 00:00:00    0.384984
2017-01-01 01:00:00    0.384984
2017-01-01 02:00:00    0.384984
2017-01-01 03:00:00    0.384984
2017-01-01 04:00:00    0.384984

                         ...

2017-01-19 21:00:00    0.143571
2017-01-19 22:00:00    0.143571
2017-01-19 23:00:00    0.143571
2017-01-20 00:00:00    0.837088
Freq: H, Length: 457, dtype: float64

# 时间频率从天提升到小时，不对新增加行填充
In[12]: data.resample('H').asfreq()
Out[12]:
2017-01-01 00:00:00    0.384984
2017-01-01 01:00:00         NaN
2017-01-01 02:00:00         NaN

                         ...

2017-01-19 23:00:00         NaN
2017-01-20 00:00:00    0.837088
Freq: H, Length: 457, dtype: float64

# 时间频率从天提升到小时，只对新增加前 3 行填充
In[13]: data.resample('H').ffill(limit=3)
Out[13]:
2017-01-01 00:00:00    0.384984
2017-01-01 01:00:00    0.384984
2017-01-01 02:00:00    0.384984
2017-01-01 03:00:00    0.384984
2017-01-01 04:00:00         NaN
2017-01-01 05:00:00         NaN

                         ...

2017-01-19 21:00:00         NaN
2017-01-19 22:00:00         NaN
2017-01-19 23:00:00         NaN
2017-01-20 00:00:00    0.837088
Freq: H, Length: 457, dtype: float64