重塑层次化索引
层次化索引为DataFrame数据的重排任务提供了一种良好一致性的方式。主要功能:
- stack: 将数据的列”旋转“为行
- unstack:将数据的行“旋转”为列
接下来看一个行列索引均为字符串的例子:
In [169]: data = DataFrame(np.arange(6).reshape((2,3)),index=pd.Index(['Ohio','Colorado'],name='sta
...: te'),columns=pd.Index(['one','two','three'],name='number'))
In [170]: data
Out[170]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
将数据转换为行,得到一个Series:
In [171]: result = data.stack()
In [172]: result
Out[172]:
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int64
对于一个层次化索引的Series,可以使用unstack
函数将其转换为DataFrame:
In [173]: result.unstack()
Out[173]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
默认情况下,unstack
与stack
操作的是最内层。传入分层级别的编号或者名称即可对其他级别进行操作。
In [174]: result.unstack(0)
Out[174]:
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
In [175]: result.unstack('state')
Out[175]:
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
**如果不是所有的级别值都能在各分组中找到的话,则unstack
操作可能会引入缺失数据:
In [180]: s1 = Series([0,1,2,3], index=['a','b','c','d'])
In [182]: s2 = Series([4,5,6], index=['c','d','e'])
In [183]: data2 = pd.concat([s1,s2], keys=['one','two'])
In [184]: data2
Out[184]:
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
In [185]: data2.unstack()
Out[185]:
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
而stack则相反,会默认滤除缺失数据
In [186]: data2.unstack().stack()
Out[186]:
one a 0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
In [187]: data2.unstack().stack(dropna=False)
Out[187]:
one a 0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64
在对DataFrame进行unstack
操作时,作为旋转轴的级别将会成为结果中最低级别:
In [198]: df.unstack()
Out[198]:
side left right
number one two three one two three
state
Ohio 0 1 2 5 6 7
Colorado 3 4 5 8 9 10
In [192]: df.unstack('state')
Out[192]:
side left right
state Ohio Colorado Ohio Colorado
number
one 0 3 5 8
two 1 4 6 9
three 2 5 7 10
对于这种特性,我是这么理解的。由于数据的特殊性存在着行索引名和列索引名,在unstack
操作时将行“旋转”为列,在经过旋转后number
变成了内层的列索引名。
相反如果将该数据通过stack
方法将列转为行:
In [272]: df.unstack().stack()
Out[272]:
side left right
state number
Ohio one 0 5
two 1 6
three 2 7
Colorado one 3 8
two 4 9
three 5 10
将“长格式”旋转为“宽格式”
多个时间序列数据通常是以“长格式(long)”或’堆叠格式(stacked)"存储在数据库和CSV中。
In [274]: data = pd.read_csv('examples/macrodata.csv')
In [275]: data.head
Out[275]:
<bound method NDFrame.head of year quarter realgdp realcons realinv ... tbilrate unemp pop infl realint
0 1959.0 1.0 2710.349 1707.4 286.898 ... 2.82 5.8 177.146 0.00 0.00
1 1959.0 2.0 2778.801 1733.7 310.859 ... 3.08 5.1 177.830 2.34 0.74
2 1959.0 3.0 2775.488 1751.8 289.226 ... 3.82 5.3 178.657 2.74 1.09
3 1959.0 4.0 2785.204 1753.7 299.356 ... 4.33 5.6 179.386 0.27 4.06
4 1960.0 1.0 2847.699 1770.5 331.722 ... 3.50 5.2 180.007 2.31 1.19
5 1960.0 2.0 2834.390 1792.9 298.152 ... 2.68 5.2 180.671 0.14 2.55
6 1960.0 3.0 2839.022 1785.8 296.375 ... 2.36 5.6 181.528 2.70 -0.34
7 1960.0 4.0 2802.616 1788.2 259.764 ... 2.29 6.3 182.287 1.21 1.08
8 1961.0 1.0 2819.264 1787.7 266.405 ... 2.37 6.8 182.992 -0.40 2.77
9 1961.0 2.0 2872.005 1814.3 286.246 ... 2.29 7.0 183.691 1.47 0.81
10 1961.0 3.0 2918.419 1823.1 310.227 ... 2.32 6.8 184.524 0.80 1.52
11 1961.0 4.0 2977.830 1859.6 315.463 ... 2.60 6.2 185.242 0.80 1.80
12 1962.0 1.0 3031.241 1879.4 334.271 ... 2.73 5.6 185.874 2.26 0.47
13 1962.0 2.0 3064.709 1902.5 331.039 ... 2.78 5.5 186.538 0.13 2.65
14 1962.0 3.0 3093.047 1917.9 336.962 ... 2.78 5.6 187.323 2.11 0.67
15 1962.0 4.0 3100.563 1945.1 325.650 ... 2.87 5.5 188.013 0.79 2.08
16 1963.0 1.0 3141.087 1958.2 343.721 ... 2.90 5.8 188.580 0.53 2.38
17 1963.0 2.0 3180.447 1976.9 348.730 ... 3.03 5.7 189.242 2.75 0.29
18 1963.0 3.0 3240.332 2003.8 360.102 ... 3.38 5.5 190.028 0.78 2.60
19 1963.0 4.0 3264.967 2020.6 364.534 ... 3.52 5.6 190.668 2.46 1.06
20 1964.0 1.0 3338.246 2060.5 379.523 ... 3.51 5.5 191.245 0.13 3.38
21 1964.0 2.0 3376.587 2096.7 377.778 ... 3.47 5.2 191.889 0.90 2.57
22 1964.0 3.0 3422.469 2135.2 386.754 ... 3.53 5.0 192.631 1.29 2.25
23 1964.0 4.0 3431.957 2141.2 389.910 ... 3.76 5.0 193.223 2.05 1.71
24 1965.0 1.0 3516.251 2188.8 429.145 ... 3.93 4.9 193.709 1.28 2.65
25 1965.0 2.0 3563.960 2213.0 429.119 ... 3.84 4.7 194.303 2.54 1.30
26 1965.0 3.0 3636.285 2251.0 444.444 ... 3.93 4.4 194.997 0.89 3.04
27 1965.0 4.0 3724.014 2314.3 446.493 ... 4.35 4.1 195.539 2.90 1.46
28 1966.0 1.0 3815.423 2348.5 484.244 ... 4.62 3.9 195.999 4.99 -0.37
29 1966.0 2.0 3828.124 2354.5 475.408 ... 4.65 3.8 196.560 2.10 2.55
.. ... ... ... ... ... ... ... ... ... ... ...
173 2002.0 2.0 11538.770 7997.8 1810.779 ... 1.70 5.8 288.028 1.56 0.14
174 2002.0 3.0 11596.430 8052.0 1814.531 ... 1.61 5.7 288.783 2.66 -1.05
175 2002.0 4.0 11598.824 8080.6 1813.219 ... 1.20 5.8 289.421 3.08 -1.88
176 2003.0 1.0 11645.819 8122.3 1813.141 ... 1.14 5.9 290.019 1.31 -0.17
177 2003.0 2.0 11738.706 8197.8 1823.698 ... 0.96 6.2 290.704 1.09 -0.13
178 2003.0 3.0 11935.461 8312.1 1889.883 ... 0.94 6.1 291.449 2.60 -1.67
179 2003.0 4.0 12042.817 8358.0 1959.783 ... 0.90 5.8 292.057 3.02 -2.11
180 2004.0 1.0 12127.623 8437.6 1970.015 ... 0.94 5.7 292.635 2.35 -1.42
181 2004.0 2.0 12213.818 8483.2 2055.580 ... 1.21 5.6 293.310 3.61 -2.41
182 2004.0 3.0 12303.533 8555.8 2082.231 ... 1.63 5.4 294.066 3.58 -1.95
183 2004.0 4.0 12410.282 8654.2 2125.152 ... 2.20 5.4 294.741 2.09 0.11
184 2005.0 1.0 12534.113 8719.0 2170.299 ... 2.69 5.3 295.308 4.15 -1.46
185 2005.0 2.0 12587.535 8802.9 2131.468 ... 3.01 5.1 295.994 1.85 1.16
186 2005.0 3.0 12683.153 8865.6 2154.949 ... 3.52 5.0 296.770 9.14 -5.62
187 2005.0 4.0 12748.699 8888.5 2232.193 ... 4.00 4.9 297.435 0.40 3.60
188 2006.0 1.0 12915.938 8986.6 2264.721 ... 4.51 4.7 298.061 2.60 1.91
189 2006.0 2.0 12962.462 9035.0 2261.247 ... 4.82 4.7 298.766 3.97 0.85
190 2006.0 3.0 12965.916 9090.7 2229.636 ... 4.90 4.7 299.593 -1.58 6.48
191 2006.0 4.0 13060.679 9181.6 2165.966 ... 4.92 4.4 300.320 3.30 1.62
192 2007.0 1.0 13099.901 9265.1 2132.609 ... 4.95 4.5 300.977 4.58 0.36
193 2007.0 2.0 13203.977 9291.5 2162.214 ... 4.72 4.5 301.714 2.75 1.97
194 2007.0 3.0 13321.109 9335.6 2166.491 ... 4.00 4.7 302.509 3.45 0.55
195 2007.0 4.0 13391.249 9363.6 2123.426 ... 3.01 4.8 303.204 6.38 -3.37
196 2008.0 1.0 13366.865 9349.6 2082.886 ... 1.56 4.9 303.803 2.82 -1.26
197 2008.0 2.0 13415.266 9351.0 2026.518 ... 1.74 5.4 304.483 8.53 -6.79
198 2008.0 3.0 13324.600 9267.7 1990.693 ... 1.17 6.0 305.270 -3.16 4.33
199 2008.0 4.0 13141.920 9195.3 1857.661 ... 0.12 6.9 305.952 -8.79 8.91
200 2009.0 1.0 12925.410 9209.2 1558.494 ... 0.22 8.1 306.547 0.94 -0.71
201 2009.0 2.0 12901.504 9189.0 1456.678 ... 0.18 9.2 307.226 3.37 -3.19
202 2009.0 3.0 12990.341 9256.0 1486.398 ... 0.12 9.6 308.013 3.56 -3.44
[203 rows x 14 columns]>
这种类型的数据就是多个时间序列的长格式。接下来要对它进行时间序列规整和数据清洗:
在清洗之前需要把这种数据库格式转换为DataFrame格式以方便处理
我们通过DataFrame的pivot
函数来实现转换
In [282]: periods = pd.PeriodIndex(year=data.year, quarter=data.quarter,name='date')
In [283]: columns = pd.Index(['realgdp','infl','unemp'], name='item')
In [284]: data = data.reindex(columns=columns)
In [285]: data.index = periods.to_timestamp('D','end')
In [286]: ldata = data.stack().reset_index().rename(columns={0: 'value'})
In [288]: pivoted = ldata.pivot('date', 'item', 'value')
In [289]: pivoted.head
Out[289]:
<bound method NDFrame.head of item infl realgdp unemp
date
1959-03-31 23:59:59.999999999 0.00 2710.349 5.8
1959-06-30 23:59:59.999999999 2.34 2778.801 5.1
1959-09-30 23:59:59.999999999 2.74 2775.488 5.3
1959-12-31 23:59:59.999999999 0.27 2785.204 5.6
1960-03-31 23:59:59.999999999 2.31 2847.699 5.2
1960-06-30 23:59:59.999999999 0.14 2834.390 5.2
1960-09-30 23:59:59.999999999 2.70 2839.022 5.6
1960-12-31 23:59:59.999999999 1.21 2802.616 6.3
1961-03-31 23:59:59.999999999 -0.40 2819.264 6.8
1961-06-30 23:59:59.999999999 1.47 2872.005 7.0
1961-09-30 23:59:59.999999999 0.80 2918.419 6.8
1961-12-31 23:59:59.999999999 0.80 2977.830 6.2
1962-03-31 23:59:59.999999999 2.26 3031.241 5.6
1962-06-30 23:59:59.999999999 0.13 3064.709 5.5
1962-09-30 23:59:59.999999999 2.11 3093.047 5.6
1962-12-31 23:59:59.999999999 0.79 3100.563 5.5
1963-03-31 23:59:59.999999999 0.53 3141.087 5.8
1963-06-30 23:59:59.999999999 2.75 3180.447 5.7
1963-09-30 23:59:59.999999999 0.78 3240.332 5.5
1963-12-31 23:59:59.999999999 2.46 3264.967 5.6
1964-03-31 23:59:59.999999999 0.13 3338.246 5.5
1964-06-30 23:59:59.999999999 0.90 3376.587 5.2
1964-09-30 23:59:59.999999999 1.29 3422.469 5.0
1964-12-31 23:59:59.999999999 2.05 3431.957 5.0
1965-03-31 23:59:59.999999999 1.28 3516.251 4.9
1965-06-30 23:59:59.999999999 2.54 3563.960 4.7
1965-09-30 23:59:59.999999999 0.89 3636.285 4.4
1965-12-31 23:59:59.999999999 2.90 3724.014 4.1
1966-03-31 23:59:59.999999999 4.99 3815.423 3.9
1966-06-30 23:59:59.999999999 2.10 3828.124 3.8
... ... ... ...
2002-06-30 23:59:59.999999999 1.56 11538.770 5.8
2002-09-30 23:59:59.999999999 2.66 11596.430 5.7
2002-12-31 23:59:59.999999999 3.08 11598.824 5.8
2003-03-31 23:59:59.999999999 1.31 11645.819 5.9
2003-06-30 23:59:59.999999999 1.09 11738.706 6.2
2003-09-30 23:59:59.999999999 2.60 11935.461 6.1
2003-12-31 23:59:59.999999999 3.02 12042.817 5.8
2004-03-31 23:59:59.999999999 2.35 12127.623 5.7
2004-06-30 23:59:59.999999999 3.61 12213.818 5.6
2004-09-30 23:59:59.999999999 3.58 12303.533 5.4
2004-12-31 23:59:59.999999999 2.09 12410.282 5.4
2005-03-31 23:59:59.999999999 4.15 12534.113 5.3
2005-06-30 23:59:59.999999999 1.85 12587.535 5.1
2005-09-30 23:59:59.999999999 9.14 12683.153 5.0
2005-12-31 23:59:59.999999999 0.40 12748.699 4.9
2006-03-31 23:59:59.999999999 2.60 12915.938 4.7
2006-06-30 23:59:59.999999999 3.97 12962.462 4.7
2006-09-30 23:59:59.999999999 -1.58 12965.916 4.7
2006-12-31 23:59:59.999999999 3.30 13060.679 4.4
2007-03-31 23:59:59.999999999 4.58 13099.901 4.5
2007-06-30 23:59:59.999999999 2.75 13203.977 4.5
2007-09-30 23:59:59.999999999 3.45 13321.109 4.7
2007-12-31 23:59:59.999999999 6.38 13391.249 4.8
2008-03-31 23:59:59.999999999 2.82 13366.865 4.9
2008-06-30 23:59:59.999999999 8.53 13415.266 5.4
2008-09-30 23:59:59.999999999 -3.16 13324.600 6.0
2008-12-31 23:59:59.999999999 -8.79 13141.920 6.9
2009-03-31 23:59:59.999999999 0.94 12925.410 8.1
2009-06-30 23:59:59.999999999 3.37 12901.504 9.2
2009-09-30 23:59:59.999999999 3.56 12990.341 9.6
[203 rows x 3 columns]>
前两个参数值分别用作行和列索引的列名,最后一个参数值则是用于填充DataFrame的数据列的列名。
In [290]: ldata['value2'] = np.random.randn(len(ldata))
In [291]: ldata[:10]
Out[291]:
date item value value2
0 1959-03-31 23:59:59.999999999 realgdp 2710.349 0.648022
1 1959-03-31 23:59:59.999999999 infl 0.000 -0.748731
2 1959-03-31 23:59:59.999999999 unemp 5.800 -1.365055
3 1959-06-30 23:59:59.999999999 realgdp 2778.801 -0.618768
4 1959-06-30 23:59:59.999999999 infl 2.340 -0.014367
5 1959-06-30 23:59:59.999999999 unemp 5.100 -1.255695
6 1959-09-30 23:59:59.999999999 realgdp 2775.488 -0.943252
7 1959-09-30 23:59:59.999999999 infl 2.740 -0.262365
8 1959-09-30 23:59:59.999999999 unemp 5.300 1.050823
9 1959-12-31 23:59:59.999999999 realgdp 2785.204 -1.224203
如果忽略最后一个参数,得到的DataFrame就会带有层次化的列:
In [292]: pivoted = ldata.pivot('date','item')
In [293]: pivoted[:5]
Out[293]:
value value2
item infl realgdp unemp infl realgdp unemp
date
1959-03-31 23:59:59.999999999 0.00 2710.349 5.8 -0.748731 0.648022 -1.365055
1959-06-30 23:59:59.999999999 2.34 2778.801 5.1 -0.014367 -0.618768 -1.255695
1959-09-30 23:59:59.999999999 2.74 2775.488 5.3 -0.262365 -0.943252 1.050823
1959-12-31 23:59:59.999999999 0.27 2785.204 5.6 0.773106 -1.224203 -0.302338
1960-03-31 23:59:59.999999999 2.31 2847.699 5.2 -0.403129 0.791051 -0.018440
In [294]: pivoted['value'][:5]
Out[294]:
item infl realgdp unemp
date
1959-03-31 23:59:59.999999999 0.00 2710.349 5.8
1959-06-30 23:59:59.999999999 2.34 2778.801 5.1
1959-09-30 23:59:59.999999999 2.74 2775.488 5.3
1959-12-31 23:59:59.999999999 0.27 2785.204 5.6
1960-03-31 23:59:59.999999999 2.31 2847.699 5.2