读写文本格式的数据
通过cat
输出文件内容:
In [20]: cat ex1.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
由于文件是csv
格式,可以使用read_csv
读取文件并返回DataFrame
:
In [23]: df = pd.read_csv('ex1.csv')
In [24]: df
Out[24]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
如果想读取没有标题行的文件:
In [26]: cat ex2.csv
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
In [27]: pd.read_csv('ex2.csv', header=None)
Out[27]:
0 1 2 3 4
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
或者可以直接定义标题
In [28]: pd.read_csv('ex2.csv', names=['a','b','c','d','message'])
Out[28]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
还可以将多个列做成层次化索引
In [29]: cat csv_mindex.csv
key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,e,9,10
In [30]: parsed = pd.read_csv('csv_mindex.csv', index_col=['key1','key2'])
In [31]: parsed
Out[31]:
value1 value2
key1 key2
one a 1 2
b 3 4
c 5 6
d 7 8
two e 9 10
有些表格分隔会比较杂乱,如下面这个:
In [32]: cat ex3.csv
A B C
Aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
Ccc -0.265273 -0.386314 -0.217601
可以在读取的时候传入一个正则来调整这些不规则的分隔
In [32]: cat ex3.csv
A B C
Aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
Ccc -0.265273 -0.386314 -0.217601
In [33]: result = pd.read_csv('ex3.csv', sep='\s+')
In [34]: result
Out[34]:
A B C
Aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
Ccc -0.265273 -0.386314 -0.217601
如何在读取文件的时候找到缺失值并标记为NaN
In [59]: cat ex5.csv
Something,a,b,c,d,message
0,one,1,2,3,4,NA
1,two,5,6,-1.#IND,8,world
2,three,7,8,9,10,NULL
In [57]: result = pd.read_csv('ex5.csv')
In [58]: pd.isnull(result)
Out[58]:
Something a b c d message
0 False False False False False True
1 False False False True False False
2 False False False False False True
pandas默认会把NA
,-1.#IND
,NULL
当做是缺失值。
我们也可以自己定义:
In [63]: cat ex5.csv
Something,a,b,c,d,message
0,one,1,2,3,4,NA
1,two,5,6,-1.#IND,8,world
2,three,7,8,9,10,NULL
4,four,11,12,13,14,空
In [64]: result = pd.read_csv('ex5.csv')
In [65]: result
Out[65]:
Something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 7 8 9.0 10 NaN
4 four 11 12 13.0 14 空 #在这里加了条‘空’的字符串
na_values
可以接受一组用于表示缺失值的字符串:
In [61]: result = pd.read_csv('ex5.csv', na_values=['空'])
In [62]: result
Out[62]:
Something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 7 8 9.0 10 NaN
4 four 11 12 13.0 14 NaN
逐块读取文本文件
如果在处理大型文件的时候,不想直接读取大文件而选择读取部分数据的时候可以通过nrows
来读取指定行数
In [69]: pd.read_csv('ex6.csv',nrows=5)
Out[69]:
One two three four key
0 0.467758 -0.758257 -1.824459 -1.824726 L
1 -0.359277 -1.995273 1.252799 -0.853329 B
2 -0.509925 1.259425 0.559245 0.992874 G
3 -0.509925 1.259425 0.559245 0.992874 R
4 -0.509925 1.259425 0.559245 0.992874 Q
如果想逐块读取行数,可设置chunksize来实现
In [70]: chunker = pd.read_csv('ex6.csv',chunksize=5)
In [71]: chunker
Out[71]: <pandas.io.parsers.TextFileReader at 0x117ef4eb8>
由于read_csv
函数返回的是一个TextFileReader
对象,所以想获取数据还需要解析它才行。
In [73]: for c in chunker:
...: print(c)
...:
One two three four key
0 0.467758 -0.758257 -1.824459 -1.824726 L
1 -0.359277 -1.995273 1.252799 -0.853329 B
2 -0.509925 1.259425 0.559245 0.992874 G
3 -0.509925 1.259425 0.559245 0.992874 R
4 -0.509925 1.259425 0.559245 0.992874 Q
One two three four key
5 -0.509925 1.259425 0.559245 0.992874 D
6 -0.509925 1.259425 0.559245 0.992874 A
7 -0.509925 1.259425 0.559245 0.992874 B
8 -0.509925 1.259425 0.559245 0.992874 E
9 -0.509925 1.259425 0.559245 0.992874 C
One two three four key
10 -0.509925 1.259425 0.559245 0.992874 F
11 -0.509925 1.259425 0.559245 0.992874 M
12 -0.509925 1.259425 0.559245 0.992874 V
将数据写到文本格式中
同样数据也可以被输出为分隔符格式的文本,然后存储在文件中
我们可以利用DataFrame
中的to_csv
函数来实现
In [76]: data.to_csv('ex5out.csv')
In [77]: cat ex5out.csv
,Something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,7,8,9.0,10,
4,four,11,12,13.0,14,空
当然,也可以使用其他分隔符
In [79]: data.to_csv('ex5out_02.csv',sep='|')
In [80]: cat ex5out_02.csv
|Something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|7|8|9.0|10|
4|four|11|12|13.0|14|空
手工处理分隔符格式
有时候也会遇到read_csv
不能读取的文本,如下面这种:
data = pd.read_csv('ex7.csv')
---------------------------------------------------------------------------
ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4
In [83]: cat ex7.csv
"a","b","c"
"1","2","3"
"1","2","3","4"
由于数据格式不准确导致无法解析,这个时候就需要用到手动去处理了。通过Python自带的csv.reader
来解决此问题:
In [85]: f = open('ex7.csv')
In [86]: reader = csv.reader(f)
In [87]: for line in reader:
...: print(line)
...:
['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3', '4']
先把数据放进了reader
对象中,然后再解决数据不规整的问题:
In [89]: lines = list(csv.reader(open('ex7.csv')))
...: header, values = lines[0],lines[1:]
...: data_dict = {h: v for h, v in zip(header, zip(*values))}
...:
In [90]: data_dict
Out[90]: {'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}
通过遍历,把reader
中的数据按照正确格式放进了字典data_dict
中,把不规整的数据给过滤掉了。
接下来可以把字典转换成DataFrame
In [92]: frame = DataFrame(data_dict)
In [93]: frame
Out[93]:
a b c
0 1 2 3
1 1 2 3
JSON数据
In [102]: obj="""
...: {"name":"Wes",
...: "places_lived":["United States", "Spain","Germany"],
...: "pet":null,
...: "siblings":[{"name":"Scott","age":25,"pet":"Zuko"},
...: {"name":"Katie","age":33,"pet":"Cisco"}]
...: }
...: """
In [103]: result = json.loads(obj)
In [104]: result
Out[104]:
{'name': 'Wes',
'places_lived': ['United States', 'Spain', 'Germany'],
'pet': None,
'siblings': [{'name': 'Scott', 'age': 25, 'pet': 'Zuko'},
{'name': 'Katie', 'age': 33, 'pet': 'Cisco'}]}
将JSON对象转换成DataFrame
In [105]: siblings = DataFrame(result['siblings'], columns=['name','age'])
In [106]: siblings
Out[106]:
name age
0 Scott 25
1 Katie 33
书上记载pandas还未完全开发出能高效导入和导出JSON的功能,需要进一步确认。
##XML和HTML:Web信息收集
利用pandas的read_html
来读取html,它会自动结合lxml
和Beautiful Soup
将数据解析为DataFrame
。所以在这之前必须安装好它们。
pip3 install lxml
pip3 install beautifulsoup4 html5lib
read_html
会尝试解析<table>
标签内的表格数据
In [17]: tables = pd.read_html('examples/fdic_failed_bank_list.html')
In [18]: failures = tables[0]
In [19]: failures.head()
Out[19]:
Bank Name ... Updated Date
0 Allied Bank ... November 17, 2016
1 The Woodbury Banking Company ... November 17, 2016
2 First CornerStone Bank ... September 6, 2016
3 Trust Company Bank ... September 6, 2016
4 North Milwaukee State Bank ... June 16, 2016
接下来可以对数据做一下清洗以及分析,从数据中我们分析下从2000-2010倒闭银行的数量:
In [20]: close_stamps = pd.to_datetime(failures['Closing Date']) #把failures['Closing Date']全部转换成了DateTime格式放入了Series中
#因为数据中记录了银行的倒闭时间,也就是说知道了‘Closing Date’中有多少个数据就有多少家银行倒闭了。
# 下面以年的形式统计了每年倒闭的银行数
In [21]: close_stamps.dt.year.value_counts()
Out[21]:
2010 157
2009 140
2011 92
2012 51
2008 25
2013 24
2014 18
2002 11
2015 8
2016 5
2004 4
2001 4
2007 3
2003 3
2000 2
Name: Closing Date, dtype: int64
利用lxml.objectify解析XML
解析XML将会是一个非常常见的数据分析操作,以后也会经常遇到这种XML文件。
我们需要用到lxml
库中的objectify
对象。
In [24]: path = 'Performance_MNR.xml'
In [25]: from lxml import objectify
In [26]: parsed = objectify.parse(open(path))
In [27]: root = parsed.getroot()
In [28]: skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ','DESIRED_CHANGE','DECIMAL_PLACES']
In [29]: for elt in root.INDICATOR:
...: el_data = {}
...: for child in elt.getchildren():
...: if child.tag in skip_fields:
...: continue
...: el_data[child.tag] = child.pyval
...: data.append(el_data)
...:
In [30]: perf = DataFrame(data)
In [31]: perf
Out[31]:
AGENCY_NAME CATEGORY ... YTD_ACTUAL YTD_TARGET
0 Metro-North Railroad Service Indicators ... 96.9 95
1 Metro-North Railroad Service Indicators ... 96 95
2 Metro-North Railroad Service Indicators ... 96.3 95
3 Metro-North Railroad Service Indicators ... 96.8 95
4 Metro-North Railroad Service Indicators ... 96.6 95
5 Metro-North Railroad Service Indicators ... 96.2 95
6 Metro-North Railroad Service Indicators ... 96.2 95
7 Metro-North Railroad Service Indicators ... 96.2 95
8 Metro-North Railroad Service Indicators ... 95.9 95
9 Metro-North Railroad Service Indicators ... 96 95
10 Metro-North Railroad Service Indicators ... 96.1 95
11 Metro-North Railroad Service Indicators ... 96 95
12 Metro-North Railroad Service Indicators ... 92.6 96.2
13 Metro-North Railroad Service Indicators ... 94.6 96.2
14 Metro-North Railroad Service Indicators ... 95.4 96.2
15 Metro-North Railroad Service Indicators ... 95.9 96.2
16 Metro-North Railroad Service Indicators ... 96.2 96.2
17 Metro-North Railroad Service Indicators ... 96.4 96.2
18 Metro-North Railroad Service Indicators ... 96.5 96.2
19 Metro-North Railroad Service Indicators ... 96.4 96.2
20 Metro-North Railroad Service Indicators ... 96.3 96.2
21 Metro-North Railroad Service Indicators ... 96.2 96.2
22 Metro-North Railroad Service Indicators ... 96.1 96.2
23 Metro-North Railroad Service Indicators ... 96 96.2
24 Metro-North Railroad Service Indicators ... 98 96.3
25 Metro-North Railroad Service Indicators ... 95.6 96.3
26 Metro-North Railroad Service Indicators ... 96.1 96.3
27 Metro-North Railroad Service Indicators ... 96.6 96.3
28 Metro-North Railroad Service Indicators ... 96.8 96.3
29 Metro-North Railroad Service Indicators ... 96.9 96.3
.. ... ... ... ... ...
618 Metro-North Railroad Service Indicators ... 95.14
619 Metro-North Railroad Service Indicators ... 95.38
620 Metro-North Railroad Service Indicators ... 95.7
621 Metro-North Railroad Service Indicators ... 96
622 Metro-North Railroad Service Indicators ... 96.21
623 Metro-North Railroad Service Indicators ... 96.5
624 Metro-North Railroad Service Indicators ... 97.95 97
625 Metro-North Railroad Service Indicators ... 98.92 97
626 Metro-North Railroad Service Indicators ... 99.29 97
627 Metro-North Railroad Service Indicators ... 99.47 97
628 Metro-North Railroad Service Indicators ... 99.58 97
629 Metro-North Railroad Service Indicators ... 98.19 97
630 Metro-North Railroad Service Indicators ... 98.46 97
631 Metro-North Railroad Service Indicators ... 98.69 97
632 Metro-North Railroad Service Indicators ... 98.3 97
633 Metro-North Railroad Service Indicators ... 97.55 97
634 Metro-North Railroad Service Indicators ... 97.47 97
635 Metro-North Railroad Service Indicators ... 96.84 97
636 Metro-North Railroad Service Indicators ... 100 97
637 Metro-North Railroad Service Indicators ... 100 97
638 Metro-North Railroad Service Indicators ... 98.86 97
639 Metro-North Railroad Service Indicators ... 98.76 97
640 Metro-North Railroad Service Indicators ... 90.91 97
641 Metro-North Railroad Service Indicators ... 97
642 Metro-North Railroad Service Indicators ... 97
643 Metro-North Railroad Service Indicators ... 97
644 Metro-North Railroad Service Indicators ... 97
645 Metro-North Railroad Service Indicators ... 97
646 Metro-North Railroad Service Indicators ... 97
647 Metro-North Railroad Service Indicators ... 97
[648 rows x 12 columns]
**我们也可以通过StringIO
来获取HTML
的标记
下面来看一个简单的例子:
In [33]: from io import StringIO
In [36]: tag = '<a href="http://www.googlge.com">Google</a>'
In [37]: root = objectify.parse(StringIO(tag)).getroot()
In [38]: root.get('href')
Out[38]: 'http://www.googlge.com'
In [39]: root.text
Out[39]: 'Google'