Python数据分析之Pandas-数据文件加载与存储（一）

pandas-read-file

读写文本格式的数据

通过cat 输出文件内容：

In [20]: cat ex1.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

由于文件是csv格式，可以使用read_csv读取文件并返回DataFrame：

In [23]: df = pd.read_csv('ex1.csv')

In [24]: df
Out[24]:
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

如果想读取没有标题行的文件:

In [26]: cat ex2.csv
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

In [27]: pd.read_csv('ex2.csv', header=None)
Out[27]:
   0   1   2   3      4
0  1   2   3   4  hello
1  5   6   7   8  world
2  9  10  11  12    foo

或者可以直接定义标题

In [28]: pd.read_csv('ex2.csv', names=['a','b','c','d','message'])
Out[28]:
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

还可以将多个列做成层次化索引

In [29]: cat csv_mindex.csv
key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,e,9,10

In [30]: parsed = pd.read_csv('csv_mindex.csv', index_col=['key1','key2'])

In [31]: parsed
Out[31]:
           value1  value2
key1 key2
one  a          1       2
     b          3       4
     c          5       6
     d          7       8
two  e          9      10

有些表格分隔会比较杂乱，如下面这个:

In [32]: cat ex3.csv
            A              B                C
Aaa -0.264438   -1.026059           -0.619500
bbb 0.927272    0.302904            -0.032399
Ccc -0.265273   -0.386314           -0.217601

可以在读取的时候传入一个正则来调整这些不规则的分隔

In [32]: cat ex3.csv
            A              B                C
Aaa -0.264438   -1.026059           -0.619500
bbb 0.927272    0.302904            -0.032399
Ccc -0.265273   -0.386314           -0.217601
In [33]: result = pd.read_csv('ex3.csv', sep='\s+')

In [34]: result
Out[34]:
            A         B         C
Aaa -0.264438 -1.026059 -0.619500
bbb  0.927272  0.302904 -0.032399
Ccc -0.265273 -0.386314 -0.217601

如何在读取文件的时候找到缺失值并标记为NaN

In [59]: cat ex5.csv
Something,a,b,c,d,message
0,one,1,2,3,4,NA
1,two,5,6,-1.#IND,8,world
2,three,7,8,9,10,NULL

In [57]: result = pd.read_csv('ex5.csv')

In [58]: pd.isnull(result)
Out[58]:
   Something      a      b      c      d  message
0      False  False  False  False  False     True
1      False  False  False   True  False    False
2      False  False  False  False  False     True

pandas默认会把NA,-1.#IND,NULL当做是缺失值。
我们也可以自己定义：

In [63]: cat ex5.csv
Something,a,b,c,d,message
0,one,1,2,3,4,NA
1,two,5,6,-1.#IND,8,world
2,three,7,8,9,10,NULL
4,four,11,12,13,14,空
In [64]: result = pd.read_csv('ex5.csv')

In [65]: result
Out[65]:
  Something   a   b     c   d message
0       one   1   2   3.0   4     NaN
1       two   5   6   NaN   8   world
2     three   7   8   9.0  10     NaN
4      four  11  12  13.0  14       空  #在这里加了条‘空’的字符串

na_values可以接受一组用于表示缺失值的字符串:

In [61]: result = pd.read_csv('ex5.csv', na_values=['空'])

In [62]: result
Out[62]:
  Something   a   b     c   d message
0       one   1   2   3.0   4     NaN
1       two   5   6   NaN   8   world
2     three   7   8   9.0  10     NaN
4      four  11  12  13.0  14     NaN

逐块读取文本文件

如果在处理大型文件的时候，不想直接读取大文件而选择读取部分数据的时候可以通过nrows来读取指定行数

In [69]: pd.read_csv('ex6.csv',nrows=5)
Out[69]:
        One       two     three      four key
0  0.467758 -0.758257 -1.824459 -1.824726   L
1 -0.359277 -1.995273  1.252799 -0.853329   B
2 -0.509925  1.259425  0.559245  0.992874   G
3 -0.509925  1.259425  0.559245  0.992874   R
4 -0.509925  1.259425  0.559245  0.992874   Q

如果想逐块读取行数，可设置chunksize来实现

In [70]: chunker = pd.read_csv('ex6.csv',chunksize=5)

In [71]: chunker
Out[71]: <pandas.io.parsers.TextFileReader at 0x117ef4eb8>

由于read_csv函数返回的是一个TextFileReader对象，所以想获取数据还需要解析它才行。

In [73]: for c in chunker:
    ...:     print(c)
    ...:
        One       two     three      four key
0  0.467758 -0.758257 -1.824459 -1.824726   L
1 -0.359277 -1.995273  1.252799 -0.853329   B
2 -0.509925  1.259425  0.559245  0.992874   G
3 -0.509925  1.259425  0.559245  0.992874   R
4 -0.509925  1.259425  0.559245  0.992874   Q
        One       two     three      four key
5 -0.509925  1.259425  0.559245  0.992874   D
6 -0.509925  1.259425  0.559245  0.992874   A
7 -0.509925  1.259425  0.559245  0.992874   B
8 -0.509925  1.259425  0.559245  0.992874   E
9 -0.509925  1.259425  0.559245  0.992874   C
         One       two     three      four key
10 -0.509925  1.259425  0.559245  0.992874   F
11 -0.509925  1.259425  0.559245  0.992874   M
12 -0.509925  1.259425  0.559245  0.992874   V

将数据写到文本格式中

同样数据也可以被输出为分隔符格式的文本，然后存储在文件中
我们可以利用DataFrame中的to_csv函数来实现

In [76]: data.to_csv('ex5out.csv')

In [77]: cat ex5out.csv
,Something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,7,8,9.0,10,
4,four,11,12,13.0,14,空

当然，也可以使用其他分隔符

In [79]: data.to_csv('ex5out_02.csv',sep='|')

In [80]: cat ex5out_02.csv
|Something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|7|8|9.0|10|
4|four|11|12|13.0|14|空

手工处理分隔符格式

有时候也会遇到read_csv不能读取的文本，如下面这种：

data = pd.read_csv('ex7.csv')
---------------------------------------------------------------------------
ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4

In [83]: cat ex7.csv
"a","b","c"
"1","2","3"
"1","2","3","4"

由于数据格式不准确导致无法解析，这个时候就需要用到手动去处理了。通过Python自带的csv.reader来解决此问题：

In [85]: f = open('ex7.csv')

In [86]: reader = csv.reader(f)

In [87]: for line in reader:
    ...:     print(line)
    ...:
['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3', '4']

先把数据放进了reader对象中，然后再解决数据不规整的问题：

In [89]: lines = list(csv.reader(open('ex7.csv')))
    ...: header, values = lines[0],lines[1:]
    ...: data_dict = {h: v for h, v in zip(header, zip(*values))}
    ...:

In [90]: data_dict
Out[90]: {'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

通过遍历，把reader中的数据按照正确格式放进了字典data_dict中，把不规整的数据给过滤掉了。
接下来可以把字典转换成DataFrame

In [92]: frame = DataFrame(data_dict)

In [93]: frame
Out[93]:
   a  b  c
0  1  2  3
1  1  2  3

JSON数据

In [102]: obj="""
     ...: {"name":"Wes",
     ...: "places_lived":["United States", "Spain","Germany"],
     ...: "pet":null,
     ...: "siblings":[{"name":"Scott","age":25,"pet":"Zuko"},
     ...:             {"name":"Katie","age":33,"pet":"Cisco"}]
     ...:      }
     ...:      """

In [103]: result = json.loads(obj)

In [104]: result
Out[104]:
{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 25, 'pet': 'Zuko'},
  {'name': 'Katie', 'age': 33, 'pet': 'Cisco'}]}

将JSON对象转换成DataFrame

In [105]: siblings = DataFrame(result['siblings'], columns=['name','age'])

In [106]: siblings
Out[106]:
    name  age
0  Scott   25
1  Katie   33

书上记载pandas还未完全开发出能高效导入和导出JSON的功能，需要进一步确认。

##XML和HTML:Web信息收集
利用pandas的read_html来读取html，它会自动结合lxml和Beautiful Soup将数据解析为DataFrame。所以在这之前必须安装好它们。

pip3 install lxml
pip3 install beautifulsoup4 html5lib

read_html会尝试解析<table>标签内的表格数据

In [17]: tables = pd.read_html('examples/fdic_failed_bank_list.html')

In [18]: failures = tables[0]

In [19]: failures.head()
Out[19]:
                      Bank Name  ...       Updated Date
0                   Allied Bank  ...  November 17, 2016
1  The Woodbury Banking Company  ...  November 17, 2016
2        First CornerStone Bank  ...  September 6, 2016
3            Trust Company Bank  ...  September 6, 2016
4    North Milwaukee State Bank  ...      June 16, 2016

接下来可以对数据做一下清洗以及分析，从数据中我们分析下从2000-2010倒闭银行的数量：

In [20]: close_stamps = pd.to_datetime(failures['Closing Date']) #把failures['Closing Date']全部转换成了DateTime格式放入了Series中

#因为数据中记录了银行的倒闭时间，也就是说知道了‘Closing Date’中有多少个数据就有多少家银行倒闭了。
# 下面以年的形式统计了每年倒闭的银行数
In [21]: close_stamps.dt.year.value_counts()
Out[21]:
2010    157
2009    140
2011     92
2012     51
2008     25
2013     24
2014     18
2002     11
2015      8
2016      5
2004      4
2001      4
2007      3
2003      3
2000      2
Name: Closing Date, dtype: int64

利用lxml.objectify解析XML

解析XML将会是一个非常常见的数据分析操作，以后也会经常遇到这种XML文件。
我们需要用到lxml库中的objectify对象。

In [24]: path = 'Performance_MNR.xml'

In [25]: from lxml import objectify

In [26]: parsed = objectify.parse(open(path))

In [27]: root = parsed.getroot()

In [28]: skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ','DESIRED_CHANGE','DECIMAL_PLACES']

In [29]: for elt in root.INDICATOR:
    ...:     el_data = {}
    ...:     for child in elt.getchildren():
    ...:         if child.tag in skip_fields:
    ...:             continue
    ...:         el_data[child.tag] = child.pyval
    ...:     data.append(el_data)
    ...:

In [30]: perf = DataFrame(data)

In [31]: perf
Out[31]:
              AGENCY_NAME            CATEGORY  ... YTD_ACTUAL YTD_TARGET
0    Metro-North Railroad  Service Indicators  ...       96.9         95
1    Metro-North Railroad  Service Indicators  ...         96         95
2    Metro-North Railroad  Service Indicators  ...       96.3         95
3    Metro-North Railroad  Service Indicators  ...       96.8         95
4    Metro-North Railroad  Service Indicators  ...       96.6         95
5    Metro-North Railroad  Service Indicators  ...       96.2         95
6    Metro-North Railroad  Service Indicators  ...       96.2         95
7    Metro-North Railroad  Service Indicators  ...       96.2         95
8    Metro-North Railroad  Service Indicators  ...       95.9         95
9    Metro-North Railroad  Service Indicators  ...         96         95
10   Metro-North Railroad  Service Indicators  ...       96.1         95
11   Metro-North Railroad  Service Indicators  ...         96         95
12   Metro-North Railroad  Service Indicators  ...       92.6       96.2
13   Metro-North Railroad  Service Indicators  ...       94.6       96.2
14   Metro-North Railroad  Service Indicators  ...       95.4       96.2
15   Metro-North Railroad  Service Indicators  ...       95.9       96.2
16   Metro-North Railroad  Service Indicators  ...       96.2       96.2
17   Metro-North Railroad  Service Indicators  ...       96.4       96.2
18   Metro-North Railroad  Service Indicators  ...       96.5       96.2
19   Metro-North Railroad  Service Indicators  ...       96.4       96.2
20   Metro-North Railroad  Service Indicators  ...       96.3       96.2
21   Metro-North Railroad  Service Indicators  ...       96.2       96.2
22   Metro-North Railroad  Service Indicators  ...       96.1       96.2
23   Metro-North Railroad  Service Indicators  ...         96       96.2
24   Metro-North Railroad  Service Indicators  ...         98       96.3
25   Metro-North Railroad  Service Indicators  ...       95.6       96.3
26   Metro-North Railroad  Service Indicators  ...       96.1       96.3
27   Metro-North Railroad  Service Indicators  ...       96.6       96.3
28   Metro-North Railroad  Service Indicators  ...       96.8       96.3
29   Metro-North Railroad  Service Indicators  ...       96.9       96.3
..                    ...                 ...  ...        ...        ...
618  Metro-North Railroad  Service Indicators  ...      95.14
619  Metro-North Railroad  Service Indicators  ...      95.38
620  Metro-North Railroad  Service Indicators  ...       95.7
621  Metro-North Railroad  Service Indicators  ...         96
622  Metro-North Railroad  Service Indicators  ...      96.21
623  Metro-North Railroad  Service Indicators  ...       96.5
624  Metro-North Railroad  Service Indicators  ...      97.95         97
625  Metro-North Railroad  Service Indicators  ...      98.92         97
626  Metro-North Railroad  Service Indicators  ...      99.29         97
627  Metro-North Railroad  Service Indicators  ...      99.47         97
628  Metro-North Railroad  Service Indicators  ...      99.58         97
629  Metro-North Railroad  Service Indicators  ...      98.19         97
630  Metro-North Railroad  Service Indicators  ...      98.46         97
631  Metro-North Railroad  Service Indicators  ...      98.69         97
632  Metro-North Railroad  Service Indicators  ...       98.3         97
633  Metro-North Railroad  Service Indicators  ...      97.55         97
634  Metro-North Railroad  Service Indicators  ...      97.47         97
635  Metro-North Railroad  Service Indicators  ...      96.84         97
636  Metro-North Railroad  Service Indicators  ...        100         97
637  Metro-North Railroad  Service Indicators  ...        100         97
638  Metro-North Railroad  Service Indicators  ...      98.86         97
639  Metro-North Railroad  Service Indicators  ...      98.76         97
640  Metro-North Railroad  Service Indicators  ...      90.91         97
641  Metro-North Railroad  Service Indicators  ...                    97
642  Metro-North Railroad  Service Indicators  ...                    97
643  Metro-North Railroad  Service Indicators  ...                    97
644  Metro-North Railroad  Service Indicators  ...                    97
645  Metro-North Railroad  Service Indicators  ...                    97
646  Metro-North Railroad  Service Indicators  ...                    97
647  Metro-North Railroad  Service Indicators  ...                    97

[648 rows x 12 columns]

**我们也可以通过StringIO 来获取HTML的标记
下面来看一个简单的例子：

In [33]: from io import StringIO

In [36]: tag = '<a href="http://www.googlge.com">Google</a>'

In [37]: root = objectify.parse(StringIO(tag)).getroot()

In [38]: root.get('href')
Out[38]: 'http://www.googlge.com'

In [39]: root.text
Out[39]: 'Google'

loannes's blog