读写文本格式的数据
1 | import numpy as np |
1 | !pwd |
/Users/zhangyangfenbi.com/Desktop/code/conda_book
1 | !cat examples/ex1.csv |
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
1 | file_path = 'examples/ex1.csv' |
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
1 | #read_table也可以,不过分隔符不一样,需要重新指定 |
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
1 | !cat examples/ex2.csv |
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
1 | #可以让pandas为其默认分配列名,或者自己定义列名 |
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
1 | df2=pd.read_csv('examples/ex2.csv',names=['a','b','c','d','e']) |
a | b | c | d | e | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
1 | df2=pd.read_csv('examples/ex2.csv',names=['a','b','c','d','e'],index_col='e') |
a | b | c | d | |
---|---|---|---|---|
e | ||||
hello | 1 | 2 | 3 | 4 |
world | 5 | 6 | 7 | 8 |
foo | 9 | 10 | 11 | 12 |
1 | !cat examples/csv_mindex.csv |
key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16
1 | pd.read_csv('examples/csv_mindex.csv',index_col=['key1','key2']) |
value1 | value2 | ||
---|---|---|---|
key1 | key2 | ||
one | a | 1 | 2 |
b | 3 | 4 | |
c | 5 | 6 | |
d | 7 | 8 | |
two | a | 9 | 10 |
b | 11 | 12 | |
c | 13 | 14 | |
d | 15 | 16 |
1 | # |
[' A B C\n',
'aaa -0.264438 -1.026059 -0.619500\n',
'bbb 0.927272 0.302904 -0.032399\n',
'ccc -0.264273 -0.386314 -0.217601\n',
'ddd -0.871858 -0.348382 1.100491\n']
1 | #通过正则表达式去匹配并不是固定的分隔符 |
A | B | C | |
---|---|---|---|
aaa | -0.264438 | -1.026059 | -0.619500 |
bbb | 0.927272 | 0.302904 | -0.032399 |
ccc | -0.264273 | -0.386314 | -0.217601 |
ddd | -0.871858 | -0.348382 | 1.100491 |
1 | !cat examples/ex4.csv |
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
1 | #使用skiprows跳过指定行 |
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
1 | !cat examples/ex5.csv |
something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo
1 | #pandas会对缺失值进行标识 |
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | foo |
1 | res1.isnull() |
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | False | False | False | False | False | True |
1 | False | False | False | True | False | False |
2 | False | False | False | False | False | False |
1 | result = pd.read_csv('examples/ex5.csv', na_values=['NULL']) |
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | foo |
逐行读取文本文件
1 | chunker = pd.read_csv('examples/ex6.csv', chunksize=1000) |
<pandas.io.parsers.TextFileReader at 0x10c0cc828>
1 | chunker = pd.read_csv('examples/ex6.csv', chunksize=1000) |
E 368.0
X 364.0
L 346.0
O 343.0
Q 340.0
dtype: float64
将数据写出到文本格式
1 | data = pd.read_csv('examples/ex5.csv') |
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | foo |
1 | pwd |
'/Users/zhangyangfenbi.com/Desktop/code/conda_book'
1 | data.to_csv('/Users/zhangyangfenbi.com/Desktop/tmp.csv') |
1 | !cat '/Users/zhangyangfenbi.com/Desktop/tmp.csv' |
,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo
1 | import sys |
|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo
1 | data.to_csv(sys.stdout, index=False, header=False) |
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo
1 | data.to_csv(sys.stdout, index=False, columns=['a', 'b', 'c']) |
a,b,c
1,2,3.0
5,6,
9,10,11.0
1 | dates = pd.date_range('1/1/2000', periods=7) |
2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6
手工处理分隔符
1 | import csv |
['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3']
Json数据
1 | #Json格式已经成为一种通用的格式,主要用户http请求和应用程序之间发送数据 |
str
1 | import json |
{'name': 'Wes',
'places_lived': ['United States', 'Spain', 'Germany'],
'pet': None,
'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
{'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}
1 | asjson = json.dumps(res) |
'{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'
web信息收集
书中主要讲了lxml和urllib2
不过现在bs4和request这两个库用得比较多,这部分看一看就好。
二进制数据格式
1 | frame = pd.read_csv('examples/ex1.csv') |
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
1 | frame.to_pickle('examples/frame_pickle_zhangyang') |
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
HTML与web api
1 | import requests |
<Response [200]>
1 | type(resp.text) |
str
使用数据库
1 | import sqlite3 |