可视化线性关系

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(color_codes=True)

tips = sns.load_dataset("tips")

绘制线性回归

Two main functions in seaborn are used to visualize a linear relationship as determined through regression. These functions, regplot() and lmplot() are closely related, and share much of their core functionality. It is important to understand the ways they differ, however, so that you can quickly choose the correct tool for particular job.

#先看一个最简单的例子
#默认置信区间为95%

sns.regplot(x='total_bill',y='tip',data=tips)

<matplotlib.axes._subplots.AxesSubplot at 0x110f64d68>

png

1	sns.lmplot(x='total_bill',y='tip',data=tips)

<seaborn.axisgrid.FacetGrid at 0x110f15780>

png

regplot()可以接受更加灵活的输入，lmplot()接受整形输入，同时regplot()也拥有部分lmplot()的功能。

1	sns.lmplot(x='size',y='tip',data=tips)

<seaborn.axisgrid.FacetGrid at 0x112574400>

png

Fitting different kinds of models（拟合不同的模型）

1 2	anscombe = sns.load_dataset("anscombe") anscombe.head()

	dataset	x	y
0	I	10.0	8.04
1	I	8.0	6.95
2	I	13.0	7.58
3	I	9.0	8.81
4	I	11.0	8.33

1	sns.lmplot(x='x',y='y',data=anscombe.query("dataset == 'I'"),ci=None,scatter_kws={"s": 80})

<seaborn.axisgrid.FacetGrid at 0x1a1e166978>

png

1 2	sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'II'"), ci=None, scatter_kws={"s": 80})

png

1
2
3

#上一个明明显拟合不到位，我们可以猜测这是一个多项式回归，利用order参数调用numpy.polyfit()
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'II'"),order = 2 ,
           ci=None, scatter_kws={"s": 80})

<seaborn.axisgrid.FacetGrid at 0x1a1e3b3cc0>

png

sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'III'"),
           ci=None, scatter_kws={"s": 80});

## 存在异常值

png

1
2
3

# 不同的损失函数来减轻相对较大的残差
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'III'"),
           robust=True, ci=None, scatter_kws={"s": 80});

png

可视化数据集的分布

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
%matplotlib inline

单变量分布

displot()函数将绘制直方图，并拟合核密度函数(KDE)

x=np.random.normal(size=100)
sns.distplot(x)

1 2	#去除kde sns.distplot(x,kde=False,rug=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1a21577f60>

png

1 2	#箱子划分有多细，Seaborn会默认猜测一个，但是更好的应该由我们来指定 sns.distplot(x,bins=20,kde=False,rug=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1a215fdbe0>

png

核密度函数

The kernel density estimate may be less familiar, but it can be a useful tool for plotting the shape of a distribution. Like the histogram, the KDE plots encode the density of observations on one axis with height along the other axis:

简单理解为展示密度

1	sns.distplot(x,hist=False,rug=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1a216fada0>

png

1	sns.kdeplot(x,shade=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1a2181cd68>

png

双变量分布

It can also be useful to visualize a bivariate distribution of two variables. The easiest way to do this in seaborn is to just use the jointplot() function, which creates a multi-panel figure that shows both the bivariate (or joint) relationship between two variables along with the univariate (or marginal) distribution of each on separate axes.

mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200)
df = pd.DataFrame(data, columns=["x", "y"])
df.head()

	x	y
0	1.620467	2.511505
1	-0.529253	0.247477
2	-1.361914	0.225665
3	-1.188358	0.785273
4	1.158663	0.180673

1	sns.jointplot(x='x',y='y',data=df)

<seaborn.axisgrid.JointGrid at 0x1a21b3a668>

png

1
2
3

x, y = np.random.multivariate_normal(mean, cov, 1000).T
with sns.axes_style("white"):
    sns.jointplot(x=x, y=y, kind="hex", color="k")

png

Kernel density estimation

这个和单变量分布的核密度函数差不多

1	sns.jointplot(x='x',y='y',kind='kde',data=df)

<seaborn.axisgrid.JointGrid at 0x1a21c534a8>

png

可视化数据集中的成对关系

这个没太看明白、、、

1 2	iris = sns.load_dataset("iris") sns.pairplot(iris);

png

1
2
3

g = sns.PairGrid(iris)
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.kdeplot, n_levels=6);

png

分类数据可视化

In seaborn, there are several different ways to visualize a relationship involving categorical data. Similar to the relationship between relplot() and either scatterplot() or lineplot(), there are two ways to make these plots. There are a number of axes-level functions for plotting categorical data in different ways and a figure-level interface, catplot(), that gives unified higher-level access to them.

catplot()

Categorical scatterplots:

stripplot() (with kind=”strip”; the default)
swarmplot() (with kind=”swarm”)

Categorical distribution plots:

boxplot() (with kind=”box”)
violinplot() (with kind=”violin”)
boxenplot() (with kind=”boxen”)

Categorical estimate plots:

pointplot() (with kind=”point”)
barplot() (with kind=”bar”)
countplot() (with kind=”count”)

1
2
3

import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="ticks", color_codes=True)

分类散点图

1 2	tips = sns.load_dataset("tips") tips.head()

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

1	sns.catplot(x='day',y='total_bill',data=tips)

<seaborn.axisgrid.FacetGrid at 0x1a156aa940>

png

1 2	#The jitter parameter controls the magnitude of jitter or disables it altogether: sns.catplot(x="day", y="total_bill", jitter=False, data=tips);

png

beeswarm，即swarmplot()

1	sns.catplot(x="day", y="total_bill", kind="swarm", data=tips);

png

1 2	#也支持hue参数进行分类，但是不支持style sns.catplot(x="day", y="total_bill", kind="swarm",hue='sex',data=tips);

png

1 2	#也可以在轴上对一个参数进行分类,参数order sns.catplot(x='smoker',y='tip',order=['Yes','No'],data=tips)

<seaborn.axisgrid.FacetGrid at 0x1a2196da58>

png

1 2	# x,y轴是很自由的，换一种展现方式 sns.catplot(x='tip',y='smoker',order=['Yes','No'],data=tips)

<seaborn.axisgrid.FacetGrid at 0x1a21ae6f28>

png

分类分布图

箱线图

The first is the familiar boxplot(). This kind of plot shows the three quartile values of the distribution along with extreme values. The “whiskers” extend to points that lie within 1.5 IQRs of the lower and upper quartile, and then observations that fall outside this range are displayed independently. This means that each value in the boxplot corresponds to an actual observation in the data.

‘晶须’延伸到1.5IQR(第一四分位和第三四分位的距离)，然后显示范围之外的独立点。

1	sns.catplot(x='day',y='total_bill',kind='box',data=tips)

<seaborn.axisgrid.FacetGrid at 0x1a21bafda0>

png

1	sns.catplot(x='day',y='total_bill',kind='box',hue='sex',data=tips)

<seaborn.axisgrid.FacetGrid at 0x1a21975d30>

png

A related function, boxenplot(), draws a plot that is similar to a box plot but optimized for showing more information about the shape of the distribution. It is best suited for larger datasets:

1	sns.catplot(x='day',y='total_bill',kind='boxen',hue='sex',data=tips)

<seaborn.axisgrid.FacetGrid at 0x1a21ba7128>

png

Violinplots

which combines a boxplot with the kernel density estimation procedure described in the distributions tutorial.

内核密度估计过程？

1	sns.catplot(x='total_bill',y='day',kind='violin',data=tips)

<seaborn.axisgrid.FacetGrid at 0x1a220c1cf8>

png

1 2	sns.catplot(x="day", y="total_bill", hue="sex", kind="violin", split=True, data=tips);

png

分类估计图

For other applications, rather than showing the distribution within each category, you might want to show an estimate of the central tendency of the values. Seaborn has two main ways to show this information. Importantly, the basic API for these functions is identical to that for the ones discussed above.

Bar plots

1 2	titanic = sns.load_dataset("titanic") titanic.head()

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

1	sns.catplot(x='sex',y='survived',kind='bar',hue='class',data=titanic)

<seaborn.axisgrid.FacetGrid at 0x110b9ecf8>

png

1	sns.catplot(x='deck',kind='count',palette='ch:.25',data=titanic)

<seaborn.axisgrid.FacetGrid at 0x1a23085cc0>

png

Point plots

An alternative style for visualizing the same information is offered by the pointplot() function. This function also encodes the value of the estimate with height on the other axis, but rather than showing a full bar, it plots the point estimate and confidence interval. Additionally, pointplot() connects points from the same hue category. This makes it easy to see how the main relationship is changing as a function of the hue semantic, because your eyes are quite good at picking up on differences of slopes:

1	sns.catplot(x='sex',y='survived',kind='point',hue='class',data=titanic)

<seaborn.axisgrid.FacetGrid at 0x1a230770b8>

png

#当然也可以标记得更好,刻画palette,markers,linestyles等参数
sns.catplot(x='class',y='survived',kind='point',hue='sex',data=titanic
           ,palette={'male':'g','female':'m'}
           ,markers=['^','o'],linestyles=['-','--'])

<seaborn.axisgrid.FacetGrid at 0x1a23683278>

png

绘制‘宽格式’数据

While using “long-form” or “tidy” data is preferred, these functions can also by applied to “wide-form” data in a variety of formats, including pandas DataFrames or two-dimensional numpy arrays. These objects should be passed directly to the data parameter:

1 2	iris = sns.load_dataset("iris") sns.catplot(data=iris, orient="h", kind="box")

<seaborn.axisgrid.FacetGrid at 0x1a23843da0>

png

多面板分类图

1
2
3

sns.catplot(x="day", y="total_bill", hue="smoker",
            col="time", aspect=.6,
            kind="swarm", data=tips);

png

g = sns.catplot(x="fare", y="survived", row="class",
                kind="box", orient="h", height=1.5, aspect=4,
                data=titanic.query("fare > 0"))
g.set(xscale="log");

png

补充一下

seaborn.catplot(x=None, y=None, hue=None, data=None, row=None, col=None, col_wrap=None, estimator=, ci=95, n_boot=1000, units=None, order=None, hue_order=None, row_order=None, col_order=None, kind=’strip’, height=5, aspect=1, orient=None, color=None, palette=None, legend=True, legend_out=True, sharex=True, sharey=True, margin_titles=False, facet_kws=None)

x,y:变量名
data:数据集名
row,col:对分类的变量显示进行控制
col_wrap:控制一行最多显示几个
estimator:每个分类中进行矢量到标量的映射
ci:置信区间
n_boot:计算置信区间时使用的引导迭代次数
..
order,hue_order:对分类进行排序
row_order,col_ordrt:行列进行排序
kind:使用哪种绘图方式(“point”, “bar”, “strip”, “swarm”, “box”, “violin”, or “boxen”)
size:每个面板的高度
aspect:纵横比
orient:方向
color:颜色
palette:调色板
legned:hue的信息面板

Seaborn-relationships

Posted on 2019-01-09

写在最前

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

内容来自于Seaborn官方教程

可视化统计关系

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style="darkgrid")

replot()
lineplot()
scatterplot()

用散点图关联变量

1 2	tips = sns.load_dataset("tips") # load_dateset()是从在线存储库加载数据集，极大方便了练习 tips.head()

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

1	sns.relplot(x="total_bill", y="tip", data=tips)

<seaborn.axisgrid.FacetGrid at 0x1a1ff4a4e0>

png

1	sns.relplot(x="total_bill", y="tip",hue='size', data=tips) #hue参数对输入的变量进行分组，生成的不同的颜色

<seaborn.axisgrid.FacetGrid at 0x1a1fbbdc88>

png

1	sns.relplot(x="total_bill", y="tip",hue='size',size='size',sizes=(15,100),data=tips) # size,sizes一般搭配使用

<seaborn.axisgrid.FacetGrid at 0x1a203be5f8>

png

下面这个分组画图感觉会很有用,col参数决定行，row参数决定列

1	sns.relplot(x="total_bill", y="tip",hue='size',size='size',sizes=(15,50),col='time',data=tips)

<seaborn.axisgrid.FacetGrid at 0x1a205eb1d0>

png

1	sns.relplot(x="total_bill", y="tip",hue='size',size='size',sizes=(15,50),col='time',row='sex',data=tips)

<seaborn.axisgrid.FacetGrid at 0x1a209b8ef0>

png

1	sns.relplot(x="total_bill", y="tip",hue='time',col ='time',palette = ['b','r'],data=tips) #指定绘图的颜色

<seaborn.axisgrid.FacetGrid at 0x1a21a2b390>

png

线图表示连续性


#lineplot()函数的很多参数其实和replot()一致，hue|size|col之类的，可以参考

df = pd.DataFrame(dict(time=np.arange(500),value=np.random.randn(500).cumsum()))
sns.relplot(x='time',y='value',kind='line',data=df)

<seaborn.axisgrid.FacetGrid at 0x1a22642c88>

png

1	sns.lineplot(x='time',y='value',data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x1a22879c18>

png

聚合和表示不确定性

如果x变量出现多次，那么seaborn会通过绘制平均值周围的其和95％置信区间来聚合每个值的多个测量值

这个对于不同的数据集是需要适应其变化的，假设是时间密集型的数据集，那么就需要禁用他们

1 2	fmri = sns.load_dataset("fmri") sns.relplot(x="timepoint", y="signal", kind="line", data=fmri)

<seaborn.axisgrid.FacetGrid at 0x1a229c42b0>

png

1 2	#特别是对于较大的数据，可以通过绘制标准偏差而不是置信区间来表示每个时间点的分布扩散 sns.relplot(x="timepoint", y="signal", kind="line",ci='sd',data=fmri)

<seaborn.axisgrid.FacetGrid at 0x1a22b6ce10>

png

用语义映射绘制数据子集

1	fmri.head()

	subject	timepoint	event	region	signal
0	s13	18	stim	parietal	-0.017552
1	s5	14	stim	parietal	-0.080883
2	s12	18	stim	parietal	-0.081033
3	s11	18	stim	parietal	-0.046134
4	s10	18	stim	parietal	-0.037970

1	sns.relplot(x="timepoint", y="signal", hue="event", kind="line", data=fmri);

png

1	sns.lineplot(x='timepoint',y='signal',hue='region',style='event',data=fmri)

<matplotlib.axes._subplots.AxesSubplot at 0x1a23077320>

png

1
2
3

#标识子集
sns.relplot(x="timepoint", y="signal", hue="region", style="event",
            dashes=False, markers=True, kind="line", marker =True,data=fmri);

png

还可以单独绘制每个采样单位，而无需通过语义区分它们。这可以避免使图例混乱

1
2
3

sns.relplot(x="timepoint", y="signal", hue="region",
            units="subject", estimator=None,
            kind="line", data=fmri.query("event == 'stim'"))

<seaborn.axisgrid.FacetGrid at 0x1a234eef98>

png

默认lineplot()的色彩映射和图例的处理还取决于色调语义是分类还是数字

dots = sns.load_dataset("dots").query("align == 'dots'")
sns.relplot(x="time", y="firing_rate",
            hue="coherence", style="choice",
            kind="line", data=dots)

<seaborn.axisgrid.FacetGrid at 0x1a2370e7b8>

png

用日期数据绘图

线图通常用于可视化与实际日期和时间相关的数据。这些函数以原始格式将数据传递给底层的matplotlib函数，因此他们可以利用matplotlib在tick标签中设置日期格式的功能。

df = pd.DataFrame(dict(time=pd.date_range("2017-1-1", periods=500),
                       value=np.random.randn(500).cumsum()))
g = sns.relplot(x="time", y="value", kind="line", data=df)
g.fig.autofmt_xdate()

png

1
2
3

sns.relplot(x="timepoint", y="signal", hue="subject",
            col="region", row="event", height=3,
            kind="line", estimator=None, data=fmri);

png

sns.relplot(x="timepoint", y="signal", hue="event", style="event",
            col="subject", col_wrap=5,
            height=3, aspect=.75, linewidth=2.5,
            kind="line", data=fmri.query("region == 'frontal'"));

png

ch08-数据可视化

Posted on 2019-01-09

关于数据可视化

import numpy as np
import pandas as pd
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.random.seed(12345)
import matplotlib.pyplot as plt
import matplotlib
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)

最简单的例子

1	plt.plot(np.arange(10))

[<matplotlib.lines.Line2D at 0x120132198>]

png

Figure和Subplot

fig = plt.figure()
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,3)

png

1	plt.plot(np.random.randn(50).cumsum(),'k--')

[<matplotlib.lines.Line2D at 0x120b7ad68>]

png

1 2	_ = ax1.hist(np.random.randn(100),bins=20,color='k',alpha=0.3) ax2.scatter(np.arange(30), np.arange(30) + 3 * np.random.randn(30))

<matplotlib.collections.PathCollection at 0x1203c5208>

fig

png

1 2	fig, axes = plt.subplots(2, 3) axes

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x120a9bcc0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x120f54c18>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x120f7b2e8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x120fa1438>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x120fc9b00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x120ffc198>]],
      dtype=object)

png

调整subplot周围的间距

fig, axes = plt.subplots(2, 2, sharex=True, sharey=True)
for i in range(2):
    for j in range(2):
        axes[i,j].hist(np.random.randn(500), bins=50, color='k', alpha=0.5)
plt.subplots_adjust(wspace=0.1,hspace=0.1)

png

颜色、标记和类型

1	plt.figure()

<Figure size 432x288 with 0 Axes>




<Figure size 432x288 with 0 Axes>

1 2	from numpy.random import randn plt.plot(randn(50).cumsum(),linestyle='--',color='b')

[<matplotlib.lines.Line2D at 0x1213f3a90>]

png

1	plt.plot(randn(30).cumsum(), color='k', linestyle='dashed', marker='o')

[<matplotlib.lines.Line2D at 0x1213f8fd0>]

png

1	plt.plot(randn(30).cumsum(),'ko--') #ko--是把参数组合在一起了.... color = 'k' marker = 'o' linestyle = '--'

[<matplotlib.lines.Line2D at 0x1214baf60>]

png

1	plt.close('all')

data = np.random.randn(30).cumsum()
plt.plot(data, 'k--', label='Default')
plt.plot(data, 'k-', drawstyle='steps-post', label='steps-post')
plt.legend(loc='best')

<matplotlib.legend.Legend at 0x1216ab9e8>

png

刻度、标签和图例

设置细节

1
2
3

fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.plot(randn(1000).cumsum())

[<matplotlib.lines.Line2D at 0x1226302b0>]

png

1
2
3

ticks  = ax.set_xticks([0,250,500,750,1000])
labels = ax.set_xticklabels(['one', 'two', 'three', 'four', 'five'],
                            rotation=30, fontsize='small')

1	ax.set_title('My First Title of Matplotlib')

Text(0.5,1,'My First Title of Matplotlib')

1	ax.set_xlabel('Stage')

Text(0.5,3.2,'Stage')

fig

png

添加图例

fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.plot(randn(100).cumsum(), 'k', label='one')
ax.plot(randn(100).cumsum(), 'k--', label='two')
ax.plot(randn(100).cumsum(), 'k.', label='three')

[<matplotlib.lines.Line2D at 0x1223a1940>]

png

1 2	ax.legend(loc='best') fig

png

注解以及在Subplot上绘图

from datetime import datetime

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)

data = pd.read_csv('examples/spx.csv', index_col=0, parse_dates=True)
spx = data['SPX']

spx.plot(ax=ax, style='k-')

crisis_data = [
    (datetime(2007, 10, 11), 'Peak of bull market'),
    (datetime(2008, 3, 12), 'Bear Stearns Fails'),
    (datetime(2008, 9, 15), 'Lehman Bankruptcy')
]

for date, label in crisis_data:
    ax.annotate(label, xy=(date, spx.asof(date) + 75),
                xytext=(date, spx.asof(date) + 225),
                arrowprops=dict(facecolor='black', headwidth=4, width=2,
                                headlength=4),
                horizontalalignment='left', verticalalignment='top')

# Zoom in on 2007-2010
ax.set_xlim(['1/1/2007', '1/1/2011'])
ax.set_ylim([600, 1800])

ax.set_title('Important dates in the 2008-2009 financial crisis')

Text(0.5,1,'Important dates in the 2008-2009 financial crisis')

png

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)

rect = plt.Rectangle((0.2, 0.75), 0.4, 0.15, color='k', alpha=0.3)
circ = plt.Circle((0.7, 0.2), 0.15, color='b', alpha=0.3)
pgon = plt.Polygon([[0.15, 0.15], [0.35, 0.4], [0.2, 0.6]],
                   color='g', alpha=0.5)

ax.add_patch(rect)
ax.add_patch(circ)
ax.add_patch(pgon)

<matplotlib.patches.Polygon at 0x123523eb8>

png

将图表保持为文件

1	fig.savefig('/Users/zhangyangfenbi.com/Desktop/demo.png')

写在最后

matplotlib实际上还是一个比较低级的工具，绘图都是组装起来的。书中介绍了pandas自带的绘图库，不过基于之前已经有了Seaborn，这个就不写pandas的了，后续把Seaborn的坑填上。

读《了不起的盖茨比》

Posted on 2019-01-08

写在最前

记得去年看村上春树的《挪威的森林》的时候，永泽在和渡边君交谈中说到

若是诵读三遍《了不起的盖茨比》的人，倒像是可以成为我的朋友。

当时有在想是什么书这么屌，以后找个时间看一看。今晚看完了这本书，小小记录一下，虽然我这样的普通青年没有读出啥火花就是了。。。

内容简介

内容来自维基百科:

小说主要事件发生在1922年夏。耶鲁大学毕业生、一战老兵尼克·卡拉威（也是小说叙述人）从中西部来到纽约，卖债券过活。他在长岛的西卵村租住了一间小屋，与盖茨比为邻。杰·盖茨比是一个年轻、神秘的百万富翁，经常举办豪华宴会，却很少出头露面；有许多人到他那里去吃喝，他始终是一个孤独的人。尼克驱车到东卵村拜访表妹黛西·费伊·布坎南，她丈夫汤姆·布坎南也是尼克的大学同学。他们将尼克介绍给乔丹·贝克小姐，她是位充满魅力却略带自私的青年高尔夫球手；尼克认为自己爱上了她。她告诉尼克，汤姆有外遇，叫默特尔·威尔逊，住在“灰烬谷”：西卵村和纽约城之间的工业垃圾场。不久，尼克和汤姆、默特尔前往他们幽会的公寓，举行放荡的狂欢会。默特尔几度提起黛西的名字，汤姆在愤怒中打扁了默特尔的鼻子。

夏季某日，尼克收到盖茨比宴会的邀请函。他在宴会上碰见乔丹·贝克，而且终于见到盖茨比，发现盖茨比竟然在战争中与他同在一个师服役。尼克从乔丹那里得知盖茨比在1917年与黛西坠入爱河，但因为他要去参军，黛西终于嫁给了汤姆•布坎南。复员后，他赚了很多钱，在长岛买下豪宅，里眺望海湾对面黛西的家，希望“再续前缘”。盖茨比奢华的生活方式与放荡的狂欢会不过是为了吸引黛西，让她回心转意。盖茨比要尼克安排他与黛西见面。尼克邀请黛西到家品茶，隐瞒了盖茨比的到场。在尴尬的见面之后，盖茨比和黛西重温旧情。他们再次相连，但汤姆很快对此产生怀疑。在饭局上，黛西对盖茨比言辞甜蜜，毫不掩饰，汤姆的怀疑得到了证实。虽然汤姆自己也有情妇，但他还是对妻子的出轨倍感愤怒。汤姆逼迫大家前往纽约市，在广场酒店的套房里与盖茨比对峙，告诉他二人间的故事是盖茨比所不能领悟的。不仅如此，他揭露盖茨比贩卖私酒，从事其它见不得人的勾当，才得到了今日的财富。黛西感觉自己无法承受，只想离开，汤姆叫盖茨比驱车送她回家。

尼克、乔丹、汤姆驱车回家时经过“灰烬谷”，发现汤姆的情妇默特尔被盖茨比的车撞死了。事后，尼克从盖茨比那里得知是黛西在开车，但盖茨比不愿揭发自己的爱人。默特尔的丈夫乔治误以为车主就是自己妻子的情人，对此展开搜索。汤姆误导乔治，后者发现车主是盖茨比后，来到豪宅，开枪行凶，随后自尽。尼克为盖茨比举办葬礼，结束了与乔丹的关系，看破了东部的生活方式，回到了中西部的老家。

一点点感受

这本书具体的价值或者说其代表的深层意义不是我可以评价的，不过这个故事我读起来还是觉得很精彩的，有回味。

下面摘几个点：

中国也合适

这个故事完全可以放到现在的中国…二三线城市小富二代，看着从不知道哪里出来的暴发户、一线城市富一代和白富美之间的故事。

关于爱情，关于理想，关于阶级…

首与尾

在我年纪轻轻，资历尚浅的那些年里，父亲曾给我一句忠告，直到今天，这句话仍在我心间萦绕。
“每当你想批评别人的时候，”他对我说，“要记住，这世上并不是所有人，都有你拥有的那些优势”

Don’t judge.

我们奋力前行，小舟逆水而上，不断地被浪潮推回到过去。

那个年代，这个年代，这个世界会好吗？

会好的！

立Flag

有机会读一读英文版…

ch07-数据规整化

Posted on 2019-01-08

合并数据集

import numpy as np
import pandas as pd
pd.options.display.max_rows = 20
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)

层次化索引

data = pd.Series(np.random.randn(9),
                 index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                        [1, 2, 3, 1, 3, 1, 2, 2, 3]])
data

a  1   -0.204708
   2    0.478943
   3   -0.519439
b  1   -0.555730
   3    1.965781
c  1    1.393406
   2    0.092908
d  2    0.281746
   3    0.769023
dtype: float64

1	data.index

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1, 1, 2]])

data['b']

1   -0.555730
3    1.965781
dtype: float64

1	data['b':'c']

b  1   -0.555730
   3    1.965781
c  1    1.393406
   2    0.092908
dtype: float64

1	data.loc[['b', 'd']]

b  1   -0.555730
   3    1.965781
d  2    0.281746
   3    0.769023
dtype: float64

1	data.loc[:, 2]

a    0.478943
c    0.092908
d    0.281746
dtype: float64

frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                     index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                     columns=[['Ohio', 'Ohio', 'Colorado'],
                              ['Green', 'Red', 'Green']])
frame

		Ohio		Colorado
		Green	Red	Green
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

1
2
3

frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

1	frame['Ohio']

	color	Green	Red
key1	key2
a	1	0	1
a	2	3	4
b	1	6	7
b	2	9	10

数据库风格的DataFrame合并

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                    'data2': range(3)})

df1

	key	data1
0	b	0
1	b	1
2	a	2
3	c	3
4	a	4
5	a	5
6	b	6

df2

	key	data2
0	a	0
1	b	1
2	d	2

1	pd.merge(df1,df2) #如果没有指定建，则会默认将重叠列名做键

	key	data1	data2
0	b	0	1
1	b	1	1
2	b	6	1
3	a	2	0
4	a	4	0
5	a	5	0

df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
                    'data2': range(3)})
print(df3)
print(df4)

  lkey  data1
0    b      0
1    b      1
2    a      2
3    c      3
4    a      4
5    a      5
6    b      6
  rkey  data2
0    a      0
1    b      1
2    d      2

1	pd.merge(df3,df4,left_on='lkey',right_on='rkey') #默认Inner连接，how=''参数决定怎么连接

	lkey	data1	rkey	data2
0	b	0	b	1
1	b	1	b	1
2	b	6	b	1
3	a	2	a	0
4	a	4	a	0
5	a	5	a	0

1 2	#左连接 pd.merge(df3,df4,left_on='lkey',right_on='rkey',how='left')

	lkey	data1	rkey	data2
0	b	0	b	1.0
1	b	1	b	1.0
2	a	2	a	0.0
3	c	3	NaN	NaN
4	a	4	a	0.0
5	a	5	a	0.0
6	b	6	b	1.0

## 多个键进行连接时，传入列表就好
left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
                     'key2': ['one', 'two', 'one'],
                     'lval': [1, 2, 3]})
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
                      'key2': ['one', 'one', 'one', 'two'],
                      'rval': [4, 5, 6, 7]})
pd.merge(left,right,on=['key1','key2'],how='left')

	key1	key2	lval	rval
0	foo	one	1	4.0
1	foo	one	1	5.0
2	foo	two	2	NaN
3	bar	one	3	6.0

1	pd.merge(left, right, on='key1')

	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

1	pd.merge(left, right, on='key1', suffixes=('_left', '_right'))

	key1	key2_left	lval	key2_right	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

索引上的合并

1
2
3

left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
                      'value': range(6)})
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])

left1

	key	value
0	a	0
1	b	1
2	a	2
3	a	3
4	b	4
5	c	5

right1

	group_val
a	3.5
b	7.0

1	pd.merge(left1,right1,left_on='key',right_index=True)

	key	value	group_val
0	a	0	3.5
2	a	2	3.5
3	a	3	3.5
1	b	1	7.0
4	b	4	7.0

lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio',
                               'Nevada', 'Nevada'],
                      'key2': [2000, 2001, 2002, 2001, 2002],
                      'data': np.arange(5.)})
righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
                      index=[['Nevada', 'Nevada', 'Ohio', 'Ohio',
                              'Ohio', 'Ohio'],
                             [2001, 2000, 2000, 2000, 2001, 2002]],
                      columns=['event1', 'event2'])

lefth

	key1	key2	data
0	Ohio	2000	0.0
1	Ohio	2001	1.0
2	Ohio	2002	2.0
3	Nevada	2001	3.0
4	Nevada	2002	4.0

righth

		event1	event2
Nevada	2001	0	1
Nevada	2000	2	3
Ohio	2000	4	5
	2000	6	7
	2001	8	9
	2002	10	11

1	pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True)

	key1	key2	data	event1	event2
0	Ohio	2000	0.0	4	5
0	Ohio	2000	0.0	6	7
1	Ohio	2001	1.0	8	9
2	Ohio	2002	2.0	10	11
3	Nevada	2001	3.0	0	1

1	pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True,how='outer')

	key1	key2	data	event1	event2
0	Ohio	2000	0.0	4.0	5.0
0	Ohio	2000	0.0	6.0	7.0
1	Ohio	2001	1.0	8.0	9.0
2	Ohio	2002	2.0	10.0	11.0
3	Nevada	2001	3.0	0.0	1.0
4	Nevada	2002	4.0	NaN	NaN
4	Nevada	2000	NaN	2.0	3.0

##两边都开启索引也没有问题

left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
                     index=['a', 'c', 'e'],
                     columns=['Ohio', 'Nevada'])
right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
                      index=['b', 'c', 'd', 'e'],
                      columns=['Missouri', 'Alabama'])

pd.merge(left2, right2, how='outer', left_index=True, right_index=True)

	Ohio	Nevada	Missouri	Alabama
a	1.0	2.0	NaN	NaN
b	NaN	NaN	7.0	8.0
c	3.0	4.0	9.0	10.0
d	NaN	NaN	11.0	12.0
e	5.0	6.0	13.0	14.0

1	left2.join(right2) ###更加快速地实现索引合并

	Ohio	Nevada	Missouri	Alabama
a	1.0	2.0	NaN	NaN
c	3.0	4.0	9.0	10.0
e	5.0	6.0	13.0	14.0

1	left1.join(right1, on='key')

	key	value	group_val
0	a	0	3.5
1	b	1	7.0
2	a	2	3.5
3	a	3	3.5
4	b	4	7.0
5	c	5	NaN

轴向连接

1 2	arr=np.arange(12).reshape(3,4) arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

1	np.concatenate([arr, arr], axis=1)

array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

1
2
3

s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])

1	pd.concat([s1,s2,s3]) #默认axis=0

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

1 2	s4=pd.concat([s1*5,s3]) pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b', 'e']])

	0	1
a	0.0	0.0
c	NaN	NaN
b	1.0	5.0
e	NaN	NaN

1 2	result = pd.concat([s1,s2,s3],keys=['one','two','three']) result

one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: int64

s1

a    0
b    1
dtype: int64

1	result.unstack()

	a	b	c	d	e	f	g
one	0.0	1.0	NaN	NaN	NaN	NaN	NaN
two	NaN	NaN	2.0	3.0	4.0	NaN	NaN
three	NaN	NaN	NaN	NaN	NaN	5.0	6.0

1	pd.concat([s1, s2, s3], axis=1, keys=['one', 'two', 'three'],sort=True)

	one	two	three
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],
                   columns=['one', 'two'])
df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],
                   columns=['three', 'four'])
df1

	one	two
a	0	1
b	2	3
c	4	5

df2

	three	four
a	5	6
c	7	8

1	pd.concat([df1, df2], axis=1,sort=True)

	one	two	three	four
a	0	1	5.0	6.0
b	2	3	NaN	NaN
c	4	5	7.0	8.0

1	pd.concat({'level1': df1, 'level2': df2}, axis=1,sort=True) #字典的键会被当做keys选项的值

	level1		level2
	one	two	three	four
a	0	1	5.0	6.0
b	2	3	NaN	NaN
c	4	5	7.0	8.0

1	pd.concat([df1,df2],axis=1,keys=['l1','l2'],names=['zz1','zz2'],sort=True)

zz1	l1		l2
zz2	one	two	three	four
a	0	1	5.0	6.0
b	2	3	NaN	NaN
c	4	5	7.0	8.0

1 2	df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd']) df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])

df1

	a	b	c	d
0	1.246435	1.007189	-1.296221	0.274992
1	0.228913	1.352917	0.886429	-2.001637
2	-0.371843	1.669025	-0.438570	-0.539741

df2

	b	d	a
0	0.476985	3.248944	-1.021228
1	-0.577087	0.124121	0.302614

1	pd.concat([df1,df2],sort=True)

	a	b	c	d
0	1.246435	1.007189	-1.296221	0.274992
1	0.228913	1.352917	0.886429	-2.001637
2	-0.371843	1.669025	-0.438570	-0.539741
0	-1.021228	0.476985	NaN	3.248944
1	0.302614	-0.577087	NaN	0.124121

1	pd.concat([df1,df2],ignore_index=True,sort=True)

	a	b	c	d
0	1.246435	1.007189	-1.296221	0.274992
1	0.228913	1.352917	0.886429	-2.001637
2	-0.371843	1.669025	-0.438570	-0.539741
3	-1.021228	0.476985	NaN	3.248944
4	0.302614	-0.577087	NaN	0.124121

合并重叠数据

np.where?

a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
              index=['f', 'e', 'd', 'c', 'b', 'a'])
b = pd.Series(np.arange(len(a), dtype=np.float64),
              index=['f', 'e', 'd', 'c', 'b', 'a'])
b[-1]=np.nan

f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64

1	np.where(pd.isnull(a),b,a) #用b对应索引的值来填充a的空值

array([0. , 2.5, 2. , 3.5, 4.5, nan])

1	b[:-2].combine_first(a[2:])

a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64

df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
                    'b': [np.nan, 2., np.nan, 6.],
                    'c': range(2, 18, 4)})
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
                    'b': [np.nan, 3., 4., 6., 8.]})

df1

	a	b	c
0	1.0	NaN	2
1	NaN	2.0	6
2	5.0	NaN	10
3	NaN	6.0	14

df2

	a	b
0	5.0	NaN
1	4.0	3.0
2	NaN	4.0
3	3.0	6.0
4	7.0	8.0

1	df1.combine_first(df2) #用参数对象的数据为调用者对象的缺失数据‘打补丁’

	a	b	c
0	1.0	NaN	2.0
1	4.0	2.0	6.0
2	5.0	4.0	10.0
3	3.0	6.0	14.0
4	7.0	8.0	NaN

重塑和轴向旋转

重塑层次化索引

data = pd.DataFrame(np.arange(6).reshape((2, 3)),
                    index=pd.Index(['Ohio', 'Colorado'], name='state'),
                    columns=pd.Index(['one', 'two', 'three'],
                    name='number'))
data

number	one	two	three
state
Ohio	0	1	2
Colorado	3	4	5

1 2	result = data.stack() result

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

1	result.unstack()

number	one	two	three
state
Ohio	0	1	2
Colorado	3	4	5

将“长格式”旋转为“宽格式”

data = pd.read_csv('examples/macrodata.csv')
data.head()
periods = pd.PeriodIndex(year=data.year, quarter=data.quarter,
                         name='date')
columns = pd.Index(['realgdp', 'infl', 'unemp'], name='item')
data = data.reindex(columns=columns)
data.index = periods.to_timestamp('D', 'end')
ldata = data.stack().reset_index().rename(columns={0: 'value'})

1	ldata[:10]

	date	item	value
0	1959-03-31	realgdp	2710.349
1	1959-03-31	infl	0.000
2	1959-03-31	unemp	5.800
3	1959-06-30	realgdp	2778.801
4	1959-06-30	infl	2.340
5	1959-06-30	unemp	5.100
6	1959-09-30	realgdp	2775.488
7	1959-09-30	infl	2.740
8	1959-09-30	unemp	5.300
9	1959-12-31	realgdp	2785.204

1 2	pivoted = ldata.pivot('date','item','value') ## index , columns , value pivoted.head()

item	infl	realgdp	unemp
date
1959-03-31	0.00	2710.349	5.8
1959-06-30	2.34	2778.801	5.1
1959-09-30	2.74	2775.488	5.3
1959-12-31	0.27	2785.204	5.6
1960-03-31	2.31	2847.699	5.2

1	ldata['value2'] = np.random.randn(len(ldata))

1	ldata[:10]

	date	item	value	value2
0	1959-03-31	realgdp	2710.349	-0.894813
1	1959-03-31	infl	0.000	-1.741494
2	1959-03-31	unemp	5.800	-1.052256
3	1959-06-30	realgdp	2778.801	1.436603
4	1959-06-30	infl	2.340	-0.576207
5	1959-06-30	unemp	5.100	-2.420294
6	1959-09-30	realgdp	2775.488	-1.062330
7	1959-09-30	infl	2.740	0.237372
8	1959-09-30	unemp	5.300	0.000957
9	1959-12-31	realgdp	2785.204	0.065253

1 2	pivoted = ldata.pivot('date','item') pivoted.head() # 带有层次化索引的列

	value			value2
item	infl	realgdp	unemp	infl	realgdp	unemp
date
1959-03-31	0.00	2710.349	5.8	-1.741494	-0.894813	-1.052256
1959-06-30	2.34	2778.801	5.1	-0.576207	1.436603	-2.420294
1959-09-30	2.74	2775.488	5.3	0.237372	-1.062330	0.000957
1959-12-31	0.27	2785.204	5.6	-1.367524	0.065253	-0.030280
1960-03-31	2.31	2847.699	5.2	-0.642437	0.940489	1.040179

1	pivoted['value'].head()

item	infl	realgdp	unemp
date
1959-03-31	0.00	2710.349	5.8
1959-06-30	2.34	2778.801	5.1
1959-09-30	2.74	2775.488	5.3
1959-12-31	0.27	2785.204	5.6
1960-03-31	2.31	2847.699	5.2

数据转化

移除重复数据

1 2	data=pd.DataFrame({'k1':['one']3+['two']4,'k2':[1,1,2,3,3,4,4]}) data

	k1	k2
0	one	1
1	one	1
2	one	2
3	two	3
4	two	3
5	two	4
6	two	4

1	data.duplicated() #判断当前行是否是重复行

0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

1	data.drop_duplicates() #移除重复行

	k1	k2
0	one	1
2	one	2
3	two	3
5	two	4

1 2	data['v1']=np.arange(7) data.drop_duplicates(['k1']) #只根据某一行来移除

	k1	k2	v1
0	one	1	0
3	two	3	3

1
2

	k1	k2	v1
0	one	1	0
1	one	1	1
2	one	2	2
3	two	3	3
4	two	3	4
5	two	4	5
6	two	4	6

1 2	data=pd.Series([1,2,3,-99]) data

0     1
1     2
2     3
3   -99
dtype: int64

1	data.replace(-99,np.nan)

0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

1	data.replace({1:100,-99:np.nan})

0    100.0
1      2.0
2      3.0
3      NaN
dtype: float64

重命名索引

data = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
                      index=['b', 'c', 'd', 'e'],
                      columns=['Missouri', 'Alabama'])
data

	Missouri	Alabama
b	7.0	8.0
c	9.0	10.0
d	11.0	12.0
e	13.0	14.0

1	data.index.map(str.upper)

Index(['B', 'C', 'D', 'E'], dtype='object')

离散化与面元划分

#离散化
age = np.arange(10)
bins = [2,5,8]
cats = pd.cut(age,bins)
cats

[NaN, NaN, NaN, (2, 5], (2, 5], (2, 5], (5, 8], (5, 8], (5, 8], NaN]
Categories (2, interval[int64]): [(2, 5] < (5, 8]]

检测与过滤异常值

1
2
3

from pandas import DataFrame
data=DataFrame(np.random.randn(1000,4))
data.describe()

	0	1	2	3
count	1000.000000	1000.000000	1000.000000	1000.000000
mean	0.002621	-0.023747	-0.003461	-0.002610
std	0.998586	0.962207	1.012928	0.996423
min	-3.024110	-2.657202	-3.105636	-3.530912
25%	-0.670724	-0.684972	-0.691494	-0.707701
50%	0.022038	0.023472	0.024927	0.020683
75%	0.649798	0.639806	0.693491	0.672463
max	3.897527	3.160760	3.144389	3.003284

1
2
3

#找出某列中绝对值超过3的值
col=data[3]
col[abs(col)>3]

208   -3.530912
969    3.003284
Name: 3, dtype: float64

1	data[(np.abs(data)>3).any(1)][:2]

	0	1	2	3
136	-1.202724	-0.286215	-3.105636	-0.369009
148	-3.024110	-1.168413	-0.888664	0.111410

1	data[(np.abs(data)>3).any(1)]=np.sign(data)*3

计算指标/哑变量

1 2	df = DataFrame({'key':['b','b','a','c','a','b'],'value':np.arange(6)}) df

	key	value
0	b	0
1	b	1
2	a	2
3	c	3
4	a	4
5	b	5

1	pd.get_dummies(df['key'])

	a	b	c
0	0	1	0
1	0	1	0
2	1	0	0
3	0	0	1
4	1	0	0
5	0	1	0

字符串操作

字符串对象方法

1	#和Python编码一起学习下

正则表达式

1 2	import re #参考之前的博客

ch06-IO

Posted on 2019-01-06

读写文本格式的数据

import numpy as np
import pandas as pd
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)

!pwd

/Users/zhangyangfenbi.com/Desktop/code/conda_book

1	!cat examples/ex1.csv

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

1
2
3

file_path = 'examples/ex1.csv'
df=pd.read_csv(file_path)
df

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

1 2	#read_table也可以，不过分隔符不一样，需要重新指定 pd.read_table('examples/ex1.csv',sep=',')

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

1	!cat examples/ex2.csv

1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

1
2
3

#可以让pandas为其默认分配列名，或者自己定义列名
df1=pd.read_csv('examples/ex2.csv',header=None)
df1

	0	1	2	3	4
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

1 2	df2=pd.read_csv('examples/ex2.csv',names=['a','b','c','d','e']) df2

	a	b	c	d	e
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

1 2	df2=pd.read_csv('examples/ex2.csv',names=['a','b','c','d','e'],index_col='e') df2

	a	b	c	d
e
hello	1	2	3	4
world	5	6	7	8
foo	9	10	11	12

1	!cat examples/csv_mindex.csv

key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16

1	pd.read_csv('examples/csv_mindex.csv',index_col=['key1','key2'])

		value1	value2
key1	key2
one	a	1	2
	b	3	4
	c	5	6
	d	7	8
two	a	9	10
	b	11	12
	c	13	14
	d	15	16

1 2	# list(open('examples/ex3.txt'))

['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']

1
2
3

#通过正则表达式去匹配并不是固定的分隔符
res = pd.read_table('examples/ex3.txt',sep='\s+')
res

	A	B	C
aaa	-0.264438	-1.026059	-0.619500
bbb	0.927272	0.302904	-0.032399
ccc	-0.264273	-0.386314	-0.217601
ddd	-0.871858	-0.348382	1.100491

1	!cat examples/ex4.csv

# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

1 2	#使用skiprows跳过指定行 pd.read_csv('examples/ex4.csv',skiprows=[0,2,3])

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

1	!cat examples/ex5.csv

something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo

1
2
3

#pandas会对缺失值进行标识
res1=pd.read_csv('examples/ex5.csv')
res1

	something	a	b	c	d	message
0	one	1	2	3.0	4	NaN
1	two	5	6	NaN	8	world
2	three	9	10	11.0	12	foo

1	res1.isnull()

	something	a	b	c	d	message
0	False	False	False	False	False	True
1	False	False	False	True	False	False
2	False	False	False	False	False	False

1 2	result = pd.read_csv('examples/ex5.csv', na_values=['NULL']) result

	something	a	b	c	d	message
0	one	1	2	3.0	4	NaN
1	two	5	6	NaN	8	world
2	three	9	10	11.0	12	foo

逐行读取文本文件

1 2	chunker = pd.read_csv('examples/ex6.csv', chunksize=1000) chunker

<pandas.io.parsers.TextFileReader at 0x10c0cc828>

chunker = pd.read_csv('examples/ex6.csv', chunksize=1000)

tot = pd.Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)

tot = tot.sort_values(ascending=False)

# Top5
tot[:5]

E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
dtype: float64

将数据写出到文本格式

1 2	data = pd.read_csv('examples/ex5.csv') data

	something	a	b	c	d	message
0	one	1	2	3.0	4	NaN
1	two	5	6	NaN	8	world
2	three	9	10	11.0	12	foo

pwd

'/Users/zhangyangfenbi.com/Desktop/code/conda_book'

1	data.to_csv('/Users/zhangyangfenbi.com/Desktop/tmp.csv')

1	!cat '/Users/zhangyangfenbi.com/Desktop/tmp.csv'

,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo

1 2	import sys data.to_csv(sys.stdout, sep='\|')

|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo

1	data.to_csv(sys.stdout, index=False, header=False)

one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo

1	data.to_csv(sys.stdout, index=False, columns=['a', 'b', 'c'])

a,b,c
1,2,3.0
5,6,
9,10,11.0

dates = pd.date_range('1/1/2000', periods=7)
ts = pd.Series(np.arange(7), index=dates)
ts.to_csv('examples/tseries.csv')
!cat examples/tseries.csv

2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6

手工处理分隔符

import csv
f = open('examples/ex7.csv')

reader = csv.reader(f)
for line in reader:
    print(line)
    
#writer省略

['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3']

Json数据

#Json格式已经成为一种通用的格式，主要用户http请求和应用程序之间发送数据
obj = """
{"name": "Wes",
 "places_lived": ["United States", "Spain", "Germany"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
              {"name": "Katie", "age": 38,
               "pets": ["Sixes", "Stache", "Cisco"]}]
}
"""

str

import json

res = json.loads(obj)
res

{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

1 2	asjson = json.dumps(res) asjson

'{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'

web信息收集

书中主要讲了lxml和urllib2
不过现在bs4和request这两个库用得比较多，这部分看一看就好。

二进制数据格式

1 2	frame = pd.read_csv('examples/ex1.csv') frame

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

1 2	frame.to_pickle('examples/frame_pickle_zhangyang') pd.read_pickle('examples/frame_pickle_zhangyang')

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

HTML与web api

import requests
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)
resp

<Response [200]>

1	type(resp.text)

str

使用数据库

import sqlite3
import pymongo
import MySQLdb

## 链接MySQL数据库为例

# 打开数据库连接
db = MySQLdb.connect("localhost", "testuser", "test123", "TESTDB", charset='utf8' )

# 使用cursor()方法获取操作游标 
cursor = db.cursor()

# 使用execute方法执行SQL语句
cursor.execute("SELECT VERSION()")

# 使用 fetchone() 方法获取一条数据
data = cursor.fetchone()

print "Database version : %s " % data

# 关闭数据库连接
db.close()

ch05-pandas基础

Posted on 2019-01-05

pandas基本数据结构

import pandas as pd
from pandas import Series,DataFrame

import numpy as np

Series

1	obj=Series([4,7,-1,3])

obj

0    4
1    7
2   -1
3    3
dtype: int64

1	obj.values

array([ 4,  7, -1,  3])

obj.index

RangeIndex(start=0, stop=4, step=1)

1 2	obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c']) obj2

d    4
b    7
a   -5
c    3
dtype: int64

1	obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

obj2['a']

-5

1	'b' in obj2

True

obj2>0

d     True
b     True
a    False
c     True
dtype: bool

1	obj2[obj2>0]

d    4
b    7
c    3
dtype: int64

1 2	sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000} sdata

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

1	type(sdata)

dict

1 2	obj3=Series(sdata) obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

1	type(obj3)

pandas.core.series.Series

states = ['California', 'Ohio', 'Oregon', 'Texas'] #California没有对应的键值
obj4=Series(sdata,index=states)
obj4

#NaN:not a number

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

1	obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

1	print(obj3,obj4)

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64 California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

obj3+obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

1	obj4.name = 'zhangyang'

1	obj4.index.name='pk'

obj4

pk
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: zhangyang, dtype: float64

DataFrame

#传入等长列表或者Numpy数组组成的字典
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
data

{'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
 'year': [2000, 2001, 2002, 2001, 2002, 2003],
 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

1	type(data)

dict

1 2	frame = DataFrame(data) frame

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9
5	Nevada	2003	3.2

1
2
3

#如果指定列，则DataFrame会按照指定列排序
frame1=DataFrame(data,columns=['year','state','pop'])
frame1

	year	state	pop
0	2000	Ohio	1.5
1	2001	Ohio	1.7
2	2002	Ohio	3.6
3	2001	Nevada	2.4
4	2002	Nevada	2.9
5	2003	Nevada	3.2

1
2
3

#若传入没有值，则会被指定为NaN
frame2=DataFrame(data,columns=['year','state','pop','debt'])
frame2

	year	state	pop	debt
0	2000	Ohio	1.5	NaN
1	2001	Ohio	1.7	NaN
2	2002	Ohio	3.6	NaN
3	2001	Nevada	2.4	NaN
4	2002	Nevada	2.9	NaN
5	2003	Nevada	3.2	NaN

1	frame.year

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

1	frame['year']

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

1	frame.loc[1]

state    Ohio
year     2001
pop       1.7
Name: 1, dtype: object

1	frame2.debt=10

1	frame2.debt

0    10
1    10
2    10
3    10
4    10
5    10
Name: debt, dtype: int64

1 2	#这里frame2.debt可以看做是一个Series print(frame2.debt.values,'and',frame2.debt.index)

[10 10 10 10 10 10] and RangeIndex(start=0, stop=6, step=1)

frame3 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                      index=['one', 'two', 'three', 'four',
                             'five', 'six'])
frame3

	year	state	pop	debt
one	2000	Ohio	1.5	NaN
two	2001	Ohio	1.7	NaN
three	2002	Ohio	3.6	NaN
four	2001	Nevada	2.4	NaN
five	2002	Nevada	2.9	NaN
six	2003	Nevada	3.2	NaN

1
2
3

val=Series([1.2,3.1,-1],index=['two','five','one'])
frame3.debt=val
print(frame3.debt)

one     -1.0
two      1.2
three    NaN
four     NaN
five     3.1
six      NaN
Name: debt, dtype: float64

1
2
3

#del 关键词可用于删除列
frame3['tmp'] = frame3.state == 'Ohio'  ##这里存在运算符的计算优先级，先判断是否相等，返回布尔型值
frame3

	year	state	pop	debt	tmp
one	2000	Ohio	1.5	-1.0	True
two	2001	Ohio	1.7	1.2	True
three	2002	Ohio	3.6	NaN	True
four	2001	Nevada	2.4	NaN	False
five	2002	Nevada	2.9	3.1	False
six	2003	Nevada	3.2	NaN	False

1	del frame3['tmp']

frame3

	year	state	pop	debt
one	2000	Ohio	1.5	-1.0
two	2001	Ohio	1.7	1.2
three	2002	Ohio	3.6	NaN
four	2001	Nevada	2.4	NaN
five	2002	Nevada	2.9	3.1
six	2003	Nevada	3.2	NaN

1
2
3

pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
pop

{'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

1 2	frame4=DataFrame(pop) frame4

	Nevada	Ohio
2000	NaN	1.5
2001	2.4	1.7
2002	2.9	3.6

frame4.T

	2000	2001	2002
Nevada	NaN	2.4	2.9
Ohio	1.5	1.7	3.6

1	frame4.values

array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])

索引对象

###index对象是不可修改的

obj = Series(range(3),index=['a','b','c'])
obj

a    0
b    1
c    2
dtype: int64

1
2
3

index = obj.index
print(index)
index[1]='d'

Index(['a', 'b', 'c'], dtype='object')



---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-84-f2a2752a2674> in <module>()
      1 index = obj.index
      2 print(index)
----> 3 index[1]='d'


/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
   2048 
   2049     def __setitem__(self, key, value):
-> 2050         raise TypeError("Index does not support mutable operations")
   2051 
   2052     def __getitem__(self, key):


TypeError: Index does not support mutable operations

1 2	## 不可修改性保证了index对象在多个数据结构之间实现共享的安全 ## index除了长得像数组，也类似一个固定大小的集合

基本功能

重新索引

1 2	obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c']) obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

1 2	obj2 = obj.reindex(['a','b','c','d','e']) obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

1	obj.reindex(['a','b','c','d','e'],fill_value=0)

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

frame=DataFrame(np.arange(9).reshape((3,3))
                ,index=['a','c','e']
                ,columns=['Ohio', 'Texas', 'California'])
frame

	Ohio	Texas	California
a	0	1	2
c	3	4	5
e	6	7	8

1	frame.reindex(['a','b','c','e'])

	Ohio	Texas	California
a	0.0	1.0	2.0
b	NaN	NaN	NaN
c	3.0	4.0	5.0
e	6.0	7.0	8.0

1	frame.reindex(columns=[ 'Texas','California','Ohio']) #这里创建了一个新对象

	Texas	California	Ohio
a	1	2	0
c	4	5	3
e	7	8	6

1	states=['Texas','California','Ohio']

丢弃指定轴上的项

1 2	obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e']) obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

1 2	new_obj = obj.drop('c') new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

1	data.drop(['one','two'],axis=1)

	three	four
Ohio	2	3
Colorado	6	7
Utah	10	11
New York	14	15

索引、选取和过滤

1	obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])

1
2
3

print(obj['b'])
print(obj[1])
print(obj[2:4])

1.0
1.0
c    2.0
d    3.0
dtype: float64

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

1	data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

1	data[['two','three']]

	two	three
Ohio	1	2
Colorado	5	6
Utah	9	10
New York	13	14

data[:2]

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7

1	data[data['three']>5]

	one	two	three	four
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

1 2	data[data<5]=0 data

	one	two	three	four
Ohio	0	0	0	0
Colorado	0	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

Selection with loc and iloc

1	data.loc['Colorado',['one','two']]

one    0
two    5
Name: Colorado, dtype: int64

data

	one	two	three	four
Ohio	0	0	0	0
Colorado	0	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

1	data.iloc[2] #可以直接使用数字索引

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

Integer Indexes

1 2	ser = pd.Series(np.arange(3.)) ser

0    0.0
1    1.0
2    2.0
dtype: float64

1 2	ser2 = pd.Series(np.arange(3.),index=['a','b','c']) ser2

a    0.0
b    1.0
c    2.0
dtype: float64

ser2[-1]

2.0

1	ser.loc[:1]

0    0.0
1    1.0
dtype: float64

1	ser.iloc[:1]

0    0.0
dtype: float64

Arithmetic and Data Alignment

1
2
3

s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],index=['a', 'c', 'e', 'f', 'g'])
print(s1,s2)

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64 a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

s1+s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                   index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])

df1

	b	c	d
Ohio	0.0	1.0	2.0
Texas	3.0	4.0	5.0
Colorado	6.0	7.0	8.0

df2

	b	d	e
Utah	0.0	1.0	2.0
Ohio	3.0	4.0	5.0
Texas	6.0	7.0	8.0
Oregon	9.0	10.0	11.0

1	df1+df2 ## 行列索引同时匹配才进行计算，否则为NaN

	b	c	d	e
Colorado	NaN	NaN	NaN	NaN
Ohio	3.0	NaN	6.0	NaN
Oregon	NaN	NaN	NaN	NaN
Texas	9.0	NaN	12.0	NaN
Utah	NaN	NaN	NaN	NaN

1 2	df1 = pd.DataFrame({'A': [1, 2]}) df2 = pd.DataFrame({'B': [3, 4]})

df1

	A
0	1
1	2

df2

	B
0	3
1	4

df2-df1

	A	B
0	NaN	NaN
1	NaN	NaN

Arithmetic methods with fill values

df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns=list('abcde'))

df1

	a	b	c	d
0	0.0	1.0	2.0	3.0
1	4.0	5.0	6.0	7.0
2	8.0	9.0	10.0	11.0

1	df1.loc[1,'b']=np.nan

df1

	a	b	c	d
0	0.0	1.0	2.0	3.0
1	4.0	NaN	6.0	7.0
2	8.0	9.0	10.0	11.0

df2

	a	b	c	d	e
0	0.0	1.0	2.0	3.0	4.0
1	5.0	6.0	7.0	8.0	9.0
2	10.0	11.0	12.0	13.0	14.0
3	15.0	16.0	17.0	18.0	19.0

df1+df2

	a	b	c	d	e
0	0.0	2.0	4.0	6.0	NaN
1	9.0	NaN	13.0	15.0	NaN
2	18.0	20.0	22.0	24.0	NaN
3	NaN	NaN	NaN	NaN	NaN

1	df1.add(df2, fill_value=0)

	a	b	c	d	e
0	0.0	2.0	4.0	6.0	4.0
1	9.0	6.0	13.0	15.0	9.0
2	18.0	20.0	22.0	24.0	14.0
3	15.0	16.0	17.0	18.0	19.0

1 / df1

	a	b	c	d
0	inf	1.000000	0.500000	0.333333
1	0.250000	NaN	0.166667	0.142857
2	0.125000	0.111111	0.100000	0.090909

1	df1.rdiv(1)

	a	b	c	d
0	inf	1.000000	0.500000	0.333333
1	0.250000	NaN	0.166667	0.142857
2	0.125000	0.111111	0.100000	0.090909

1	df1.reindex(columns=df2.columns,fill_value=0)

	a	b	c	d
0	0.0	1.0	2.0	3.0
1	4.0	NaN	6.0	7.0
2	8.0	9.0	10.0	11.0

Operations between DataFrame and Series

1 2	arr = np.arange(12.).reshape((3, 4)) arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

arr[0]

array([0., 1., 2., 3.])

1	arr-arr[0] #对每一个元素都做处理

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

	b	d	e
Utah	0.0	1.0	2.0
Ohio	3.0	4.0	5.0
Texas	6.0	7.0	8.0
Oregon	9.0	10.0	11.0

1 2	series = frame.loc['Utah'] series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

1	frame - series #对于DataFrame数据结构同理

	b	d	e
Utah	0.0	0.0	0.0
Ohio	3.0	3.0	3.0
Texas	6.0	6.0	6.0
Oregon	9.0	9.0	9.0

1 2	series2 = pd.Series(range(3), index=['b', 'e', 'f']) series2

b    0
e    1
f    2
dtype: int64

1	frame+series2

	b	d	e	f
Utah	0.0	NaN	3.0	NaN
Ohio	3.0	NaN	6.0	NaN
Texas	6.0	NaN	9.0	NaN
Oregon	9.0	NaN	12.0	NaN

1	series3 = frame['d']

1	frame.sub(series3,axis='index')

	b	e
Utah	-1.0	1.0
Ohio	-1.0	1.0
Texas	-1.0	1.0
Oregon	-1.0	1.0

Function Application and Mapping

1
2
3

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

	b	d	e
Utah	2.006862	-1.308834	1.440639
Ohio	0.001529	0.026818	0.706586
Texas	-0.461218	0.365081	-0.898180
Oregon	-0.280978	0.402707	-1.396092

1	np.abs(frame)

	b	d	e
Utah	2.006862	1.308834	1.440639
Ohio	0.001529	0.026818	0.706586
Texas	0.461218	0.365081	0.898180
Oregon	0.280978	0.402707	1.396092

1 2	f = lambda x:x.max()-x.min() frame.apply(f,axis='columns') # 传入DataFrame中的一行或者一列数据（Series）,在自定义函数中进行计算

Utah      3.315696
Ohio      0.705058
Texas     1.263261
Oregon    1.798799
dtype: float64

1
2
3

def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f,axis='index')

	b	d	e
min	-0.461218	-1.308834	-1.396092
max	2.006862	0.402707	1.440639

1 2	format = lambda x: '%.2f' % x frame.applymap(format)

	b	d	e
Utah	2.01	-1.31	1.44
Ohio	0.00	0.03	0.71
Texas	-0.46	0.37	-0.90
Oregon	-0.28	0.40	-1.40

1	frame['e'].map(format)

Utah       1.44
Ohio       0.71
Texas     -0.90
Oregon    -1.40
Name: e, dtype: object

Sorting and Ranking

1 2	obj = pd.Series(np.random.randn(4), index=['d', 'a', 'b', 'c']) obj

d   -1.066429
a    0.005021
b   -0.257605
c   -1.705094
dtype: float64

1	obj.sort_values()

c   -1.705094
d   -1.066429
b   -0.257605
a    0.005021
dtype: float64

1	obj.sort_index()

a    0.005021
b   -0.257605
c   -1.705094
d   -1.066429
dtype: float64

frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])
frame

	d	a	b	c
three	0	1	2	3
one	4	5	6	7

1	frame.sort_index(axis=0)

	d	a	b	c
one	4	5	6	7
three	0	1	2	3

1	frame.sort_index(axis=1)

	a	b	c	d
three	1	2	3	0
one	5	6	7	4

1 2	obj = pd.Series([4, np.nan, 7, np.nan, -3, 2]) obj.sort_values(ascending=False) ## asending参数决定顺序逆序

2    7.0
0    4.0
5    2.0
4   -3.0
1    NaN
3    NaN
dtype: float64

1 2	frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]}) frame

	b	a
0	4	0
1	7	1
2	-3	0
3	2	1

1	frame.sort_values(by='b') # 以某一列值为准

	b	a
2	-3	0
3	2	1
0	4	0
1	7	1

1	frame.sort_values(by=['a', 'b'])

	b	a
2	-3	0
0	4	0
3	2	1
1	7	1

1 2	obj = pd.Series([7, -5, 7, 4, 2, 0, 4]) obj.rank()#平均排名，破坏同级关系

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

1	obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

1 2	# Assign tie values the maximum rank in the group obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

1
2
3

frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                      'c': [-2, 5, 8, -2.5]})
frame

	b	a	c
0	4.3	0	-2.0
1	7.0	1	5.0
2	-3.0	0	8.0
3	2.0	1	-2.5

1	frame.rank(axis=1)

	b	a	c
0	3.0	2.0	1.0
1	3.0	1.0	2.0
2	1.0	2.0	3.0
3	3.0	2.0	1.0

Axis Indexes with Duplicate Labels

1 2	obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c']) obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

1	obj.index.is_unique

False

1 2	df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b']) df

	0	1	2
a	-1.513555	0.286993	0.982033
a	1.211395	-1.512109	1.007934
b	-0.609349	0.729770	1.106319
b	-0.427720	0.354752	0.286622

1	df.loc['b'] #选出所有指定列

	0	1	2
b	-0.609349	0.729770	1.106319
b	-0.427720	0.354752	0.286622

Summarizing and Computing Descriptive Statistics

df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
df

	one	two
a	1.40	NaN
b	7.10	-4.5
c	NaN	NaN
d	0.75	-1.3

df.sum()

one    9.25
two   -5.80
dtype: float64

1	df.sum(axis='columns')

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

1	df.mean(axis='columns', skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

1	df.idxmax()

one    b
two    d
dtype: object

1	df.cumsum() ##累计求和，默认列

	one	two
a	1.40	NaN
b	8.50	-4.5
c	NaN	NaN
d	9.25	-5.8

1	df.cumsum(axis=1)

	one	two
a	1.40	NaN
b	7.10	2.60
c	NaN	NaN
d	0.75	-0.55

1	df.describe() ###牛逼牛逼

	one	two
count	3.000000	2.000000
mean	3.083333	-2.900000
std	3.493685	2.262742
min	0.750000	-4.500000
25%	1.075000	-3.700000
50%	1.400000	-2.900000
75%	4.250000	-2.100000
max	7.100000	-1.300000

1 2	obj = pd.Series(['a', 'a', 'b', 'c'] * 4) obj

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object

1	obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

Correlation and Covariance

1	## 暂时略过这一部分

Unique Values, Value Counts, and Membership

1 2	obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c']) obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

1 2	uniques = obj.unique() uniques

array(['c', 'a', 'd', 'b'], dtype=object)

1	obj.value_counts()

a    3
c    3
b    2
d    1
dtype: int64

1	pd.value_counts(obj.values, sort=False)

c    3
b    2
d    1
a    3
dtype: int64

1	obj.isin(['b', 'c'])

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

1	obj[obj.isin(['b', 'c'])]

0    c
5    b
6    b
7    c
8    c
dtype: object

1
2
3

to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
unique_vals = pd.Series(['c', 'b', 'a'])
pd.Index(unique_vals).get_indexer(to_match)

array([0, 2, 1, 1, 0, 2])

data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                     'Qu2': [2, 3, 1, 2, 3],
                     'Qu3': [1, 5, 2, 4, 4]})
data

	Qu1	Qu2	Qu3
0	1	2	1
1	3	3	5
2	4	1	2
3	3	2	4
4	4	3	4

1 2	res=data.apply(pd.value_counts).fillna(0) ##统计每一列出现的次数 res

	Qu1	Qu2	Qu3
1	1.0	1.0	1.0
2	0.0	2.0	1.0
3	2.0	2.0	0.0
4	2.0	0.0	2.0
5	0.0	0.0	1.0

final

该去实战了

ch04_Numpy基础

Posted on 2018-12-31

1	import this

import numpy as np

#创建ndarray
data0 = [6,7,8.1,2]
data =np.array(data0)
data

array([6. , 7. , 8.1, 2. ])

import numpy as np
data2=[[1,2,3],[2.1,6,3]]
arr = np.array(data2)
arr.dtype

dtype('float64')

1	np.zeros((3,2))

array([[0., 0.],
       [0., 0.],
       [0., 0.]])

1	np.ones((3,2))

array([[1., 1.],
       [1., 1.],
       [1., 1.]])

1 2	np.empty((3,2)) ### np.empty()返回的是未经初始化的垃圾值

array([[1., 1.],
       [1., 1.],
       [1., 1.]])

1	np.arange(5)

array([0, 1, 2, 3, 4])

1 2	test = np.arange(5) test.dtype

dtype('int64')

# ones_like()，传入一个数组，根据其形状和dtype创建一个全1数组
a=[[1,2,3],[4,5,6]]
b=np.ones_like(a)
b

array([[1, 1, 1],
       [1, 1, 1]])

1 2	#创建正方的NxN矩阵(对角线为1，其余为0) np.eye(5)

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

1	np.array([1,2,3],dtype=np.int64)

array([1, 2, 3])

#进行显式类型转换
data = np.array([1,2,3],dtype=np.int64)
print(data.dtype)
print(data.astype(np.float64).dtype)

int64
float64

1
2
3

num_str = np.array(['1.2','3.25','4.1'],dtype=np.string_)
print(num_str)
num_str.astype(float)

[b'1.2' b'3.25' b'4.1']





array([1.2 , 3.25, 4.1 ])

#任何计算都会被应用到元素级
arr=np.array([[1,2,3],[4,5,6]])

arr*arr

array([[ 1,  4,  9],
       [16, 25, 36]])

1
2
3

#与标量进行运算
arr=np.array([[1,2,3],[4,5,6]])
arr * 2

array([[ 2,  4,  6],
       [ 8, 10, 12]])

基本的索引和切片

1 2	arr = np.arange(10) arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

1
2
3

arr = np.arange(10)
arr[5:7] = 99
arr

array([ 0,  1,  2,  3,  4, 99, 99,  7,  8,  9])

1 2	arr0 = np.ones((5,2)) arr0[0][1]

dtype('float64')

1
2
3

#可以传入一个索引列表来选取单个元素
arr1=np.ones((5,2))
arr1[1,1]

1.0

1
2
3

#高维数据
arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

1 2	old_value = arr3d[0].copy() arr3d[0]=12

arr3d[0]

array([[12, 12, 12],
       [12, 12, 12]])

1	arr3d[0]=old_value

arr3d[0]

array([[1, 2, 3],
       [4, 5, 6]])

1 2	arr2d=np.array([[1,2,3],[4,5,6],[7,8,9]]) arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

arr2d[:2]

array([[1, 2, 3],
       [4, 5, 6]])

1	arr2d[:2,1:2]

array([[2],
       [5]])

1	arr2d[:,:2]

array([[1, 2],
       [4, 5],
       [7, 8]])

布尔型索引

1
2
3

names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe'])
data = np.random.randn(6,5)
names

array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe'], dtype='<U4')

data

array([[ 1.04334945, -0.51882989,  0.39479822, -1.26167769, -0.60706667],
       [-0.10854399, -0.77654652, -0.90842022, -0.91657036, -1.57115294],
       [ 0.78047305, -0.55011782, -0.72659944, -0.78787495,  2.10762613],
       [-0.94467982,  1.4091048 ,  0.4530369 , -1.83722786, -0.14625949],
       [ 0.34030044, -1.12975372,  1.03528971,  0.8180118 ,  0.42579557],
       [-0.07116101,  0.83523538, -0.61881987, -0.5052446 ,  1.06253317]])

1	names == 'Bob'

array([ True, False, False,  True, False, False])

1	data[names=='Bob']

array([[ 0.24865102,  0.11944466,  0.40557113, -1.24757741,  0.16418035],
       [-0.0478229 , -0.30082172, -1.18252039, -1.17703784, -0.40956047]])

1	data[names=='Bob',:2]

array([[ 0.24865102,  0.11944466],
       [-0.0478229 , -0.30082172]])

1
2
3

demo = np.array([1,2,3,-1,-5])
demo[demo<0] = 1
demo

array([1, 2, 3, 1, 1])

花式索引

1	arr=np.empty((8,4))

1
2
3

for i in range(8):
    arr[i] = i 
arr

array([[0., 0., 0., 0.],
       [1., 1., 1., 1.],
       [2., 2., 2., 2.],
       [3., 3., 3., 3.],
       [4., 4., 4., 4.],
       [5., 5., 5., 5.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]])

1 2	#传入整数列表或者ndarray获取元素 arr[[4,2,0,1]]

array([[4., 4., 4., 4.],
       [2., 2., 2., 2.],
       [0., 0., 0., 0.],
       [1., 1., 1., 1.]])

1 2	arr1=np.arange(32).reshape((8,4)) arr1

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])

1	arr1[[4,2],[3,2]]

array([19, 10])

数组转置和轴兑换

1	import numpy as np

1	arr = np.arange(15).reshape((3,5))

arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

arr.T

array([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]])

1 2	arr= np.random.randn(6,3) arr

array([[-1.14899925,  2.01403377, -0.579223  ],
       [ 1.29437371, -0.37256935, -0.1998847 ],
       [ 0.88795876,  0.38322303, -0.77289001],
       [ 0.84318194,  1.57318664, -0.14691985],
       [ 0.09926862, -0.84374676,  0.47847472],
       [ 0.30721121,  0.7380255 ,  1.09155033]])

1	np.dot(arr.T,arr)

array([[ 4.59926208, -0.98662632, -0.02053929],
       [-0.98662632,  8.0735063 , -1.21754486],
       [-0.02053929, -1.21754486,  2.41481777]])

通用函数：快速的元素级组函数

1 2	arr = np.arange(10) arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

1	np.sqrt(arr)

array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ])

1	np.exp(arr)

array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
       5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
       2.98095799e+03, 8.10308393e+03])

1 2	x = np.random.randn(8) x

array([-1.52621084,  0.91491997,  1.8613378 ,  0.50723883,  0.26956039,
       -0.65576259,  0.81621241, -0.71835102])

1 2	y = np.random.randn(8) y

array([-0.61305033, -0.99195929, -0.89955148, -0.63491395,  1.54908888,
       -1.82440893,  0.08511608, -0.60391516])

1	np.maximum(x,y)

array([-0.61305033,  0.91491997,  1.8613378 ,  0.50723883,  1.54908888,
       -0.65576259,  0.81621241, -0.60391516])

利用数组进行数据处理

用数组表达式代替循环的做法，我们称之为矢量化

1	points = np.arange(-5,5,0.01)#1000个间隔点

1	xs,ys=np.meshgrid(points,points)

xs

array([[-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       ...,
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99]])

ys

array([[-5.  , -5.  , -5.  , ..., -5.  , -5.  , -5.  ],
       [-4.99, -4.99, -4.99, ..., -4.99, -4.99, -4.99],
       [-4.98, -4.98, -4.98, ..., -4.98, -4.98, -4.98],
       ...,
       [ 4.97,  4.97,  4.97, ...,  4.97,  4.97,  4.97],
       [ 4.98,  4.98,  4.98, ...,  4.98,  4.98,  4.98],
       [ 4.99,  4.99,  4.99, ...,  4.99,  4.99,  4.99]])

1 2	z=np.sqrt(xs2+ys2) z

array([[7.07106781, 7.06400028, 7.05693985, ..., 7.04988652, 7.05693985,
        7.06400028],
       [7.06400028, 7.05692568, 7.04985815, ..., 7.04279774, 7.04985815,
        7.05692568],
       [7.05693985, 7.04985815, 7.04278354, ..., 7.03571603, 7.04278354,
        7.04985815],
       ...,
       [7.04988652, 7.04279774, 7.03571603, ..., 7.0286414 , 7.03571603,
        7.04279774],
       [7.05693985, 7.04985815, 7.04278354, ..., 7.03571603, 7.04278354,
        7.04985815],
       [7.06400028, 7.05692568, 7.04985815, ..., 7.04279774, 7.04985815,
        7.05692568]])

1
2
3

import matplotlib.pyplot as plt
plt.imshow(z,cmap=plt.cm.OrRd)
plt.title('Image plot of $\sqrt{x^2+y^2}$ for a grid of values')

Text(0.5,1,'Image plot of $\\sqrt{x^2+y^2}$ for a grid of values')

png

将条件表达式表述为数组运算

np.where?

1 2	xarr = np.random.randn(5) xarr

array([-0.05191135,  0.46807508,  1.5955647 , -1.21585517,  0.68848672])

1 2	yarr=np.random.randn(5) yarr

array([-1.60333056,  2.16303939, -0.37219312, -1.85605698,  0.41180341])

1	np.where(xarr>=0,xarr,yarr)#构建布尔型索引，实现想要的东西

array([-1.60333056,  0.46807508,  1.5955647 , -1.85605698,  0.68848672])

1
2
3

##替换所有正值为1，负值为-1
arr=np.random.randn(4,4)
arr

array([[ 0.32359596, -1.15124188,  0.12417984, -1.34511765],
       [-0.41019678,  1.0543996 ,  2.6307449 ,  0.74725061],
       [ 1.03418855, -0.58064793, -0.61019497, -1.13773196],
       [-0.64005234,  0.73911588,  1.15966556, -0.26103626]])

1	np.where(arr>0,1,-1)

array([[ 1, -1,  1, -1],
       [-1,  1,  1,  1],
       [ 1, -1, -1, -1],
       [-1,  1,  1, -1]])

数学和统计方法

1 2	arr=np.random.randn(5,4) arr

array([[ 0.3794937 , -0.91051976,  0.54977469,  0.98390242],
       [ 1.24989257, -0.14989659, -0.70528342,  0.66344849],
       [ 0.15440786,  0.75716823, -1.54809025,  0.05263153],
       [ 0.63369665, -1.47415409, -1.35897948, -0.24638285],
       [ 0.36552553,  1.44667304,  1.80073603,  0.70854674]])

1	arr.mean()

0.1676295519907499

1	np.mean(arr)

0.1676295519907499

arr.sum()

3.352591039814998

1	arr.sum(axis=1) # 接受轴参数

array([ 1.00265105,  1.05816104, -0.58388263, -2.44581977,  4.32148134])

arr[0]

array([ 0.3794937 , -0.91051976,  0.54977469,  0.98390242])

1	arr.sum(0)

array([ 2.7830163 , -0.33072917, -1.26184243,  2.16214634])

np.sum?

1 2	arr=np.arange(9).reshape((3,3)) arr

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

用于布尔型数组的方法

1
2
3

#求和等计算方法中，布尔值会被强制转化为0，1，因为sum()方法可以计数
arr=np.random.randn(10)
arr

array([ 0.62662142, -0.23714492,  2.52986602,  0.66838534,  0.47275484,
        1.81467714, -0.39454002, -2.59347451,  0.90815739, -0.0813537 ])

arr > 0

array([ True, False,  True,  True,  True,  True, False, False,  True,
       False])

1	(arr>0).sum()

1	(arr<0).any() #若有一个True,则为True

True

1	(arr>0).all() #全部为True,则为True

False

排序

1 2	arr=np.random.randn(10) arr

array([ 0.23588362, -0.45045835,  1.22450303, -0.2419639 , -0.23873288,
       -1.09141889, -0.87760038, -0.53059957,  0.15428331, -1.43959318])

1 2	arr.sort() arr

array([-1.43959318, -1.09141889, -0.87760038, -0.53059957, -0.45045835,
       -0.2419639 , -0.23873288,  0.15428331,  0.23588362,  1.22450303])

1
2
3

#可以在任意轴上排序
arr=np.random.randn(5,5)
arr

array([[ 0.34871828,  1.03879317,  0.21363644,  0.05765405,  1.01230602],
       [ 0.20640237, -0.2323433 ,  0.2214327 ,  1.16611884,  0.5123435 ],
       [ 0.4660787 , -0.16572832,  0.03096976,  1.07155177, -1.90712269],
       [-0.45824044, -0.25984925, -1.37214123,  1.14006713, -0.70677386],
       [-2.51549148,  0.1314714 ,  1.68439925, -0.92174553,  1.03215197]])

arr.sort?

1 2	arr.sort(axis=1) arr

array([[ 0.05765405,  0.21363644,  0.34871828,  1.01230602,  1.03879317],
       [-0.2323433 ,  0.20640237,  0.2214327 ,  0.5123435 ,  1.16611884],
       [-1.90712269, -0.16572832,  0.03096976,  0.4660787 ,  1.07155177],
       [-1.37214123, -0.70677386, -0.45824044, -0.25984925,  1.14006713],
       [-2.51549148, -0.92174553,  0.1314714 ,  1.03215197,  1.68439925]])

1 2	arr.sort(1) arr

array([[ 0.05765405,  0.21363644,  0.34871828,  1.01230602,  1.03879317],
       [-0.2323433 ,  0.20640237,  0.2214327 ,  0.5123435 ,  1.16611884],
       [-1.90712269, -0.16572832,  0.03096976,  0.4660787 ,  1.07155177],
       [-1.37214123, -0.70677386, -0.45824044, -0.25984925,  1.14006713],
       [-2.51549148, -0.92174553,  0.1314714 ,  1.03215197,  1.68439925]])

唯一化

1	arr = np.array([3,3,2,1,1,54,223,3,2,3])

1	np.unique(arr) #找出数组中的唯一值，并返回排序的结果

array([  1,   2,   3,  54, 223])

线性代数

x = np.array([[1., 2., 3.], [4., 5., 6.]])
y = np.array([[6., 23.], [-1, 7], [8, 9]])
x
y
np.dot(x,y)

array([[ 28.,  64.],
       [ 67., 181.]])

1	np.dot(x,np.ones(3))

array([ 6., 15.])

1 2	arr = np.random.normal(size=(4,4)) arr

array([[ 1.11433974, -0.2520489 , -0.2349691 , -0.94610534],
       [ 2.28170964,  0.78521532, -2.05844323, -0.40333454],
       [-0.1225117 , -0.9144343 ,  0.25932307,  0.283972  ],
       [-0.63086567, -1.17039446, -0.20103388, -0.21096491]])

随机漫步

1
2
3

nstep = 100
draws = np.random.randint(0,2,size=nstep)
draws

array([1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1,
       0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1,
       1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

1 2	steps = np.where(draws>0,1,-1) steps

array([ 1,  1,  1,  1, -1, -1, -1, -1,  1,  1, -1, -1, -1,  1,  1, -1, -1,
       -1, -1,  1,  1,  1, -1,  1, -1, -1, -1, -1,  1,  1, -1,  1, -1,  1,
        1,  1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  1, -1, -1, -1, -1,
        1, -1, -1, -1,  1, -1, -1,  1, -1,  1, -1, -1, -1, -1,  1, -1,  1,
        1,  1,  1,  1,  1, -1,  1, -1, -1, -1,  1,  1, -1, -1, -1,  1,  1,
       -1,  1,  1,  1,  1, -1,  1, -1, -1,  1,  1, -1,  1, -1,  1])

1	walk=steps.cumsum()

1	walk.min()

-19

1	walk.max()

1	(np.abs(walk)>=5).argmax()

可视化线性关系

绘制线性回归

Fitting different kinds of models（拟合不同的模型）

可视化数据集的分布

单变量分布

核密度函数

双变量分布

Kernel density estimation

可视化数据集中的成对关系

分类数据可视化

分类散点图

分类分布图

箱线图

Violinplots

分类估计图

Bar plots

Point plots

绘制‘宽格式’数据

多面板分类图

补充一下

写在最前

可视化统计关系

用散点图关联变量

线图表示连续性

聚合和表示不确定性

用语义映射绘制数据子集

用日期数据绘图

Showing multiple relationships with facets

关于数据可视化

最简单的例子

Figure和Subplot

调整subplot周围的间距

颜色、标记和类型

刻度、标签和图例

设置细节

添加图例

注解以及在Subplot上绘图

将图表保持为文件

写在最后

写在最前

内容简介

一点点感受

中国也合适

首与尾

立Flag

合并数据集

层次化索引

数据库风格的DataFrame合并

索引上的合并

轴向连接

合并重叠数据

重塑和轴向旋转

重塑层次化索引

将“长格式”旋转为“宽格式”

数据转化

移除重复数据

重命名索引

离散化与面元划分

检测与过滤异常值

计算指标/哑变量

字符串操作

字符串对象方法

正则表达式

读写文本格式的数据

逐行读取文本文件

将数据写出到文本格式

手工处理分隔符

Json数据

web信息收集

二进制数据格式

HTML与web api

使用数据库

pandas基本数据结构

Series

DataFrame

索引对象

基本功能

重新索引

丢弃指定轴上的项

索引、选取和过滤