| Yang |


  • Home

  • Tags

  • Archives

seaborn-liner

Posted on 2019-01-10

可视化线性关系

1
2
3
4
5
6
7
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(color_codes=True)

tips = sns.load_dataset("tips")

绘制线性回归

Two main functions in seaborn are used to visualize a linear relationship as determined through regression. These functions, regplot() and lmplot() are closely related, and share much of their core functionality. It is important to understand the ways they differ, however, so that you can quickly choose the correct tool for particular job.

1
2
3
4
#先看一个最简单的例子
#默认置信区间为95%

sns.regplot(x='total_bill',y='tip',data=tips)
<matplotlib.axes._subplots.AxesSubplot at 0x110f64d68>

png

1
sns.lmplot(x='total_bill',y='tip',data=tips)
<seaborn.axisgrid.FacetGrid at 0x110f15780>

png

regplot()可以接受更加灵活的输入,lmplot()接受整形输入,同时regplot()也拥有部分lmplot()的功能。

1
sns.lmplot(x='size',y='tip',data=tips)
<seaborn.axisgrid.FacetGrid at 0x112574400>

png

Fitting different kinds of models(拟合不同的模型)

1
2
anscombe = sns.load_dataset("anscombe")
anscombe.head()
dataset x y
0 I 10.0 8.04
1 I 8.0 6.95
2 I 13.0 7.58
3 I 9.0 8.81
4 I 11.0 8.33
1
sns.lmplot(x='x',y='y',data=anscombe.query("dataset == 'I'"),ci=None,scatter_kws={"s": 80})
<seaborn.axisgrid.FacetGrid at 0x1a1e166978>

png

1
2
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'II'"),
ci=None, scatter_kws={"s": 80})

png

1
2
3
#上一个明明显拟合不到位,我们可以猜测这是一个多项式回归,利用order参数调用numpy.polyfit()
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'II'"),order = 2 ,
ci=None, scatter_kws={"s": 80})
<seaborn.axisgrid.FacetGrid at 0x1a1e3b3cc0>

png

1
2
3
4
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'III'"),
ci=None, scatter_kws={"s": 80});

## 存在异常值

png

1
2
3
# 不同的损失函数来减轻相对较大的残差
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'III'"),
robust=True, ci=None, scatter_kws={"s": 80});

png

seaborn-distribution

Posted on 2019-01-10

可视化数据集的分布

1
2
3
4
5
6
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
%matplotlib inline

单变量分布

displot()函数将绘制直方图,并拟合核密度函数(KDE)

x=np.random.normal(size=100)
sns.distplot(x)

1
2
#去除kde
sns.distplot(x,kde=False,rug=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1a21577f60>

png

1
2
#箱子划分有多细,Seaborn会默认猜测一个,但是更好的应该由我们来指定
sns.distplot(x,bins=20,kde=False,rug=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1a215fdbe0>

png

核密度函数

The kernel density estimate may be less familiar, but it can be a useful tool for plotting the shape of a distribution. Like the histogram, the KDE plots encode the density of observations on one axis with height along the other axis:

简单理解为展示密度

1
sns.distplot(x,hist=False,rug=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1a216fada0>

png

1
sns.kdeplot(x,shade=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1a2181cd68>

png

双变量分布

It can also be useful to visualize a bivariate distribution of two variables. The easiest way to do this in seaborn is to just use the jointplot() function, which creates a multi-panel figure that shows both the bivariate (or joint) relationship between two variables along with the univariate (or marginal) distribution of each on separate axes.

1
2
3
4
mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200)
df = pd.DataFrame(data, columns=["x", "y"])
df.head()
x y
0 1.620467 2.511505
1 -0.529253 0.247477
2 -1.361914 0.225665
3 -1.188358 0.785273
4 1.158663 0.180673
1
sns.jointplot(x='x',y='y',data=df)
<seaborn.axisgrid.JointGrid at 0x1a21b3a668>

png

1
2
3
x, y = np.random.multivariate_normal(mean, cov, 1000).T
with sns.axes_style("white"):
sns.jointplot(x=x, y=y, kind="hex", color="k")

png

Kernel density estimation

这个和单变量分布的核密度函数差不多

1
sns.jointplot(x='x',y='y',kind='kde',data=df)
<seaborn.axisgrid.JointGrid at 0x1a21c534a8>

png

可视化数据集中的成对关系

这个没太看明白、、、

1
2
iris = sns.load_dataset("iris")
sns.pairplot(iris);

png

1
2
3
g = sns.PairGrid(iris)
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.kdeplot, n_levels=6);

png

seaborn-categorical

Posted on 2019-01-10

分类数据可视化

In seaborn, there are several different ways to visualize a relationship involving categorical data. Similar to the relationship between relplot() and either scatterplot() or lineplot(), there are two ways to make these plots. There are a number of axes-level functions for plotting categorical data in different ways and a figure-level interface, catplot(), that gives unified higher-level access to them.

catplot()

Categorical scatterplots:

  • stripplot() (with kind=”strip”; the default)
  • swarmplot() (with kind=”swarm”)

Categorical distribution plots:

  • boxplot() (with kind=”box”)
  • violinplot() (with kind=”violin”)
  • boxenplot() (with kind=”boxen”)

Categorical estimate plots:

  • pointplot() (with kind=”point”)
  • barplot() (with kind=”bar”)
  • countplot() (with kind=”count”)
1
2
3
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="ticks", color_codes=True)

分类散点图

1
2
tips = sns.load_dataset("tips")
tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
1
sns.catplot(x='day',y='total_bill',data=tips)
<seaborn.axisgrid.FacetGrid at 0x1a156aa940>

png

1
2
#The jitter parameter controls the magnitude of jitter or disables it altogether:
sns.catplot(x="day", y="total_bill", jitter=False, data=tips);

png

beeswarm,即swarmplot()

1
sns.catplot(x="day", y="total_bill", kind="swarm", data=tips);

png

1
2
#也支持hue参数进行分类,但是不支持style
sns.catplot(x="day", y="total_bill", kind="swarm",hue='sex',data=tips);

png

1
2
#也可以在轴上对一个参数进行分类,参数order
sns.catplot(x='smoker',y='tip',order=['Yes','No'],data=tips)
<seaborn.axisgrid.FacetGrid at 0x1a2196da58>

png

1
2
# x,y轴是很自由的,换一种展现方式
sns.catplot(x='tip',y='smoker',order=['Yes','No'],data=tips)
<seaborn.axisgrid.FacetGrid at 0x1a21ae6f28>

png

分类分布图

箱线图

The first is the familiar boxplot(). This kind of plot shows the three quartile values of the distribution along with extreme values. The “whiskers” extend to points that lie within 1.5 IQRs of the lower and upper quartile, and then observations that fall outside this range are displayed independently. This means that each value in the boxplot corresponds to an actual observation in the data.

‘晶须’延伸到1.5IQR(第一四分位和第三四分位的距离),然后显示范围之外的独立点。

1
sns.catplot(x='day',y='total_bill',kind='box',data=tips)
<seaborn.axisgrid.FacetGrid at 0x1a21bafda0>

png

1
sns.catplot(x='day',y='total_bill',kind='box',hue='sex',data=tips)
<seaborn.axisgrid.FacetGrid at 0x1a21975d30>

png

A related function, boxenplot(), draws a plot that is similar to a box plot but optimized for showing more information about the shape of the distribution. It is best suited for larger datasets:

1
sns.catplot(x='day',y='total_bill',kind='boxen',hue='sex',data=tips)
<seaborn.axisgrid.FacetGrid at 0x1a21ba7128>

png

Violinplots

which combines a boxplot with the kernel density estimation procedure described in the distributions tutorial.

内核密度估计过程?

1
sns.catplot(x='total_bill',y='day',kind='violin',data=tips)
<seaborn.axisgrid.FacetGrid at 0x1a220c1cf8>

png

1
2
sns.catplot(x="day", y="total_bill", hue="sex",
kind="violin", split=True, data=tips);

png

分类估计图

For other applications, rather than showing the distribution within each category, you might want to show an estimate of the central tendency of the values. Seaborn has two main ways to show this information. Importantly, the basic API for these functions is identical to that for the ones discussed above.

Bar plots

1
2
titanic = sns.load_dataset("titanic")
titanic.head()
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
1
sns.catplot(x='sex',y='survived',kind='bar',hue='class',data=titanic)
<seaborn.axisgrid.FacetGrid at 0x110b9ecf8>

png

1
sns.catplot(x='deck',kind='count',palette='ch:.25',data=titanic)
<seaborn.axisgrid.FacetGrid at 0x1a23085cc0>

png

Point plots

An alternative style for visualizing the same information is offered by the pointplot() function. This function also encodes the value of the estimate with height on the other axis, but rather than showing a full bar, it plots the point estimate and confidence interval. Additionally, pointplot() connects points from the same hue category. This makes it easy to see how the main relationship is changing as a function of the hue semantic, because your eyes are quite good at picking up on differences of slopes:

1
sns.catplot(x='sex',y='survived',kind='point',hue='class',data=titanic)
<seaborn.axisgrid.FacetGrid at 0x1a230770b8>

png

1
2
3
4
#当然也可以标记得更好,刻画palette,markers,linestyles等参数
sns.catplot(x='class',y='survived',kind='point',hue='sex',data=titanic
,palette={'male':'g','female':'m'}
,markers=['^','o'],linestyles=['-','--'])
<seaborn.axisgrid.FacetGrid at 0x1a23683278>

png

绘制‘宽格式’数据

While using “long-form” or “tidy” data is preferred, these functions can also by applied to “wide-form” data in a variety of formats, including pandas DataFrames or two-dimensional numpy arrays. These objects should be passed directly to the data parameter:

1
2
iris = sns.load_dataset("iris")
sns.catplot(data=iris, orient="h", kind="box")
<seaborn.axisgrid.FacetGrid at 0x1a23843da0>

png

多面板分类图

1
2
3
sns.catplot(x="day", y="total_bill", hue="smoker",
col="time", aspect=.6,
kind="swarm", data=tips);

png

1
2
3
4
g = sns.catplot(x="fare", y="survived", row="class",
kind="box", orient="h", height=1.5, aspect=4,
data=titanic.query("fare > 0"))
g.set(xscale="log");

png

补充一下

seaborn.catplot(x=None, y=None, hue=None, data=None, row=None, col=None, col_wrap=None, estimator=, ci=95, n_boot=1000, units=None, order=None, hue_order=None, row_order=None, col_order=None, kind=’strip’, height=5, aspect=1, orient=None, color=None, palette=None, legend=True, legend_out=True, sharex=True, sharey=True, margin_titles=False, facet_kws=None)

  • x,y:变量名
  • data:数据集名
  • row,col:对分类的变量显示进行控制
  • col_wrap:控制一行最多显示几个
  • estimator:每个分类中进行矢量到标量的映射
  • ci:置信区间
  • n_boot:计算置信区间时使用的引导迭代次数
    ..
  • order,hue_order:对分类进行排序
  • row_order,col_ordrt:行列进行排序
  • kind:使用哪种绘图方式(“point”, “bar”, “strip”, “swarm”, “box”, “violin”, or “boxen”)
  • size:每个面板的高度
  • aspect:纵横比
  • orient:方向
  • color:颜色
  • palette:调色板
  • legned:hue的信息面板

Seaborn-relationships

Posted on 2019-01-09

写在最前

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

内容来自于Seaborn官方教程

可视化统计关系

1
2
3
4
5
6
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style="darkgrid")
  • replot()
  • lineplot()
  • scatterplot()

用散点图关联变量

1
2
tips = sns.load_dataset("tips")   # load_dateset()是从在线存储库加载数据集,极大方便了练习
tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
1
sns.relplot(x="total_bill", y="tip", data=tips)
<seaborn.axisgrid.FacetGrid at 0x1a1ff4a4e0>

png

1
sns.relplot(x="total_bill", y="tip",hue='size', data=tips) #hue参数对输入的变量进行分组,生成的不同的颜色
<seaborn.axisgrid.FacetGrid at 0x1a1fbbdc88>

png

1
sns.relplot(x="total_bill", y="tip",hue='size',size='size',sizes=(15,100),data=tips)  # size,sizes一般搭配使用
<seaborn.axisgrid.FacetGrid at 0x1a203be5f8>

png

下面这个分组画图感觉会很有用,col参数决定行,row参数决定列

1
sns.relplot(x="total_bill", y="tip",hue='size',size='size',sizes=(15,50),col='time',data=tips)
<seaborn.axisgrid.FacetGrid at 0x1a205eb1d0>

png

1
sns.relplot(x="total_bill", y="tip",hue='size',size='size',sizes=(15,50),col='time',row='sex',data=tips)
<seaborn.axisgrid.FacetGrid at 0x1a209b8ef0>

png

1
sns.relplot(x="total_bill", y="tip",hue='time',col ='time',palette = ['b','r'],data=tips) #指定绘图的颜色
<seaborn.axisgrid.FacetGrid at 0x1a21a2b390>

png

线图表示连续性

1
2
3
4
5

#lineplot()函数的很多参数其实和replot()一致,hue|size|col之类的,可以参考

df = pd.DataFrame(dict(time=np.arange(500),value=np.random.randn(500).cumsum()))
sns.relplot(x='time',y='value',kind='line',data=df)
<seaborn.axisgrid.FacetGrid at 0x1a22642c88>

png

1
sns.lineplot(x='time',y='value',data=df)
<matplotlib.axes._subplots.AxesSubplot at 0x1a22879c18>

png

聚合和表示不确定性

如果x变量出现多次,那么seaborn会通过绘制平均值周围的其和95%置信区间来聚合每个值的多个测量值

这个对于不同的数据集是需要适应其变化的,假设是时间密集型的数据集,那么就需要禁用他们

1
2
fmri = sns.load_dataset("fmri")
sns.relplot(x="timepoint", y="signal", kind="line", data=fmri)
<seaborn.axisgrid.FacetGrid at 0x1a229c42b0>

png

1
2
#特别是对于较大的数据,可以通过绘制标准偏差而不是置信区间来表示每个时间点的分布扩散
sns.relplot(x="timepoint", y="signal", kind="line",ci='sd',data=fmri)
<seaborn.axisgrid.FacetGrid at 0x1a22b6ce10>

png

用语义映射绘制数据子集

1
fmri.head()
subject timepoint event region signal
0 s13 18 stim parietal -0.017552
1 s5 14 stim parietal -0.080883
2 s12 18 stim parietal -0.081033
3 s11 18 stim parietal -0.046134
4 s10 18 stim parietal -0.037970
1
sns.relplot(x="timepoint", y="signal", hue="event", kind="line", data=fmri);

png

1
sns.lineplot(x='timepoint',y='signal',hue='region',style='event',data=fmri)
<matplotlib.axes._subplots.AxesSubplot at 0x1a23077320>

png

1
2
3
#标识子集
sns.relplot(x="timepoint", y="signal", hue="region", style="event",
dashes=False, markers=True, kind="line", marker =True,data=fmri);

png

还可以单独绘制每个采样单位,而无需通过语义区分它们。这可以避免使图例混乱

1
2
3
sns.relplot(x="timepoint", y="signal", hue="region",
units="subject", estimator=None,
kind="line", data=fmri.query("event == 'stim'"))
<seaborn.axisgrid.FacetGrid at 0x1a234eef98>

png

默认lineplot()的色彩映射和图例的处理还取决于色调语义是分类还是数字

1
2
3
4
dots = sns.load_dataset("dots").query("align == 'dots'")
sns.relplot(x="time", y="firing_rate",
hue="coherence", style="choice",
kind="line", data=dots)
<seaborn.axisgrid.FacetGrid at 0x1a2370e7b8>

png

用日期数据绘图

线图通常用于可视化与实际日期和时间相关的数据。这些函数以原始格式将数据传递给底层的matplotlib函数,因此他们可以利用matplotlib在tick标签中设置日期格式的功能。

1
2
3
4
df = pd.DataFrame(dict(time=pd.date_range("2017-1-1", periods=500),
value=np.random.randn(500).cumsum()))
g = sns.relplot(x="time", y="value", kind="line", data=df)
g.fig.autofmt_xdate()

png

Showing multiple relationships with facets

1
2
3
sns.relplot(x="timepoint", y="signal", hue="subject",
col="region", row="event", height=3,
kind="line", estimator=None, data=fmri);

png

1
2
3
4
sns.relplot(x="timepoint", y="signal", hue="event", style="event",
col="subject", col_wrap=5,
height=3, aspect=.75, linewidth=2.5,
kind="line", data=fmri.query("region == 'frontal'"));

png

ch08-数据可视化

Posted on 2019-01-09

关于数据可视化

1
2
3
4
5
6
7
8
9
import numpy as np
import pandas as pd
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.random.seed(12345)
import matplotlib.pyplot as plt
import matplotlib
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)

最简单的例子

1
plt.plot(np.arange(10))
[<matplotlib.lines.Line2D at 0x120132198>]

png

Figure和Subplot

1
2
3
4
fig = plt.figure()
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,3)

png

1
plt.plot(np.random.randn(50).cumsum(),'k--')
[<matplotlib.lines.Line2D at 0x120b7ad68>]

png

1
2
_ = ax1.hist(np.random.randn(100),bins=20,color='k',alpha=0.3)
ax2.scatter(np.arange(30), np.arange(30) + 3 * np.random.randn(30))
<matplotlib.collections.PathCollection at 0x1203c5208>
1
fig

png

1
2
fig, axes = plt.subplots(2, 3)
axes
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x120a9bcc0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x120f54c18>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x120f7b2e8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x120fa1438>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x120fc9b00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x120ffc198>]],
      dtype=object)

png

调整subplot周围的间距

1
2
3
4
5
fig, axes = plt.subplots(2, 2, sharex=True, sharey=True)
for i in range(2):
for j in range(2):
axes[i,j].hist(np.random.randn(500), bins=50, color='k', alpha=0.5)
plt.subplots_adjust(wspace=0.1,hspace=0.1)

png

颜色、标记和类型

1
plt.figure()
<Figure size 432x288 with 0 Axes>




<Figure size 432x288 with 0 Axes>
1
2
from numpy.random import randn
plt.plot(randn(50).cumsum(),linestyle='--',color='b')
[<matplotlib.lines.Line2D at 0x1213f3a90>]

png

1
plt.plot(randn(30).cumsum(), color='k', linestyle='dashed', marker='o')
[<matplotlib.lines.Line2D at 0x1213f8fd0>]

png

1
plt.plot(randn(30).cumsum(),'ko--')   #ko--是把参数组合在一起了.... color = 'k' marker = 'o' linestyle = '--'
[<matplotlib.lines.Line2D at 0x1214baf60>]

png

1
plt.close('all')
1
2
3
4
data = np.random.randn(30).cumsum()
plt.plot(data, 'k--', label='Default')
plt.plot(data, 'k-', drawstyle='steps-post', label='steps-post')
plt.legend(loc='best')
<matplotlib.legend.Legend at 0x1216ab9e8>

png

刻度、标签和图例

设置细节

1
2
3
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.plot(randn(1000).cumsum())
[<matplotlib.lines.Line2D at 0x1226302b0>]

png

1
2
3
ticks  = ax.set_xticks([0,250,500,750,1000])
labels = ax.set_xticklabels(['one', 'two', 'three', 'four', 'five'],
rotation=30, fontsize='small')
1
ax.set_title('My First Title of Matplotlib')
Text(0.5,1,'My First Title of Matplotlib')
1
ax.set_xlabel('Stage')
Text(0.5,3.2,'Stage')
1
fig

png

添加图例

1
2
3
4
5
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.plot(randn(100).cumsum(), 'k', label='one')
ax.plot(randn(100).cumsum(), 'k--', label='two')
ax.plot(randn(100).cumsum(), 'k.', label='three')
[<matplotlib.lines.Line2D at 0x1223a1940>]

png

1
2
ax.legend(loc='best')
fig

png

注解以及在Subplot上绘图

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from datetime import datetime

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)

data = pd.read_csv('examples/spx.csv', index_col=0, parse_dates=True)
spx = data['SPX']

spx.plot(ax=ax, style='k-')

crisis_data = [
(datetime(2007, 10, 11), 'Peak of bull market'),
(datetime(2008, 3, 12), 'Bear Stearns Fails'),
(datetime(2008, 9, 15), 'Lehman Bankruptcy')
]

for date, label in crisis_data:
ax.annotate(label, xy=(date, spx.asof(date) + 75),
xytext=(date, spx.asof(date) + 225),
arrowprops=dict(facecolor='black', headwidth=4, width=2,
headlength=4),
horizontalalignment='left', verticalalignment='top')

# Zoom in on 2007-2010
ax.set_xlim(['1/1/2007', '1/1/2011'])
ax.set_ylim([600, 1800])

ax.set_title('Important dates in the 2008-2009 financial crisis')
Text(0.5,1,'Important dates in the 2008-2009 financial crisis')

png

1
2
3
4
5
6
7
8
9
10
11
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)

rect = plt.Rectangle((0.2, 0.75), 0.4, 0.15, color='k', alpha=0.3)
circ = plt.Circle((0.7, 0.2), 0.15, color='b', alpha=0.3)
pgon = plt.Polygon([[0.15, 0.15], [0.35, 0.4], [0.2, 0.6]],
color='g', alpha=0.5)

ax.add_patch(rect)
ax.add_patch(circ)
ax.add_patch(pgon)
<matplotlib.patches.Polygon at 0x123523eb8>

png

将图表保持为文件

1
fig.savefig('/Users/zhangyangfenbi.com/Desktop/demo.png')

写在最后

matplotlib实际上还是一个比较低级的工具,绘图都是组装起来的。书中介绍了pandas自带的绘图库,不过基于之前已经有了Seaborn,这个就不写pandas的了,后续把Seaborn的坑填上。

读《了不起的盖茨比》

Posted on 2019-01-08

写在最前

记得去年看村上春树的《挪威的森林》的时候,永泽在和渡边君交谈中说到

若是诵读三遍《了不起的盖茨比》的人,倒像是可以成为我的朋友。

当时有在想是什么书这么屌,以后找个时间看一看。今晚看完了这本书,小小记录一下,虽然我这样的普通青年没有读出啥火花就是了。。。

内容简介

内容来自维基百科:

小说主要事件发生在1922年夏。耶鲁大学毕业生、一战老兵尼克·卡拉威(也是小说叙述人)从中西部来到纽约,卖债券过活。他在长岛的西卵村租住了一间小屋,与盖茨比为邻。杰·盖茨比是一个年轻、神秘的百万富翁,经常举办豪华宴会,却很少出头露面;有许多人到他那里去吃喝,他始终是一个孤独的人。尼克驱车到东卵村拜访表妹黛西·费伊·布坎南,她丈夫汤姆·布坎南也是尼克的大学同学。他们将尼克介绍给乔丹·贝克小姐,她是位充满魅力却略带自私的青年高尔夫球手;尼克认为自己爱上了她。她告诉尼克,汤姆有外遇,叫默特尔·威尔逊,住在“灰烬谷”:西卵村和纽约城之间的工业垃圾场。不久,尼克和汤姆、默特尔前往他们幽会的公寓,举行放荡的狂欢会。默特尔几度提起黛西的名字,汤姆在愤怒中打扁了默特尔的鼻子。

夏季某日,尼克收到盖茨比宴会的邀请函。他在宴会上碰见乔丹·贝克,而且终于见到盖茨比,发现盖茨比竟然在战争中与他同在一个师服役。尼克从乔丹那里得知盖茨比在1917年与黛西坠入爱河,但因为他要去参军,黛西终于嫁给了汤姆•布坎南。复员后,他赚了很多钱,在长岛买下豪宅,里眺望海湾对面黛西的家,希望“再续前缘”。盖茨比奢华的生活方式与放荡的狂欢会不过是为了吸引黛西,让她回心转意。盖茨比要尼克安排他与黛西见面。尼克邀请黛西到家品茶,隐瞒了盖茨比的到场。在尴尬的见面之后,盖茨比和黛西重温旧情。他们再次相连,但汤姆很快对此产生怀疑。在饭局上,黛西对盖茨比言辞甜蜜,毫不掩饰,汤姆的怀疑得到了证实。虽然汤姆自己也有情妇,但他还是对妻子的出轨倍感愤怒。汤姆逼迫大家前往纽约市,在广场酒店的套房里与盖茨比对峙,告诉他二人间的故事是盖茨比所不能领悟的。不仅如此,他揭露盖茨比贩卖私酒,从事其它见不得人的勾当,才得到了今日的财富。黛西感觉自己无法承受,只想离开,汤姆叫盖茨比驱车送她回家。

尼克、乔丹、汤姆驱车回家时经过“灰烬谷”,发现汤姆的情妇默特尔被盖茨比的车撞死了。事后,尼克从盖茨比那里得知是黛西在开车,但盖茨比不愿揭发自己的爱人。默特尔的丈夫乔治误以为车主就是自己妻子的情人,对此展开搜索。汤姆误导乔治,后者发现车主是盖茨比后,来到豪宅,开枪行凶,随后自尽。尼克为盖茨比举办葬礼,结束了与乔丹的关系,看破了东部的生活方式,回到了中西部的老家。

一点点感受

这本书具体的价值或者说其代表的深层意义不是我可以评价的,不过这个故事我读起来还是觉得很精彩的,有回味。

下面摘几个点:

中国也合适

这个故事完全可以放到现在的中国…二三线城市小富二代,看着从不知道哪里出来的暴发户、一线城市富一代和白富美之间的故事。

关于爱情,关于理想,关于阶级…

首与尾

在我年纪轻轻,资历尚浅的那些年里,父亲曾给我一句忠告,直到今天,这句话仍在我心间萦绕。
“每当你想批评别人的时候,”他对我说,“要记住,这世上并不是所有人,都有你拥有的那些优势”

Don’t judge.

我们奋力前行,小舟逆水而上,不断地被浪潮推回到过去。

那个年代,这个年代,这个世界会好吗?

会好的!

立Flag

有机会读一读英文版…

ch07-数据规整化

Posted on 2019-01-08

合并数据集

1
2
3
4
5
6
7
import numpy as np
import pandas as pd
pd.options.display.max_rows = 20
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)

层次化索引

1
2
3
4
data = pd.Series(np.random.randn(9),
index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
[1, 2, 3, 1, 3, 1, 2, 2, 3]])
data
a  1   -0.204708
   2    0.478943
   3   -0.519439
b  1   -0.555730
   3    1.965781
c  1    1.393406
   2    0.092908
d  2    0.281746
   3    0.769023
dtype: float64
1
data.index
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1, 1, 2]])
1
data['b']
1   -0.555730
3    1.965781
dtype: float64
1
data['b':'c']
b  1   -0.555730
   3    1.965781
c  1    1.393406
   2    0.092908
dtype: float64
1
data.loc[['b', 'd']]
b  1   -0.555730
   3    1.965781
d  2    0.281746
   3    0.769023
dtype: float64
1
data.loc[:, 2]
a    0.478943
c    0.092908
d    0.281746
dtype: float64
1
2
3
4
5
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=[['Ohio', 'Ohio', 'Colorado'],
['Green', 'Red', 'Green']])
frame
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
1
2
3
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
1
frame['Ohio']
color Green Red
key1 key2
a 1 0 1
2 3 4
b 1 6 7
2 9 10

数据库风格的DataFrame合并

1
2
3
4
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
'data2': range(3)})
1
df1
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
1
df2
key data2
0 a 0
1 b 1
2 d 2
1
pd.merge(df1,df2)  #如果没有指定建,则会默认将重叠列名做键
key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
1
2
3
4
5
6
df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
'data2': range(3)})
print(df3)
print(df4)
  lkey  data1
0    b      0
1    b      1
2    a      2
3    c      3
4    a      4
5    a      5
6    b      6
  rkey  data2
0    a      0
1    b      1
2    d      2
1
pd.merge(df3,df4,left_on='lkey',right_on='rkey') #默认Inner连接,how=''参数决定怎么连接
lkey data1 rkey data2
0 b 0 b 1
1 b 1 b 1
2 b 6 b 1
3 a 2 a 0
4 a 4 a 0
5 a 5 a 0
1
2
#左连接
pd.merge(df3,df4,left_on='lkey',right_on='rkey',how='left')
lkey data1 rkey data2
0 b 0 b 1.0
1 b 1 b 1.0
2 a 2 a 0.0
3 c 3 NaN NaN
4 a 4 a 0.0
5 a 5 a 0.0
6 b 6 b 1.0
1
2
3
4
5
6
7
8
## 多个键进行连接时,传入列表就好
left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
'key2': ['one', 'two', 'one'],
'lval': [1, 2, 3]})
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
'key2': ['one', 'one', 'one', 'two'],
'rval': [4, 5, 6, 7]})
pd.merge(left,right,on=['key1','key2'],how='left')
key1 key2 lval rval
0 foo one 1 4.0
1 foo one 1 5.0
2 foo two 2 NaN
3 bar one 3 6.0
1
pd.merge(left, right, on='key1')
key1 key2_x lval key2_y rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7
1
pd.merge(left, right, on='key1', suffixes=('_left', '_right'))
key1 key2_left lval key2_right rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7

索引上的合并

1
2
3
left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
'value': range(6)})
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
1
left1
key value
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5
1
right1
group_val
a 3.5
b 7.0
1
pd.merge(left1,right1,left_on='key',right_index=True)
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
1
2
3
4
5
6
7
8
9
lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio',
'Nevada', 'Nevada'],
'key2': [2000, 2001, 2002, 2001, 2002],
'data': np.arange(5.)})
righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
index=[['Nevada', 'Nevada', 'Ohio', 'Ohio',
'Ohio', 'Ohio'],
[2001, 2000, 2000, 2000, 2001, 2002]],
columns=['event1', 'event2'])
1
lefth
key1 key2 data
0 Ohio 2000 0.0
1 Ohio 2001 1.0
2 Ohio 2002 2.0
3 Nevada 2001 3.0
4 Nevada 2002 4.0
1
righth
event1 event2
Nevada 2001 0 1
2000 2 3
Ohio 2000 4 5
2000 6 7
2001 8 9
2002 10 11
1
pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True)
key1 key2 data event1 event2
0 Ohio 2000 0.0 4 5
0 Ohio 2000 0.0 6 7
1 Ohio 2001 1.0 8 9
2 Ohio 2002 2.0 10 11
3 Nevada 2001 3.0 0 1
1
pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True,how='outer')
key1 key2 data event1 event2
0 Ohio 2000 0.0 4.0 5.0
0 Ohio 2000 0.0 6.0 7.0
1 Ohio 2001 1.0 8.0 9.0
2 Ohio 2002 2.0 10.0 11.0
3 Nevada 2001 3.0 0.0 1.0
4 Nevada 2002 4.0 NaN NaN
4 Nevada 2000 NaN 2.0 3.0
1
2
3
4
5
6
7
8
9
10
##两边都开启索引也没有问题

left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
index=['a', 'c', 'e'],
columns=['Ohio', 'Nevada'])
right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
index=['b', 'c', 'd', 'e'],
columns=['Missouri', 'Alabama'])

pd.merge(left2, right2, how='outer', left_index=True, right_index=True)
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0
1
left2.join(right2)    ###更加快速地实现索引合并
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
c 3.0 4.0 9.0 10.0
e 5.0 6.0 13.0 14.0
1
left1.join(right1, on='key')
key value group_val
0 a 0 3.5
1 b 1 7.0
2 a 2 3.5
3 a 3 3.5
4 b 4 7.0
5 c 5 NaN

轴向连接

1
2
arr=np.arange(12).reshape(3,4)
arr
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
1
np.concatenate([arr, arr], axis=1)
array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])
1
2
3
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])
1
pd.concat([s1,s2,s3]) #默认axis=0
a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64
1
2
s4=pd.concat([s1*5,s3])
pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b', 'e']])
0 1
a 0.0 0.0
c NaN NaN
b 1.0 5.0
e NaN NaN
1
2
result = pd.concat([s1,s2,s3],keys=['one','two','three'])
result
one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: int64
1
s1
a    0
b    1
dtype: int64
1
result.unstack()
a b c d e f g
one 0.0 1.0 NaN NaN NaN NaN NaN
two NaN NaN 2.0 3.0 4.0 NaN NaN
three NaN NaN NaN NaN NaN 5.0 6.0
1
pd.concat([s1, s2, s3], axis=1, keys=['one', 'two', 'three'],sort=True)
one two three
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
1
2
3
4
5
df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],
columns=['one', 'two'])
df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],
columns=['three', 'four'])
df1
one two
a 0 1
b 2 3
c 4 5
1
df2
three four
a 5 6
c 7 8
1
pd.concat([df1, df2], axis=1,sort=True)
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
1
pd.concat({'level1': df1, 'level2': df2}, axis=1,sort=True) #字典的键会被当做keys选项的值
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
1
pd.concat([df1,df2],axis=1,keys=['l1','l2'],names=['zz1','zz2'],sort=True)
zz1 l1 l2
zz2 one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
1
2
df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])
1
df1
a b c d
0 1.246435 1.007189 -1.296221 0.274992
1 0.228913 1.352917 0.886429 -2.001637
2 -0.371843 1.669025 -0.438570 -0.539741
1
df2
b d a
0 0.476985 3.248944 -1.021228
1 -0.577087 0.124121 0.302614
1
pd.concat([df1,df2],sort=True)
a b c d
0 1.246435 1.007189 -1.296221 0.274992
1 0.228913 1.352917 0.886429 -2.001637
2 -0.371843 1.669025 -0.438570 -0.539741
0 -1.021228 0.476985 NaN 3.248944
1 0.302614 -0.577087 NaN 0.124121
1
pd.concat([df1,df2],ignore_index=True,sort=True)
a b c d
0 1.246435 1.007189 -1.296221 0.274992
1 0.228913 1.352917 0.886429 -2.001637
2 -0.371843 1.669025 -0.438570 -0.539741
3 -1.021228 0.476985 NaN 3.248944
4 0.302614 -0.577087 NaN 0.124121

合并重叠数据

1
np.where?
1
2
3
4
5
a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
index=['f', 'e', 'd', 'c', 'b', 'a'])
b = pd.Series(np.arange(len(a), dtype=np.float64),
index=['f', 'e', 'd', 'c', 'b', 'a'])
b[-1]=np.nan
1
a
f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64
1
b
f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64
1
np.where(pd.isnull(a),b,a) #用b对应索引的值来填充a的空值
array([0. , 2.5, 2. , 3.5, 4.5, nan])
1
b[:-2].combine_first(a[2:])
a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64
1
2
3
4
5
df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
'b': [np.nan, 2., np.nan, 6.],
'c': range(2, 18, 4)})
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
'b': [np.nan, 3., 4., 6., 8.]})
1
df1
a b c
0 1.0 NaN 2
1 NaN 2.0 6
2 5.0 NaN 10
3 NaN 6.0 14
1
df2
a b
0 5.0 NaN
1 4.0 3.0
2 NaN 4.0
3 3.0 6.0
4 7.0 8.0
1
df1.combine_first(df2) #用参数对象的数据为调用者对象的缺失数据‘打补丁’
a b c
0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN

重塑和轴向旋转

重塑层次化索引

1
2
3
4
5
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
index=pd.Index(['Ohio', 'Colorado'], name='state'),
columns=pd.Index(['one', 'two', 'three'],
name='number'))
data
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
1
2
result = data.stack()
result
state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64
1
result.unstack()
number one two three
state
Ohio 0 1 2
Colorado 3 4 5

将“长格式”旋转为“宽格式”

1
2
3
4
5
6
7
8
data = pd.read_csv('examples/macrodata.csv')
data.head()
periods = pd.PeriodIndex(year=data.year, quarter=data.quarter,
name='date')
columns = pd.Index(['realgdp', 'infl', 'unemp'], name='item')
data = data.reindex(columns=columns)
data.index = periods.to_timestamp('D', 'end')
ldata = data.stack().reset_index().rename(columns={0: 'value'})
1
ldata[:10]
date item value
0 1959-03-31 realgdp 2710.349
1 1959-03-31 infl 0.000
2 1959-03-31 unemp 5.800
3 1959-06-30 realgdp 2778.801
4 1959-06-30 infl 2.340
5 1959-06-30 unemp 5.100
6 1959-09-30 realgdp 2775.488
7 1959-09-30 infl 2.740
8 1959-09-30 unemp 5.300
9 1959-12-31 realgdp 2785.204
1
2
pivoted = ldata.pivot('date','item','value')  ## index , columns , value 
pivoted.head()
item infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8
1959-06-30 2.34 2778.801 5.1
1959-09-30 2.74 2775.488 5.3
1959-12-31 0.27 2785.204 5.6
1960-03-31 2.31 2847.699 5.2
1
ldata['value2'] = np.random.randn(len(ldata))
1
ldata[:10]
date item value value2
0 1959-03-31 realgdp 2710.349 -0.894813
1 1959-03-31 infl 0.000 -1.741494
2 1959-03-31 unemp 5.800 -1.052256
3 1959-06-30 realgdp 2778.801 1.436603
4 1959-06-30 infl 2.340 -0.576207
5 1959-06-30 unemp 5.100 -2.420294
6 1959-09-30 realgdp 2775.488 -1.062330
7 1959-09-30 infl 2.740 0.237372
8 1959-09-30 unemp 5.300 0.000957
9 1959-12-31 realgdp 2785.204 0.065253
1
2
pivoted = ldata.pivot('date','item')
pivoted.head() # 带有层次化索引的列
value value2
item infl realgdp unemp infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8 -1.741494 -0.894813 -1.052256
1959-06-30 2.34 2778.801 5.1 -0.576207 1.436603 -2.420294
1959-09-30 2.74 2775.488 5.3 0.237372 -1.062330 0.000957
1959-12-31 0.27 2785.204 5.6 -1.367524 0.065253 -0.030280
1960-03-31 2.31 2847.699 5.2 -0.642437 0.940489 1.040179
1
pivoted['value'].head()
item infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8
1959-06-30 2.34 2778.801 5.1
1959-09-30 2.74 2775.488 5.3
1959-12-31 0.27 2785.204 5.6
1960-03-31 2.31 2847.699 5.2

数据转化

移除重复数据

1
2
data=pd.DataFrame({'k1':['one']*3+['two']*4,'k2':[1,1,2,3,3,4,4]})
data
k1 k2
0 one 1
1 one 1
2 one 2
3 two 3
4 two 3
5 two 4
6 two 4
1
data.duplicated() #判断当前行是否是重复行
0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool
1
data.drop_duplicates() #移除重复行
k1 k2
0 one 1
2 one 2
3 two 3
5 two 4
1
2
data['v1']=np.arange(7)
data.drop_duplicates(['k1']) #只根据某一行来移除
k1 k2 v1
0 one 1 0
3 two 3 3
1
2


k1 k2 v1
0 one 1 0
1 one 1 1
2 one 2 2
3 two 3 3
4 two 3 4
5 two 4 5
6 two 4 6
1
2
data=pd.Series([1,2,3,-99])
data
0     1
1     2
2     3
3   -99
dtype: int64
1
data.replace(-99,np.nan)
0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64
1
data.replace({1:100,-99:np.nan})
0    100.0
1      2.0
2      3.0
3      NaN
dtype: float64

重命名索引

1
2
3
4
data = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
index=['b', 'c', 'd', 'e'],
columns=['Missouri', 'Alabama'])
data
Missouri Alabama
b 7.0 8.0
c 9.0 10.0
d 11.0 12.0
e 13.0 14.0
1
data.index.map(str.upper)
Index(['B', 'C', 'D', 'E'], dtype='object')

离散化与面元划分

1
2
3
4
5
#离散化
age = np.arange(10)
bins = [2,5,8]
cats = pd.cut(age,bins)
cats
[NaN, NaN, NaN, (2, 5], (2, 5], (2, 5], (5, 8], (5, 8], (5, 8], NaN]
Categories (2, interval[int64]): [(2, 5] < (5, 8]]

检测与过滤异常值

1
2
3
from pandas import DataFrame
data=DataFrame(np.random.randn(1000,4))
data.describe()
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.002621 -0.023747 -0.003461 -0.002610
std 0.998586 0.962207 1.012928 0.996423
min -3.024110 -2.657202 -3.105636 -3.530912
25% -0.670724 -0.684972 -0.691494 -0.707701
50% 0.022038 0.023472 0.024927 0.020683
75% 0.649798 0.639806 0.693491 0.672463
max 3.897527 3.160760 3.144389 3.003284
1
2
3
#找出某列中绝对值超过3的值
col=data[3]
col[abs(col)>3]
208   -3.530912
969    3.003284
Name: 3, dtype: float64
1
data[(np.abs(data)>3).any(1)][:2]
0 1 2 3
136 -1.202724 -0.286215 -3.105636 -0.369009
148 -3.024110 -1.168413 -0.888664 0.111410
1
data[(np.abs(data)>3).any(1)]=np.sign(data)*3

计算指标/哑变量

1
2
df = DataFrame({'key':['b','b','a','c','a','b'],'value':np.arange(6)})
df
key value
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5
1
pd.get_dummies(df['key'])
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0

字符串操作

字符串对象方法

1
#和Python编码一起学习下

正则表达式

1
2
import re
#参考之前的博客

ch06-IO

Posted on 2019-01-06

读写文本格式的数据

1
2
3
4
5
6
import numpy as np
import pandas as pd
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)
1
!pwd
/Users/zhangyangfenbi.com/Desktop/code/conda_book
1
!cat examples/ex1.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
1
2
3
file_path = 'examples/ex1.csv'
df=pd.read_csv(file_path)
df
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
1
2
#read_table也可以,不过分隔符不一样,需要重新指定
pd.read_table('examples/ex1.csv',sep=',')
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
1
!cat examples/ex2.csv
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
1
2
3
#可以让pandas为其默认分配列名,或者自己定义列名
df1=pd.read_csv('examples/ex2.csv',header=None)
df1
0 1 2 3 4
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
1
2
df2=pd.read_csv('examples/ex2.csv',names=['a','b','c','d','e'])
df2
a b c d e
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
1
2
df2=pd.read_csv('examples/ex2.csv',names=['a','b','c','d','e'],index_col='e')
df2
a b c d
e
hello 1 2 3 4
world 5 6 7 8
foo 9 10 11 12
1
!cat examples/csv_mindex.csv
key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16
1
pd.read_csv('examples/csv_mindex.csv',index_col=['key1','key2'])
value1 value2
key1 key2
one a 1 2
b 3 4
c 5 6
d 7 8
two a 9 10
b 11 12
c 13 14
d 15 16
1
2
#
list(open('examples/ex3.txt'))
['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']
1
2
3
#通过正则表达式去匹配并不是固定的分隔符
res = pd.read_table('examples/ex3.txt',sep='\s+')
res
A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382 1.100491
1
!cat examples/ex4.csv
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
1
2
#使用skiprows跳过指定行
pd.read_csv('examples/ex4.csv',skiprows=[0,2,3])
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
1
!cat examples/ex5.csv
something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo
1
2
3
#pandas会对缺失值进行标识
res1=pd.read_csv('examples/ex5.csv')
res1
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo
1
res1.isnull()
something a b c d message
0 False False False False False True
1 False False False True False False
2 False False False False False False
1
2
result = pd.read_csv('examples/ex5.csv', na_values=['NULL'])
result
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo

逐行读取文本文件

1
2
chunker = pd.read_csv('examples/ex6.csv', chunksize=1000)
chunker
<pandas.io.parsers.TextFileReader at 0x10c0cc828>
1
2
3
4
5
6
7
8
9
10
chunker = pd.read_csv('examples/ex6.csv', chunksize=1000)

tot = pd.Series([])
for piece in chunker:
tot = tot.add(piece['key'].value_counts(), fill_value=0)

tot = tot.sort_values(ascending=False)

# Top5
tot[:5]
E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
dtype: float64

将数据写出到文本格式

1
2
data = pd.read_csv('examples/ex5.csv')
data
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo
1
pwd
'/Users/zhangyangfenbi.com/Desktop/code/conda_book'
1
data.to_csv('/Users/zhangyangfenbi.com/Desktop/tmp.csv')
1
!cat '/Users/zhangyangfenbi.com/Desktop/tmp.csv'
,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo
1
2
import sys
data.to_csv(sys.stdout, sep='|')
|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo
1
data.to_csv(sys.stdout, index=False, header=False)
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo
1
data.to_csv(sys.stdout, index=False, columns=['a', 'b', 'c'])
a,b,c
1,2,3.0
5,6,
9,10,11.0
1
2
3
4
dates = pd.date_range('1/1/2000', periods=7)
ts = pd.Series(np.arange(7), index=dates)
ts.to_csv('examples/tseries.csv')
!cat examples/tseries.csv
2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6

手工处理分隔符

1
2
3
4
5
6
7
8
import csv
f = open('examples/ex7.csv')

reader = csv.reader(f)
for line in reader:
print(line)

#writer省略
['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3']

Json数据

1
2
3
4
5
6
7
8
9
10
#Json格式已经成为一种通用的格式,主要用户http请求和应用程序之间发送数据
obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
{"name": "Katie", "age": 38,
"pets": ["Sixes", "Stache", "Cisco"]}]
}
"""
str
1
2
3
4
import json

res = json.loads(obj)
res
{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}
1
2
asjson = json.dumps(res)
asjson
'{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'

web信息收集

书中主要讲了lxml和urllib2
不过现在bs4和request这两个库用得比较多,这部分看一看就好。

二进制数据格式

1
2
frame = pd.read_csv('examples/ex1.csv')
frame
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
1
2
frame.to_pickle('examples/frame_pickle_zhangyang')
pd.read_pickle('examples/frame_pickle_zhangyang')
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

HTML与web api

1
2
3
4
import requests
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)
resp
<Response [200]>
1
type(resp.text)
str

使用数据库

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import sqlite3
import pymongo
import MySQLdb

## 链接MySQL数据库为例

# 打开数据库连接
db = MySQLdb.connect("localhost", "testuser", "test123", "TESTDB", charset='utf8' )

# 使用cursor()方法获取操作游标
cursor = db.cursor()

# 使用execute方法执行SQL语句
cursor.execute("SELECT VERSION()")

# 使用 fetchone() 方法获取一条数据
data = cursor.fetchone()

print "Database version : %s " % data

# 关闭数据库连接
db.close()

ch05-pandas基础

Posted on 2019-01-05

pandas基本数据结构

1
2
3
4
import pandas as pd
from pandas import Series,DataFrame

import numpy as np

Series

1
obj=Series([4,7,-1,3])
1
obj
0    4
1    7
2   -1
3    3
dtype: int64
1
obj.values
array([ 4,  7, -1,  3])
1
obj.index
RangeIndex(start=0, stop=4, step=1)
1
2
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2
d    4
b    7
a   -5
c    3
dtype: int64
1
obj2.index
Index(['d', 'b', 'a', 'c'], dtype='object')
1
obj2['a']
-5
1
'b' in obj2
True
1
obj2>0
d     True
b     True
a    False
c     True
dtype: bool
1
obj2[obj2>0]
d    4
b    7
c    3
dtype: int64
1
2
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
sdata
{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
1
type(sdata)
dict
1
2
obj3=Series(sdata)
obj3
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64
1
type(obj3)
pandas.core.series.Series
1
2
3
4
5
states = ['California', 'Ohio', 'Oregon', 'Texas'] #California没有对应的键值
obj4=Series(sdata,index=states)
obj4

#NaN:not a number
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
1
obj4.isnull()
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
1
print(obj3,obj4)
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64 California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64
1
obj3+obj4
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64
1
obj4.name = 'zhangyang'
1
obj4.index.name='pk'
1
obj4
pk
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: zhangyang, dtype: float64

DataFrame

1
2
3
4
5
#传入等长列表或者Numpy数组组成的字典
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
data
{'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
 'year': [2000, 2001, 2002, 2001, 2002, 2003],
 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
1
type(data)
dict
1
2
frame = DataFrame(data)
frame
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2
1
2
3
#如果指定列,则DataFrame会按照指定列排序
frame1=DataFrame(data,columns=['year','state','pop'])
frame1
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
5 2003 Nevada 3.2
1
2
3
#若传入没有值,则会被指定为NaN
frame2=DataFrame(data,columns=['year','state','pop','debt'])
frame2
year state pop debt
0 2000 Ohio 1.5 NaN
1 2001 Ohio 1.7 NaN
2 2002 Ohio 3.6 NaN
3 2001 Nevada 2.4 NaN
4 2002 Nevada 2.9 NaN
5 2003 Nevada 3.2 NaN
1
frame.year
0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64
1
frame['year']
0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64
1
frame.loc[1]
state    Ohio
year     2001
pop       1.7
Name: 1, dtype: object
1
frame2.debt=10
1
frame2.debt
0    10
1    10
2    10
3    10
4    10
5    10
Name: debt, dtype: int64
1
2
#这里frame2.debt可以看做是一个Series
print(frame2.debt.values,'and',frame2.debt.index)
[10 10 10 10 10 10] and RangeIndex(start=0, stop=6, step=1)
1
2
3
4
frame3 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four',
'five', 'six'])
frame3
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 NaN
1
2
3
val=Series([1.2,3.1,-1],index=['two','five','one'])
frame3.debt=val
print(frame3.debt)
one     -1.0
two      1.2
three    NaN
four     NaN
five     3.1
six      NaN
Name: debt, dtype: float64
1
2
3
#del 关键词可用于删除列
frame3['tmp'] = frame3.state == 'Ohio' ##这里存在运算符的计算优先级,先判断是否相等,返回布尔型值
frame3
year state pop debt tmp
one 2000 Ohio 1.5 -1.0 True
two 2001 Ohio 1.7 1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 NaN False
five 2002 Nevada 2.9 3.1 False
six 2003 Nevada 3.2 NaN False
1
del frame3['tmp']
1
frame3
year state pop debt
one 2000 Ohio 1.5 -1.0
two 2001 Ohio 1.7 1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 3.1
six 2003 Nevada 3.2 NaN
1
2
3
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
pop
{'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
1
2
frame4=DataFrame(pop)
frame4
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
1
frame4.T
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6
1
frame4.values
array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])

索引对象

1
2
3
4
###index对象是不可修改的

obj = Series(range(3),index=['a','b','c'])
obj
a    0
b    1
c    2
dtype: int64
1
2
3
index = obj.index
print(index)
index[1]='d'
Index(['a', 'b', 'c'], dtype='object')



---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-84-f2a2752a2674> in <module>()
      1 index = obj.index
      2 print(index)
----> 3 index[1]='d'


/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
   2048 
   2049     def __setitem__(self, key, value):
-> 2050         raise TypeError("Index does not support mutable operations")
   2051 
   2052     def __getitem__(self, key):


TypeError: Index does not support mutable operations
1
2
## 不可修改性保证了index对象在多个数据结构之间实现共享的安全
## index除了长得像数组,也类似一个固定大小的集合

基本功能

重新索引

1
2
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
1
2
obj2 = obj.reindex(['a','b','c','d','e'])
obj2
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64
1
obj.reindex(['a','b','c','d','e'],fill_value=0)
a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64
1
2
3
4
frame=DataFrame(np.arange(9).reshape((3,3))
,index=['a','c','e']
,columns=['Ohio', 'Texas', 'California'])
frame
Ohio Texas California
a 0 1 2
c 3 4 5
e 6 7 8
1
frame.reindex(['a','b','c','e'])
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
e 6.0 7.0 8.0
1
frame.reindex(columns=[ 'Texas','California','Ohio'])  #这里创建了一个新对象
Texas California Ohio
a 1 2 0
c 4 5 3
e 7 8 6
1
states=['Texas','California','Ohio']

丢弃指定轴上的项

1
2
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
1
2
new_obj = obj.drop('c')
new_obj
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
1
2
3
4
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
1
data.drop(['one','two'],axis=1)
three four
Ohio 2 3
Colorado 6 7
Utah 10 11
New York 14 15

索引、选取和过滤

1
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
1
2
3
print(obj['b'])
print(obj[1])
print(obj[2:4])
1.0
1.0
c    2.0
d    3.0
dtype: float64
1
2
3
4
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
1
data['two']
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64
1
data[['two','three']]
two three
Ohio 1 2
Colorado 5 6
Utah 9 10
New York 13 14
1
data[:2]
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
1
data[data['three']>5]
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
1
2
data[data<5]=0
data
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

Selection with loc and iloc

1
data.loc['Colorado',['one','two']]
one    0
two    5
Name: Colorado, dtype: int64
1
data
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
1
data.iloc[2]    #可以直接使用数字索引
one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

Integer Indexes

1
2
ser = pd.Series(np.arange(3.))
ser
0    0.0
1    1.0
2    2.0
dtype: float64
1
2
ser2 = pd.Series(np.arange(3.),index=['a','b','c'])
ser2
a    0.0
b    1.0
c    2.0
dtype: float64
1
ser2[-1]
2.0
1
ser.loc[:1]
0    0.0
1    1.0
dtype: float64
1
ser.iloc[:1]
0    0.0
dtype: float64

Arithmetic and Data Alignment

1
2
3
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],index=['a', 'c', 'e', 'f', 'g'])
print(s1,s2)
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64 a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64
1
s1+s2
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64
1
2
3
4
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
1
df1
b c d
Ohio 0.0 1.0 2.0
Texas 3.0 4.0 5.0
Colorado 6.0 7.0 8.0
1
df2
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
1
df1+df2   ## 行列索引同时匹配才进行计算,否则为NaN
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN
1
2
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'B': [3, 4]})
1
df1
A
0 1
1 2
1
df2
B
0 3
1 4
1
df2-df1
A B
0 NaN NaN
1 NaN NaN

Arithmetic methods with fill values

1
2
3
4
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
columns=list('abcde'))
1
df1
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
1
df1.loc[1,'b']=np.nan
1
df1
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 NaN 6.0 7.0
2 8.0 9.0 10.0 11.0
1
df2
a b c d e
0 0.0 1.0 2.0 3.0 4.0
1 5.0 6.0 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
1
df1+df2
a b c d e
0 0.0 2.0 4.0 6.0 NaN
1 9.0 NaN 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
1
df1.add(df2, fill_value=0)
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 6.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0
1
1 / df1
a b c d
0 inf 1.000000 0.500000 0.333333
1 0.250000 NaN 0.166667 0.142857
2 0.125000 0.111111 0.100000 0.090909
1
df1.rdiv(1)
a b c d
0 inf 1.000000 0.500000 0.333333
1 0.250000 NaN 0.166667 0.142857
2 0.125000 0.111111 0.100000 0.090909
1
df1.reindex(columns=df2.columns,fill_value=0)
a b c d e
0 0.0 1.0 2.0 3.0 0
1 4.0 NaN 6.0 7.0 0
2 8.0 9.0 10.0 11.0 0

Operations between DataFrame and Series

1
2
arr = np.arange(12.).reshape((3, 4))
arr
array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])
1
arr[0]
array([0., 1., 2., 3.])
1
arr-arr[0] #对每一个元素都做处理
array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])
1
2
3
4
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
1
2
series = frame.loc['Utah']
series
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64
1
frame - series   #对于DataFrame数据结构同理
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0
1
2
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
series2
b    0
e    1
f    2
dtype: int64
1
frame+series2
b d e f
Utah 0.0 NaN 3.0 NaN
Ohio 3.0 NaN 6.0 NaN
Texas 6.0 NaN 9.0 NaN
Oregon 9.0 NaN 12.0 NaN
1
series3 = frame['d']
1
frame.sub(series3,axis='index')
b d e
Utah -1.0 0.0 1.0
Ohio -1.0 0.0 1.0
Texas -1.0 0.0 1.0
Oregon -1.0 0.0 1.0

Function Application and Mapping

1
2
3
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
b d e
Utah 2.006862 -1.308834 1.440639
Ohio 0.001529 0.026818 0.706586
Texas -0.461218 0.365081 -0.898180
Oregon -0.280978 0.402707 -1.396092
1
np.abs(frame)
b d e
Utah 2.006862 1.308834 1.440639
Ohio 0.001529 0.026818 0.706586
Texas 0.461218 0.365081 0.898180
Oregon 0.280978 0.402707 1.396092
1
2
f = lambda x:x.max()-x.min()
frame.apply(f,axis='columns') # 传入DataFrame中的一行或者一列数据(Series),在自定义函数中进行计算
Utah      3.315696
Ohio      0.705058
Texas     1.263261
Oregon    1.798799
dtype: float64
1
2
3
def f(x):
return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f,axis='index')
b d e
min -0.461218 -1.308834 -1.396092
max 2.006862 0.402707 1.440639
1
2
format = lambda x: '%.2f' % x
frame.applymap(format)
b d e
Utah 2.01 -1.31 1.44
Ohio 0.00 0.03 0.71
Texas -0.46 0.37 -0.90
Oregon -0.28 0.40 -1.40
1
frame['e'].map(format)
Utah       1.44
Ohio       0.71
Texas     -0.90
Oregon    -1.40
Name: e, dtype: object

Sorting and Ranking

1
2
obj = pd.Series(np.random.randn(4), index=['d', 'a', 'b', 'c'])
obj
d   -1.066429
a    0.005021
b   -0.257605
c   -1.705094
dtype: float64
1
obj.sort_values()
c   -1.705094
d   -1.066429
b   -0.257605
a    0.005021
dtype: float64
1
obj.sort_index()
a    0.005021
b   -0.257605
c   -1.705094
d   -1.066429
dtype: float64
1
2
3
4
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
index=['three', 'one'],
columns=['d', 'a', 'b', 'c'])
frame
d a b c
three 0 1 2 3
one 4 5 6 7
1
frame.sort_index(axis=0)
d a b c
one 4 5 6 7
three 0 1 2 3
1
frame.sort_index(axis=1)
a b c d
three 1 2 3 0
one 5 6 7 4
1
2
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values(ascending=False) ## asending参数决定顺序逆序
2    7.0
0    4.0
5    2.0
4   -3.0
1    NaN
3    NaN
dtype: float64
1
2
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame
b a
0 4 0
1 7 1
2 -3 0
3 2 1
1
frame.sort_values(by='b')   # 以某一列值为准
b a
2 -3 0
3 2 1
0 4 0
1 7 1
1
frame.sort_values(by=['a', 'b'])
b a
2 -3 0
0 4 0
3 2 1
1 7 1
1
2
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()#平均排名,破坏同级关系
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64
1
obj.rank(method='first')
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64
1
2
# Assign tie values the maximum rank in the group
obj.rank(ascending=False, method='max')
0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64
1
2
3
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
'c': [-2, 5, 8, -2.5]})
frame
b a c
0 4.3 0 -2.0
1 7.0 1 5.0
2 -3.0 0 8.0
3 2.0 1 -2.5
1
frame.rank(axis=1)
b a c
0 3.0 2.0 1.0
1 3.0 1.0 2.0
2 1.0 2.0 3.0
3 3.0 2.0 1.0

Axis Indexes with Duplicate Labels

1
2
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj
a    0
a    1
b    2
b    3
c    4
dtype: int64
1
obj.index.is_unique
False
1
2
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df
0 1 2
a -1.513555 0.286993 0.982033
a 1.211395 -1.512109 1.007934
b -0.609349 0.729770 1.106319
b -0.427720 0.354752 0.286622
1
df.loc['b']   #选出所有指定列
0 1 2
b -0.609349 0.729770 1.106319
b -0.427720 0.354752 0.286622

Summarizing and Computing Descriptive Statistics

1
2
3
4
5
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
[np.nan, np.nan], [0.75, -1.3]],
index=['a', 'b', 'c', 'd'],
columns=['one', 'two'])
df
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
1
df.sum()
one    9.25
two   -5.80
dtype: float64
1
df.sum(axis='columns')
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64
1
df.mean(axis='columns', skipna=False)
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64
1
df.idxmax()
one    b
two    d
dtype: object
1
df.cumsum() ##累计求和,默认列
one two
a 1.40 NaN
b 8.50 -4.5
c NaN NaN
d 9.25 -5.8
1
df.cumsum(axis=1)
one two
a 1.40 NaN
b 7.10 2.60
c NaN NaN
d 0.75 -0.55
1
df.describe()   ###牛逼牛逼
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000
1
2
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj
0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object
1
obj.describe()
count     16
unique     3
top        a
freq       8
dtype: object

Correlation and Covariance

1
## 暂时略过这一部分

Unique Values, Value Counts, and Membership

1
2
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
obj
0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object
1
2
uniques = obj.unique()
uniques
array(['c', 'a', 'd', 'b'], dtype=object)
1
obj.value_counts()
a    3
c    3
b    2
d    1
dtype: int64
1
pd.value_counts(obj.values, sort=False)
c    3
b    2
d    1
a    3
dtype: int64
1
obj.isin(['b', 'c'])
0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool
1
obj[obj.isin(['b', 'c'])]
0    c
5    b
6    b
7    c
8    c
dtype: object
1
2
3
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
unique_vals = pd.Series(['c', 'b', 'a'])
pd.Index(unique_vals).get_indexer(to_match)
array([0, 2, 1, 1, 0, 2])
1
2
3
4
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
'Qu2': [2, 3, 1, 2, 3],
'Qu3': [1, 5, 2, 4, 4]})
data
Qu1 Qu2 Qu3
0 1 2 1
1 3 3 5
2 4 1 2
3 3 2 4
4 4 3 4
1
2
res=data.apply(pd.value_counts).fillna(0)   ##统计每一列出现的次数
res
Qu1 Qu2 Qu3
1 1.0 1.0 1.0
2 0.0 2.0 1.0
3 2.0 2.0 0.0
4 2.0 0.0 2.0
5 0.0 0.0 1.0

final

该去实战了

ch04_Numpy基础

Posted on 2018-12-31
1
import this
1
2
3
4
5
6
import numpy as np

#创建ndarray
data0 = [6,7,8.1,2]
data =np.array(data0)
data
array([6. , 7. , 8.1, 2. ])
1
2
3
4
import numpy as np
data2=[[1,2,3],[2.1,6,3]]
arr = np.array(data2)
arr.dtype
dtype('float64')
1
np.zeros((3,2))
array([[0., 0.],
       [0., 0.],
       [0., 0.]])
1
np.ones((3,2))
array([[1., 1.],
       [1., 1.],
       [1., 1.]])
1
2
np.empty((3,2))
### np.empty()返回的是未经初始化的垃圾值
array([[1., 1.],
       [1., 1.],
       [1., 1.]])
1
np.arange(5)
array([0, 1, 2, 3, 4])
1
2
test = np.arange(5)
test.dtype
dtype('int64')
1
2
3
4
# ones_like(),传入一个数组,根据其形状和dtype创建一个全1数组
a=[[1,2,3],[4,5,6]]
b=np.ones_like(a)
b
array([[1, 1, 1],
       [1, 1, 1]])
1
2
#创建正方的NxN矩阵(对角线为1,其余为0)
np.eye(5)
array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])
1
np.array([1,2,3],dtype=np.int64)
array([1, 2, 3])
1
2
3
4
#进行显式类型转换
data = np.array([1,2,3],dtype=np.int64)
print(data.dtype)
print(data.astype(np.float64).dtype)
int64
float64
1
2
3
num_str = np.array(['1.2','3.25','4.1'],dtype=np.string_)
print(num_str)
num_str.astype(float)
[b'1.2' b'3.25' b'4.1']





array([1.2 , 3.25, 4.1 ])
1
2
3
4
#任何计算都会被应用到元素级
arr=np.array([[1,2,3],[4,5,6]])

arr*arr
array([[ 1,  4,  9],
       [16, 25, 36]])
1
2
3
#与标量进行运算
arr=np.array([[1,2,3],[4,5,6]])
arr * 2
array([[ 2,  4,  6],
       [ 8, 10, 12]])

基本的索引和切片

1
2
arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
1
2
3
arr = np.arange(10)
arr[5:7] = 99
arr
array([ 0,  1,  2,  3,  4, 99, 99,  7,  8,  9])
1
2
arr0 = np.ones((5,2))
arr0[0][1]
dtype('float64')
1
2
3
#可以传入一个索引列表来选取单个元素
arr1=np.ones((5,2))
arr1[1,1]
1.0
1
2
3
#高维数据
arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
arr3d
array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])
1
arr3d
array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])
1
2
old_value = arr3d[0].copy()
arr3d[0]=12
1
arr3d[0]
array([[12, 12, 12],
       [12, 12, 12]])
1
arr3d[0]=old_value
1
arr3d[0]
array([[1, 2, 3],
       [4, 5, 6]])
1
2
arr2d=np.array([[1,2,3],[4,5,6],[7,8,9]])
arr2d
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
1
arr2d[:2]
array([[1, 2, 3],
       [4, 5, 6]])
1
arr2d[:2,1:2]
array([[2],
       [5]])
1
arr2d[:,:2]
array([[1, 2],
       [4, 5],
       [7, 8]])

布尔型索引

1
2
3
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe'])
data = np.random.randn(6,5)
names
array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe'], dtype='<U4')
1
data
array([[ 1.04334945, -0.51882989,  0.39479822, -1.26167769, -0.60706667],
       [-0.10854399, -0.77654652, -0.90842022, -0.91657036, -1.57115294],
       [ 0.78047305, -0.55011782, -0.72659944, -0.78787495,  2.10762613],
       [-0.94467982,  1.4091048 ,  0.4530369 , -1.83722786, -0.14625949],
       [ 0.34030044, -1.12975372,  1.03528971,  0.8180118 ,  0.42579557],
       [-0.07116101,  0.83523538, -0.61881987, -0.5052446 ,  1.06253317]])
1
names == 'Bob'
array([ True, False, False,  True, False, False])
1
data[names=='Bob']
array([[ 0.24865102,  0.11944466,  0.40557113, -1.24757741,  0.16418035],
       [-0.0478229 , -0.30082172, -1.18252039, -1.17703784, -0.40956047]])
1
data[names=='Bob',:2]
array([[ 0.24865102,  0.11944466],
       [-0.0478229 , -0.30082172]])
1
2
3
demo = np.array([1,2,3,-1,-5])
demo[demo<0] = 1
demo
array([1, 2, 3, 1, 1])

花式索引

1
arr=np.empty((8,4))
1
2
3
for i in range(8):
arr[i] = i
arr
array([[0., 0., 0., 0.],
       [1., 1., 1., 1.],
       [2., 2., 2., 2.],
       [3., 3., 3., 3.],
       [4., 4., 4., 4.],
       [5., 5., 5., 5.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]])
1
2
#传入整数列表或者ndarray获取元素
arr[[4,2,0,1]]
array([[4., 4., 4., 4.],
       [2., 2., 2., 2.],
       [0., 0., 0., 0.],
       [1., 1., 1., 1.]])
1
2
arr1=np.arange(32).reshape((8,4))
arr1
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])
1
arr1[[4,2],[3,2]]
array([19, 10])

数组转置和轴兑换

1
import numpy as np
1
arr = np.arange(15).reshape((3,5))
1
arr
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
1
arr.T
array([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]])
1
2
arr= np.random.randn(6,3)
arr
array([[-1.14899925,  2.01403377, -0.579223  ],
       [ 1.29437371, -0.37256935, -0.1998847 ],
       [ 0.88795876,  0.38322303, -0.77289001],
       [ 0.84318194,  1.57318664, -0.14691985],
       [ 0.09926862, -0.84374676,  0.47847472],
       [ 0.30721121,  0.7380255 ,  1.09155033]])
1
np.dot(arr.T,arr)
array([[ 4.59926208, -0.98662632, -0.02053929],
       [-0.98662632,  8.0735063 , -1.21754486],
       [-0.02053929, -1.21754486,  2.41481777]])

通用函数:快速的元素级组函数

1
2
arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
1
np.sqrt(arr)
array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ])
1
np.exp(arr)
array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
       5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
       2.98095799e+03, 8.10308393e+03])
1
2
x = np.random.randn(8)
x
array([-1.52621084,  0.91491997,  1.8613378 ,  0.50723883,  0.26956039,
       -0.65576259,  0.81621241, -0.71835102])
1
2
y = np.random.randn(8)
y
array([-0.61305033, -0.99195929, -0.89955148, -0.63491395,  1.54908888,
       -1.82440893,  0.08511608, -0.60391516])
1
np.maximum(x,y)
array([-0.61305033,  0.91491997,  1.8613378 ,  0.50723883,  1.54908888,
       -0.65576259,  0.81621241, -0.60391516])

利用数组进行数据处理

用数组表达式代替循环的做法,我们称之为矢量化

1
points = np.arange(-5,5,0.01)#1000个间隔点
1
xs,ys=np.meshgrid(points,points)
1
xs
array([[-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       ...,
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99]])
1
ys
array([[-5.  , -5.  , -5.  , ..., -5.  , -5.  , -5.  ],
       [-4.99, -4.99, -4.99, ..., -4.99, -4.99, -4.99],
       [-4.98, -4.98, -4.98, ..., -4.98, -4.98, -4.98],
       ...,
       [ 4.97,  4.97,  4.97, ...,  4.97,  4.97,  4.97],
       [ 4.98,  4.98,  4.98, ...,  4.98,  4.98,  4.98],
       [ 4.99,  4.99,  4.99, ...,  4.99,  4.99,  4.99]])
1
2
z=np.sqrt(xs**2+ys**2)
z
array([[7.07106781, 7.06400028, 7.05693985, ..., 7.04988652, 7.05693985,
        7.06400028],
       [7.06400028, 7.05692568, 7.04985815, ..., 7.04279774, 7.04985815,
        7.05692568],
       [7.05693985, 7.04985815, 7.04278354, ..., 7.03571603, 7.04278354,
        7.04985815],
       ...,
       [7.04988652, 7.04279774, 7.03571603, ..., 7.0286414 , 7.03571603,
        7.04279774],
       [7.05693985, 7.04985815, 7.04278354, ..., 7.03571603, 7.04278354,
        7.04985815],
       [7.06400028, 7.05692568, 7.04985815, ..., 7.04279774, 7.04985815,
        7.05692568]])
1
2
3
import matplotlib.pyplot as plt
plt.imshow(z,cmap=plt.cm.OrRd)
plt.title('Image plot of $\sqrt{x^2+y^2}$ for a grid of values')
Text(0.5,1,'Image plot of $\\sqrt{x^2+y^2}$ for a grid of values')

png

将条件表达式表述为数组运算

1
np.where?
1
2
xarr = np.random.randn(5)
xarr
array([-0.05191135,  0.46807508,  1.5955647 , -1.21585517,  0.68848672])
1
2
yarr=np.random.randn(5)
yarr
array([-1.60333056,  2.16303939, -0.37219312, -1.85605698,  0.41180341])
1
np.where(xarr>=0,xarr,yarr)#构建布尔型索引,实现想要的东西
array([-1.60333056,  0.46807508,  1.5955647 , -1.85605698,  0.68848672])
1
2
3
##替换所有正值为1,负值为-1
arr=np.random.randn(4,4)
arr
array([[ 0.32359596, -1.15124188,  0.12417984, -1.34511765],
       [-0.41019678,  1.0543996 ,  2.6307449 ,  0.74725061],
       [ 1.03418855, -0.58064793, -0.61019497, -1.13773196],
       [-0.64005234,  0.73911588,  1.15966556, -0.26103626]])
1
np.where(arr>0,1,-1)
array([[ 1, -1,  1, -1],
       [-1,  1,  1,  1],
       [ 1, -1, -1, -1],
       [-1,  1,  1, -1]])

数学和统计方法

1
2
arr=np.random.randn(5,4)
arr
array([[ 0.3794937 , -0.91051976,  0.54977469,  0.98390242],
       [ 1.24989257, -0.14989659, -0.70528342,  0.66344849],
       [ 0.15440786,  0.75716823, -1.54809025,  0.05263153],
       [ 0.63369665, -1.47415409, -1.35897948, -0.24638285],
       [ 0.36552553,  1.44667304,  1.80073603,  0.70854674]])
1
arr.mean()
0.1676295519907499
1
np.mean(arr)
0.1676295519907499
1
arr.sum()
3.352591039814998
1
arr.sum(axis=1)  # 接受轴参数
array([ 1.00265105,  1.05816104, -0.58388263, -2.44581977,  4.32148134])
1
arr[0]
array([ 0.3794937 , -0.91051976,  0.54977469,  0.98390242])
1
arr.sum(0)
array([ 2.7830163 , -0.33072917, -1.26184243,  2.16214634])
1
np.sum?
1
2
arr=np.arange(9).reshape((3,3))
arr
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

用于布尔型数组的方法

1
2
3
#求和等计算方法中,布尔值会被强制转化为0,1,因为sum()方法可以计数
arr=np.random.randn(10)
arr
array([ 0.62662142, -0.23714492,  2.52986602,  0.66838534,  0.47275484,
        1.81467714, -0.39454002, -2.59347451,  0.90815739, -0.0813537 ])
1
arr > 0
array([ True, False,  True,  True,  True,  True, False, False,  True,
       False])
1
(arr>0).sum()
6
1
(arr<0).any() #若有一个True,则为True
True
1
(arr>0).all() #全部为True,则为True
False

排序

1
2
arr=np.random.randn(10)
arr
array([ 0.23588362, -0.45045835,  1.22450303, -0.2419639 , -0.23873288,
       -1.09141889, -0.87760038, -0.53059957,  0.15428331, -1.43959318])
1
2
arr.sort()
arr
array([-1.43959318, -1.09141889, -0.87760038, -0.53059957, -0.45045835,
       -0.2419639 , -0.23873288,  0.15428331,  0.23588362,  1.22450303])
1
2
3
#可以在任意轴上排序
arr=np.random.randn(5,5)
arr
array([[ 0.34871828,  1.03879317,  0.21363644,  0.05765405,  1.01230602],
       [ 0.20640237, -0.2323433 ,  0.2214327 ,  1.16611884,  0.5123435 ],
       [ 0.4660787 , -0.16572832,  0.03096976,  1.07155177, -1.90712269],
       [-0.45824044, -0.25984925, -1.37214123,  1.14006713, -0.70677386],
       [-2.51549148,  0.1314714 ,  1.68439925, -0.92174553,  1.03215197]])
1
arr.sort?
1
2
arr.sort(axis=1)
arr
array([[ 0.05765405,  0.21363644,  0.34871828,  1.01230602,  1.03879317],
       [-0.2323433 ,  0.20640237,  0.2214327 ,  0.5123435 ,  1.16611884],
       [-1.90712269, -0.16572832,  0.03096976,  0.4660787 ,  1.07155177],
       [-1.37214123, -0.70677386, -0.45824044, -0.25984925,  1.14006713],
       [-2.51549148, -0.92174553,  0.1314714 ,  1.03215197,  1.68439925]])
1
2
arr.sort(1)
arr
array([[ 0.05765405,  0.21363644,  0.34871828,  1.01230602,  1.03879317],
       [-0.2323433 ,  0.20640237,  0.2214327 ,  0.5123435 ,  1.16611884],
       [-1.90712269, -0.16572832,  0.03096976,  0.4660787 ,  1.07155177],
       [-1.37214123, -0.70677386, -0.45824044, -0.25984925,  1.14006713],
       [-2.51549148, -0.92174553,  0.1314714 ,  1.03215197,  1.68439925]])

唯一化

1
arr = np.array([3,3,2,1,1,54,223,3,2,3])
1
np.unique(arr) #找出数组中的唯一值,并返回排序的结果
array([  1,   2,   3,  54, 223])

线性代数

1
2
3
4
5
x = np.array([[1., 2., 3.], [4., 5., 6.]])
y = np.array([[6., 23.], [-1, 7], [8, 9]])
x
y
np.dot(x,y)
array([[ 28.,  64.],
       [ 67., 181.]])
1
np.dot(x,np.ones(3))
array([ 6., 15.])

1
2
arr = np.random.normal(size=(4,4))
arr
array([[ 1.11433974, -0.2520489 , -0.2349691 , -0.94610534],
       [ 2.28170964,  0.78521532, -2.05844323, -0.40333454],
       [-0.1225117 , -0.9144343 ,  0.25932307,  0.283972  ],
       [-0.63086567, -1.17039446, -0.20103388, -0.21096491]])

随机漫步

1
2
3
nstep = 100
draws = np.random.randint(0,2,size=nstep)
draws
array([1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1,
       0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1,
       1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1])
1
2
steps = np.where(draws>0,1,-1)
steps
array([ 1,  1,  1,  1, -1, -1, -1, -1,  1,  1, -1, -1, -1,  1,  1, -1, -1,
       -1, -1,  1,  1,  1, -1,  1, -1, -1, -1, -1,  1,  1, -1,  1, -1,  1,
        1,  1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  1, -1, -1, -1, -1,
        1, -1, -1, -1,  1, -1, -1,  1, -1,  1, -1, -1, -1, -1,  1, -1,  1,
        1,  1,  1,  1,  1, -1,  1, -1, -1, -1,  1,  1, -1, -1, -1,  1,  1,
       -1,  1,  1,  1,  1, -1,  1, -1, -1,  1,  1, -1,  1, -1,  1])
1
walk=steps.cumsum()
1
walk.min()
-19
1
walk.max()
4
1
(np.abs(walk)>=5).argmax()
40
1…8910…12
zhangyang

zhangyang

120 posts
39 tags
© 2022 zhangyang
Powered by Hexo
|
Theme — NexT.Mist v5.1.4