python数据分析初学

时间：2023-06-22 14:07:00 t型m12连接器

1.numpy

接下面将从这5个方面来介绍numpy模块内容：

1)创建数组

2)相关数组的属性和函数

3)获取数组元素–普通索引、切片、布尔索引和花式索引

4)统计函数和线性代数运算

5)随机随机数

1.1数组的创建
创建一维数组

可以使用numpy中的arange()函数创建一维有序数组，内置函数range的扩展版。
In 1: import numpy as np

In 2: ls1 = range(10)

In 3: list(ls1)

Out3: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In 4: type(ls1)

Out4: range

In [5]: ls2 = np.arange(10)

In [6]: list(ls2)

Out[6]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [7]: type(ls2)

Out[7]: numpy.ndarray

通过arange生成的序列就是列表类型不简单而是一个一维数组。
如果一维数组不是规则的有序元素，而是人为输入，则需要array()函数创建。

In [8]: arr1 = np.array(1、20、13、28、22)

In [9]: arr1

Out[9]: array([ 1, 20, 13, 28, 22])

In [10]: type(arr1)

Out[10]: numpy.ndarray

以上是由元组序列组成的一维数组。

In [11]: arr2 = np.array(1、1、2、3、5、8、13、21)

In [12]: arr2

Out[12]: array([ 1, 1, 2, 3, 5, 8, 13, 21])

In [13]: type(arr2)

Out[13]: numpy.ndarray

以上是由列表序列组成的一维数组。

创建二维数组

二维数组的创建实际上是列表套列表或元组套元组。

In [14]: arr3 = np.array(1、1、2、3)(5、8、13、21)(34、55、89、144)

In [15]: arr3

Out[15]:

array([[ 1, 1, 2, 3],

[ 5, 8, 13, 21],

[ 34, 55, 89, 144]])

以上采用元组套元组。

In [16]: arr4 = np.array([1，2，3，4]，[5，6，7，8]，[9，10，11，12]]

In [17]: arr4

Out[17]:

array([[ 1, 2, 3, 4],

[ 5, 6, 7, 8],

[ 9, 10, 11, 12]])

使用列表套列表。

对于高维数组在未来的数据分析中使用较少，这里关于高维数组的创建就不赘述了，构建方法仍然是一套方法。
以上介绍的是人工设置的一维、二维或高维数组，numpy还提供了几种特殊的数组，即：

In [18]: np.ones(3) #返回一维元素全为1的数组

Out[18]: array([ 1., 1., 1.])

In [19]: np.ones([3,4]) #返回元素全部为1的3×4二维数组

Out[19]:

array([[ 1., 1., 1., 1.],

[ 1., 1., 1., 1.],

[ 1., 1., 1., 1.]])

In [20]: np.zeros(3) #返回一维元素为0的数组

Out[20]: array([ 0., 0., 0.])

In [21]: np.zeros([3,4]) #返回元素全部为0的3×4二维数组

Out[21]:

array([[ 0., 0., 0., 0.],

[ 0., 0., 0., 0.],

[ 0., 0., 0., 0.]])

In [22]: np.empty(3) #返回一维空数组

Out[22]: array([ 0., 0., 0.])

In [23]: np.empty([3,4]) #返回3×4二维空数组

Out[23]:

array([[ 0., 0., 0., 0.],

[ 0., 0., 0., 0.],

[ 0., 0., 0., 0.]])

1.2关于数组的属性和函数

In [25]: arr3.shape #shape返回数组行数和列数的方法

Out[25]: (3, 4)

In [26]: arr3.dtype #dtype返回数组数据类型的方法

Out[26]: dtype(‘int32’)

拉直办法

In [27]: a = arr3.ravel() #通过ravel将数组拉直(将多维数组降低为一维数组)的方法

In [28]: a

Out[28]: array([ 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144])

In [29]: b = arr3.flatten() #通过flatten将数组拉直的方法

In [30]: b

Out[30]: array([ 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144])

两者的区别在于ravel该方法生成了原数组的视图，不占用内存空间，但视图的变化会影响原数组的变化。flatten该方法返回真实值，其值的变化不会影响原数组的变化。
拉直两种方法的例子比较:
可以通过以下例子来理解：

In [31]: b[:3] = 0

In [32]: arr3

Out[32]:

array([[ 1, 1, 2, 3],

[ 5, 8, 13, 21],

[ 34, 55, 89, 144]])

原数组通过更改b值没有变化。（fatten）

In [33]: a[:3] = 0

In [34]: arr3

Out[34]:

array([[ 0, 0, 0, 3],

[ 5, 8, 13, 21],

[ 34, 55, 89, 144]])

a值变化后，原数组会随之变化。（ravel）

In [35]: arr4

Out[35]:

array([[ 1, 2, 3, 4],

[ 5, 6, 7, 8],

[ 9, 10, 11, 12]])

In [36]: arr4.ndim #返回数组维数

Out[36]: 2

In [37]: arr4.size ##返回数组元素的数量

Out[37]: 12
In [38]: arr4.T ##返回数组的转移结果

Out[38]:

array([[ 1, 5, 9],

[ 2, 6, 10],

[ 3, 7, 11],

[ 4, 8, 12]])

如果数组的数据类型为复数的话，real该方法可返回复数实部，imag方法返回复数的虚部。
在介绍了一些数组方法后，让我们来看看数组本身可以操作的函数：

In [39]: len(arr4) #返回数组有多少行？

Out[39]: 3

In [40]: arr3

Out[40]:

array([[ 0, 0, 0, 3],

[ 5, 8, 13, 21],

[ 34, 55, 89, 144]]

In [41]: arr4

Out[41]:

array([[ 1, 2, 3, 4],

[ 5, 6, 7, 8],

[ 9, 10, 11, 12]])

In [42]: np.hstack((arr3,arr4))

Out[42]:

array([[ 0, 0, 0, 3, 1, 2, 3, 4],

[ 5, 8, 13, 21, 5, 6, 7, 8],

[ 34, 55, 89, 144, 9, 10, 11, 12]])

横向拼接arr3和arr4两个数组，但必须满足两个数组的行数相同。

In [43]: np.vstack((arr3,arr4))

Out[43]:

array([[ 0, 0, 0, 3],

[ 5, 8, 13, 21],

[ 34, 55, 89, 144],

[ 1, 2, 3, 4],

[ 5, 6, 7, 8],

[ 9, 10, 11, 12]])

纵向拼接arr3和arr4两个数组，但必须满足两个数组的列数相同。

In [44]: np.column_stack((arr3,arr4)) #与hstack函数具有一样的效果

Out[44]:

array([[ 0, 0, 0, 3, 1, 2, 3, 4],

[ 5, 8, 13, 21, 5, 6, 7, 8],

[ 34, 55, 89, 144, 9, 10, 11, 12]])

In [45]: np.row_stack((arr3,arr4)) #与vstack函数具有一样的效果

Out[45]:

array([[ 0, 0, 0, 3],

[ 5, 8, 13, 21],

[ 34, 55, 89, 144],

[ 1, 2, 3, 4],

[ 5, 6, 7, 8],

[ 9, 10, 11, 12]])

reshape()函数和resize()函数可以重新设置数组的行数和列数：

In [46]: arr5 = np.array(np.arange(24))

In [47]: arr5 #此为一维数组

Out[47]:

array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,

17, 18, 19, 20, 21, 22, 23])

In [48]: a = arr5.reshape(4,6)

In [49]: a

Out[49]:

array([[ 0, 1, 2, 3, 4, 5],

[ 6, 7, 8, 9, 10, 11],

[12, 13, 14, 15, 16, 17],

[18, 19, 20, 21, 22, 23]])

通过reshape函数将一维数组设置为二维数组，且为4行6列的数组。

In [50]: a.resize(6,4)

In [51]: a

Out[51]:

array([[ 0, 1, 2, 3],

[ 4, 5, 6, 7],

[ 8, 9, 10, 11],

[12, 13, 14, 15],

[16, 17, 18, 19],

[20, 21, 22, 23]])

通过resize函数会直接改变原数组的形状。

数组转换：tolist将数组转换为列表，astype()强制转换数组的数据类型，下面是两个函数的例子：

In [53]: b = a.tolist()

In [54]: b

Out[54]:

[[0, 1, 2, 3],

[4, 5, 6, 7],

[8, 9, 10, 11],

[12, 13, 14, 15],

[16, 17, 18, 19],

[20, 21, 22, 23]]

In [55]: type(b)

Out[55]: list

In [56]: c = a.astype(float)

In [57]: c

Out[57]:

array([[ 0., 1., 2., 3.],

[ 4., 5., 6., 7.],

[ 8., 9., 10., 11.],

[ 12., 13., 14., 15.],

[ 16., 17., 18., 19.],

[ 20., 21., 22., 23.]])

In [58]: a.dtype

Out[58]: dtype(‘int32’)

In [59]: c.dtype

Out[59]: dtype(‘float64’)

1.3 数组元素的获取

通过索引和切片的方式获取数组元素，一维数组元素的获取与列表、元组的获取方式一样：
In [60]: arr7 = np.array(np.arange(10))

In [61]: arr7

Out[61]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [62]: arr73 #获取第4个元素

Out[62]: 3

In [63]: arr7[:3] #获取前3个元素

Out[63]: array([0, 1, 2])

In [64]: arr7[3:] #获取第4个元素即之后的所有元素

Out[64]: array([3, 4, 5, 6, 7, 8, 9])

In [65]: arr7[-2:] #获取末尾的2个元素

Out[65]: array([8, 9])

In [66]: arr7[::2] #从第1个元素开始，获取步长为2的所有元素

Out[66]: array([0, 2, 4, 6, 8])

二维数组元素的获取：

In [67]: arr8 = np.array(np.arange(12)).reshape(3,4)

In [68]: arr8

Out[68]:

array([[ 0, 1, 2, 3],

[ 4, 5, 6, 7],

[ 8, 9, 10, 11]])

In [69]: arr81 #返回数组的第2行

Out[69]: array([4, 5, 6, 7])

In [70]: arr8[:2] #返回数组的前2行

Out[70]:

array([[0, 1, 2, 3],

[4, 5, 6, 7]])

In [71]: arr8[[0,2]] #返回指定的第1行和第3行

Out[71]:

array([[ 0, 1, 2, 3],

[ 8, 9, 10, 11]])

In [72]: arr8[:,0] #返回数组的第1列

Out[72]: array([0, 4, 8]

In [73]: arr8[:,-2:] #返回数组的后2列

Out[73]:

array([[ 2, 3],

[ 6, 7],

[10, 11]])

In [74]: arr8[:,[0,2]] #返回数组的第1列和第3列

Out[74]:

array([[ 0, 2],

[ 4, 6],

[ 8, 10]])

In [75]: arr8[1,2] #返回数组中第2行第3列对应的元素

Out[75]: 6

布尔索引，即索引值为True和False，需要注意的是布尔索引必须输数组对象。
In [76]: log = np.array([True,False,False,True,True,False])

In [77]: arr9 = np.array(np.arange(24)).reshape(6,4)

In [78]: arr9

Out[78]:

array([[ 0, 1, 2, 3],

[ 4, 5, 6, 7],

[ 8, 9, 10, 11],

[12, 13, 14, 15],

[16, 17, 18, 19],

[20, 21, 22, 23]])

In [79]: arr9[log] #返回所有为True的对应行

Out[79]:

array([[ 0, 1, 2, 3],

[12, 13, 14, 15],

[16, 17, 18, 19]])

In [80]: arr9[-log] #通过负号筛选出所有为False的对应行

Out[80]:

array([[ 4, 5, 6, 7],

[ 8, 9, 10, 11],

[20, 21, 22, 23]])

举一个场景，一维数组表示区域，二维数组表示观测值，如何选取目标区域的观测？
In [81]: area = np.array([‘A’,‘B’,‘A’,‘C’,‘A’,‘B’,‘D’])

In [82]: area

Out[82]:

array([‘A’, ‘B’, ‘A’, ‘C’, ‘A’, ‘B’, ‘D’],

dtype=’

In [83]: observes = np.array(np.arange(21)).reshape(7,3)

In [84]: observes

Out[84]:

array([[ 0, 1, 2],

[ 3, 4, 5],

[ 6, 7, 8],

[ 9, 10, 11],

[12, 13, 14],

[15, 16, 17],

[18, 19, 20]])

In [85]: observes[area == ‘A’]

Out[85]:

array([[ 0, 1, 2],

[ 6, 7, 8],

[12, 13, 14]])

返回所有A区域的观测。

In [86]: observes[(area == ‘A’) | (area == ‘D’)] #条件值需要在&(and),|(or)两端用圆括号括起来
Out[86]:
array([[ 0, 1, 2],
[ 6, 7, 8],
[12, 13, 14],
[18, 19, 20]])
返回所有A区域和D区域的观测。

当然，布尔索引也可以与普通索引或切片混合使用：

In [87]: observes[area == ‘A’][:,[0,2]]

Out[87]:

array([[ 0, 2],

[ 6, 8],

[12, 14]])

返回A区域的所有行，且只获取第1列与第3列数据。

花式索引：实际上就是将数组作为索引将原数组的元素提取出来
n [88]: arr10 = np.arange(1,29).reshape(7,4)

In [89]: arr10

Out[89]:

array([[ 1, 2, 3, 4],

[ 5, 6, 7, 8],

[ 9, 10, 11, 12],

[13, 14, 15, 16],

[17, 18, 19, 20],

[21, 22, 23, 24],

[25, 26, 27, 28]])

In [90]: arr10[[4,1,3,5]] #按照指定顺序返回指定行

Out[90]:

array([[17, 18, 19, 20],

[ 5, 6, 7, 8],

[13, 14, 15, 16],

[21, 22, 23, 24]])

In [91]: arr10[[4,1,5]][:,[0,2,3]] #返回指定的行与列

Out[91]:

array([[17, 19, 20],

[ 5, 7, 8],

[21, 23, 24]])

In [92]: arr10[[4,1,5],[0,2,3]]

Out[92]: array([17, 7, 24]
请注意！这与上面的返回结果是截然不同的，上面返回的是二维数组，而这条命令返回的是一维数组。

如果想使用比较简单的方式返回指定行以列的二维数组的话，可以使用ix_()函数
In [93]: arr10[np.ix_([4,1,5],[0,2,3])]

Out[93]:

array([[17, 19, 20],

[ 5, 7, 8],

[21, 23, 24]])

这与arr10[[4,1,5]][:,[0,2,3]]返回的结果是一致的。

1.4统计函数与线性代数运算

统计运算中常见的聚合函数有：最小值、最大值、中位数、均值、方差、标准差等。首先来看看数组元素级别的计算：
In [94]: arr11 = 5-np.arange(1,13).reshape(4,3)

In [95]: arr12 = np.random.randint(1,10,size = 12).reshape(4,3)
In [96]: arr11

Out[96]:

array([[ 4, 3, 2],

[ 1, 0, -1],

[-2, -3, -4],

[-5, -6, -7]])

In [97]: arr12

Out[97]:

array([[1, 3, 7],

[7, 3, 7],

[3, 7, 4],

[6, 1, 2]])

In [98]: arr11 ** 2 #计算每个元素的平方

Out[98]:

array([[16, 9, 4],

[ 1, 0, 1],

[ 4, 9, 16],

[25, 36, 49]])

In [99]: np.sqrt(arr11) #计算每个元素的平方根

Out[99]:

array([[ 2. , 1.73205081, 1.41421356],

[ 1. , 0. , nan],

[ nan, nan, nan],

[ nan, nan, nan]])
由于负值的平方根没有意义，故返回nan。

In [100]: np.exp(arr11) #计算每个元素的指数值

Out[100]:

array([[ 5.45981500e+01, 2.00855369e+01, 7.38905610e+00],

[ 2.71828183e+00, 1.00000000e+00, 3.67879441e-01],

[ 1.35335283e-01, 4.97870684e-02, 1.83156389e-02],

[ 6.73794700e-03, 2.47875218e-03, 9.11881966e-04]])

In [101]: np.log(arr12) #计算每个元素的自然对数值

Out[101]:

array([[ 0. , 1.09861229, 1.94591015],

[ 1.94591015, 1.09861229, 1.94591015],

[ 1.09861229, 1.94591015, 1.38629436],

[ 1.79175947, 0. , 0.69314718]])

In [102]: np.abs(arr11) #计算每个元素的绝对值

Out[102]:

array([[4, 3, 2],

[1, 0, 1],

[2, 3, 4],

[5, 6, 7]]
相同形状数组间元素的操作：

In [103]: arr11 + arr12 #加

Out[103]:

array([[ 5, 6, 9],

[ 8, 3, 6],

[ 1, 4, 0],

[ 1, -5, -5]])

In [104]: arr11 - arr12 #减

Out[104]:

array([[ 3, 0, -5],

[ -6, -3, -8],

[ -5, -10, -8],

[-11, -7, -9]])

In [105]: arr11 * arr12 #乘

Out[105]:

array([[ 4, 9, 14],

[ 7, 0, -7],

[ -6, -21, -16],

[-30, -6, -14]])

In [106]: arr11 / arr12 #除

Out[106]:

array([[ 4. , 1. , 0.28571429],

[ 0.14285714, 0. , -0.14285714],

[-0.66666667, -0.42857143, -1. ],

[-0.83333333, -6. , -3.5 ]])

In [107]: arr11 // arr12 #整除

Out[107]:

array([[ 4, 1, 0],

[ 0, 0, -1],

[-1, -1, -1],

[-1, -6, -4]], dtype=int32)

In [108]: arr11 % arr12 #取余

Out[108]:

array([[0, 0, 2],

[1, 0, 6],

[1, 4, 0],

[1, 0, 1]], dtype=int32)

接下来我们看看统计运算函数：

In [109]: np.sum(arr11) #计算所有元素的和

Out[109]: -18

In [110]: np.sum(arr11,axis = 0) #对每一列求和

Out[110]: array([ -2, -6, -10])

In [111]: np.sum(arr11, axis = 1) #对每一行求和

Out[111]: array([ 9, 0, -9, -18])

In [112]: np.cumsum(arr11) #对每一个元素求累积和（从上到下，从左到右的元素顺序）
Out[112]: array([ 4, 7, 9, 10, 10, 9, 7, 4, 0, -5, -11, -18], dtype=int32

In [113]: np.cumsum(arr11, axis = 0) #计算每一列的累积和，并返回二维数组

Out[113]:

array([[ 4, 3, 2],

[ 5, 3, 1],

[ 3, 0, -3],

[ -2, -6, -10]], dtype=int32)

In [114]: np.cumprod(arr11, axis = 1) #计算每一行的累计积，并返回二维数组

Out[114]:

array([[ 4, 12, 24],

[ 1, 0, 0],

[ -2, 6, -24],

[ -5, 30, -210]], dtype=int32)

In [115]: np.min(arr11) #计算所有元素的最小值

Out[115]: -7

In [116]: np.max(arr11, axis = 0) #计算每一列的最大值

Out[116]: array([4, 3, 2])

In [117]: np.mean(arr11) #计算所有元素的均值

Out[117]: -1.5

In [118]: np.mean(arr11, axis = 1) #计算每一行的均值

Out[118]: array([ 3., 0., -3., -6.])

In [119]: np.median(arr11) #计算所有元素的中位数

Out[119]: -1.5

In [120]: np.median(arr11, axis = 0) #计算每一列的中位数

Out[120]: array([-0.5, -1.5, -2.5])

In [121]: np.var(arr12) #计算所有元素的方差

Out[121]: 5.354166666666667

In [122]: np.std(arr12, axis = 1) #计算每一行的标准差

Out[122]: array([ 2.49443826, 1.88561808, 1.69967317, 2.1602469 ])

numpy中的统计函数运算是非常灵活的，既可以计算所有元素的统计值，也可以计算指定行或列的统计指标。还有其他常用的函数，如符号函数sign，ceil(>=x的最小整数)，floor(<=x的最大整数)，modf(将浮点数的整数部分与小数部分分别存入两个独立的数组)，cos，arccos，sin，arcsin，tan，arctan等。
让我很兴奋的一个函数是where()，它类似于Excel中的if函数，可以进行灵活的变换：
In [123]: arr11

Out[123]:

array([[ 4, 3, 2],

[ 1, 0, -1],

[-2, -3, -4],

[-5, -6, -7]])

In [124]: np.where(arr11 < 0, ‘negtive’,‘positive’)

Out[124]:

array([[‘positive’, ‘positive’, ‘positive’],

[‘positive’, ‘positive’, ‘negtive’],

[‘negtive’, ‘negtive’, ‘negtive’],

[‘negtive’, ‘negtive’, ‘negtive’]],

dtype=’

当然，np.where还可以嵌套使用，完成复杂的运算。

其它函数
unique(x):计算x的唯一元素，并返回有序结果
intersect(x,y)：计算x和y的公共元素，即交集
union1d(x,y):计算x和y的并集
setdiff1d(x,y):计算x和y的差集，即元素在x中，不在y中
setxor1d(x,y):计算集合的对称差，即存在于一个数组中，但不同时存在于两个数组中
in1d(x,y):判断x的元素是否包含于y中

线性代数运算

同样numpu也跟R语言一样，可以非常方便的进行线性代数方面的计算，如行列式、逆、迹、特征根、特征向量等。但需要注意的是，有关线性代数的函数并不在numpy中，而是numpy的子例linalg中。
In [125]: arr13 = np.array([[1,2,3,5],[2,4,1,6],[1,1,4,3],[2,5,4,1]])

In [126]: arr13

Out[126]:

array([[1, 2, 3, 5],

[2, 4, 1, 6],

[1, 1, 4, 3],

[2, 5, 4, 1]])

In [127]: np.linalg.det(arr13) #返回方阵的行列式

Out[127]: 51.000000000000021

In [128]: np.linalg.inv(arr13) #返回方阵的逆

Out[128]:

array([[-2.23529412, 1.05882353, 1.70588235, -0.29411765],

[ 0.68627451, -0.25490196, -0.7254902 , 0.2745098 ],

[ 0.19607843, -0.21568627, 0.07843137, 0.07843137],

[ 0.25490196, 0.01960784, -0.09803922, -0.09803922]])

In [129]: np.trace(arr13) #返回方阵的迹（对角线元素之和），注意迹的求解不在linalg子例程中

Out[129]: 10

In [130]: np.linalg.eig(arr13) #返回由特征根和特征向量组成的元组

Out[130]:

(array([ 11.35035004, -3.99231852, -0.3732631 , 3.01523159]),

array([[-0.4754174 , -0.48095078, -0.95004728, 0.19967185],

[-0.60676806, -0.42159999, 0.28426325, -0.67482638],

[-0.36135292, -0.16859677, 0.08708826, 0.70663129],

[-0.52462832, 0.75000995, 0.09497472, -0.07357122]]))

In [131]: np.linalg.qr(arr13) #返回方阵的QR分解

Out[131]:

(array([[-0.31622777, -0.07254763, -0.35574573, -0.87645982],

[-0.63245553, -0.14509525, 0.75789308, -0.06741999],

[-0.31622777, -0.79802388, -0.38668014, 0.33709993],

[-0.63245553, 0.580381 , -0.38668014, 0.33709993]]),

array([[-3.16227766, -6.64078309, -5.37587202, -6.95701085],

[ 0. , 1.37840488, -1.23330963, -3.04700025],

[ 0. , 0. , -3.40278524, 1.22190924],

[ 0. , 0. , 0. , -3.4384193 ]]))

In [132]:np.linalg.svd(arr13) #返回方阵的奇异值分解

Out[132]:

(array([[-0.50908395, 0.27580803, 0.35260559, -0.73514132],

[-0.59475561, 0.4936665 , -0.53555663, 0.34020325],

[-0.39377551, -0.10084917, 0.70979004, 0.57529852],

[-0.48170545, -0.81856751, -0.29162732, -0.11340459]]),

array([ 11.82715609, 4.35052602, 3.17710166, 0.31197297]),

array([[-0.25836994, -0.52417446, -0.47551003, -0.65755329],

[-0.10914615, -0.38326507, -0.54167613, 0.74012294],

[-0.18632462, -0.68784764, 0.69085326, 0.12194478],

[ 0.94160248, -0.32436807, -0.05655931, -0.07050652]]))

QR分解法
QR分解法是将矩阵分解成一个正规正交矩阵与上三角形矩阵。正规正交矩阵Q满足条件，所以称为QR分解法与此正规正交矩阵的通用符号Q有关。
MATLAB以qr函数来执行QR分解法，其语法为[Q,R]=qr(A)，其中Q代表正规正交矩阵，而R代表上三角形矩阵。此外，原矩阵A不必为正方矩阵；如果矩阵A大小为，则矩阵Q大小为，矩阵R大小为。
奇异值分解法
奇异值分解 (sigular value decomposition,SVD) 是另一种正交矩阵分解法；SVD是最可靠的分解法，但是它比QR 分解法要花上近十倍的计算时间。[U,S,V]=svd(A)，其中U和V代表二个相互正交矩阵，而S代表一对角矩阵。和QR分解法相同者，原矩阵A不必为正方矩阵。
使用SVD分解法的用途是解最小平方误差法和数据压缩。

1.5 随机数生成

统计学中经常会讲到数据的分布特征，如正态分布、指数分布、卡方分布、二项分布、泊松分布等，下面就讲讲有关分布的随机数生成。

正态分布直方图

In [137]: import matplotlib #用于绘图的模块

In [138]: np.random.seed(1234) #设置随机种子

In [139]: N = 10000 #随机产生的样本量

In [140]: randnorm = np.random.normal(size = N) #生成正态随机数

In [141]: counts, bins, path = matplotlib.pylab.hist(randnorm, bins = np.sqrt(N), normed = True, color = ‘blue’) #绘制直方图

以上将直方图的频数和组距存放在counts和bins内。

n [142]: sigma = 1; mu = 0

**In [143]: norm_dist = (1/np.sqrt(2sigmanp.pi))*np.exp(-((bins-mu)2)/2) 正态分布密度函数

In [144]: matplotlib.pylab.plot(bins,norm_dist,color = ‘red’) #绘制正态分布密度函数图

使用二项分布进行赌博

同时抛弃9枚硬币，如果正面朝上少于5枚，则输掉8元，否则就赢8元。如果手中有1000元作为赌资，请问赌博10000次后可能会是什么情况呢？
In [146]: np.random.seed(1234)

In [147]: binomial = np.random.binomial(9,0.5,10000) #生成二项分布随机数
In [148]: money = np.zeros(10000) #生成10000次赌资的列表
In [149]: money[0] = 1000 #首次赌资为1000元
In [150]: for i in range(1,10000):

…： if binomial[i] < 5:

 ...:         money[i] = money[i-1] - 8

#如果少于5枚正面，则在上一次赌资的基础上输掉8元

 ...:     else:

 ...:         money[i] = money[i-1] + 8

#如果至少5枚正面，则在上一次赌资的基础上赢取8元

In [151]: matplotlib.pylab.plot(np.arange(10000), money)

使用随机整数实现随机游走
一个醉汉在原始位置上行走10000步后将会在什么地方呢？如果他每走一步是随机的，即下一步可能是1也可能是-1。

In [152]: np.random.seed(1234) #设定随机种子

In [153]: position = 0 #设置初始位置

In [154]: walk = [] #创建空列表

In [155]: steps = 10000 #假设接下来行走10000步

In [156]: for i in np.arange(steps):

 ...:     step = 1 if np.random.randint(0,2) else -1  #每一步都是随机的

 ...:     position = position + step  #对每一步进行累计求和

 ...:     walk.append(position)   #确定每一步所在的位置

In [157]: matplotlib.pylab.plot(np.arange(10000), walk) #绘制随机游走图

上面的代码还可以写成（结合前面所讲的where函数，cumsum函数）：

In [158]: np.random.seed(1234)

In [159]: step = np.where(np.random.randint(0,2,10000)>0,1,-1)

In [160]: position = np.cumsum(step)

In [161]: matplotlib.pylab.plot(np.arange(10000), position)

2.pandas

2.1 前期准备

import pandas as pd

mydataset = {
‘sites’: [“Google”, “Runoob”, “Wiki”],
‘number’: [1, 2, 3]
}

myvar = pd.DataFrame(mydataset)

print(myvar)

2.2 pandas 数据结构 ——Series

Pandas Series 类似表格中的一个列（column），类似于一维数组，可以保存任何数据类型
pandas.Series( data, index, dtype, name, copy)

参数说明：

data：一组数据(ndarray 类型)。

index：数据索引标签，如果不指定，默认从 0 开始。

dtype：数据类型，默认会自己判断。

name：设置名称。

copy：拷贝数据，默认为 False。

1.import pandas as pd

a = [1, 2, 3]

myvar = pd.Series(a)

print(myvar)

2.我们可以指定索引值，如下实例：

实例
import pandas as pd

a = ["Google", "Runoob", "Wiki"]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)
print(myvar["y"])
输出结果如下：

Runoob

我们也可以使用 key/value 对象，类似字典来创建 Series：

实例
import pandas as pd

sites = {1: “Google”, 2: “Runoob”, 3: “Wiki”}

myvar = pd.Series(sites)

print(myvar)
字典的 key 变成了索引值。
如果我们只需要字典中的一部分数据，只需要指定需要数据的索引即可，如下实例：
import pandas as pd

sites = {1: “Google”, 2: “Runoob”, 3: “Wiki”}

myvar = pd.Series(sites, index = [1, 2])

print(myvar)
只取前面两个
设置 Series 名称参数：

实例
import pandas as pd

sites = {1: "Google", 2: "Runoob", 3: "Wiki"}

myvar = pd.Series(sites, index = [1, 2], name="RUNOOB-Series-TEST" )

print(myvar)
**设置名字**

2.3 dataframe

DataFrame 是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔型值）。DataFrame 既有行索引也有列索引，它可以被看做由 Series 组成的字典（共同用一个索引）。
pandas.DataFrame( data, index, columns, dtype, copy)

参数说明：

data：一组数据(ndarray、series, map, lists, dict 等类型)。

index：索引值，或者可以称为行标签。

columns：列标签，默认为 RangeIndex (0, 1, 2, …, n) 。

dtype：数据类型。

copy：拷贝数据，默认为 False。
Pandas DataFrame 是一个二维的数组结构，类似二维数组。
实例 - 使用列表创建
import pandas as pd

data = [['Google',10],['Runoob',12],['Wiki',13]]

df = pd.DataFrame(data,columns=['Site','Age'],dtype=float)

print(df)
输出结果如下：

2.还可以使用字典（key/value），其中字典的 key 为列名:
import pandas as pd

data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

df = pd.DataFrame(data)

print (df)
    a   b     c
0  1   2   NaN
1  5  10  20.0
**没有对应的部分数据为 NaN**

3.Pandas 可以使用 loc 属性返回指定行的数据，如果没有设置索引，第一行索引为 0，第二行索引为 1，以此类推：

实例
import pandas as pd

data = {
"calories": [420, 380, 390],
 "duration": [50, 40, 45]
}

 数据载入到 DataFrame 对象
df = pd.DataFrame(data)

返回第一行
print(df.loc[0])
返回第二行
print(df.loc[1])**
返回第一行和第二行
print(df.loc[[0, 1]])
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)  #行名为day1，day2, day3
输出结果为：

  calories  duration

day1 420 50
day2 380 40
day3 390 45
#指定索引
print(df.loc[“day2”])

2.4 csv 文件

Pandas 可以很方便的处理 CSV 文件，本文以 nba.csv 为例，你可以下载 nba.csv 或打开 nba.csv 查看。
1.实例
import pandas as pd

df = pd.read_csv('nba.csv')

print(df.to_string())

to_string() 用于返回 DataFrame 类型的数据，如果不使用该函数，则输出结果为数据的前面 5 行和末尾 5 行，中间部分以 … 代替。
print(df)
2.我们也可以使用 to_csv() 方法将 DataFrame 存储为 csv 文件：

实例
import pandas as pd

#三个字段 name, site, age
nme = [“Google”, “Runoob”, “Taobao”, “Wiki”]
st = [“www.google.com”, “www.runoob.com”, “www.taobao.com”, “www.wikipedia.org”]
ag = [90, 40, 80, 98]

#字典
dict = {‘name’: nme, ‘site’: st, ‘age’: ag}

df = pd.DataFrame(dict)

#保存 dataframe
df.to_csv(‘site.csv’)
3.数据处理head()
head( n ) 方法用于读取前面的 n 行，如果不填参数 n ，默认返回 5 行。

实例 - 读取前面 5 行
import pandas as pd

df = pd.read_csv('nba.csv')

print(df.head())
print(df.head(10))

4.tail()
tail( n ) 方法用于读取尾部的 n 行，如果不填参数 n ，**默认返回 5 行，**空行各个字段的值返回 NaN。
实例 - 读取末尾 5 行
import pandas as pd

df = pd.read_csv('nba.csv')

print(df.tail())
print(df.tail(10))

5.info()
info() 方法返回表格的一些基本信息：
import pandas as pd

df = pd.read_csv('nba.csv')

print(df.info())

输出结果为：

RangeIndex: 458 entries, 0 to 457 # 行数，458 行，第一行编号为 0
Data columns (total 9 columns): # 列数，9列

Column Non-Null Count Dtype # 各列的数据类型

0 Name 457 non-null object
1 Team 457 non-null object
2 Number 457 non-null float64
3 Position 457 non-null object
4 Age 457 non-null float64
5 Height 457 non-null object
6 Weight 457 non-null float64
7 College 373 non-null object # non-null，意思为非空的数据
8 Salary 446 non-null float64
dtypes: float64(4), object(5) # 类型

2.5 JSON

JSON（JavaScript Object Notation，JavaScript 对象表示法），是存储和交换文本信息的语法，类似 XML。

JSON 比 XML 更小、更快，更易解析，更多 JSON 内容可以参考 JSON 教程。
JSON（JavaScript Object Notation）是一种轻量级的数据交换格式，可使人们很容易地进行阅读和编写，同时也方便了机器进行解析和生成。JSON适用于进行数据交互的场景，如网站前台与后台之间的数据交互。
JSON是比XML更简单的一种数据交换格式，它采用完全独立于编程语言的文本格式来存储和表示数据。
其语法规则如下：
(1)使用键值对( key:value )表示对象属性和值。
(2)使用逗号(，)分隔多条数据。
(3)使用花括号{}包含对象。
(4)使用方括号[ ]表示数组。
在JavaScript语言中，一切皆是对象，所以任何支持的类型都可以通过JSON来表示，如字符串、数字、对象、数组等。其中，对象和数组是比较特殊且常用的两种类型。
实例
import pandas as pd

df = pd.read_json(‘sites.json’)

print(df.to_string())
to_string() 用于返回 DataFrame 类型的数据，我们也可以直接处理 JSON 字符串。
import pandas as pd

#字典格式的 JSON
s = {
“col1”:{“row1”:1,“row2”:2,“row3”:3},
“col2”:{“row1”:“x”,“row2”:“y”,“row3”:“z”}
}

#读取 JSON 转为 DataFrame
df = pd.DataFrame(s)
print(df)

从 URL 中读取 JSON 数据：

实例
import pandas as pd

URL = ‘https://static.runoob.com/download/sites.json’
df = pd.read_json(URL)
print(df)
以上实例输出结果为：

 id    name            		 url 									 likes

0 A001 菜鸟教程 www.runoob.com 61
1 A002 Google www.google.com 124
2 A003 淘宝 www.taobao.com 45

内嵌的 JSON 数据
假设有一组内嵌的 JSON 数据文件 nested_list.json ：
nested_list.json 文件内容
{
“school_name”: “ABC primary school”,
“class”: “Year 1”,
“students”: [
{
“id”: “A001”,
“name”: “Tom”,
“math”: 60,
“physics”: 66,
“chemistry”: 61
},
{
“id”: “A002”,
“name”: “James”,
“math”: 89,
“physics”: 76,
“chemistry”: 51
},
{
“id”: “A003”,
“name”: “Jenny”,
“math”: 79,
“physics”: 90,
“chemistry”: 78
}]
}
使用以下代码格式化完整内容：

实例
import pandas as pd

df = pd.read_json(‘nested_list.json’)

print(df)
以上实例输出结果为：

      school_name   class                                           students

0 ABC primary school Year 1 {‘id’: ‘A001’, ‘name’: ‘Tom’, ‘math’: 60, 'phy…
1 ABC primary school Year 1 {‘id’: ‘A002’, ‘name’: ‘James’, ‘math’: 89, 'p…
2 ABC primary school Year 1 {‘id’: ‘A003’, ‘name’: ‘Jenny’, ‘math’: 79, 'p…

这时我们就需要使用到 json_normalize() 方法将内嵌的数据完整的解析出来：解析students

实例
import pandas as pd
import json

#使用 Python JSON 模块载入数据
with open(‘nested_list.json’,‘r’) as f:
data = json.loads(f.read())

#展平数据
df_nested_list = pd.json_normalize(data, record_path =[‘students’])
print(df_nested_list)
以上实例输出结果为：

 id   name  math  physics  chemistry

0 A001 Tom 60 66 61
1 A002 James 89 76 51
2 A003 Jenny 79 90 78

data = json.loads(f.read()) 使用 Python JSON 模块载入数据。

json_normalize() 使用了参数 record_path 并设置为 [‘students’] 用于展开内嵌的 JSON 数据 students。

显示结果还没有包含 school_name 和 class 元素，如果需要展示出来可以使用 meta 参数来显示这些元数据：
实例
import pandas as pd
import json

#使用 Python JSON 模块载入数据
with open(‘nested_list.json’,‘r’) as f:
data = json.loads(f.read())

#展平数据
df_nested_list = pd.json_normalize(
data,
record_path =[‘students’],
meta=[‘school_name’, ‘class’]
)
print(df_nested_list)
以上实例输出结果为：

 id   name  math  physics  chemistry         school_name   class

0 A001 Tom 60 66 61 ABC primary school Year 1
1 A002 James 89 76 51 ABC primary school Year 1
2 A003 Jenny 79 90 78 ABC primary school Year 1

接下来，让我们尝试读取更复杂的 JSON 数据，该数据嵌套了列表和字典，数据文件 nested_mix.json 如下：

nested_mix.json 文件内容
{
“school_name”: “local primary school”,
“class”: “Year 1”,
“info”: {
“president”: “John Kasich”,
“address”: “ABC road, London, UK”,
“contacts”: {
“email”: “admin@e.com”,
“tel”: “123456789”
}
},
“students”: [
{
“id”: “A001”,
“name”: “Tom”,
“math”: 60,
“physics”: 66,
“chemistry”: 61
},
{
“id”: “A002”,
“name”: “James”,
“math”: 89,
“physics”: 76,
“chemistry”: 51
},
{
“id”: “A003”,
“name”: “Jenny”,
“math”: 79,
“physics”: 90,
“chemistry”: 78
}]
}
nested_mix.json 文件转换为 DataFrame：

实例
import pandas as pd
import json

#使用 Python JSON 模块载入数据
with open(‘nested_mix.json’,‘r’) as f:
data = json.loads(f.read())

df = pd.json_normalize(
data,
record_path =[‘students’],
meta=[
‘class’,
[‘info’, ‘president’],
[‘info’, ‘contacts’, ‘tel’]
]
)

print(df)
以上实例输出结果为：

 id   name  math  physics  chemistry   class info.president info.contacts.tel

0 A001 Tom 60 66 61 Year 1 John Kasich 123456789
1 A002 James 89 76 51 Year 1 John Kasich 123456789
2 A003 Jenny 79 90 78 Year 1 John Kasich 123456789

读取内嵌数据中的一组数据
以下是实例文件 nested_deep.json，我们只读取内嵌中的 math 字段：

nested_deep.json 文件内容
{
“school_name”: “local primary school”,
“class”: “Year 1”,
“students”: [
{
“id”: “A001”,
“name”: “Tom”,
“grade”: {
“math”: 60,
“physics”: 66,
“chemistry”: 61
}

},
{
    "id": "A002",
    "name": "James",
    "grade": {
        "math": 89,
        "physics": 76,
        "chemistry": 51
    }
   
},
{
    "id": "A003",
    "name": "Jenny",
    "grade": {
        "math": 79,
        "physics": 90,
        "chemistry": 78
    }
}]

}
这里我们需要使用到 glom 模块来处理数据套嵌，glom 模块允许我们使用 . 来访问内嵌对象的属性。

第一次使用我们需要安装 glom：

pip3 install glom
实例
import pandas as pd
from glom import glom

df = pd.read_json(‘nested_deep.json’)

data = df[‘students’].apply(lambda row: glom(row, ‘grade.math’))
print(data)
以上实例输出结果为：

0 60
1 89
2 79
Name: students, dtype: int64

2.6 数据清洗

数据清洗是对一些没有用的数据进行处理的过程。

很多数据集存在数据缺失、数据格式错误、错误数据或重复数据的情况，如果要对使数据分析更加准确，就需要对这些没有用的数据进行处理。

在这个教程中，我们将利用 Pandas包来进行数据清洗。
测试数据地址：https://static.runoob.com/download/property-data.csv

Pandas 清洗空值
如果我们要删除包含空字段的行，可以使用 dropna() 方法，语法格式如下：

DataFrame.dropna(axis=0, how=‘any’, thresh=None, subset=None, inplace=False)
参数说明：

axis：默认为 0，表示逢空值剔除整行，如果设置参数 axis＝1 表示逢空值去掉整列。
how：默认为 ‘any’ 如果一行（或一列）里任何一个数据有出现 NA 就去掉整行，如果设置 how=‘all’ 一行（或列）都是 NA 才去掉这整行。
thresh：设置需要多少非空值的数据才可以保留下来的。
subset：设置想要检查的列。如果是多个列，可以使用列名的 list 作为参数。
inplace：如果设置 True，将计算得到的值直接覆盖之前的值并返回 None，修改的是源数据。

我们可以通过 isnull() 判断各个单元格是否为空。
实例
import pandas as pd

df = pd.read_csv(‘property-data.csv’)

print (df[‘NUM_BEDROOMS’])
print (df[‘NUM_BEDROOMS’].isnull())
以上实例输出结果如下：

以上例子中我们看到 Pandas 把 n/a 和 NA 当作空数据，na 不是空数据，不符合我们要求，我们可以指定空数据类型：
import pandas as pd

missing_values = [“n/a”, “na”, “–”]
df = pd.read_csv(‘property-data.csv’, na_values = missing_values)

print (df[‘NUM_BEDROOMS’])
print (df[‘NUM_BEDROOMS’].isnull())
以上实例输出结果如下：

接下来的实例演示了删除包含空数据的行。
import pandas as pd

df = pd.read_csv(‘property-data.csv’)

new_df = df.dropna()

print(new_df.to_string())
以上实例输出结果如下：

注意：默认情况下，dropna() 方法返回一个新的 DataFrame，不会修改源数据。

如果你要修改源数据 DataFrame, 可以使用 inplace = True 参数:

我们也可以移除指定列有空值的行：

实例
移除 ST_NUM 列中字段值为空的行：

import pandas as pd

df = pd.read_csv(‘property-data.csv’)

df.dropna(subset=[‘ST_NUM’], inplace = True)

print(df.to_string())
以上实例输出结果如下：

我们也可以 fillna() 方法来替换一些空字段：
实例
使用 12345 替换空字段：
import pandas as pd

df = pd.read_csv(‘property-data.csv’)

df.fillna(12345, inplace = True)

print(df.to_string())
我们也可以指定某一个列来替换数据：
实例
使用 12345 替换 PID 为空数据：

import pandas as pd

df = pd.read_csv(‘property-data.csv’)

df[‘PID’].fillna(12345, inplace = True)

print(df.to_string())

替换空单元格的常用方法是计算列的均值、中位数值或众数。
Pandas使用 mean()、median() 和 mode() 方法计算列的均值（所有值加起来的平均值）、中位数值（排序后排在中间的数）和众数（出现频率最高的数）。
实例
使用 mean() 方法计算列的均值并替换空单元格：

import pandas as pd

df = pd.read_csv(‘property-data.csv’)

x = df[“ST_NUM”].mean()

df[“ST_NUM”].fillna(x, inplace = True)

print(df.to_string())

Pandas 清洗格式错误数据
数据格式错误的单元格会使数据分析变得困难，甚至不可能。

我们可以通过包含空单元格的行，或者将列中的所有单元格转换为相同格式的数据。

以下实例会格式化日期：
实例
import pandas as pd

第三个日期格式错误
data = {
“Date”: [‘2020/12/01’, ‘2020/12/02’ , ‘20201226’],
“duration”: [50, 40, 45]
}

df = pd.DataFrame(data, index = [“day1”, “day2”, “day3”])

df[‘Date’] = pd.to_datetime(df[‘Date’])

print(df.to_string())
Pandas 清洗错误数据
数据错误也是很常见的情况，我们可以对错误的数据进行替换或移除。

以下实例会替换错误年龄的数据：
实例
import pandas as pd

person = {
“name”: [‘Google’, ‘Runoob’ , ‘Taobao’],
“age”: [50, 40, 12345] # 12345 年龄数据是错误的
}

df = pd.DataFrame(person)
df.loc[2, ‘age’] = 30 # 修改数据

print(df.to_string())
也可以设置条件语句：
实例
将 age 大于 120 的设置为 120:

import pandas as pd

person = {
“name”: [‘Google’, ‘Runoob’ , ‘Taobao’],
“age”: [50, 200, 12345]
}

df = pd.DataFrame(person)

for x in df.index:
if df.loc[x, “age”] > 120:
df.loc[x, “age”] = 120

print(df.to_string())

也可以将错误数据的行删除：
实例
将 age 大于 120 的删除:

import pandas as pd

person = {
“name”: [‘Google’, ‘Runoob’ , ‘Taobao’],
“age”: [50, 40, 12345] # 12345 年龄数据是错误的
}

df = pd.DataFrame(person)

for x in df.index:
if df.loc[x, “age”] > 120:
df.drop(x, inplace = True)

print(df.to_string())

Pandas 清洗重复数据
如果我们要清洗重复数据，可以使用 duplicated() 和 drop_duplicates() 方法。

如果对应的数据是重复的，duplicated() 会返回 True，否则返回 False
import pandas as pd

person = {
“name”: [‘Google’, ‘Runoob’, ‘Runoob’, ‘Taobao’],
“age”: [50, 40, 40, 23]
}
df = pd.DataFrame(person)

print(df.duplicated())
以上实例输出结果如下：

0 False
1 False
2 True
3 False
dtype: bool
删除重复数据，可以直接使用drop_duplicates() 方法。

实例
import pandas as pd

persons = {
“name”: [‘Google’, ‘Runoob’, ‘Runoob’, ‘Taobao’],
“age”: [50, 40, 40, 23]
}

df = pd.DataFrame(persons)

df.drop_duplicates(inplace = True)
print(df)
以上实例输出结果如下：

 name  age

0 Google 50
1 Runoob 40
3 Taobao 23

3.matplotlib

3.1 matplotlib pyplot

import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import numpy as np

xpoints = np.array([0, 6])
ypoints = np.array([0, 100])

plt.plot(xpoints, ypoints)
plt.show()
以上实例中我们使用了 Pyplot 的 plot() 函数， plot() 函数是绘制二维图形的最基本函数。

plot() 用于画图它可以绘制点和线，语法格式如下：

画单条线
plot([x], y, [fmt], *, data=None, **kwargs)
画多条线
plot([x], y, [fmt], [x2], y2, [fmt2], …, **kwargs)
参数说明：

x, y：点或线的节点，x 为 x 轴数据，y 为 y 轴数据，数据可以列表或数组。
fmt：可选，定义基本格式（如颜色、标记和线条样式）。
kwargs：可选，用在二维平面图上，设置指定属性，如标签，线的宽度等。
plot(x, y) # 创建 y 中数据与 x 中对应值的二维线图，使用默认样式
plot(x, y, ‘bo’) # 创建 y 中数据与 x 中对应值的二维线图，使用蓝色实心圈绘制
plot(y) # x 的值为 0…N-1
plot(y, ‘r+’) # 使用红色 + 号
颜色字符:‘b’ 蓝色，‘m’ 洋红色，‘g’ 绿色，‘y’ 黄色，‘r’ 红色，‘k’ 黑色，‘w’ 白色，‘c’ 青绿色，’#008000’ RGB 颜色符串。多条曲线不指定颜色时，会自动选择不同颜色。
线型参数：’‐’ 实线，’‐‐’ 破折线，’‐.’ 点划线，’:’ 虚线。
标记字符：’.’ 点标记，’,’ 像素标记(极小点)，‘o’ 实心圈标记，‘v’ 倒三角标记，’^’ 上三角标记，’>’ 右三角标记，’<’ 左三角标记…等等。

如果我们只想绘制两个坐标点，而不是一条线，可以使用 o 参数，表示一个实心圈的标记：

绘制坐标 (1, 3) 和 (8, 10) 的两个点
import matplotlib.pyplot as plt
import numpy as np

xpoints = np.array([1, 8])
ypoints = np.array([3, 10])

plt.plot(xpoints, ypoints, ‘o’)
plt.show()

我们也可以绘制任意数量的点，只需确保两个轴上的点数相同即可。

绘制一条不规则线，坐标为 (1, 3) 、 (2, 8) 、(6, 1) 、(8, 10)，对应的两个数组为：[1, 2, 6, 8] 与 [3, 8, 1, 10]。

实例
import matplotlib.pyplot as plt
import numpy as np

xpoints = np.array([1, 2, 6, 8])
ypoints = np.array([3, 8, 1, 10])

plt.plot(xpoints, ypoints)
plt.show()

如果我们不指定 x 轴上的点，则 x 会根据 y 的值来设置为 0, 1, 2, 3…N-1。

以下实例我们绘制一个正弦和余弦图，在 plt.plot() 参数中包含两对 x,y 值，第一对是 x,y，这对应于正弦函数，第二对是 x,z，这对应于余弦函数。

实例
import matplotlib.pyplot as plt
import numpy as np

x = np.arange(0,4*np.pi,0.1)
# start,stop,step
y = np.sin(x)
z = np.cos(x)
plt.plot(x,y,x,z)
plt.show()

3.2Matplotlib 绘图标记

绘图过程如果我们想要给坐标自定义一些不一样的标记，就可以使用 plot() 方法的 marker 参数来定义。

以下实例定义了实心圆标记：
实例
import matplotlib.pyplot as plt
import numpy as np

ypoints = np.array([1,3,4,5,8,9,6,1,3,4,5,2,4])

plt.plot(ypoints, marker = ‘o’)
plt.show()
显示结果如下：

marker 可以定义的符号如下：
标记符号描述
“.” m00 点
“,” m01 像素点
“o” m02 实心圆
“v” m03 下三角
“^” m04 上三角
“<” m05 左三角
“>” m06 右三角
“1” m07 下三叉
“2” m08 上三叉
“3” m09 左三叉
“4” m10 右三叉
“8” m11 八角形
“s” m12 正方形
“p” m13 五边形
“P” m23 加号（填充）
“*” m14 星号
“h” m15 六边形 1
“H” m16 六边形 2
“+” m17 加号
“x” m18 乘号 x
“X” m24 乘号 x (填充)
“D” m19 菱形
“d” m20 瘦菱形
“|” m21 竖线
“_” m22 横线
0 (TICKLEFT) m25 左横线
1 (TICKRIGHT) m26 右横线
2 (TICKUP) m27 上竖线
3 (TICKDOWN) m28 下竖线
4 (CARETLEFT) m29 左箭头
5 (CARETRIGHT) m30 右箭头
6 (CARETUP) m31 上箭头
7 (CARETDOWN) m32 下箭头
8 (CARETLEFTBASE) m33 左箭头 (中间点为基准)
9 (CARETRIGHTBASE) m34 右箭头 (中间点为基准)
10 (CARETUPBASE) m35 上箭头 (中间点为基准)
11 (CARETDOWNBASE) m36 下箭头 (中间点为基准)
“None”, " " or “” 没有任何标记
‘ $. . .$ ’ m37 渲染指定的字符。例如 “ $f$ ” 以字母 f 为标记。
参考：https://www.runoob.com/matplotlib/matplotlib-marker.html

fmt 参数
fmt 参数定义了基本格式，如标记、线条样式和颜色。

fmt = ‘[marker][line][color]’
例如 o:r，o 表示实心圆标记，: 表示虚线，r 表示颜色为红色。

实例
import matplotlib.pyplot as plt
import numpy as np

ypoints = np.array([6, 2, 13, 10])

plt.plot(ypoints, ‘o:r’)
plt.show()
线类型：

线类型标记描述
‘-’ 实线
‘:’ 虚线
‘–’ 破折线
‘-.’ 点划线

颜色类型：
颜色标记描述
‘r’ 红色
‘g’ 绿色
‘b’ 蓝色
‘c’ 青色
‘m’ 品红
‘y’ 黄色
‘k’ 黑色
‘w’ 白色

标记大小与颜色
我们可以自定义标记的大小与颜色，使用的参数分别是：
markersize，简写为 ms：定义标记的大小。
markerfacecolor，简写为 mfc：定义标记内部的颜色。
markeredgecolor，简写为 mec：定义标记边框的颜色
设置标记大小：

实例
import matplotlib.pyplot as plt
import numpy as np

ypoints = np.array([6, 2, 13, 10])

plt.plot(ypoints, marker = ‘o’, ms = 20)
plt.show()
显示结果如下：

设置标记外边框颜色：
实例
import matplotlib.pyplot as plt
import numpy as np

ypoints = np.array([6, 2, 13, 10])

plt.plot(ypoints, marker = ‘o’, ms = 20, mec = ‘r’)
plt.show()
显示结果如下：

设置标记内部颜色：

实例
import matplotlib.pyplot as plt
import numpy as np

ypoints = np.array([6, 2, 13, 10])

plt.plot(ypoints, marker = ‘o’, ms = 20, mfc = ‘r’)
plt.show()
显示结果如下：

3.3Matplotlib 绘图线

绘图过程如果我们自定义线的样式，包括线的类型、颜色和大小等。

线的类型
线的类型可以使用 linestyle 参数来定义，简写为 ls。
类型简写说明
‘solid’ (默认) ‘-’ 实线
‘dotted’ ‘:’ 点虚线
‘dashed’ ‘–’ 破折线
‘dashdot’ ‘-.’ 点划线
‘None’ ‘’ 或 ’ ’ 不画线

实例
import matplotlib.pyplot as plt
import numpy as np

ypoints = np.array([6, 2, 13, 10])

plt.plot(ypoints, linestyle = ‘dotted’)
plt.show()

使用简写：

实例
import matplotlib.pyplot as plt
import numpy as np

ypoints = np.array([6, 2, 13, 10])

plt.plot(ypoints, ls = ‘-.’)
plt.show()
线的颜色
线的颜色可以使用 color 参数来定义，简写为 c。

颜色类型：

颜色标记描述
‘r’ 红色
‘g’ 绿色
‘b’ 蓝色
‘c’ 青色
‘m’ 品红
‘y’ 黄色
‘k’ 黑色
‘w’ 白色
当然也可以自定义颜色类型，例如：SeaGreen、#8FBC8F 等，完整样式可以参考 HTML 颜色值。

实例
import matplotlib.pyplot as plt
import numpy as np

ypoints = np.array([6, 2, 13, 10])

plt.plot(ypoints, color = ‘r’)
plt.show()

线的宽度
线的宽度可以使用 linewidth 参数来定义，简写为 lw，值可以是浮点数，如：1、2.0、5.67 等。

实例
import matplotlib.pyplot as plt
import numpy as np

ypoints = np.array([6, 2, 13, 10])

plt.plot(ypoints, linewidth = ‘12.5’)
plt.show()
多条线
plot() 方法中可以包含多对 x,y 值来绘制多条线。

实例
import matplotlib.pyplot as plt
import numpy as np

y1 = np.array([3, 7, 5, 9])
y2 = np.array([6, 2, 13, 10])

plt.plot(y1)
plt.plot(y2)

plt.show()
实例
import matplotlib.pyplot as plt
import numpy as np

x1 = np.array([0, 1, 2, 3])
y1 = np.array([3, 7, 5, 9])
x2 = np.array([0, 1, 2, 3])
y2 = np.array([6, 2, 13, 10])

plt.plot(x1, y1, x2, y2)
plt.show()

3.4Matplotlib 轴标签和标题

我们可以使用 xlabel() 和 ylabel() 方法来设置 x 轴和 y 轴的标签。
import numpy as np
import matplotlib.pyplot as plt

x = np.array([1, 2, 3, 4])
y = np.array([1, 4, 9, 16])
plt.plot(x, y)

plt.xlabel(“x - label”)
plt.ylabel(“y - label”)

plt.show()

标题
我们可以使用 title() 方法来设置标题。

实例
import numpy as np
import matplotlib.pyplot as plt

x = np.array([1, 2, 3, 4])
y = np.array([1, 4, 9, 16])
plt.plot(x, y)

plt.title(“RUNOOB TEST TITLE”)
plt.xlabel(“x - label”)
plt.ylabel(“y - label”)

plt.show()

3.5 Matplotlib 网格线

我们可以使用 pyplot 中的 grid() 方法来设置图表中的网格线。

grid() 方法语法格式如下：

matplotlib.pyplot.grid(b=None, which=‘major’, axis=‘both’, )
参数说明：

b：可选，默认为 None，可以设置布尔值，true 为显示网格线，false 为不显示，如果设置 **kwargs 参数，则值为 true。
which：可选，可选值有 ‘major’、‘minor’ 和 ‘both’，默认为 ‘major’，表示应用更改的网格线。
axis：可选，设置显示哪个方向的网格线，可以是取 ‘both’（默认），‘x’ 或 ‘y’，分别表示两个方向，x 轴方向或 y 轴方向。
**kwargs：可选，设置网格样式，可以是 color=‘r’, linestyle=’-’ 和 linewidth=2，分别表示网格线的颜色，样式和宽度。
plt.grid()
以下实例添加一个简单的网格线，axis 参数使用 x，设置 x 轴方向显示网格线：

实例
import numpy as np
import matplotlib.pyplot as plt

x = np.array([1, 2, 3, 4])
y = np.array([1, 4, 9, 16])

plt.title(“RUNOOB grid() Test”)
plt.xlabel(“x - label”)
plt.ylabel(“y - label”)

plt.plot(x, y)

plt.grid(axis=‘x’) # 设置 y 就在轴方向显示网格线

plt.show()

以下实例添加一个简单的网格线，并设置网格线的样式，格式如下：

grid(color = ‘color’, linestyle = ‘linestyle’, linewidth = number)
参数说明：

color：‘b’ 蓝色，‘m’ 洋红色，‘g’ 绿色，‘y’ 黄色，‘r’ 红色，‘k’ 黑色，‘w’ 白色，‘c’ 青绿色，’#008000’ RGB 颜色符串。

linestyle：’‐’ 实线，’‐‐’ 破折线，’‐.’ 点划线，’:’ 虚线。

linewidth：设置线的宽度，可以设置一个数字。

实例
import numpy as np
import matplotlib.pyplot as plt

x = np.array([1, 2, 3, 4])
y = np.array([1, 4, 9, 16])

plt.title(“RUNOOB grid() Test”)
plt.xlabel(“x - label”)
plt.ylabel(“y - label”)

plt.plot(x, y)

plt.grid(color = ‘r’, linestyle = ‘–’, linewidth = 0.5)

plt.show()

3.6Matplotlib 绘制多图

我们可以使用 pyplot 中的 subplot() 和 subplots() 方法来绘制多个子图。

subpot() 方法在绘图时需要指定位置，subplots() 方法可以一次生成多个，在调用时只需要调用生成对象的 ax 即可。
subplot
subplot(nrows, ncols, index, **kwargs)
subplot(pos, **kwargs)
subplot(**k

锐单商城拥有海量元器件数据手册、IC替代型号，打造电子元器件IC百科大全！

python数据分析初学

相关文章