Menguak Potensi plt.scatter() dengan Lebih dari Dua Variabel

plt.scatter() adalah salah satu fungsi yang paling populer dalam perangkat lunak Matplotlib untuk menggambar plot scatter. Fungsi ini dapat digunakan untuk merepresentasikan lebih dari dua variabel, membuatnya sangat powerful dan fleksibel.

Dalam artikel ini, kita akan mempelajari cara menggunakan plt.scatter() untuk mewujudkan potensi yang lebih besar dengan merepresentasikan lebih dari dua variabel. Kami juga akan mengulas berbagai opsi customisasi plot scatter yang tersedia dalam plt.scatter(), serta cara menggunakan NumPy dan matplotlib untuk membuat plot scatter yang lebih detail.

Representing More Than Two Variables

plt.scatter() dapat digunakan untuk merepresentasikan lebih dari dua variabel dengan menggunakan parameter x, y, s, c, marker, cmap, dan alpha. Dalam contoh berikut, kita akan menggunakan enam variable: Harga (X-axis), Jumlah yang Terjual Rata-rata (Y-axis), Margin Keuntungan (Marker size), Tipe Produk (Marker shape), Kandungan Gula (Marker color).

Variable
Represented by


Price
X-axis


Average number sold
Y-axis


Profit margin
Marker size


Product type
Marker shape


Sugar content
Marker color

Kelebihan plt.scatter() dalam merepresentasikan lebih dari dua variabel membuatnya sangat powerful dan fleksibel.

Exploring plt.scatter() Further

plt.scatter() juga menawarkan kemampuan customisasi plot scatter yang lebih lanjut. Dalam bagian ini, kita akan mempelajari cara menggunakan NumPy arrays dan plt.scatter() untuk menggambar plot scatter yang lebih detail.

Dalam contoh berikut, kita akan membuat data point random dan lalu membagi mereka menjadi dua wilayah yang jelas dalam plot scatter yang sama.

A commuter who’s keen on collecting data has collated the arrival times for buses at her local bus stop over a six-month period. The timetabled arrival times are at 15 minutes and 45 minutes past the hour, but she noticed that the true arrival times follow a normal distribution around these times:


This plot shows the relative likelihood of a bus arriving at each minute within an hour. This probability distribution can be represented using NumPy and np.linspace():


You’ve created two normal distributions centered on 15 and 45 minutes past the hour and summed them. You set the most likely arrival time to a value of 1 by dividing by the maximum value.


You can now simulate bus arrival times using this distribution. To do this, you can create random times and random relative probabilities using the built-in random module. In the code below, you will also use list comprehensions:


You’ve simulated 40 bus arrivals, which you can visualize with the following scatter plot:


Your plot will look different since the data you’re generating is random. However, not all of these points are likely to be close to the reality that the commuter observed from the data she gathered and analyzed. You can plot the distribution she obtained from the data with the simulated bus arrivals:




To keep the simulation realistic, you need to make sure that the random bus arrivals match the data and the distribution obtained from those data. You can filter the randomly generated points by keeping only the ones that fall within the probability distribution. You can achieve this by creating a mask for the scatter plot:


The variables in_region and out_region are NumPy arrays containing Boolean values based on whether the randomly generated likelihoods fall above or below the distribution y. You then plot two separate scatter plots, one with the points that fall within the distribution and another for the points that fall outside the distribution. The data points that fall above the distribution are not representative of the real data:




You’ve segmented the data points from the original scatter plot based on whether they fall within the distribution and used a different color and marker to identify the two sets of data.

Reviewing the Key Input Parameters

Kita telah mempelajari cara menggunakan plt.scatter() untuk mewujudkan potensi yang lebih besar dengan merepresentasikan lebih dari dua variabel. Berikut ini adalah ringkasan penting tentang parameter-input yang harus diingat:

Parameter
Description


x and y
These parameters represent the two main variables and can be any array-like data types, such as lists or NumPy arrays. These are required parameters.


s
This parameter defines the size of the marker. It can be a float if all the markers have the same size or an array-like data structure if the markers have different sizes.


c
This parameter represents the color of the markers. It will typically be either an array of colors, such as RGB values, or a sequence of values that will be mapped onto a colormap using the parameter cmap.


marker
This parameter is used to specify the marker type. It can be one of the following: 'o' for circles, '^' for triangles, 's' for squares, 'D' for diamonds, '<' for left-pointing triangles, '>' for right-pointing triangles, and '1' for vertical lines.


cmap
This parameter is used to specify the colormap. It can be one of the following: 'viridis', 'plasma', 'inferno', 'magma', or any other valid colormap name.


alpha
This parameter is used to specify the transparency of the markers. It should be a float value between 0 and 1.

Dengan demikian, plt.scatter() dapat digunakan untuk mewujudkan potensi yang lebih besar dengan merepresentasikan lebih dari dua variabel, serta menawarkan kemampuan customisasi plot scatter yang lebih lanjut.

Menguak Potensi plt.scatter() dengan Lebih dari Dua Variabel

Artikel Terkait