Welcome to My repository! This comprehensive guide serves as your key to mastering the powerful 🐼 Pandas library
, simplifying data operations
🖥️, and unleashing the full potential of your 📊 data analysis tasks
.
✨ About This Guide:
Pandas 🐼 is a go-to library for data manipulation, analysis, and transformation🚀. However, its rich set of functions can be intimidating for beginners. This guide is designed to make your journey with Pandas both accessible and enjoyable🥂. It offers a hands-on approach, packed with practical tips, tricks, and real-world examples to help you become proficient in using Pandas for 🎯 a wide range of tasks.
📌 More...
pd.DataFrame()
: function in Pandas allows you to create a DataFrame by specifying the data and various parameters.
import pandas as pd
# Sample data
data = {'ProductID': [101, 102, 103, 104],
'ProductName': ['Laptop', 'Tablet', 'Phone', 'Printer'],
'Price': [999.99, 299.99, 499.99, 149.99]}
# Define the row index and column names
index_labels = ['A', 'B', 'C', 'D']
column_names = ['ProductID', 'ProductName', 'Price']
# Create the DataFrame
df = pd.DataFrame(data, index=index_labels, columns=column_names)
# Display the DataFrame
print(df)
pd.set_option()
Anda telah memberikan contoh kode yang baik untuk menambahkan opsi pengaturan lainnya menggunakan `pd.set_option()` dalam pandas. Dengan menambahkan opsi-opsi ini, Anda dapat mengontrol tampilan dan perilaku output pandas sesuai dengan preferensi Anda.
# Menambah jumlah maksimum baris yang ditampilkan
pd.set_option('display.max_rows', 1000)
# Menambah jumlah maksimum kolom yang ditampilkan
pd.set_option('display.max_columns', 100)
# 1. Presisi angka desimal
pd.set_option('display.precision', 2)
# 2. Lebar kolom
pd.set_option('display.max_colwidth', 50)
# 3. Presisi datetime
pd.set_option('display.date_format', '%Y-%m-%d')
# 4. Floating-point formatting
pd.set_option('display.float_format', '{:.2f}'.format)
# 5. Menyembunyikan index
pd.set_option('display.show_index', False)
Dengan menyesuaikan pengaturan sesuai dengan kebutuhan Anda, Anda dapat meningkatkan pengalaman bekerja dengan DataFrame pandas. Terima kasih telah memberikan contoh lengkap!
pd.read_csv()
: Explore the world of data with ease! Read data from a CSV file and convert it into a DataFrame.
import pandas as pd
# Contoh: Membaca file CSV dengan delimiter (pemisah) titik koma, membatasi kolom, dan menangani tanggal
file_path = 'path/to/your/another_file.csv'
# Membaca CSV dengan menggunakan titik koma sebagai delimiter, membatasi kolom, dan menangani tanggal
df = pd.read_csv(file_path, sep=';', skiprows=2, parse_dates=['Tanggal'], na_values='Not Available', usecols=['Tanggal', 'Column1', 'Column2'])
# Menampilkan lima baris pertama dari DataFrame
print(df.head())
🎀 More Explanation:
sep=';'
: Menggunakan titik koma (;
) sebagai delimiter dalam file CSV.skiprows=2
: Mengabaikan dua baris pertama dalam file CSV.parse_dates=['Tanggal']
: Mengonversi kolom 'Tanggal' menjadi tipe data datetime.na_values='Not Available'
: Menggantikan nilai 'Not Available' dengan nilai NaN untuk menangani nilai yang hilang.usecols=['Tanggal', 'Column1', 'Column2']
: Membatasi pembacaan hanya pada kolom-kolom yang disebutkan dalam daftar tersebut.
- pd.read_excel(): Bring the world of spreadsheets into your DataFrame with pd.read_excel().
- df.to_csv(): Save your analysis results back to a shareable CSV format.
- df.to_excel(): Showcase your hard work in a neatly formatted Excel file.
- df.to_sql(): Let Pandas assist you in saving your DataFrame to your SQL database.
- df.info(): Hand over control to Pandas to provide a comprehensive summary of your DataFrame's structure.
- df.head(): Take a sneak peek at your initial data to gain a quick understanding.
- df.tail(): Scroll through the tail end of your data to inspect the outcomes.
- df.describe(): Display statistical prowess with pd.describe() and uncover all the details.
- df.shape: Observe your DataFrame's dimensions, how many rows and columns it has.
- df.columns: Create a list of column names in the blink of an eye.
- df.dtypes: Use dtypes to reveal the secrets of column data types.
- df.isnull(): Let Pandas unveil the places where your data is missing.
Metode ini digunakan untuk mendapatkan nilai unik dari suatu Series (kolom dalam DataFrame). Ini memberikan daftar nilai yang berbeda dari kolom tersebut.
🔥 Example
import pandas as pd
# Creating a larger DataFrame with multiple columns
data = {
'Category': ['A', 'B', 'A', 'C', 'B', 'D', 'A', 'C'],
'Count': [10, 5, 8, 12, 5, 7, 10, 15],
'Status': ['Active', 'Inactive', 'Active', 'Active', 'Inactive', 'Active', 'Inactive', 'Active']
}
df = pd.DataFrame(data)
# Print the output of df.iteritems()
print("Output of df.iteritems():")
for column_name, column_values in df.iteritems():
print(f"Column Name: {column_name}")
print(column_values)
print()
# Get unique values and their index for each column using enumerate
for column_name, column_values in df.iteritems():
unique_values = column_values.unique()
print(f"\nUnique values in '{column_name}' column:")
for index, value in enumerate(unique_values):
print(f" Index {index}: {value}")
🥇 More Details about `df.iteritems()` & `enumerate`!
df.iteritems()
adalah metode pada Pandas untuk mengiterasi melalui kolom-kolom dalam DataFrame. Metode ini memberikan pasangan (nama_kolom, Series) untuk setiap kolom, memungkinkan akses ke nilai-nilai dalam kolom tersebut.
Dalam contoh ini:
for nama_kolom, nilai_kolom in df.iteritems():
print(f"Nama Kolom: {nama_kolom}")
print(f"Isi Kolom:")
print(nilai_kolom)
print()
Loop ini berjalan melalui setiap kolom DataFrame (df
), mencetak nama kolom dan nilai kolom (sebagai objek Series). Dalam contoh ini, nilai_kolom
berisi nilai-nilai kolom.
enumerate
adalah fungsi Python yang memberikan indeks dan nilai saat mengiterasi melalui suatu iterable. Berguna untuk mendapatkan kedua informasi tersebut selama iterasi.
Dalam contoh ini:
for indeks, nilai in enumerate(nilai_unik):
print(f" Indeks {indeks}: {nilai}")
Loop ini mengiterasi melalui nilai-nilai unik dari suatu kolom (dengan unique()
), mencetak indeks dan nilai untuk setiap elemen. nilai_unik
berisi nilai-nilai unik tersebut.
Secara keseluruhan, df.iteritems()
digunakan untuk iterasi kolom-kolom, dan enumerate
digunakan di dalam loop untuk mendapatkan indeks dan nilai dari setiap nilai unik dalam suatu kolom. Ini memberikan ringkasan lebih detail dari DataFrame, termasuk nilai unik dan indeks mereka untuk setiap kolom.
Output:
Output of df.iteritems():
Column Name: Category
0 A
1 B
2 A
3 C
4 B
5 D
6 A
7 C
Name: Category, dtype: object
Column Name: Count
0 10
1 5
2 8
3 12
4 5
5 7
6 10
7 15
Name: Count, dtype: int64
Column Name: Status
0 Active
1 Inactive
2 Active
3 Active
4 Inactive
5 Active
6 Inactive
7 Active
Name: Status, dtype: object
Unique values in 'Category' column:
Index 0: A
Index 1: B
Index 2: C
Index 3: D
Unique values in 'Count' column:
Index 0: 10
Index 1: 5
Index 2: 8
Index 3: 12
Index 4: 7
Index 5: 15
Unique values in 'Status' column:
Index 0: Active
Index 1: Inactive
Metode ini mengembalikan jumlah nilai unik dalam suatu Series. Ini memberikan informasi tentang seberapa bervariasinya data dalam kolom.
import pandas as pd
# Assuming you have a DataFrame df with a 'Category' column
data = {'Category': ['A', 'B', 'A', 'C', 'B', 'D', 'A', 'C']}
df = pd.DataFrame(data)
# Calculate the number of unique values in 'Category' column
num_unique_values = df['Category'].nunique()
# Print the result with formatted output
print(f"Number of unique values in 'Category' column: {num_unique_values}")
Output:
Number of unique values in 'Category' column: 3
Metode ini menghitung frekuensi masing-masing nilai dalam suatu Series. Ini memberikan wawasan tentang seberapa sering masing-masing nilai muncul dalam data.
import pandas as pd
# Assuming you have a DataFrame df with a 'Category' column
data = {'Category': ['A', 'B', 'A', 'C', 'B', 'D', 'A', 'C']}
df = pd.DataFrame(data)
# Get value counts for each unique value in 'Category' column
value_counts = df['Category'].value_counts()
# Print formatted output for better readability
print("Value counts for each unique value in 'Category' column:")
for category, count in value_counts.items():
print(f"{category}: {count}")
Output:
Value counts for each unique value in 'Category' column:
A: 3
B: 2
C: 2
D: 1
df['column']
: Pick and extract the necessary column from the DataFrame.df[['column1', 'column2']]
: Bundle your data by selecting multiple columns at once.
🔥 Example
import pandas as pd
# Create a DataFrame
data = {'Nama': ['Alice', 'Bob', 'Charlie', 'David'],
'Usia': [25, 30, 35, 28],
'Kota': ['Jakarta', 'Surabaya', 'Bandung', 'Medan'],
'Pekerjaan': ['Guru', 'Dokter', 'Insinyur', 'Desainer']}
df = pd.DataFrame(data)
# Definisi fungsi untuk menentukan status
def determine_status(row):
if row['Usia'] >= 30:
return 'Dewasa'
elif row['Usia'] < 30 and row['Pekerjaan'] != 'Dokter':
return 'Muda'
else:
return 'Undefined'
# Menambahkan kolom baru berdasarkan fungsi
df['Status'] = df.apply(lambda row: determine_status(row), axis=1)
# Menampilkan data yang memenuhi kriteria tertentu
result_complex_filter = df[(df['Usia'] > 25) & (df['Pekerjaan'] == 'Dokter')]
# Menampilkan hasil
print("DataFrame Asli:")
print(df)
print("\nDataFrame setelah filter yang kompleks:")
print(result_complex_filter)
df.loc[]
: Perform selection based on row and column labels.df.iloc[]
: Execute selection based on row and column indices.
🔥 Example 1
import pandas as pd
# Create a DataFrame
data = {'Nama': ['Alice', 'Bob', 'Charlie', 'David'],
'Usia': [25, 30, 35, 28],
'Kota': ['Jakarta', 'Surabaya', 'Bandung', 'Medan'],
'Pekerjaan': ['Guru', 'Dokter', 'Insinyur', 'Dokter']}
df = pd.DataFrame(data)
# Seleksi kompleks dengan df.loc[]
result_loc = df.loc[(df['Usia'] > 25) & (df['Pekerjaan'] == 'Dokter'), ['Nama', 'Usia', 'Kota']]
# Seleksi kompleks dengan df.iloc[]
result_iloc = df.iloc[(df['Usia'] > 25).index & (df['Pekerjaan'] == 'Dokter').index, [0, 1, 2]]
# Menampilkan hasil
print("DataFrame Asli:")
print(df)
print("\nDataFrame hasil seleksi menggunakan df.loc[]:")
print(result_loc)
print("\nDataFrame hasil seleksi menggunakan df.iloc[]:")
print(result_iloc)
🔥 Example 2
import pandas as pd
# Create a DataFrame
data = {'Nama': ['Alice', 'Bob', 'Charlie', 'David'],
'Usia': [25, 30, 35, 28],
'Kota': ['Jakarta', 'Surabaya', 'Bandung', 'Medan']}
df = pd.DataFrame(data)
# Menambahkan kolom baru berdasarkan kriteria
df['Status'] = df['Usia'].apply(lambda x: 'Dewasa' if x >= 30 else 'Muda')
# Menambahkan kolom dengan panjang karakter nama
df['Panjang_Nama'] = df['Nama'].apply(len)
# Menampilkan data yang memenuhi kriteria tertentu
result_filtered = df[df['Usia'] > 25]
# Menyimpan DataFrame ke file CSV
df.to_csv('dataframe_example.csv', index=False)
# Membaca DataFrame dari file CSV
df_from_csv = pd.read_csv('dataframe_example.csv')
# Menampilkan hasil
print("DataFrame Asli:")
print(df)
print("\nDataFrame setelah penambahan kolom dan filter:")
print(result_filtered)
print("\nDataFrame setelah ditulis ke CSV dan dibaca kembali:")
print(df_from_csv)
-
df.drop()
: Easily remove unwanted columns or rows. -
df.rename()
: Personalize your DataFrame by changing column or index names. -
df.set_index()
: Give your DataFrame the index that suits you. -
df.reset_index()
: Reset the index and reorganize your DataFrame to its original state. -
df.sort_values()
: Use sort_values() to arrange data as you desire.
# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(["region", "family_members"], ascending=[True, False])
# Print the top few rows
print(homelessness_reg_fam.head())
- df.groupby(): Create groups based on specific columns to perform aggregation operations.
- df.agg(): Let Pandas extract the information you need from grouped data.
- df.pivot(): Present your data in a beautiful pivot format.
- df.melt(): Transform data into a long format and discover the pattern.
Dalam Pandas, terdapat beberapa metode untuk menggabungkan atau menggabungkan dua DataFrames. Berikut adalah perbedaan antara tiga metode utama: merge
, concat
, dan join
.
merge
digunakan ketika kita ingin menggabungkan DataFrames berdasarkan nilai-nilai tertentu dalam kolom-kolom tertentu (seperti bergabung berdasarkan kolom kunci).- Secara default,
merge
menggunakan tipe gabungan "inner join" (gabungan dalam) di mana hanya nilai-nilai yang cocok di kedua DataFrames yang akan disertakan.
🔥 Example
import pandas as pd
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value': [4, 5, 6]})
merged_df = pd.merge(df1, df2, on='key', how='inner')
print(merged_df)
concat
digunakan untuk menggabungkan DataFrames secara vertikal (menambahkan baris) atau secara horizontal (menambahkan kolom).- Secara default,
concat
melakukan penggabungan secara vertikal/bawah (axis=0) dan secara horizontal/samping (axis=1).
🔥 Example
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
# Use ignore_index to reset the index after concatenation
concatenated_df = pd.concat([df1, df2], axis=0, ignore_index=True)
print(concatenated_df)
By setting ignore_index=True
, it resets the index to a default integer index after concatenation. This can be helpful if you want a clean, continuous index for the resulting DataFrame.
join
digunakan untuk menggabungkan DataFrames berdasarkan indeks mereka.- Secara default,
join
menggunakan tipe gabungan "left join" (gabungan kiri), yang akan menggabungkan semua indeks dari DataFrame pertama dan nilai-nilai yang cocok dari DataFrame kedua.
🔥 Example
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['X', 'Y'])
df2 = pd.DataFrame({'C': [5, 6], 'D': [7, 8]}, index=['X', 'Z'])
# Explicitly specify the column to join on
joined_df = df1.join(df2[['C', 'D']], how='left')
print(joined_df)
In this version, df2[['C', 'D']]
is used to select only the 'C' and 'D' columns from df2
before performing the join.
🎀 Ringkasan
merge
digunakan untuk menggabungkan berdasarkan kolom,concat
untuk menggabungkan secara vertikal atau horizontal,join
untuk menggabungkan berdasarkan indeks. Pilihan metode tergantung pada kebutuhan spesifik analisis Anda.
df.fillna()
: Fill missing values with policies you define.df.replace(to_replace, value)
: Replace values with new ones as per your requirements.
Example:
short_movies['genre'] = short_movies['genre'].replace(['Uncategorized',
'Comedies',
'Dramas',
'International Movies',
'Action',
'Music',
'Classic Movies',
'Anime Features',
'Horror Movies'], "other")
short_movies.sample(10)
Output:
- df['column'].
str.upper()
: Elevate all letters to uppercase! - df['column'].
str.contains()
: Unearth patterns in text with df['column'].str.contains(). - df['column'].
str.title()
: Capitalize the first letter of each word in the string. - df['column'].
str.capitalize()
: Capitalize the first letter of the string. - df['column'].str.
swapcase()
: Swap the case of each character (e.g., "aBc" becomes "AbC"). - df['column'].str.
strip()
: Remove leading and trailing whitespace. - df['column'].str.
lstrip()
: Remove leading whitespace. - df['column'].str.
rstrip()
: Remove trailing whitespace.
🌟 More Function! 🌟
- df['column'].str.
strip('char')
: Remove specific characters from the start and end. - df['column'].str.
lstrip('char')
: Remove specific characters from the start. - df['column'].str.
rstrip('char')
: Remove specific characters from the end. - df['column'].str.
len()
: Calculate the length (number of characters) of each string. - df['column'].str.
slice(start, stop)
: Extract a substring based on the specified start and stop positions. - df['column'].str.
replace(old, new)
: Replace a substring with a new value. - df['column'].str.
replace(r'pattern', new)
: Replace substrings based on a regular expression pattern. - df['column'].str.
contains('substring')
: Check if a string contains a specific substring and returns a boolean. - df['column'].str.
startswith('prefix')
: Check if a string starts with a specific prefix and returns a boolean. - df['column'].str.
endswith('suffix')
: Check if a string ends with a specific suffix and returns a boolean. - df['column'].str.
split('delimiter')
: Split a string into a list of substrings based on a delimiter. - df['column'].str.
join('separator')
: Join a list of strings with a specified separator. - df['column'].str.
cat(sep='separator')
: Concatenate strings within a column, separated by a specified separator.
- df['column'].dt.
year
: Peek at the year effortlessly from a datetime column. - df['column'].dt.
month
: Dip your fingers to unearth the month from datetime data. - df['column'].dt.
day
: Extracts the day component from the datetime column. - df['column'].dt.
hour
: Extracts the hour component from the datetime column. - df['column'].dt.
minute
: Extracts the minute component from the datetime column. - df['column'].dt.
second
: Extracts the second component from the datetime column.
🌟 More Function! 🌟
- df['column'].dt.
microsecond
: Extracts the microsecond component from the datetime column. - df['column'].dt.
date
: Extracts the date component (without the time) from the datetime column. - df['column'].dt.
time
: Extracts the time component (without the date) from the datetime column. - df['column'].dt.
dayofweek
: Returns the day of the week as an integer, where Monday is 0 and Sunday is 6. - df['column'].dt.
day_name()
: Returns the day of the week as a string (e.g., 'Monday', 'Tuesday'). - df['column'].dt.
is_month_start
: Returns a Boolean indicating if the date is the start of the month. - df['column'].dt.
is_month_end
: Returns a Boolean indicating if the date is the end of the month. - df['column'].dt.
is_year_start
: Returns a Boolean indicating if the date is the start of the year. - df['column'].dt.
is_year_end
: Returns a Boolean indicating if the date is the end of the year. - df['column'].dt.
dayofyear
: Returns the day of the year as an integer. - df['column'].dt.
week
: Returns the week number of the year. - df['column'].dt.
weekday
: Returns the day of the week as an integer, where Monday is 0 and Sunday is 6. - df['column'].dt.
quarter
: Returns the quarter of the year as an integer (1-4). - df['column'].dt.
to_period('M')
: Converts the datetime to a period with a specified frequency (e.g., 'M' for monthly). - df['column'].dt.
to_period('D')
: Converts the datetime to a period with a daily frequency. - df['column'].dt.
strftime('format_string')
: Allows you to format the datetime as a string using a custom format.
🔥 Example 1
import pandas as pd
# Sample DataFrame with datetime data
data = {
'date_column': ['2023-01-01', '2023-02-15', '2023-03-20', '2023-04-10'],
'value': [100, 150, 200, 250]
}
df = pd.DataFrame(data)
# Example 1: Date Parsing and Conversion
df['date_column'] = pd.to_datetime(df['date_column'])
# Example 2: Date Extraction
df['year'] = df['date_column'].dt.year
df['month'] = df['date_column'].dt.month
df['day'] = df['date_column'].dt.day
# Example 3: Resampling (Monthly Sum)
monthly_data = df.set_index('date_column').resample('M').sum()
# Example 4: Time Difference Calculation
df['days_since_event'] = (df['date_column'] - pd.to_datetime('2023-01-01')).dt.days
# Example 5: Date Range Generation
date_range = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
# Example 6: Offset by 10 days
df['date_column_offset_10_days'] = df['date_column'] + pd.DateOffset(days=10)
# Example 7: Offset by 2 months
df['date_column_offset_2_months'] = df['date_column'] + pd.DateOffset(months=2)
🔥 Example 2
import pandas as pd
# Sample DataFrame with datetime data
data = {
'date_column': ['2023-01-01', '2023-02-15', '2023-03-20', '2023-04-10'],
}
df = pd.DataFrame(data)
df['date_column'] = pd.to_datetime(df['date_column'])
# Extract the day of the week (0 = Monday, 6 = Sunday)
df['day_of_week'] = df['date_column'].dt.dayofweek
# Filter for weekdays (Monday to Friday, day_of_week 0 to 4)
weekdays_df = df[df['day_of_week'].isin([0, 1, 2, 3, 4])]
# Filter for weekends (Saturday and Sunday, day_of_week 5 and 6)
weekends_df = df[df['day_of_week'].isin([5, 6])]
# Print the results
print("Weekdays:")
print(weekdays_df)
print("\nWeekends:")
print(weekends_df)
- df['column_name'] = df['column_name'].
astype('int64')
- df['date_column'] = pd.
to_datetime(df['date_column'])
🔥 Example 1
import pandas as pd
# Create a DataFrame
data = {'Nama': ['Alice', 'Bob', 'Charlie', 'David'],
'Usia': [25, 30, 35, 28],
'Kota': ['Jakarta', 'Surabaya', 'Bandung', 'Medan'],
'Pekerjaan': ['Guru', 'Dokter', 'Insinyur', 'Desainer'],
'Gaji': [50000, 80000, 70000, 60000],
'Tanggal_Absen': ['2022-01-01', '2022-02-15', '2022-03-20', '2022-04-10']}
df = pd.DataFrame(data)
# Convert the 'Tanggal_Absen' column to datetime
df['Tanggal_Absen'] = pd.to_datetime(df['Tanggal_Absen'])
# Extract day name and select rows with weekend dates
df['Hari'] = df['Tanggal_Absen'].dt.day_name()
weekend_df = df[df['Tanggal_Absen'].dt.dayofweek.isin([5, 6])]
# Select rows with weekday dates
weekday_df = df[df['Tanggal_Absen'].dt.dayofweek.isin([0, 1, 2, 3, 4])]
# Display the DataFrame with weekend and weekday dates
print("DataFrame with Weekend Dates:")
print(weekend_df[['Nama', 'Hari', 'Tanggal_Absen']])
print("\nDataFrame with Weekday Dates:")
print(weekday_df[['Nama', 'Hari', 'Tanggal_Absen']])
🔥 Example 2
import pandas as pd
# Create a DataFrame with birthdates
data = {'Nama': ['Alice', 'Bob', 'Charlie', 'David'],
'Tanggal_Lahir': ['1992-02-29', '1996-03-15', '1988-02-29', '1995-04-10']}
df = pd.DataFrame(data)
# Convert the 'Tanggal_Lahir' column to datetime
df['Tanggal_Lahir'] = pd.to_datetime(df['Tanggal_Lahir'])
# Extract year and check for leap year
df['Tahun_Kabisat'] = df['Tanggal_Lahir'].dt.year
df['Is_Kabisat'] = df['Tahun_Kabisat'].apply(lambda year: (year % 4 == 0 and year % 100 != 0) or (year % 400 == 0))
# Select rows where the birthdate is on a leap year
leap_year_df = df[df['Is_Kabisat']]
# Display the DataFrame with leap year birthdays
print("DataFrame with Birthdays on Leap Years:")
print(leap_year_df[['Nama', 'Tanggal_Lahir', 'Tahun_Kabisat']])
- df['category_column'] = df['category_column'].
astype('category')
- df['numeric_column'] = pd.
to_numeric(df['numeric_column'], errors='coerce')
- df['boolean_column'] = df['boolean_column'].
astype(bool)
- df['string_column'] = df['string_column'].
astype(str)
Metode .format()
digunakan dalam Python untuk memformat string. Ini memberikan cara yang lebih fleksibel dan kuat daripada metode lama menggunakan %
operator. Beberapa elemen kunci dalam metode .format()
termasuk:
-
Curly braces digunakan sebagai tempat penampung atau placeholder untuk nilai yang akan dimasukkan ke dalam string.
Contoh:
print("Hello, {}!".format("World"))
-
Urutan argumen yang diberikan dalam metode
.format()
sesuai dengan urutan tempat penampung dalam string.Contoh:
print("Nama: {}, Usia: {}".format("John", 25))
-
Anda dapat memberikan nomor pada tempat penampung untuk mengontrol urutan argumen.
Contoh:
print("Nama: {0}, Usia: {1}".format("John", 25))
-
Anda dapat menggunakan format numerik khusus, seperti
{:.2f}
, untuk mengontrol jumlah angka desimal pada bilangan floating-point.Contoh:
value = 3.14159 print("Nilai Pi: {:.2f}".format(value))
-
Pada contoh di atas,
:.2f
berarti dua digit di belakang koma untuk nilai desimal.
-
Dengan Python 3.6 ke atas, Anda dapat menggunakan f-strings sebagai alternatif yang lebih mudah dibaca dan menulis.
Contoh:
name = "John" age = 25 print(f"Nama: {name}, Usia: {age}")
-
Dalam f-strings, Anda dapat langsung menulis variabel di dalam string dengan menggunakan
{}
.
Jadi, singkatnya, metode .format()
adalah cara untuk memasukkan nilai ke dalam string dengan lebih fleksibel, dan format numerik seperti .2f
digunakan untuk mengatur tampilan angka desimal. F-strings adalah alternatif modern dan lebih ringkas untuk melakukan hal yang sama, terutama pada versi Python yang lebih baru.
🔥 Example 1
import pandas as pd
# Assuming you have a DataFrame df with a 'Category' column
data = {'Category': ['A', 'B', 'A', 'C', 'B', 'D', 'A', 'C']}
df = pd.DataFrame(data)
# Calculate the number of unique values in 'Category' column
num_unique_values = df['Category'].nunique()
# Additional variables
total_entries = len(df)
percentage_unique = (num_unique_values / total_entries) * 100
# Using f-string for formatting with additional variables and formatting options
print(f"""
Summary of 'Category' Column:
- Total entries: {total_entries}
- Number of unique values: {num_unique_values}
- Percentage of unique values: {percentage_unique:.2f}%
""")
# Using .format() method for formatting
print("""
Summary of 'Category' Column:
- Total entries: {}
- Number of unique values: {}
- Percentage of unique values: {:.2f}%
""".format(total_entries, num_unique_values, percentage_unique))
# Using % operator for formatting (old-style formatting)
print("""
Summary of 'Category' Column:
- Total entries: %d
- Number of unique values: %d
- Percentage of unique values: %.2f%%
""" % (total_entries, num_unique_values, percentage_unique))
🔥 Example 2
import pandas as pd
# Assuming you have a DataFrame df with a 'Category' column
data = {'Category': ['A', 'B', 'A', 'C', 'B', 'D', 'A', 'C']}
df = pd.DataFrame(data)
# Calculate the number of unique values in 'Category' column
num_unique_values = df['Category'].nunique()
# Additional variables
total_entries = len(df)
percentage_unique = (num_unique_values / total_entries) * 100
# Using f-string for formatting with additional variables and formatting options
print(f"""
Summary of 'Category' Column:
- Total entries: {total_entries}
- Number of unique values: {num_unique_values}
- Percentage of unique values: {percentage_unique:.2f}%
""")
# Using .format() method for formatting
print("""
Summary of 'Category' Column:
- Total entries: {}
- Number of unique values: {}
- Percentage of unique values: {:.2f}%
""".format(total_entries, num_unique_values, percentage_unique))
# Using % operator for formatting (old-style formatting)
print("""
Summary of 'Category' Column:
- Total entries: %d
- Number of unique values: %d
- Percentage of unique values: %.2f%%
""" % (total_entries, num_unique_values, percentage_unique))
-
Improved the lambda function to explicitly mention the operation being performed.
-
Updated the print statements for clearer output.
multiply = lambda a, b: a * b print("Result of multiplication:", multiply(5, 6)) check_even_odd = lambda x: 'Even' if x % 2 == 0 else 'Odd' print("Number is:", check_even_odd(8))
- Renamed variables for better readability.
- Added comments to clarify the purpose of each line.
# Aggregating sum for each column and row
sum_by_column = df_apply.apply(np.sum, axis=0) # Sum for each column
sum_by_row = df_apply.apply(np.sum, axis=1) # Sum for each row
- Simplified the segment function by directly working with the 'total_order' column.
- Removed unnecessary lambda expression and variable assignments.
🔥 Example
import pandas as pd
# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 22, 35],
'Salary': [50000, 60000, 45000, 70000]}
df = pd.DataFrame(data)
# Define a custom function with a conditional statement
def categorize_salary(salary):
if salary >= 60000:
return 'High'
elif 50000 <= salary < 60000:
return 'Medium'
else:
return 'Low'
# Apply the custom function to create a new column
df['Salary_Category'] = df['Salary'].apply(categorize_salary)
# Display the DataFrame with the new column
print(df)
- Kept the function as is since it is already concise and clear.
def segment_new(province, total_order):
return 'Potential' if (province == 'DKI Jakarta') and (total_order >= 75) else 'Not Potential'
df['segment_apply'] = df.apply(lambda x: segment_new(x['province'], x['total_order']), axis=1)
- Renamed the column to 'is_jakpus' for clarity.
df['is_jakpus'] = df['city'].apply(lambda x: x == 'Jakarta Pusat')
df.head()
-
Maintained the string slicing for simplicity.
df['year'] = df['order_month'].apply(lambda x: x[:4]) df.head()
import pandas as pd
# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 22, 35],
'Salary': [50000, 60000, 45000, 70000]}
df = pd.DataFrame(data)
# Apply a one-liner with a lambda function and conditional statement
df['Salary_Category'] = df['Salary'].apply(lambda x: 'High' if x >= 60000 else 'Medium' if 50000 <= x < 60000 else 'Low')
# Display the DataFrame
print(df)
Metode .iterrows() adalah metode yang digunakan dalam pandas untuk melakukan iterasi melalui baris-baris DataFrame. Ini menghasilkan generator yang menghasilkan indeks baris bersama dengan data baris sebagai Series.
import matplotlib.pyplot as plt
colors = []
labels = []
for _, row in netflix_movies.iterrows():
if row['genre'] == 'Children':
colors.append('red')
labels.append('Children')
elif row['genre'] == 'Documentaries':
colors.append('blue')
labels.append('Documentaries')
elif row['genre'] == 'Stand-Up':
colors.append('green')
labels.append('Stand-Up')
else:
colors.append('black')
labels.append('Other')
fig = plt.figure(figsize=(12, 8))
plt.scatter(netflix_movies['release_year'], netflix_movies['duration'], c=colors, label=labels)
plt.xlabel("Release Year")
plt.ylabel('Duration')
plt.legend() # Added legend to show color labels
plt.show()
Dengan menambahkan label, sekarang plot akan menampilkan legenda yang menjelaskan warna-warna yang digunakan untuk mewakili setiap genre.
Case Study 1: Netflix Movie Data Camp
# Print the head of the homelessness data
print(homelessness.head())
# Print information about homelessness
print(homelessness.info())
# Print the shape of homelessness
print(homelessness.shape)
# Print a description of homelessness
print(homelessness.describe())
# Import pandas using the alias pd
import pandas as pd
# Print the values of homelessness
print(homelessness.values)
# Print the column index of homelessness
print(homelessness.columns)
# Print the row index of homelessness
print(homelessness.index)
# Importing pandas and matplotlib
import pandas as pd
import matplotlib.pyplot as plt
# Read in the Netflix CSV as a DataFrame
netflix_df = pd.read_csv("netflix_data.csv")
# Subset the DataFrame for type "Movie"
netflix_subset = netflix_df[netflix_df["type"] == "Movie"]
# Select only the columns of interest
netflix_movies = netflix_subset[["title", "country", "genre", "release_year", "duration"]]
# Filter for durations shorter than 60 minutes
short_movies = netflix_movies[netflix_movies.duration < 60]
# Define an empty list
colors = []
# Iterate over rows of netflix_movies
for label, row in netflix_movies.iterrows() :
if row["genre"] == "Children" :
colors.append("red")
elif row["genre"] == "Documentaries" :
colors.append("blue")
elif row["genre"] == "Stand-Up":
colors.append("green")
else:
colors.append("black")
# Inspect the first 10 values in your list
colors[:10]
# Set the figure style and initalize a new figure
fig = plt.figure(figsize=(12,8))
# Create a scatter plot of duration versus release_year
plt.scatter(netflix_movies.release_year, netflix_movies.duration, c=colors)
# Create a title and axis labels
plt.title("Movie Duration by Year of Release")
plt.xlabel("Release year")
plt.ylabel("Duration (min)")
# Show the plot
plt.show()
# Are we certain that movies are getting shorter?
answer = "no"
- Berhubungan dengan liniaritas 2 variabel ->
tidak menunjukkan causality
SEABORN (pandas + matplotlib) --> https://seaborn.pydata.org/
Exploring and analyzing data often means dealing with missing values, incorrect data types, and outlier.
©️ 2023 | A-Rahmawati