Pandas

before we start this portion of the lesson: check if you have pip installed since we are going to be installing some libraries today!!!!!! if you arnt sure if you have pip, check it by running this command:

pip

if your terminal says "command not found" or something else on linux, run this:

python3 -m ensurepip --default-pip

Overview: Pandas is a powerful tool in Python that is used for data analysis and manipulation. In this lesson, we will explore how to use Pandas to work with datasets, analyze them, and visualize the results.

Learning Objectives:

By the end of this lesson, students should be able to:

  • Understand what Pandas is and why it is useful for data analysis
  • Load data into Pandas and create tables to store it
  • Use different functions in Pandas to manipulate data, such as filtering, sorting, and grouping
  • Visualize data using graphs and charts

Question Who here has used numpy????

(should be all odf you because all of you have used it in this class before. )

what is pandas?

no not this

this:

  • Pandas is a Python library used for data analysis and manipulation.
  • it can handle different types of data, including CSV files and databases.
  • it also allows you to create tables to store and work with your data.
  • it has functions for filtering, sorting, and grouping data to make it easier to work with.
  • it also has tools for visualizing data with graphs and charts.
  • it is widely used in the industry for data analysis and is a valuable skill to learn.
  • companies that use Pandas include JPMorgan Chase, Google, NASA, the New York Times, and many others.

Question #2 & 3:

  • which companies use pandas?
  • what is pandas?

but why is pandas useful?

  • it can provides tools for handling and manipulating tabular data, which is a common format for storing and analyzing data.
  • it can handle different types of data, including CSV files and databases.
  • it allows you to perform tasks such as filtering, sorting, and grouping data, making it easier to analyze and work with.
  • it has functions for handling missing data and can fill in or remove missing values, which is important for accurate data analysis.
  • it also has tools for creating visualizations such as graphs and charts, making it easier to communicate insights from the data.
  • it is fast and efficient, even for large datasets, which is important for time-critical data analysis.
  • it is widely used in the industry and has a large community of users and developers, making it easy to find support and resources.

Question #4:

  • why is pandas useful?

how do i flipping use it? its so hard, my puny brain cant understand it it is actually really simple

here is numpy doing simple math:

import pandas as pd

df = pd.read_csv('yourcsvfileidcjustpickoneidiot.csv')

print(df.head())

print("Average age:", df['Age'].mean())

females = df[df['Gender'] == 'Female']
print(females)

sorted_data = df.sort_values(by='Salary', ascending=False)
print(sorted_data)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 1
----> 1 import pandas as pd
      3 # Read the CSV file into a Pandas DataFrame
      4 df = pd.read_csv('example.csv')

ModuleNotFoundError: No module named 'pandas'

uh oh!!! no pandas 😢

if see this error, enter these into your terminal:

pip install wheel
pip install pandas

on stack overflow, it said pandas is disturbed through pip as a wheel. so you need that too.

link to full forum if curious: https://stackoverflow.com/questions/33481974/importerror-no-module-named-pandas

ps: do this for this to work on ur laptop:

wget https://raw.githubusercontent.com/KKcbal/amongus/master/_notebooks/files/example.csv

example code on how to load a csv into a chart

import pandas as pd

# read the CSV file
df = pd.read_csv('example.csv')

# print the first five rows
print(df.head())

# define a function to assign each age to an age group
def assign_age_group(age):
    if age < 30:
        return '<30'
    elif age < 40:
        return '30-40'
    elif age < 50:
        return '40-50'
    else:
        return '>50'

# apply the function to the Age column to create a new column with age groups
df['Age Group'] = df['Age'].apply(assign_age_group)

# group by age group and count the number of people in each group
age_counts = df.groupby('Age Group')['Name'].count()

# print the age group counts
print(age_counts)
           Name  Age  Gender Occupation
0      John Doe   32    Male   Engineer
1    Jane Smith   27  Female    Teacher
2  Mike Johnson   45    Male    Manager
3      Sara Lee   38  Female     Doctor
4     David Kim   23    Male    Student
Age Group
30-40    7
40-50    4
<30      7
Name: Name, dtype: int64

how to manipulate the data in pandas.

import pandas as pd

# load the csv file
df = pd.read_csv('example.csv')

# print the first five rows
print(df.head())

# filter the data to include only people aged 30 or older
df_filtered = df[df['Age'] >= 30]

# sort the data by age in descending order
df_sorted = df.sort_values('Age', ascending=False)

# group the data by gender and calculate the mean age for each group
age_by_gender = df.groupby('Gender')['Age'].mean()

# print the filtered data
print(df_filtered)

# print the sorted data
print(df_sorted)

# print the mean age by gender
print(age_by_gender)
           Name  Age  Gender Occupation
0      John Doe   32    Male   Engineer
1    Jane Smith   27  Female    Teacher
2  Mike Johnson   45    Male    Manager
3      Sara Lee   38  Female     Doctor
4     David Kim   23    Male    Student
                Name  Age  Gender               Occupation
0           John Doe   32    Male                 Engineer
2       Mike Johnson   45    Male                  Manager
3           Sara Lee   38  Female                   Doctor
6       Robert Green   41    Male                Architect
7        Emily Davis   35  Female        Marketing Manager
8   Carlos Hernandez   47    Male             Entrepreneur
10         Kevin Lee   31    Male               Accountant
12     Jacob Johnson   34    Male                   Lawyer
13   Maria Rodriguez   39  Female               Consultant
15    Victoria Brown   42  Female  Human Resources Manager
17        Sophie Lee   30  Female          Project Manager
                Name  Age  Gender               Occupation
8   Carlos Hernandez   47    Male             Entrepreneur
2       Mike Johnson   45    Male                  Manager
15    Victoria Brown   42  Female  Human Resources Manager
6       Robert Green   41    Male                Architect
13   Maria Rodriguez   39  Female               Consultant
3           Sara Lee   38  Female                   Doctor
7        Emily Davis   35  Female        Marketing Manager
12     Jacob Johnson   34    Male                   Lawyer
0           John Doe   32    Male                 Engineer
10         Kevin Lee   31    Male               Accountant
17        Sophie Lee   30  Female          Project Manager
5        Anna Garcia   29  Female       Software Developer
14       Mark Taylor   28    Male             Web Designer
1         Jane Smith   27  Female                  Teacher
11      Rachel Baker   26  Female               Journalist
9     Melissa Nguyen   25  Female         Graphic Designer
16        Ethan Chen   24    Male       Research Assistant
4          David Kim   23    Male                  Student
Gender
Female    32.333333
Male      33.888889
Name: Age, dtype: float64

how do i put it into a chart 😩 here is how:

import pandas as pd
import matplotlib.pyplot as plt

# read the CSV file
df = pd.read_csv('example.csv')

# create a bar chart of the number of people in each age group
age_groups = ['<30', '30-40', '40-50', '>50']
age_counts = pd.cut(df['Age'], bins=[0, 30, 40, 50, df['Age'].max()], labels=age_groups, include_lowest=True).value_counts()
plt.bar(age_counts.index, age_counts.values)
plt.title('Number of people in each age group')
plt.xlabel('Age group')
plt.ylabel('Number of people')
plt.show()

# create a pie chart of the gender distribution
gender_counts = df['Gender'].value_counts()
plt.pie(gender_counts.values, labels=gender_counts.index, autopct='%1.1f%%')
plt.title('Gender distribution')
plt.show()

# create a scatter plot of age vs. income
plt.scatter(df['Age'], df['Income'])
plt.title('Age vs. Income')
plt.xlabel('Age')
plt.ylabel('Income')
plt.show()
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[55], line 2
      1 import pandas as pd
----> 2 import matplotlib.pyplot as plt
      4 # read the CSV file
      5 df = pd.read_csv('example.csv')

ModuleNotFoundError: No module named 'matplotlib'

uh oh!!!! another error!??!!??!?! install this library:

pip install matplotlib
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# read the CSV file
df = pd.read_csv('example.csv')

# define age groups
age_groups = ['<30', '30-40', '40-50', '>50']

# create a new column with the age group for each person
df['Age Group'] = pd.cut(df['Age'], bins=[0, 30, 40, 50, np.inf], labels=age_groups, include_lowest=True)

# group by age group and count the number of people in each group
age_counts = df.groupby('Age Group')['Name'].count()

# create a bar chart of the age counts
age_counts.plot(kind='bar')

# set the title and axis labels
plt.title('Number of People in Each Age Group')
plt.xlabel('Age Group')
plt.ylabel('Number of People')

# show the chart
plt.show()

magic!!!!!!

Completed Hacks

Questions:

  1. make your own data using your brian, google or chatgpt, should look different than mine.
  2. modify my code or write your own
  3. output your data other than a bar graph.
  4. answer the questions below, the more explained the better.

Questions

1. What are the two primary data structures in pandas and how do they differ?

The two primary data structures in pandas are Series and DataFrame. They differ in terms of their dimensions, Series are a 1D array and hold a single data type while DataFrames are 2D arrays and hold multiple Series objects that have a similar/common index type.

2. How do you read a CSV file into a pandas DataFrame?

read_csv()

3. How do you select a single column from a pandas DataFrame?

df['column-name']

4. How do you filter rows in a pandas DataFrame based on a condition? Booleans can be used as well.

df[df['column-name']=[value]

5. How do you group rows in a pandas DataFrame by a particular column?

grouped = df.groupby('Name')

6. How do you aggregate data in a pandas DataFrame using functions like sum and mean?

Sum:

sum_df = df.groupby('group').sum()

Mean:

mean_df = df.groupby('group').mean()

7. How do you handle missing values in a pandas DataFrame?

One can handle missing values in a pandas DataFrame by dropping missing values or filling in missing values. Finding the missing values in the code can also be another way.

8. How do you merge two pandas DataFrames together?

merged_df = pd.merge(df1, df2, on='key')

9. How do you export a pandas DataFrame to a CSV file?

df.to_csv('__.csv', index=False)

10. What is the difference between a Series and a DataFrame in Pandas?

The difference between a Series and a DataFrame in Pandas is that Series are 1D labeled arrays. Whereas, a DataFrame in Pandas is a 2D labeled data structure. They also both have different data types and each column will consist of a different data type.

note all hacks due saturday night, the more earlier you get them in the higher score you will get. if you miss the due date, you will get a 0. there will be no tolerance.

no questions answered

Tonight- 2.9

Friday Night- 2.8

Saturday Night - 2.7

Sunday Night - 0.0

questions answered

Tonight- 3.0

Friday Night- 2.9

Saturday Night - 2.8

Sunday Night - 0.0

My Own Dataset about Books and Year Released

import pandas as pd
df = pd.read_json('files/books.json')
print(df)
   author                    country  \
   0             Chinua Achebe                    Nigeria   
   1   Hans Christian Andersen                    Denmark   
   2           Dante Alighieri                      Italy   
   3                   Unknown  Sumer and Akkadian Empire   
   4                   Unknown          Achaemenid Empire   
   ..                      ...                        ...   
   95                    Vyasa                      India   
   96             Walt Whitman              United States   
   97           Virginia Woolf             United Kingdom   
   98           Virginia Woolf             United Kingdom   
   99     Marguerite Yourcenar             France/Belgium   

                        imageLink  language  \
   0       images/things-fall-apart.jpg   English   
   1             images/fairy-tales.jpg    Danish   
   2       images/the-divine-comedy.jpg   Italian   
   3   images/the-epic-of-gilgamesh.jpg  Akkadian   
   4         images/the-book-of-job.jpg    Hebrew   
   ..                               ...       ...   
   95       images/the-mahab-harata.jpg  Sanskrit   
   96        images/leaves-of-grass.jpg   English   
   97           images/mrs-dalloway.jpg   English   
   98      images/to-the-lighthouse.jpg   English   
   99     images/memoirs-of-hadrian.jpg    French   

                                                    link  pages  \
   0   https://en.wikipedia.org/wiki/Things_Fall_Apart\n    209   
   1   https://en.wikipedia.org/wiki/Fairy_Tales_Told...    784   
   2       https://en.wikipedia.org/wiki/Divine_Comedy\n    928   
   3   https://en.wikipedia.org/wiki/Epic_of_Gilgamesh\n    160   
   4         https://en.wikipedia.org/wiki/Book_of_Job\n    176   
   ..                                                ...    ...   
   95        https://en.wikipedia.org/wiki/Mahabharata\n    276   
   96    https://en.wikipedia.org/wiki/Leaves_of_Grass\n    152   
   97       https://en.wikipedia.org/wiki/Mrs_Dalloway\n    216   
   98  https://en.wikipedia.org/wiki/To_the_Lighthouse\n    209   
   99  https://en.wikipedia.org/wiki/Memoirs_of_Hadri...    408   

                 title  year  
   0       Things Fall Apart  1958  
   1             Fairy tales  1836  
   2       The Divine Comedy  1315  
   3   The Epic Of Gilgamesh -1700  
   4         The Book Of Job  -600  
   ..                    ...   ...  
   95            Mahabharata  -700  
   96        Leaves of Grass  1855  
   97           Mrs Dalloway  1925  
   98      To the Lighthouse  1927  
   99     Memoirs of Hadrian  1951  

   [100 rows x 8 columns]
import pandas as pd
df = pd.read_json('files/books.json')
cols_to_print = [ 'title','author', 'pages', 'year']
df = df[cols_to_print]
rows_to_print = [0,1,2,3,4,5, 6, 7, 8]
df = df.iloc[rows_to_print]

print(df)
                        title                   author  pages  year
0            Things Fall Apart            Chinua Achebe    209  1958
1                  Fairy tales  Hans Christian Andersen    784  1836
2            The Divine Comedy          Dante Alighieri    928  1315
3        The Epic Of Gilgamesh                  Unknown    160 -1700
4              The Book Of Job                  Unknown    176  -600
5  One Thousand and One Nights                  Unknown    288  1200
6                  Njál's Saga                  Unknown    384  1350
7          Pride and Prejudice              Jane Austen    226  1813
8               Le Père Goriot         Honoré de Balzac    443  1835
df_sorted = df.sort_values('pages', ascending=False)
df = df_sorted
print(df)
            title                   author  pages  year
    2    The Divine Comedy          Dante Alighieri    928  1315
    1          Fairy tales  Hans Christian Andersen    784  1836
    8       Le Père Goriot         Honoré de Balzac    443  1835
    7  Pride and Prejudice              Jane Austen    226  1813
    0    Things Fall Apart            Chinua Achebe    209  1958
pageValues = ['928', '784', '443', '226', '209']
page_values = pd.cut(df['average_value'], bins=[0, 1, 2, 3, 4, df['average_value'].max()], labels=pageValues, lowest=True).value_counts()
plt.bar(page_values.index, page_values.values)
plt.title('Title of Books')
plt.xlabel('Page Number')
plt.ylabel('Year')
plt.show()


plt.scatter(df['page_values'], df['average_value'])
plt.title('Page vs. Year')
plt.xlabel('Page count')
plt.ylabel('Year rating')
plt.show()

scatterplot

Data Analysis / Predictive Analysis

  1. How can Numpy and Pandas be used to preprocess data for predictive analysis?

Numpy and Pandas can be used to preprocess data for predictive analysis in several ways. This is because they both load data using .csv files and help clean data. The data received can be transformed, organized, and categorized. Overall, making it easier to only have relevant and neccesary information.

  • What machine learning algorithms can be used for predictive analysis, and how do they differ?

Machine learning algorithms that can be used for predictive analysis are linear regression, decision trees, neaural networks, and logistic regression. They differ by their usage, each one of the learning algorithms are used for something different and all serve a different purpose.

  • Can you discuss some real-world applications of predictive analysis in different industries?

Some real-world applications of predictive analysis in different industries are healthcare and marketing. Healthcare is a read-world application of predictive analysis because predictive analysis is used on patients that are at risk to identify readmission. They keep track of patient behaviors and medicines that may be needed. In addition, another example is marketing because marketers always keep track of performace and engagement. They also keep track of behaviors and engagements with the contents.

  • Can you explain the role of feature engineering in predictive analysis, and how it can improve model accuracy?

The role of feature engineering in predictive analysis is that it selects, extracts, transforms, and varies the dataset and variables. Feature engineering significantly plays a role in accuracy. Using techniques such as correlation analysis, scaling, normilzation, range, and information. It can improve model accuracy by transforming, selecting, and creating features.

  • How can machine learning models be deployed in real-time applications for predictive analysis?

Machine learning models be deployed in real-time applications for predictive analysis by Django and Flask.

  • Can you discuss some limitations of Numpy and Pandas, and when it might be necessary to use other data analysis tools?

Some limitations of Numpy and Pandas are that they have limited support for large datasets. These are not the best for certain situations involving lots of data since they would require more memory and processing power. Numpy and Pandas only provide functions such as cleaning data, but do not have functions like natural language processing and predictive modeling. Both Numpy and Pandas also only work on a singular server and do not help with computing data.

  • How can predictive analysis be used to improve decision-making and optimize business processes?

Predictive analysis can be used to improve decision-making and optimizing business processes because predictive analysis helps identify patterns and detect actions in certain situations. It helps immensely in terms of engagement and user activity in order to optimize business processes.

Numpy

from skimage import io
import matplotlib.pyplot as plt

photo = io.imread('../images/waldo.jpg')
type(photo)

plt.imshow(photo)
<matplotlib.image.AxesImage at 0x7f980ac869d0>

waldo.jpg

plt.imshow(photo[210:350, 425:500])
<matplotlib.image.AxesImage at 0x7ff43431c6a0>

waldo.jpg

Another example of a numpy function is random, which is generating a random number from 1-100.

from numpy import random

x = random.randint(100)

print(x)
43

This numpy function, random can be used when for large data sets and simulations. There are several cases and settings this function can be useful.