Beginners Guide to Pandas and Numpy: Data Manipulation Lesson

Data Science Poshan Pandey 0 Comments

Pandas is a python package built on Numpy and Matplotlib that is used for data manipulation and visualization. It is used by the entire python data science community. Tabular or Rectangular data is the most popular form of data for data analysis and pandas can handle its manipulation and visualization in a fluent manner. It is designed to work with rectangular data or data frames. When you first receive a dataset you want to quickly explore it and get a sense of its content. For that pandas provide several methods and attributes. At firs to import pandas in your python file following script is added at first.

import pandas as pd

Now if you have to read a CSV file using pandas then following script is used:

pd.read_csv("csv_file_location")

After you have loaded data following methods can be run in your data. Try it yourselves.

Let us consider a data frame as below and let's see what the method effects are on it.

The first is head(), it returns the first few rows of the data frames. It is very useful if we have many rows but for few rows not much difference there.

Another one is the info method. It displays the name of the column, the datatype they contain, and whether they have missing values.

Dataframe's shape is an attribute that contains a tuple that holds the number of rows followed by the number of columns. Shape is attribute not method so we write it without parenthesis.

Describe is another method that computes the summary statistics for numerical columns like mean and median. It is used for a quick overview of the numeric variables. There are many summary statistics functions like median, mode, min, max, std, var, etc.

A data frame has mainly three components as Values, Index, and Columns.

Values contain the data frame values in two-dimensional NumPy array.

Index and Column return the names of rows and columns respectively. The index represents the rows and it can return numeric values.

Sorting and Subsetting

At first to sort values in the data frame according to some column we use:

dataframe_name.sort_values("name_of_column", ascending = True)

You can also the first sort by one column and then by another column as:

dataframe_name.sort_values("name_of_column", "name_of_second_column", ascending = True)

Now if you have to select certain column or columns from the whole data frame you can:

Also you can use logical operators to access certain data as:

If you want to filter on multiple values of a categorical variable the easiest way is to use isin method as below:

For example, you want to add a new column to your data frame, then you can apply a method like below to add a new column. This method is also called a mutating or transforming of data frame.

There is lot more to pandas. Other topics and approaches will be added shortly.

PS: this article is based on the datacamp data science course and most of the images belong to datacamp.

Now, It's Time to Explore Numpy:

Numeric Python, popularly known as Numpy is a fundamental package for scientific computing with python. It is an opensource python library that provides an alternative to a regular Python list. It has very powerful N-dimensional arrays or data structures and routines to manipulate it. Numpy Array is similar to the python list but has additional features such as the ability to perform a calculation over entire arrays. It is really easy and super fast as well. It also has other derived objects like masked arrays and matrices. It can be used to performs various mathematical operations on arrays. It guarantees the efficient calculation with matrices and arrays and also provides high-level mathematical functions that can be operated to those arrays and matrices. They consume less memory and are convenient to use. Many times it is used by libraries like SciPy, matplotlib, OpenCv, Scikit-image, Scikit-learn, pandas to store multi-dimensional data.

To install NumPy in your device make sure you have already installed python and pip and then:

In the terminal: pip install NumPy

Now to actually use numpy in your program you first need to import it in the following ways:

Import numpy

Remember it all should be in a small letter. Whenever we use any functions from numpy we use it as below:

numpy.array

For Ease, you can import numpy as some name and use that name to call the function like below:

import numpy as np
np.array

Now lets initialize our first array first_array:

first_array = np.array([1, 2, 3, 4])
second_array = np.array( [5, 6, 7, 8])

you can also initialize matrices as follow:

first_matrix = np.array([[1, 2, 3, 4, 5, 6, 7, 8], [1, 2, 3, 4, 5, 6, 7, 8], [1, 2, 3, 4, 5, 6, 7, 8]])

we can access these elements as follow:

print(a[0])\

This will print [1, 2, 3, 4, 5, 6, 7, 8] for array of array.

Numpy has functions to automatically create arrays and matrices of 0's and 1's as follow:

Many functions can be performed between arrays and matrices like:

sort, concatenate, ndim, size, shape as follow:

>>> array_example = np.array([[[0, 1, 2, 3],
...                            [4, 5, 6, 7]],
...
...                           [[0, 1, 2, 3],
...                            [4, 5, 6, 7]],
...
...                           [[0 ,1 ,2, 3],
...                            [4, 5, 6, 7]]])

>>> array_example.ndim
3
>>> array_example.size
24
>>> array_example.shape
(3, 2, 4)
>>> np.concatenate((first_array, second_array))
array([1, 2, 3, 4, 5, 6, 7, 8])
>>> first_array.shape
(4,)

There are also functions to reshape the size of an array as follows:

>>> first_array = np.arange(6)
>>> print(first_array)
[0 1 2 3 4 5]

>>> second_array = first_array.reshape(3, 2)
>>> print(b)
[[0 1]
 [2 3][4 5]]

Indexing and Slicing of an array can be performed in the following ways:

Basic Array Operations:

Finally, I hope you are well learnt about the NumPy and pandas package and will be able to implement it in real-life situations.

Here are some of the references for you to dive deep into pandas and NumPy:

Data Science Python

Beginners Guide to Pandas and Numpy: Data Manipulation Lesson

In the terminal: pip install NumPy

Import numpy

numpy.array

import numpy as np
np.array

There are also functions to reshape the size of an array as follows:

Indexing and Slicing of an array can be performed in the following ways:

Related Posts

Post a Comment

Beginners Guide to Pandas and Numpy: Data Manipulation Lesson

In the terminal: pip install NumPy

Import numpy

numpy.array

import numpy as npnp.array

There are also functions to reshape the size of an array as follows:

Indexing and Slicing of an array can be performed in the following ways:

Related Posts

Post a Comment

import numpy as np
np.array