We all are surrounded by data. Data is a magical thing that is growing just the same as Carbon-dioxide has grown in our environment, with the difference that it is good for us, not as bad as growing global warming. I say this because if we have a huge amount of data, then we will have the power to make better decisions in the future.

We all generate new data every day, either by liking a post or commenting on the other’s post, or when we upload a new post on any social site. 

Nowadays, companies are very sensitive about data, as collecting, storing, processing, and analyzing the data is vital for making better decisions. Every company have to take web development seriously as everything is online in the present and will be in the future world. 

There are so many tools and programming languages that help us to do the above mentioned tasks. Excel is a powerful spreadsheet tool for doing data analyses. 

But it has so many limitations when it comes to tackling a huge amount of data. Most companies use EXCEL + VBA Scripting to do some complex calculations, but it also has various limitations. 

So data analysts always try to adopt new ways that help them to speed up their work and generate quality analysis. To do that, data analysts use Programming languages that are far more powerful than any other spreadsheet tool. Python and R are the most preferred programming languages for doing data analysis.

In this blog, I will not talk about the R programming language, but we will explore the power of Python. You will learn how to handle large amounts of data using a real-life example through this blog.

Requirements to start Programming

What you will require before starting the actual programming:

  1. Python should be installed in your system
  2. You should have an editor where you write the python code. I suggest you install Jupyter Notebook.
  3. Install Numpy and Pandas Library before starting the coding.
  4. Last but the most important point is that you should have the curiosity to go beyond the limits of using data. Curiosity is key!

Now that you have all the requirements aligned, let’s start the journey of data analysis.

Setting Workspace

  • Open your Jupyter notebook and import the following statements:
    1. import numpy as np
    2. import pandas as pd
    3. import os
  • Execute the cell by pressing Shift + Enter

Importing Data

View the file format of your data. And add code accordingly:

If you have a CSV file, then write the following code:

df = pd.read_csv(r“Actual_path_of_your_csv_file”)

If you have an Excel file, then write following code:

df = pd.read_excel(open(r“Actual_path_of_your_excel_file”, “rb”), sheet_name=“Name_of_sheet_which_you_want_to_import”)

I have an excel sheet, so I used the second option in the following example.

Basic Functions about the Data

Now you have imported the data into Python. Next step is that you need to apply so that you have a Bird Eye View of your data.

Shape Function

The shape function shows you the total number of rows and columns in your imported file. Write df.shape in your Jupyter notebook cell and execute the cell by pressing Shift+Enter.

If you are only interested in Rows, then write df.shape[0]

If you are only interested in Columns, then write df.shape[1] 

Head Function

If you want to see the top few records, then you can use head(). Write df.head() in your Jupyter notebook cell and execute the cell by pressing Shift+Enter. It will return a data frame with the top five records.

If you want to see more than 5 records, then you can mention the number in round brackets df.head(10) now it returns the top 10 records. 

Tail Function

If you want to see the few records from the bottom, then you can use tail(). Write df.tail() in your Jupyter notebook cell and execute the cell by pressing Shift+Enter. It will return a data frame with the bottom five records.

If you want to see more than five records, then you can mention the number in round brackets df.tail(10); now, it returns the top 10 records. 

Getting all Column Names

If you want to get the names of all columns, then you just simply write df.columns, and it will return the all column name.

Getting the Specific Column

You can extract any column by using its name. After applying the code below which will return you a list of values that are stored in the column.

Syntax:

Dataframe[“Column_name”]

Example:

df[“Candidate Name”]

Check the Data Type of Column

Now, as we know that we store data in columns, and we will be curious to know about the data type of column before applying any operations on it. So for that, write the following code in your Jupyter Notebook cell:

Syntax:

Dataframe[“Column_name”].dtype()

Example:

df[“Candidate Age”].dtype()

Sum Function

If you have some numeric columns in your data and you just want to know the Sum by adding each value of that particular column, then you can use the sum() function.

Before applying this formula make sure that column type is not String

Syntax:

Dataframe[“Column_name”].sum()

Example:

df[” Total Valid Votes”].sum()

In the following example, I sum up all the valid votes which are polled in 117 constituencies of Punjab.

Finding the Average of a Particular Column

If you want to find the average of a column then you can use mean() function

Syntax:

Dataframe[“Column_name”].mean()

Example:

df[” Total Valid Votes”].mean()

In the following example, I got the average votes which are polled for each candidate.

Finding the Maximum Value of a Particular Column

If you want to find the maximum value of a column then you can use max() function

Syntax:

Dataframe[“Column_name”].max()

Example:

df[” Total Valid Votes”].max()

In the following example, I got the maximum votes poll for the candidate.

Finding the Minimum Value of a Particular Column

If you want to find the minimum value of a column then you can use min() function

Syntax:

Dataframe[“Column_name”].min()

Example:

df[” Total Valid Votes”].min()

In the following example, I got the minimum votes polled for the candidate.

Finding the Standard Deviation of a Particular Column

If you want to find the minimum value of a column then you can use std() function

Syntax:

Dataframe[“Column_name”].std()

Example:

df[” Total Valid Votes”].std()

Basic String Functions

Now let us discuss some very useful string functions which are helpful in your day-to-day job. But before applying these string functions, make sure the column type is String.

Finding the Length of String

If you want to find the minimum value of a column then you can use std() function

Syntax:

Dataframe[“Column_name”].str.len()

Example:

df[“Constituency Name”].str.len()

It will return the list which has numeric values, and these numeric values represent the length of the corresponding String. You can add this list as a New Column if you want to show the length of the String in your data.

Capitalizing the First Character of each Word

As you know that we can not have a Title Case(Capitalize first character of each word) function in excel but python have. So for that use title() function

Syntax:

Dataframe[“Column_name”].str.title()

Example:

df[“Candidate Name”].str.title()

Upper Case

You can use upper() function to make a string characters uppercase

Syntax:

Dataframe[“Column_name”].str.upper()

Example:

df[“Candidate Name”].str.upper()

Lower Case

You can use lower() function to make a string characters lowercase

Syntax:

Dataframe[“Column_name”].str.lower()

Example:

df[“Candidate Name”].str.lower()

Getting Specific Record

To get the specific record from your data, you may confirm that your data has at least one column which has a unique value. The concept is similar to a Primary key in SQL. You can also mix up multiple columns to get a specific record.

Like in my example I extract the records by using Constituency Name and Candidate Name following code:

df[(df[“Constituency Name”] == “Sultanpur Lodhi “) & (df[“Candidate Name”] == “SAJJAN SINGH CHEEMA”)] 

Getting a Group of Records

Sometimes you might want to extract the data which belongs to the same category. Like in the following example, I want to extract the data for Sultanpur Lodhi Constituency, and I want Candidate names in the title case, and then I will export this data as sultapur-lodhi-2017.csv

Now sultapur-lodhi-2017.csv file contains data only from Sultanpur Lodhi Constituency.

Wrapping up

So, in this blog, you have learned some basic functions to analyze a huge amount of data. I have just given you a small tour of data analysis in Python. There are tons of things that are uncovered and are there to explore. 

To read more blogs, visit www.webdew.com. If you are looking for web design and web development services, our web team will be thrilled to get you what you want! Contact us to know more. 

Frequently Asked Questions

When dealing with very large data sets in software development, various methods and techniques can be used to manage and process the data efficiently.
Batch processing involves dividing large data sets into smaller batches or chunks for processing. This method is suitable for tasks that can be split into discrete units of work, such as data extraction, transformation, and loading (ETL) processes.
Parallel processing involves dividing the data into smaller segments and processing them concurrently using multiple processors or nodes in a distributed computing environment.
Distributed computing frameworks are designed to handle large data sets across a cluster of computers or nodes.

Generally, the types of big data are: Structured Data, Unstructured Data and Semi-Structured Data.
Structured Data is highly organized and is typically stored in relational databases or tabular formats where data elements are organized into rows and columns.
Unstructured Data lacks a specific structure or format and includes text, images, audio, and video content.
Semi-Structured Data has some level of structure but does not conform to a rigid schema. It is often tagged or labeled, making it easier to parse and query.

There are four primary types of data processing: Batch processing, Real-time processing, Stream processing, Interactive processing.
Batch processing involves collecting, processing, and storing a large volume of data over a period of time.
Real-time processing involves handling data as it is generated or received.
Stream processing focuses on handling continuous streams of data. It involves the real-time analysis and transformation of data as it flows through a system.
Interactive processing involves handling user requests for data in real time or near real time.

The “5 V's” of big data are: Volume, Velocity, Variety, Veracity and Value.
Volume refers to the sheer size or quantity of data. Big data often involves massive volumes of data that exceed the capacity of traditional data storage and processing systems.
Velocity represents the speed at which data is generated, collected, and processed.
Variety refers to the diverse types and formats of data.
Veracity relates to the quality and trustworthiness of data.
Value represents the ultimate goal of big data analysis—to derive meaningful insights and value from the data.