default logo

Basic data analysis techniques every data analyst should know, using Python.


1. Basic filtering

So now we loaded the pandas module and created a dataset, we can start with the first technique. When you want to get a subset of your data based on the values in a column, we are talking about filtering data.

In pandas we have multiple ways to do that, for now we look at the most common ones:

  1. Using boolean indexing with square brackets []
  2. Using boolean indexing with.loc

So filtering with square brackets looks as follows:

The logic behind filtering in pandas is you pass the condition to the dataframe between the square brackets:

df[condition]

And gives us the following output:

Filtering with square brackets

Filtering with .loc looks quite similar:

And as expected, it gives us the same output, since we applied the same filter

Filtering with .loc

Which one is preferred to use? For basic filters, as we saw above, there’s no difference or preference, it comes down to what you prefer code syntax wise. But when you want to apply more advanced selecting of your data, .locprovides for that and can do more complex selecting and slicing. But that’s not something to worry about right now.


2. Filtering with conditions

We applied our first filter, which was pretty straight forward. But let’s say you want to apply a filter with multiple conditions. How would we do that in pandas? For that we have look at Python operators.

2.1 The & operator
For example, you want to filter all the rows where ID is equal to C1 and Value is above 100.

To apply that filter, we have to chain two conditions with the & operator. That would look like following:

And will return the following output:

Filtering with and operator

As expected, we get one row back, since only this row met the conditions we set in our filter.

2.2 The | operator

The | operator in Python stands for or and will return True if one of the conditions is met.

We can show this by applying the following filter: give us all the rows where date is later than 2019–04–10 or Value is greater than 100.

In Python code this would look like the following:

And will return the following output:

Filtering with or operator

As expected all the rows that are returned have a value greater than 100 orhave a date after 2019–04–10.


3. Agreggation

Sometimes there’s the need to aggregate data so you can create certain overviews or to do some calculation. In pandas we use groupby for this.

So what is groupby exactly? If we quote the pandas documentation:

By “group by” we are referring to a process involving one or more of the following steps:
* Splitting the data into groups based on some criteria.
* Applying a function to each group independently.
* Combining the results into a data structure.

So basically it’s making groups out of your data based on some indicator, to enable yourself to do some actions on these groups.

3.1 Groupby #1: get total sum
Lets look at an example. Say we want to get the total value of each group based on ID . This would like like the following in Python code:

Which will give us the following output:

Aggregation with sum

So if we look at out DataFrame again, we can see that this is correct:

Original dataframe

For example for ID A1 the total value is 100 + 120 = 220 ,which is correct.

3.2 Groupby #2: get the highest date

Pandas provides a big range of function you can use on your groups after using groupby. Let’s look at one more. For example, we can get the highest date per group by using the .max() function.

That would look like this:

And would give us the following output:

Aggregation with max date

4. Joins

Joins are combining two dataframes on a side by side manner based on a common column. Most of the time these columns are referred to askey columns .

The term joinis originated from the database language SQL, and was needed because the data modelling of SQL databases is mostly done by using relational modelling.

There are many types of joins, and your output will be based on which type of join your perform. Because this is an introductionary tutorial, we will look at the most common one: inner join. In later parts of these series we will look at more complex joins.

The inner join is derived from venn diagrams which represents inner(intersection) part of both sets. So when we translate this to our data, an inner join returns the rows which are present in both dataframes.

4.1 Our datasets

Because we want to combine two dataframes, we will create new data. These two imaginary dataset represent customers master table and an orders table.

With the following code we create two new dataframes:

And they look like the following:

dfA: Customers master
dfB: Orders

So one logical analysis we could do on this new data, would be to get the names and city of the customers next to each order in the orders table. This is a typical join problem, matching two dataframes row-wise and enriching the data with more columns. In this case, our key-column is the Customer_ID.

In pandas we use the merge method for joining. We will pass the following arguments to this method:

  1. Which dataframes you want to join (dfA, dfB).
  2. What are the key columns (Customer_ID).
  3. Which type of join you want to perform (Inner).

There are more arguments we can use in the merge method than the ones listed above, but for now these are sufficient.

The merge we want to perform looks like following in pandas:

And the output is as we expected, the name and city columns are added next to each corresponding customer_ID.

Inner join

So that was it for this part: basic data analysis techniques every data analyst should know, using Python.

You can find the code of this article on my GitHub in the form of a Jupyter Notebook: Link

If this article was useful for you, please consider giving this article a like and share this with friends and/or colleagues.

For any questions or other discussion, feel free to comment.

Expect part II soon, where we go a bit more advanced.

Laat een reactie achter

*

Send this to a friend