In my daily job as a data analyst I see all kinds of data and all kinds of analysis requests from the clients. What I noticed is that certain basic techniques you need in most projects, independent of the type of project you are working on. I’m convinced every data analyst/scientist should have a good understanding of these techniques. So the goal of this article is to take the readers through these techniques and to explain these on a basic level.
These are the topics we will go through and discuss:
Filtering with multiple conditions
For our analysis, we will make use of thepandaslibrary in Python. So if you haven’t installed this library, use one of the following codes in your command prompt to install pandas:
# If you use Anaconda, type the following in anaconda prompt conda install -c anaconda pandas
# If you use pip, type the following in command prompt pip install pandas
Furthermore I assume you already have a basic knowledge of Python and the pandas library. But no worries if you haven’t touched any of the above, we will go through everything from the bottom up.
To be able to go through the mentioned techniques, we need data. We could import acsvfile or anexcelfile, but for now we keep it simple and just create a small dataset with pandas.
As can be seen above, it contains ID’s, values and dates.
1. Basic filtering
So now we loaded the pandas module and created a dataset, we can start with the first technique. When you want to get a subset of your data based on the values in a column, we are talking aboutfilteringdata.
In pandas we have multiple ways to do that, for now we look at the most common ones:
The logic behind filtering in pandas is you pass the condition to the dataframe between the square brackets:
And gives us the following output:
Filtering with .loclooks quite similar:
And as expected, it gives us the same output, since we applied the same filter
Which one is preferred to use? For basic filters, as we saw above, there’s no difference or preference, it comes down to what you prefer code syntax wise. But when you want to apply more advanced selecting of your data, .locprovides for that and can do more complex selecting and slicing. But that’s not something to worry about right now.
2. Filtering with conditions
We applied our first filter, which was pretty straight forward. But let’s say you want to apply a filter with multiple conditions. How would we do that in pandas? For that we have look at Python operators.
2.1 The&operator For example, you want to filter all the rows whereIDis equal to C1 andValueis above 100.
To apply that filter, we have to chain two conditions with the&operator. That would look like following:
And will return the following output:
As expected, we get one row back, since only this row met the conditions we set in our filter.
2.2 The | operator
The|operator in Python stands fororand will returnTrueif one of the conditions is met.
We can show this by applying the following filter: give us all the rows wheredateis later than 2019–04–10orValueis greater than 100.
In Python code this would look like the following:
And will return the following output:
As expected all the rows that are returned have a value greater than 100orhave a date after 2019–04–10.
Sometimes there’s the need to aggregate data so you can create certain overviews or to do some calculation. In pandas we usegroupbyfor this.
By “group by” we are referring to a process involving one or more of the following steps: * Splittingthe data into groups based on some criteria. * Applyinga function to each group independently. * Combiningthe results into a data structure.
So basically it’s making groups out of your data based on some indicator, to enable yourself to do some actions on these groups.
3.1 Groupby #1: get total sum Lets look at an example. Say we want to get the totalvalueof each group based onID . This would like like the following in Python code:
Which will give us the following output:
So if we look at out DataFrame again, we can see that this is correct:
For example for ID A1 the total value is100 + 120 = 220,which is correct.
3.2 Groupby #2: get the highest date
Pandas provides a big range of function you can use on your groups after using groupby. Let’s look at one more. For example, we can get the highest date per group by using the .max()function.
That would look like this:
And would give us the following output:
Joins are combining two dataframes on a side by side manner based on a common column. Most of the time these columns are referred to askey columns .
The termjoinis originated from the database language SQL, and was needed because the data modelling of SQL databases is mostly done by using relational modelling.
There are many types of joins, and your output will be based on which type of join your perform. Because this is an introductionary tutorial, we will look at the most common one:inner join. In later parts of these series we will look at more complex joins.
Theinner joinis derived fromvenn diagramswhich representsinner(intersection) part of both sets. So when we translate this to our data, an inner join returns the rows which are present inbothdataframes.
4.1 Our datasets
Because we want to combine two dataframes, we will create new data. These two imaginary dataset represent customers master table and an orders table.
With the following code we create two new dataframes:
And they look like the following:
So one logical analysis we could do on this new data, would be to get the names and city of the customers next to each order in the orders table. This is a typicaljoinproblem, matching two dataframes row-wise and enriching the data with more columns. In this case, our key-column is theCustomer_ID.
In pandas we use themergemethod for joining. We will pass the following arguments to this method:
Which dataframes you want to join (dfA, dfB).
What are the key columns (Customer_ID).
Which type of join you want to perform (Inner).
There are more arguments we can use in the merge method than the ones listed above, but for now these are sufficient.
The merge we want to perform looks like following in pandas:
And the output is as we expected, the name and city columns are added next to each correspondingcustomer_ID.
So that was itfor this part:basic data analysis techniques every data analyst should know, using Python.
You can find the code of this article on my GitHub in the form of a Jupyter Notebook:Link
If this article was useful for you, please consider giving this article a like and share this with friends and/or colleagues.
For any questions or other discussion, feel free to comment.
Expect part II soon, where we go a bit more advanced.