What we talk about when we talk about Data Manipulation;

Part 1

“I manipulate data for a living” this sentence always ends with a sudden silence from both parties, the one who says it and the one who absorbs it. Through this sequence of posts about Data Manipulation with Python, I will share some basic but useful functions for “herding”, channeling or in short, manipulating data the way we please.

In Python, there is usually more than one package almost for everything and as a result, it can be a little bit confusing which package to use. But when we talk about Data Manipulation, everybody points to the famous Pandas package (respectfully) as it is the combination of all the good and pretty things in the world 🙂

In this post I will talk about:

                integrate it in a dataset,

and pack/unpack it in a dataset as needed

Imagine that we have a dataset (namely “df”) as shown below:

Fard-48651Boom hinge shaft cover

We may need to tidy up the column Description and get rid of the unnecessary words.

First of all, we need to homogenize our dataset as datasets usually are the result of integration of several sources and can have different formats. So as Python is very sensitive to such behaviors, it is a good practice that before starting a project, we make the data as uniform as possible. By doing so, we need to get rid of unnecessary spaces inside the string and at both ends of the string and also lower case the whole string:


df = df.dropna()
for c in df:
 df[c] = df[c].apply(lambda x: '.join(x.split()).strip().lower())


 If we provide Lambda function with a dataset it will look at the data, column-wise or row-wise and will equal x with each column/row in the dataset. But if we feed Lambda by a column (as we did here), Lambda will treat data cell-wise and will equal x with each cell from the input column.

Now that the dataset is in a uniform format, we can proceed with extracting info from Description column by using a predefined set (namely “KeyWords”). I will use Lambda again to look for keywords inside the column “Description”:


df['Found_Keywords'] = df['Description'].apply(lambda x: [w for w in KeyWords if w in x])


By using a list as my container of choice, I have filtered out the undesired words from the Description column and have stored only the desired values in the column Found_Keywords. The output would be something like this:

Fard-48651boom hinge shaft cover[shaft, hinge, cover]

Now that we have a list of “clean” values in Found_Keywords column, we can play around with the list based on our requirement. We may need to:

Expanding the list into separate columns:


Expanded_df = df['Found_Keywords'].apply(pandas.Series)\ .merge(df, left_index=True,right_index=True)

Expanded_df = Expanded_df[Expanded_df.columns[::-1]] # to reverse the order of columns


The result would be as below:

Fard-48651boom hinge shaft cover[shaf, hinge, cover]shafthingecover

Expanding the list into seperate rows:


Expanded_df = df.explode('Found_Keywords').reset_index(drop=True)


With the output as below:

Fard-48651boom hinge shaft covershaft
Fard-48651boom hinge shaft coverhinge
Fard-48651boom hinge shaft covercover

Gather up all the values in all the lists as a single set:

So, if Found_Keywords column have values such as [shaft, hinge, cover] and [shaft, door, key] and we need to combine them as a single set, we can again use ‘explode’ function:



And we will end up with a set such as: {‘shaft’, ‘hinge’, ‘door’, ‘cover’}


In this post we talked about integration of Lambda with list in the concept of Pandas data frames and how useful this integration can be in terms of shaping data and extraction info.

Lambdas are pretty flexible and fast. With Lambda, we can define a custom function, apply it to either each column/row or each cell in the dataset and shape data as we need.

Lists (as a user-friendly container) come pretty handy when we want to loop through data and extract/store the desired values and keep them together.