VC Healy's Blog

VC Healy's Blog

Pandas and the webscrape

Work has finally kicked into high gear and it has shown the flaws in the python code I have written. Well, not really flaws but more than the output, a spreadsheet, was needing a bit of manual handling, and although this would have been fine if I was only hitting one server and one process that updated a handful of devices it has turned into another type of animal.

A paginated table is scraped page by page but page one had a blank column at the start and this wasn't present in further pages. I had missed the header=None from the read_table method. That was an easy fix.

Page table headers. Not one row but two rows that appended to the dataframe with each page. Manually sorting a column in excel and then deleting those rows eats up a lot time. After a bit of searching and trying various ways to remove a row if it has specific text, I couldn't believe how many ways, on websites that should know better, that didn't work were astonishing. Simply put, this used twice did work.

df = df[df[0] != '**TEXT HERE **']
Now, I have the data rows that I want. Each data row had a single string, the string had three components separated by a space. All I was interested in was the first component and the following sorted the single column into three.

df = pd.DataFrame(df[0].str.split(' ',2).tolist(), columns = ['ID','Component2', 'Component3'])

Yes, I could have removed the second and third columns, but I wanted to focus on the information I knew was right and didn't want to break something else at this point.

The ID column had multiples of a handful of unique entries and this was the interesting information for me. I needed two things here. With a temporary dataframe, a count of the unique ID, and a list of the unique ID.

This latter needed a bit of work, convert the column into a dict using groupby. Convert the dict into a new dataframe.

To prevent a ValueError

ValueError: If using all scalar values, you must pass an index

an index has to be passed. The simplest way to do this is to use the keys (the Unique IDs that are the keys in the dict and lump the other columns into a second column.

df_dict = {k: v for (k, v) in df.groupby('ID')} df_uniques = pd.DataFrame(df_dict.items(), columns=['ID', 'COMPONENT2']) df = df.append(df_uniques)

The naming of the columns were intentional, as it allows the dataframe append to add to the bottom of the spreadsheet columns already there from the main dataframe for a cleaner look.

and the count Unique ID

# Count the Unique ID no_of_id = df['ID'].nunique() dfcount = pd.DataFrame({'ID':['Number of Unique ISAM'], 'COMPONENT2': [no_of_isam]}) df = df.append(dfcount)

That's it one process done, dataframe created, but the indexing, the first column for each page scraped was all over the place. A simple fix against the dataframe.

df.reset_index(drop=True, inplace=True)
That gets a proper count of the entries in the initial dataframe and the number of unique ID. I already had the code set to add a new sheet per process for the multiple processes on each server, and yes, a multi-sheet spreadsheet. I also had the code written to run over multiple servers.

The time-saving.
The above running over three servers now runs at less than three minutes, but it was over 20 minutes manually handling the previous spreadsheets. This task will be expanded to thirteen servers and manual handling of that magnitude would be unsustainable. Especially as it would have to be run twice a day, during business hours.

Is this better? Yes.

Could I do better?
Of course, this is just the start of my data science trek.
Those additional dataframes for the list of unique ID and their count. I want to all the processes on the server generating these extra dataframes all onto one sheet of a spreadsheet, they will need labeling though to ensure that the ID are associated to the correct process.

That is for another day.

 
Share this