Jun 13 2020
Name That Algorithm #4
One project that I have been working on recently is gathering property tax information. While the reason isn’t particularly interesting, what is interesting is my recent optimization. Previously, I had been archiving the raw html files. To get the information in a human readable CSV format, I ran a Node.js parser which utilized cheerio. I was only concerned with a couple of the fields, so I just plucked those out and plopped them in a CSV.
After about 6 months of using this tool, the pain points started to set in. Processing an html file in Node and cheerio ended up taking roughly a half second per file. When you have approximately 40,000 files, this takes a long time – over five hours.
There are two low hanging fruit to attack:
- Removing cheerio (and potentially Node)
- Ditching the raw html files by preprocessing all of this data upfront.
Cheerio claims that it is a blazing fast DOM. However, in my case, I didn’t need an entire DOM. I just want to linearly parse my data. Manipulating and rendering is not needed. I had a feeling that the extra bells and whistles held some overhead.
It is common in coding interviews to have data preprocessing as part of a solution to a whiteboard problem. With this being a common in big-data, I knew this to be a solution to my issue
Let’s name the algorithm for processing this data. Instead of showing an algorithm that requires optimizing, I am going to show some data that needs to be processed.
Consider some tabular data of tax payments on a property. In my case this was an HTML table:
The tool I chose for the job is Pythons built-in
It is fairly easy to use the
handle_data override to extract the above into an array.
I knew that this was going to by snappy since this underlying parser is compiled into bytecode.
Now that Python has processed my data into a flat array, we need to label this information, similar to the table above. More specifically, I want something like:
What algorithm is this?
Scroll down for the answer.
I felt it makes sense to use
Since it is guaranteed that empty cells (
<td></td>) parse as empty string, we don’t need to worry about offset issues.
The result of chunk yields:
Now it’s pretty trivial to get this in the desired object format.
Chop off the 0th index and then access each sub-index per attribute.
You can even use the
invokeMap algorithm if you want to get fancy.