Fun with Python: Doing Real Work With Excel Spreadsheets

Rick Lowrey
7 min readAug 15, 2020
Vineyard View, Gaston OR; Photo by author, 2017

I recently had a request from my lovely ex-engineer-now-artist girlfriend (yes, the same girlfriend) that she wanted to update some Cost-Of-Goods-Sold values in an Excel spreadsheet based on the value of another column in the same row.

Now, before you say “Why not just run a macro?”, let me explain. No, let me sum up. The spreadsheet was from an outside vendor source, and had to be re-downloaded periodically from the same source. There were about 5000 rows of data. The reference values were based on a limited number of items (about 50 different possibilities) each of which was associated with a specific price, which could change. And, a replacement file had to be re-uploaded to the same place once her changes were made. So, having a separate file to maintain her reference prices would be easier and cleaner in the long run.

As I had mentioned in my last article “Some Basics And Some Real Work Using Google App Script”, I had just finished a Python course that had the usual read-a-csv-write-json simplistic scenarios. Feeling very confident in my ability to figure out anything that came my way, I hitched up my britches and pitched in.

Of course the first thing to decide was, how was I going to read the data? I asked her to save the Excel file to a flat CSV. That way I could use my newly-acquired Python skills to slurp the data in, twiddle the column, then write out a new CSV all ready for her to import into Excel. Seems pretty easy, right?

The first roadblock was when I tried to read the CSV file she had exported. Python complained that it found characters that weren’t UTF-8 encoded. After a little sniffing around on the web, I found a reference to a handy library called “chardet” to help detect what character set this thing was in. However, there were very few examples that matched my requirement, so I delved into the documentation. I spent a goodly amount of time playing with it, but the best I could do was to get a low-confidence response that the file was in Thai. Let’s just say that I went down the code page/charset rabbit hole a lot further than I intended (or should have…but it was so interesting!). There had to be an easier way.

Next, I researched into whether there were any libraries that were specifically designed to read Excel files. It was no surprise that there were several. The first I tried was “xlrd”, and it worked great. Very simple to use, I was able to treat the incoming rows as a list with attributes, it looked promising. However, I had an issue where one of the columns I was interested in had come out as a formula rather than the actual value. Not a winner. Additionally, I could not update the data. I’d have to use a separate library called “xlwt” to create an entirely new object and populate THAT so I could write out the file. Again, there had to be an easier way.

I finally wound up using a library called “openpyxl”. It was possible to set an option when I opened the file to load the workbook, access the sheet and force the formulas to evaluate and supply the actual end value.

Huzzah! It looked like a viable approach.

Now, after I had successfully proven to myself that I could read the spreadsheet file, I concentrated on how I was going to provide the “decode” values. As in, I had to use the value in one spreadsheet column, do a lookup, and provide a value to put into a different spreadsheet column. Sounds like it’s tailor-made for JSON.

Fortunately, JSON is pretty simple to read and load into a dictionary object.

Here’s a subset of the entire JSON file:

Once you have both of those objects, comparing them is a pretty simple iteration through the worksheet object.

In this case, I wanted to determine whether or not there was a match between the value in a specific column, and the JSON values I had read into a dictionary from the other file. Python has a handy “in” keyword for use with dictionaries to see if a key exists:

So, I determined I could read a spreadsheet, read a JSON file with lookup values, compare the values and update my rows in memory from the lookup. Great!

From a workflow standpoint, having hard-coded file names wasn’t really the best option. Keeping in mind that the lookup JSON data was supposed to be “cost of goods sold” for a set of products, it was entirely likely that the values would change over time, so I wanted to be able to have an archive. And, the spreadsheet that was supplied would likely have a different name each time it was provided. So, I wanted to be able to supply command-line arguments for both of these. It turns out there’s a wonderful Python library called “argparse” that gives you a ton of flexibility.

Creating a parser is pretty simple.

The “description” parameter supplies some text that will display if you run the script with a “-h” argument. More on that later.

Once you have the parser object, you can pretty much add as many arguments as you like. You have the option of setting them as required, whether there’s a default value, and a host of others (here’s a link to the documentation).

The first argument I wanted was to specify where to find the prices file. The statement looked like this:

The first two parameters are what can be used on the command line as either a short version or a long version. Some people like to use the more verbose one for readability and self-documentation. The “help” parameter supplies some text that will display if you run the script with a “-h” argument. The “nargs=” parameter indicates that there could be zero or one value supplied. The “default” parameter supplies the value to be used if the parameter/value is not supplied on the command line.

Next, I wanted to give the name of the input spreadsheet. This one would have to be supplied, not optional and no default value would be accepted.

Finally, I wanted to be able to test this without necessarily generating an output spreadsheet, so I needed a “flag” style parameter.

The “action” parameter in the above tells the parser to basically hold a boolean value for later use.

But how to access these values? First, you create an object variable to hold them.

Once you have this object, you can reference the various values using “dot” notation. For example, here’s how we open the JSON file:

Notice that the reference after the “dot” is the same as the verbose argument name from above. Depending on what kind of value you supply, you can use these as numbers, as strings, or as booleans:

So, once I determine that this is not a test, I can save the on-memory workbook with a new name that I assemble from today’s date and time:

Now, since I want to be able to run this script and get an idea of how many prices are actually being updated, I put in some counters and a print statement:

This is the end of the script.

However, from a workflow standpoint, there are a couple of other things you may want to do.

When you execute this from the Linux command line with the — help argument, you get this:

rick@mylinuxmint: ~$python3 setCOGS.py --help
usage: setCOGS2.py [-h] [-p [PRICESFILE]] -i INPUTFILE [-t]
Add COGS to a products spreadsheet from a prices file.
optional arguments:
-h, --help show this help message and exit
-p [PRICESFILE], --pricesfile [PRICESFILE]
Prices file. Must be valid json.
-i INPUTFILE, --inputfile INPUTFILE
Product Prices file. Must be valid .xlsx file.
-t, --test Allows you to run without generating a new file.

If you exclude any parameters you get this:

rick@mylinuxmint: ~$python3 setCOGS.py
usage: setCOGS.py [-h] [-p [PRICESFILE]] -i INPUTFILE [-t]
setCOGS.py: error: the following arguments are required: -i/ - inputfile

This is expected, since we said the — inputfile parameter is required above.

If you specify all the arguments, it looks like this:

rick@mylinuxmint: ~$python3 setCOGS.py -p cogsprices.json -i cogs-8–12–2020.xlsx -t
Starting at: 08/14/2020 19:11:39
Ended at: 08/14/2020 19:11:39
File NOT saved
Changed 1148 out of 4639 rows.

So it gives us when we started, when it ended, that the file was NOT saved since this is a test, and how many rows would have been changed.

Finally, when we do this for real:

rick@mylinuxmint: ~$python3 setCOGS.py -p cogsprices.json -i cogs-8–12–2020.xlsx
Starting at: 08/14/2020 19:13:26
Ended at: 08/14/2020 19:13:27
File saved as new_file_20200814191327.xlsx
Changed 1148 out of 4639 rows.

And, voila. I have a file I can send back to my girlfriend for re-uploading, and I’ve saved her a bunch of work.

Conclusion

Python is great for manipulating data, and the plethora of libraries means you probably won’t have to “re-invent the wheel” when you need something specific done. Our particular script accepts parameters, it is documented, it runs fast, and it’s simple to use. This is a classic scenario of data wrangling that turns into something we could consider “production ready”. Here’s the complete script…only 30 lines, including print statements!

Coda:

Later that evening, my lovely ex-engineer-now-artist girlfriend said “Oh, no! Oh dear! I have some price changes already, and some new products to include!”

“Not to worry, my dear…all I have to do is include them in the cogsprices.json file and run it again!”

<add rows to the JSON file, re-run the script>

“My hero!”

Okay, so maybe it didn’t really quite happen like that, but you get the idea. We’ve abstracted the things we KNOW will be dynamically changing so our script doesn’t have to change. Plus, we’ve included some handy help to give us a hint about how to run this script later on (after we’ve forgotten how to run it). And, in a pinch, we could even supply this script to someone else so they could do it by themselves and we wouldn’t ever have to touch it again.

Thanks for reading!

--

--