Posts Tagged ‘reliability’

How to Reduce Electric Utility Outages

Wednesday, February 16th, 2011 Mike Ficco

A couple of weeks ago we had a heavy, wet snow that took down many trees and in the process interrupted power for a significant number of people.  By some estimates nearly 300,000 households, businesses, and schools in the metropolitan Washington D.C. area lost power for the better part of a day.  At the time, there were a number of news reports comparing the power situation to that of a third-world country.  In addition, there were many disturbing reports of inadequately staffed call centers and, therefore, difficulty in reporting dangerous situations like live power lines on the ground.

Background

The power company responsible for the majority of the outages was already under the microscope.  They seemed to have a several year history of reliability problems and just last year a major snowstorm caused long outages for many of their customers.  When extended outages occurred again, there was a not too surprising customer outcry followed quickly by a great deal of posturing by both the power company and politicians.

… but that was weeks ago …

The reporting and talking continues.  Hearings are being held.  Excuses and promises are being made and made again. Enough.

Engineering Observations

Time to make some “engineering” observations. I experience first-hand both last year’s and this year’s extended power outages.  I was extremely irritated BOTH times when the power company blamed the outage on an “act of God”.  I disagree with that assessment.  Getting hit by a meteor – now that’s and act of God.  Getting snow in the winter just doesn’t seem to qualify as an act of God.  Furthermore, if you are genuinely surprised to get snow and ice during a Washington D.C. winter, you probably shouldn’t be allowed to be in charge of anything important.

Here is some further insight into this “act of God”:  both last year and this year, ALL my other utilities continued working after the power failed.  That would be my gas, water, sewage, telephone, and cable TV.  That is, my 5 other utilities performed significantly better than my power company.  So much for an act of God…

The good news for the most recent extended outage was that it was cold outside.  Since we don’t have any bears and coyotes in the area we could save some of the content of our refrigerator by putting it outside.  The bad news was that it was cold outside and we had no heat.  To be clear, my furnace and hot water heater are gas powered.  We had gas and we had water.  We had hot water.  We had no heat since the furnace blower required the conspicuously absent electricity to heat the house.  Fortunately, I had a fireplace and firewood.  We slept on the floor in front of the fireplace.  The furnace thermostat hit 52 degrees and on the second night without power I considered letting the faucets drip to prevent the pipes from freezing.

One (at least one) of the local radio stations periodically aired listener comments on the situation.  Most were outraged, but there was the occasional defender of the power company.  EVERYONE appreciated the efforts of the front line workers who risked their lives in the cold weather and dangerous conditions.  However, in my opinion, any comment that excused the power company from blame was very wrong to the point of being irresponsible.  Some comments went so far as to call people soft or whiners and told them to get tough or buy a generator.  Again, no one wants to diminish the contributions of the maintenance staff who worked so diligently to get power back on.  BUT it is unacceptable to do or say anything that excuses the executives who, for the last 20 or 30 years, “saved” money and presumably reaped large bonuses by minimizing regular maintenance and infrastructure improvements.

Calling those who complain about having no power “whiners” should have no place in any legitimate conversation.  It deflects accountability from the inability of the power company to fulfill the social contract at the basis of their very existence.  This is a public utility.  As such, the public cannot “vote with their feet” and select another company to deliver power to their home.  In return for being awarded this monopoly, the power company is expected to reliably deliver power – not blame God or repeatedly promise to get better.

I’ve worked for several companies that cut corners or skimped on design, implementation, or testing time in order to save money.  Of course the intention was not so much saving money as boosting profits.  My experience has been that these attempts usually backfire because customers don’t buy the junk that results from such cutting of corners.  That is, backfires for the company.  Many of my previous employers no longer exist.  Along the way, however, some executives got rich because of poorly (for the company) structured bonus plans.  Some also profited handsomely by selling hyped corporate stock before reality came crashing down.

Retail companies with competitors are very different from public utilities.  Those who manage public utilities have a social responsibility to provide quality service to their customers.  The politicians who oversee the utilities, therefore, have a responsibility to their constituents to insure bonus plans, maintenance schedules, infrastructure improvements, etc. are in line with the end goal of providing quality service.  If a public utility does not provide quality service, either due to ineptitude or malicious intent of its managers, those managers should be banned for life from involvement with any public utility.

But enough about managing a company to provide reliable service…

How to Reduce Electric Utility Outages

At some point, customers of even the best managed power company will lose service.  In my situation, both last year and this year, the power company was not able to – or was unwilling to – give us any realistic idea when power would come back on.  This year they made a blanket statement of “11:00 Saturday night”.  In my book (What Every Engineer Should Know About Career Management) I call this “big bang scheduling” and I assert that big bang scheduling is unacceptable.  A schedule with specific milestones is vastly superior because you can easily tell when implementers start to fall behind schedule or when events take an unexpected turn.  Either the power company was completely inept or they wanted to hide from the public when they were going off plan or off schedule.

During an outage, somewhere inside the gigantic power company people make decisions about what neighborhoods get worked on in what order.  If the power company doesn’t have a map of their power grid with known failures highlighted, they certainly should.  How could they hope to work in an efficient fashion without such an annotated grid?  I propose that all power companies… NAY!  All public utilities… be required to host an outage web site.  This should be a graphic presentation of the very same data used by the utility to schedule work.

I hear the power company now!  We can’t do that.  It’s too much work, etc.  Malarkey!  They BETTER be privately doing something like this already if they hope to repair damage in anything remotely resembling an efficient sequence of activity.  Graphically presenting this information to the public would accomplish two incredibly important objectives:

  • Homeowners could track the progress of work and see where their neighborhood was in the sequence of repairs.  They could then make educated decisions about emptying the refrigerator or checking into a hotel.
  • The public could critique the power company’s allocation or resources and staging of repairs.  The utility would probably dread such critique, but only because they are inept.  If done well, a public display of effective and efficient repairs could provide a wonderful boost of confidence that utility payments are being well spent.

My power company, unfortunately, reminds me of some of my former employers.  Unlike those, however, I can’t just leave in hopes of finding a better situation.  I’m stuck with these guys until the public utility commission wakes from their coma and learns enough about engineering to properly oversee the power company and properly align the thinking of the utility executives.  To help this process, in a future blog I will provide a short engineering specification of the outage web site concept presented above.

Done!   See How Utility Outages SHOULD Be Handled

Cut And Paste Engineering

Thursday, September 9th, 2010 Mike Ficco

Several years ago I was involved in a project that expected to have a large production volume.  The development group was working with a few prototypes but the manufacturing team was not yet fully engaged.  Part of my work required a unique device serial number for security and other purposes.  Unfortunately, our prototypes had no serial numbers since they were not produced by the normal manufacturing process.  I needed a serial number so I came up with a relatively simple solution.  On power up I would read the area of non-volatile memory that was intended to hold the serial number and other information.  If the information passed a validity check I would write the serial number into another special area of non-volatile memory.  If the validity check failed I would instead write a fictitious serial number[1].  All other code made use of this “special memory” that I created and managed.  My immediate development problem was solved and all the code would automatically start using real serial numbers as soon as the equipment was being made on a production line.  Problem solved!

The product shipped and actually did get produced in high volume.

Fast-forward about eight years.

A coworker came into my office and asked if I could take a look at some old code.  We walked to his office where he brought up a page of code I had not seen for many years.  It was my power up serial number initialization function.  Well, more accurately, it was a descendent of my code.  After many years and several million devices, the code was still present.  The operating system had changed at least three times and several people – perhaps more than a dozen had their fingers in the surrounding code.  The details had changed and the content of the structure that was validity checked had changed.  Even the method of doing the validity check was different.  Yet there was my fake serial number and my privately managed memory.

The developer said he didn’t understand what the code accomplished or why it was needed.  He asked the person that previously worked on the code and he didn’t know either.  Some of the original project folks had left the company.  I had gone off to new products and problems.  Others had seen the code but did not know the rationale behind it.  Eventually my name came up as he continued to ask questions, so he thought he would come talk to me.

I quickly explained what the code was for and that it was no longer needed.  I also took the opportunity to congratulate him for being conscientious in wanting to get the code right and bold enough to ask about the code when so many who came before him had not.

This is a true story and you may or may not have enjoyed it.  The problem is such stories of tracking down the meaning of mysterious code are far too rare.  More often code proliferates and becomes progressively more convoluted as programmers are afraid to touch or delete what they don’t understand.

One very popular coding technique is to copy an existing piece of code that solves a problem similar to the one on which you are working.  Over time a large code base becomes fabricated from bits and pieces of old code.  This is outstanding in that it is something like code reuse.  It is beyond horrible in that such reuse is occasionally perverted into a bloated and unreliable mess.  It seems basic instinct for most programmers to allow poorly understood code to remain.  I have seen developers and managers too fearful, and I truly mean fearful, to remove bizarre code because it might be doing something worthwhile.

Last year I worked on a one-chip-wonder micro controller.  I inherited over 90K of buggy code that needed additional features.  Four months later I had 5K of reliable code that had all the needed features.  Not all of my projects result in a 1800% code reduction, but this basic scenario has played itself out over and over.  A great deal of my work finding and fixing bugs on legacy products has involved removing large amounts of code.

Let me leak out a well-kept secret:  If you want your code to be reliable, you have to understand what it does.

You are not a very good developer – at least not one confident in your ability – if you are afraid to touch some mysterious code for fear of breaking it.  Poke it!  Tweak it!  Test it!  Figure out what it does and determine if it does that correctly.  You or your manager may worry that this is wasted time, but my personal experience has been that mysterious and bloated code is often the cause of problems and takes forever to debug.

Well-understood code is not only shipped faster, but you can also ship it with pride.


[1] Our serial numbers were to be based on the production shift, production line, and the date and time of production.  I worked with the manufacturing group to create a serial number based on a non-existent production shift and line to guarantee my fictitious serial number would never be mistaken as genuine.